University of Toronto Department of Economics ECO2010H1F Mathematics and Statistics for PhD Students Instructor: Prof. Martin Burda Summer 2022 Lecture Notes Contents I Topology and Real Analysis 1 1 Methods of Proofs 1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Proof Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Set 2.1 2.2 2.3 Theory Sets and Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Metric Spaces 3.1 Metric . . . . . . . . . . . . . . . 3.2 Examples of the Metric Function 3.3 Compactness . . . . . . . . . . . 3.4 Completeness . . . . . . . . . . . 4 Analysis in Metric 4.1 Continuity . . . 4.2 Fixed Points . . 4.3 Convergence . . 1 1 2 6 7 10 11 . . . . 12 12 12 14 16 Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 18 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Vector Spaces 5.1 Normed Space . . . . . . . . . . . . . . . 5.2 Linear Combination . . . . . . . . . . . 5.3 Linear Independence . . . . . . . . . . . 5.4 Basis and Dimension . . . . . . . . . . . 5.5 Vector Spaces Associated with Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 23 24 24 26 6 Linear Algebra in Vector Spaces 6.1 Linear Transformations . . . . . 6.2 Inner Product Spaces . . . . . . . 6.3 Orthogonal Bases . . . . . . . . . 6.4 Projections . . . . . . . . . . . . 6.5 Euclidean Spaces . . . . . . . . . 6.6 Separating Hyperplane Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 28 29 30 31 32 7 Correspondences 7.1 Continuity of Correspondences . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Kakutani’s Fixed Point Theorem . . . . . . . . . . . . . . . . . . . . . . . . 34 34 35 . . . . . . . . . . . . . . . . . . . . . . . . 8 Continuity 8.1 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 The Maximum Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Semicontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 37 37 40 II 43 Optimization 9 Constrained Optimization 9.1 Equality Constraints . . . . . . . . . . . . . . . . . 9.2 Extreme Value Theorem . . . . . . . . . . . . . . . 9.3 Inequality Constraints . . . . . . . . . . . . . . . . 9.4 Application: The Consumer Problem . . . . . . . . 9.5 Envelope Theorem for Unconstrained Optimization 9.6 Envelope Theorem for Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 45 47 49 50 10 Dynamic Optimization 10.1 Motivation: Labor Supply . . . . . . . 10.2 Approaches to Dynamic Optimization 10.3 2 Periods . . . . . . . . . . . . . . . . 10.4 T Periods . . . . . . . . . . . . . . . . 10.5 General Problem . . . . . . . . . . . . 10.6 Lagrangean approach . . . . . . . . . . 10.7 Maximum Principle . . . . . . . . . . 10.8 In…nite Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 53 54 54 56 57 58 60 60 11 Dynamic Programming 11.1 Finite Horizon: Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 In…nite Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 62 64 12 DP Application - Economic 12.1 Euler Equation . . . . . . 12.2 Hamiltonian . . . . . . . . 12.3 Dynamic Programming . . 12.4 In…nite Horizon . . . . . . Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 DP Application - Labor Supply 13.1 Intertemporal Labor Supply . . . . . . . . . . . 13.2 Finite Horizon with No Uncertainty . . . . . . 13.3 Finite Horizon with Uncertainty . . . . . . . . 13.4 In…nite Horizon without Uncertainty . . . . . . 13.5 In…nite Horizon with Uncertainty . . . . . . . . 13.6 Example: Intertemporal Labor Supply Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 67 67 68 68 . . . . . . . . . . . . . . . . . . . . . . . . . Finite T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 70 71 74 75 76 76 . . . . . . . . . . . . . . . . 14 Dynamic Optimization in Continuous Time 14.1 Discounting in Continuous Time . . . . . . . 14.2 Continuous Time Optimization . . . . . . . . 14.3 Parallels with Discrete Time . . . . . . . . . . 14.4 Application - Optimal Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 78 78 80 82 15 Numerical Optimization 15.1 Golden Search . . . . . . 15.2 Nelder-Mead . . . . . . . 15.3 Newton-Raphson Method 15.4 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 85 86 87 88 III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Analysis 89 16 Introduction to Probability 16.1 Randomness and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 89 92 17 Measure-Theoretic Probability 17.1 Elements of Measure Theory . . . . . . . 17.2 Random Variable . . . . . . . . . . . . . . 17.3 Conditional Probability and Independence 17.4 Bayes Rule . . . . . . . . . . . . . . . . . . . . . 93 93 96 96 97 18 Random Variables and Distributions 18.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2 Transformations of Random Variables . . . . . . . . . . . . . . . . . . . . . 18.3 Parametric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 101 103 104 19 Statistical Properties of Estimators 19.1 Sampling Distribution . . . . . . . . . . . . . . . . . . . 19.2 Finite Sample Properties . . . . . . . . . . . . . . . . . . 19.3 Convergence in Probability and Consistency . . . . . . . 19.4 Convergence in Distribution and Asymptotic Normality . . . . 106 106 107 108 110 . . . . . 112 112 112 113 114 115 20 Stochastic Orders and Delta Method 20.1 A-notation . . . . . . . . . . . . . . . 20.2 O-notation . . . . . . . . . . . . . . 20.3 o-notation . . . . . . . . . . . . . . . 20.4 Op-notation . . . . . . . . . . . . . . 20.5 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Regression with Matrix Algebra 21.1 Multiple Regression in Matrix Notation . . . . . . 21.2 MLR Assumptions in Matrix Notation . . . . . . . 21.3 Projection Matrices . . . . . . . . . . . . . . . . . . 21.4 Unbiasedness of OLS . . . . . . . . . . . . . . . . . 21.5 Variance-Covariance Matrix of the OLS Estimator 21.6 Gauss-Markov in Matrix Notation . . . . . . . . . 22 Maximum Likelihood 22.1 Method of Maximum Likelihood 22.2 Examples . . . . . . . . . . . . . 22.3 Information Matrix Equality . . 22.4 Asympotic Properties of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 116 118 118 119 119 120 . . . . 122 122 122 125 127 23 Generalized Method of Moments 128 23.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 23.2 GMM Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 23.3 Example: Euler Equation Asset Pricing Model . . . . . . . . . . . . . . . . 131 24 Testing of Nonlinear Hypotheses 24.1 Wald Test . . . . . . . . . . . . . 24.2 Likelihood Ratio Test . . . . . . 24.3 Lagrange Multiplier Test . . . . . 24.4 LM Test in Auxiliary Regressions 24.5 Test Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 135 136 136 137 138 25 Bootstrap Approximation 139 25.1 The Empirical Distribution Function . . . . . . . . . . . . . . . . . . . . . . 139 25.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 26 Elements of Bayesian Analysis 142 26.1 The Normal Linear Regression Model . . . . . . . . . . . . . . . . . . . . . 143 26.2 Bernstein-Von Mises Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 144 26.3 Posterior Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 27 Markov Chain Monte Carlo 146 27.1 Acceptance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 27.2 Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 146 28 Neural Networks and Machine 28.1 Arti…cial Neural Networks . . 28.2 Feed-Forward Network . . . . 28.3 Machine Learning for NNs . . 28.4 Cross-Validation . . . . . . . 28.5 Model Fit . . . . . . . . . . . 28.6 Recurrent Neural Networks . Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 149 150 151 153 155 156 IV References 159 ECO2010, Summer 2022 Part I Topology and Real Analysis 1 Methods of Proofs 1.1 Preliminaries Consider the statements A and B. We often construct statements using the two quanti…ers: 9 (read as "there exists") and 8 (read as "for all"). More speci…cally: (9x 2 A)[B(x)] means "there exists an x in the set A such that B(x)" and (8x 2 A)[B(x)] means "for all x in the set A, B(x)". We can construct new statements out of existing statements by again using the connectives "and" (^), "or" (_), "not" (:). A ^ B means "A and B", A _ B means "A or B", :A means "not A". Terminology: Implication: A =) B; i.e. "A implies B". Inverse: :A =) :B: Converse: B =) A Contrapositive: :B =) :A Equivalence: A () B which means "A is equivalent to B" or fA =) Bg^fB =) Ag: Statements: A Theorem or Proposition is a statement that we prove to be true. A Conjecture is a statement that does not have a proof. A Lemma is a theorem we use to prove another theorem. A Corollary is a theorem whose proof follows directly from a Theorem. A De…nition is a statement that is true by interpreting one of its terms in such a way as to make the statement true. An Axiom or Assumption is a statement that is taken to be true without proof. A Tautology is a statement which is true without assumptions (for example, x = x). A Contradiction is a statement that cannot be true (for example, "A is true and A is false"). 1 1. METHODS OF PROOFS 1.2 1.2.1 ECO2010, Summer 2022 Proof Techniques Direct Proof A direct proof establishes the conclusion by a logical combination of the hypothesis with axioms, de…nitions, and previously proved theorems. Example: De…nition 1. An integer x is called even (respectively odd) if there is another integer k for which x = 2k (respectively 2k + 1). Theorem 2. The sum of two even integers is always even. Proof. Let x and y be any two even integers, so there exist integers a and b such that x = 2a and y = 2b. Then, x + y = 2a + 2b = 2(a + b), which is even. 1.2.2 Constructive Proof A constructive proof demonstrates the existence of an object by creating or providing a method for creating such an object. Example: De…nition 3. A rational number r is any number that can be expressed as the quotient p=q of two integers, p and q; for q 6= 0: Theorem 4. For every positive rational number x, there exists a rational number y such that 0 < y < x. Proof. Let x be any positive rational number. Then there exist positive integers a and b such that x = a=b. Let y = a=(2b). Thus, y > 0 is also rational and y= 1.2.3 a a < =x 2b b Proof by Contradiction A proof by contradiction shows that if the conclusion were false, a logical contradiction would occur. We start by assuming the hypothesis is true and the conclusion is false, and then we derive the contradiction. To prove a logical implication A =) B, assume A and :B and prove a contradiction in this assumption. Example: Theorem 5. There is no greatest even integer. Note that here for simplicity the Theorem states only B; while A is a vacuous statement (anything). Proof. Suppose :B; i.e. there is greatest even integer N . Then for every even integer n, N n. Now suppose M = N + 2. Then, M is an even integer by de…nition, and also M > N by de…nition. This contradicts the supposition that N n for every even integer n. Hence, the supposition is false and the statement in the Theorem is true. 2 1. METHODS OF PROOFS 1.2.4 ECO2010, Summer 2022 Proof by Contrapositive An implication and its contrapositive are equivalent statements: fA =) Bg () f:B =) :Ag Proving the contrapositive is often more convenient or intuitive than proving the original statement. Example: De…nition 6. Two integers are said to have the same parity if they are both odd or both even. Theorem 7. If x and y are two integers for which x + y is even, then x and y have the same parity. The contrapositive version of the Theorem is: "If x and y are two integers with opposite parity, then their sum must be odd." Proof. 1. Assume x and y have opposite parity. 2. Suppose (without loss of generality) that x is even and y is odd. Thus, there are integers k and m for which x = 2k and y = 2m + 1. 3. Compute the sum x + y = 2k + 2m + 1 = 2(k + m) + 1, which is an odd integer by de…nition. 1.2.5 Proof by Induction The induction technique is useful for proving theorems involving an in…nite number of cases based on some discrete method of categorizing them. An example are theorems of the form "For every integer n n0 ; a property P (n) holds" where P (n) depends on n. There are two types of induction, weak induction and strong induction. Both involve proving a base case, for example proving the statement P (n0 ), and then an inductive step in which we build up the proof for the other cases. In the inductive step, we assume the result holds for one or more cases and then prove it for a new cases. The di¤erence between weak and strong induction comes in the form of this assumption, which is also called the inductive hypothesis. In weak induction, the inductive hypothesis assumes the result is true for one case. For example, in the inductive step of a proof on the positive integers, we assume P (n) is true for some n and prove P (n + 1) is also true. Example: Theorem 8. For any positive integer n, 1 + 2 + ::: + n = n(n + 1) 2 (1) Proof. For the base case, take n = 1. Note that 1 = 2=2 = 1(1 + 1)=2. For the inductive step, suppose the result (1) holds for some positive integer n. Then, by adding (n + 1) to 3 1. METHODS OF PROOFS ECO2010, Summer 2022 both sides we obtain n(n + 1) + (n + 1) 2 n(n + 1) 2(n + 1) = + 2 2 (n + 1)(n + 2) = 2 1 + 2 + : : : + n + (n + 1) = (2) Notice the …nal equation (2) in the proof is just the equation (1) of the theorem with n replaced by n + 1. Thus, we have shown the result holds for n + 1 if it holds for n. The base case tells us it holds for n = 1. Then the inductive step says it also holds for n = 2. Once it holds for n = 2, the inductive step says it also holds for n = 3. This chain of consequences continues for all positive integers, and thus we proved the theorem. Strong induction employs a stronger inductive hypothesis that assumes the result is true for multiple cases. For example, we might assume P (k) is true for all integers 1 k n and prove P (n + 1) is also true. Example: Theorem 9 (The Fundamental Theorem of Arithmetics). Every integer n factored into a product of primes n = p1 p2 pr 2 can be in exactly one way. This theorem actually contains two assertions: (1) the number n can be factored into a product of primes in some way, and (2) there is only one such way, aside from rearranging the factors. We will only prove the …rst assertion. Recall that a prime is a natural number greater than 1 that has no positive divisors other than 1 and itself. A natural number greater than 1 that is not a prime number is called a composite number. Proof. For the base cases, consider n = 2; 3; 4. Certainly, 2 = 2, 3 = 3, and 4 = 22 , so each can be factored into primes. For the inductive step, let us suppose all positive integers 2 k n can be factored into primes. Then either n + 1 is a prime number, in which case it is its own factorization into primes, or it is composite. If n + 1 is composite, it is the product n + 1 = n1 n2 of two integers with 2 n1 ; n2 n. By the inductive hypothesis, we know n1 and n2 can be factored into primes: n 1 = p 1 p2 pr n2 = q 1 q 2 qs Thus, n 1 n 2 = p1 p2 pr q 1 q 2 and n + 1 can be factored into a product of primes. 4 qs 1. METHODS OF PROOFS ECO2010, Summer 2022 Mathematica demonstration: Proof by Induction http://demonstrations.wolfram.com/ProofByInduction/ Reference: Daniel Solow. How to Read and Do Proofs. John Wiley & Sons, Inc., New York, fourth edition, 2005. ISBN 0-471-68058-3. William J. Turner. A Brief Introduction to Proofs. 2010. facstaff/turnerw/Writing/proofs.pdf 5 http://persweb.wabash.edu/ 2. SET THEORY 2 ECO2010, Summer 2022 Set Theory 6 2. SET THEORY 2.1 ECO2010, Summer 2022 Sets and Subsets A set is a collection of distinct objects, considered as an object in its own right. We express the notion of membership by 2 so that x 2 A means "x is an element of the set A" and y2 = A means "y is not an element of the set A." B is a subset of A, written B A, if every x 2 B satis…es x 2 A. If B A and there exists an element in A which is not in B; we say that B is a proper subset of A and write B A: Two sets are equal if they contain the same elements: we write A = B if A B and B A, and A 6= B otherwise. We usually de…ne a set explicitly by saying "A is the set of all elements x such that P (x)" written as A = fx : P (x)g where P (x) denotes a meaningful property for each x: Sometimes we also write A = fx 2 S : P (x)g for a set S to whose elements the property P applies in a given case. Example: R+ = fx 2 R : x 0g de…nes the set of all non-negative real numbers. The power set of A; denoted P(A), is the set of all subsets of A: A collection is a subset of P(A); i.e. a set of sets. A family is a set of collections. We have the set / collection / family hierarchy in place for cases in which we need to distinguish between several levels of set composition in the same context. A class is a collection of sets that can be unambiguously de…ned by a property that all its members share. 2.1.1 Sets of Numbers The following are some of the most important sets we will encounter: The set of natural numbers N = f1; 2; 3; : : :g The set of integers Z = f: : : ; 2; 1; 0; 1; 2; : : :g and the set of non-negative integers Z+ = f0; 1; 2; : : :g The set of rational numbers (or quotients) Q = f m n : m; n 2 Z; n 6= 0g The set of real numbers R which can be constructed as a completion of Q in such a way that a sequence de…ned by a decimal or binary expansion, e.g. f3; 3:1; 3:14; 3:141; 3:1415; : : :g converges to a unique real number. R = Q [ I where I is the set of irrational numbers. Hence, Q R: (There are several di¤erent ways to de…ne the set of real numbers - if interested in the details see the reference literature for this section). 7 2. SET THEORY 2.1.2 ECO2010, Summer 2022 Set Operations For sets A and B we de…ne: 1. A \ B; the intersection of A and B; by A \ B = fx : [x 2 A] ^ [x 2 B]g 2. A [ B; the union of A and B; by A [ B = fx : [x 2 A] _ [x 2 B]g 3. A n B; the di¤erence between A and B; by A n B = fx 2 A : x 2 = Bg 4. A4B; the symmetric di¤erence between A and B; by A4B = (A n B)[(B n A) 5. Ac ; the complement of A; by Ac = fx 2 S : x 2 = Ag 6. ;; the empty set, by ; = S c 7. A and B to be disjoint if A \ B = ;: 2.1.3 Ordered Pair and Cartesian Product An ordered pair is an ordered list (a; b) consisting of two objects a and b whereby for any two ordered pairs (a; b) and (a0 ; b0 ) it holds that (a; b) = (a0 ; b0 ) if and only if a = a0 and b = b0 : If A and B are nonempty sets, then the Cartesian product, denoted A B, is the set of all ordered pairs f(a; b) : a 2 A and b 2 Bg: While in the set fa; bg there is no preference given to a over b in terms of order, i.e. fa; bg = fb; ag; the order structure allows us to distinguish between e.g. …rst and second sets. Mathematica demonstration: Graph Products http://demonstrations.wolfram.com/GraphProducts/ 8 2. SET THEORY 2.1.4 ECO2010, Summer 2022 Relation In this section we assume that X and Y are non-empty sets. A subset R of X Y is called a relation from X to Y: If X = Y then we say that R is a relation on X (i.e. R X X):If (x; y) 2 R then we think of R associating x with y; and express it as xRy: A relation R on X is: re‡exive if xRx for each x 2 X; complete if either xRy or yRx holds for each x; y 2 X; symmetric if, for any x; y 2 X; xRy implies yRx; antisymmetric if, for any x; y 2 X; xRy and yRx imply x = y; transitive if xRy and yRz imply xRz for any x; y; z 2 X: Let R be a re‡exive relation on X: The asymmetric part of R is de…ned by the relation PR on X as xPR y if and only if xRy but not yRx: The relation IR RnPR on X is called the symmetric part of R: 2.1.5 Equivalence Relation A relation on X is called an equivalence relation if it is re‡exive symmetric and transitive. For any x 2 X; the equivalence class of x relative to is de…ned as the set [x] fy 2 X : y xg A quotient set of X relative to ; denoted X= ; is the class of all equivalence classes relative to ; that is X= f[x] : x 2 Xg One typically uses an equivalence relation to simplify a situation in a way that all things that are indistinguishable from a particular perspective are put together in a set and treated as a single entity. Example: Let X be the set of all the people in the world. "Being a sibling of" is an equivalence class on X (under the convention that any person is also a sibling of oneself). Furthermore, "being a Capricorn" (or any other particular sign) is an equivalence class on X: 2.1.6 Partition An equivalence relation can be used to decompose a set into subsets such that the member of each subset share the same equivalence property while members of di¤erent subsets are "distinct". A partition of X is a class of pairwise disjoint subsets of X whose union is X: Theorem 10 (Partition Theorem). For any equivalence relation X= is a partition of X: 9 on X; the quotient set 2. SET THEORY 2.1.7 ECO2010, Summer 2022 Order Relations A relation % on X is called a preorder on X if it is transitive and re‡exive. A relation % on X is called a partial order on X if it is antisymmetric preorder on X: Example 1: In individual choice theory, a preference relation % on a set X of choice alternatives is de…ned as a preorder on X: Re‡exivity is assumed and transitivity follows from agent rationality. The strict preference relation is de…ned as the asymmetric part of % (it is transitive but not re‡exive). The indi¤erence relation is de…ned as the symmetric part of % (it is an equivalence relation on X): For any x 2 X; the equivalence class [x] is called the indi¤erence class of x: This is a generalization of an indi¤erence curve that passes through x. An implication of the Partition Theorem is that no two distinct indi¤erence sets can have a point in common. This is a generalization of the result that two indi¤erence curves cannot cross. Example 2: In social choice theory, we often work with multiple preference relations on a given choice alternative set X: Suppose there are n individuals in the population, and let %i denote the preference relation of the ith individual. The Pareto dominance relation % on X is then de…ned as x % y i¤ x %i y for each i = 1; : : : ; n: Here % is a preorder on X in general, and a partial order on X if each x %i y is antisymmetric. 2.2 Functions Let X and Y be non-empty sets. A function f that maps X into Y; denoted f : X ! Y; is a relation f 2 X Y such that: (1) 8 x 2 X; 9y 2 Y s.t. xf y; (2) 8 y; z 2 Y with xf y and xf z; we have y = z: Such function is also often called a map. This is a set-theoretic formulation of the concept of a function used in calculus. The set X is called the domain of f: The set Y is called the co-domain of f: The range of f is de…ned as f (X) = fy 2 Y : xf y for some x 2 Xg Example: for a function f : R ! R de…ned by f (x) = x2 the co-domain of f is R but the range is R+ 0 i.e. the interval [0; 1) since f does not map to any negative number. The set of all functions that map X into Y is denoted by Y X : For example, R[0;1] is the set of all real-valued functions on [0; 1]. A function whose domain and co-domain are identical, i.e. f 2 X X ; is called a self-map on X: The notation y = f (x) refers to y as the image (or value) of x under f: An image of a set A X under f 2 Y X ; denoted f (A); is de…ned as the collection of all elements y in Y with y = f (x) for some x 2 A; that is f (A) ff (x) : x 2 Ag The range of f is thus the image of the domain. 10 2. SET THEORY ECO2010, Summer 2022 The inverse image of a set B Y under f; denoted f all x in X whose images under f belong to B; that is f 1 (B) 1 (B); is de…ned as the set of fx 2 X : f (x) 2 Bg Note that f 1 may or may not satisfy the de…nition of a function. If f 1 is a function, we say that f is invertible and f 1 is the inverse of f: Example: f (x) = x2 is not invertible since f (x) = 1 does not have a unique image under f 1 : If f (X) = Y; that is the range of f equals its codomain, then we say that f maps X onto Y , and refer to it as a surjection (or as a surjective function/map). If f maps distinct points in its domain to distinct points in its codomain, that is, if x 6= y implies f (x) 6= f (y) for all x; y 2 X, then we say that f is an injection (or a one-to-one, or an injective function/map). Finally, if f is both injective and surjective, then it is called a bijection (or a bijective function/map). Let X; Z; and Y be non-empty sets. Let f : X ! Z and g : Z ! Y be functions. We de…ne the composition function g f : X ! Y by (g f )(x) = g(f (x)) for every x 2 X. Example: suppose that the functions f : R ! R and g : R ! R are de…ned by f (x) = x2 and g(x) = x 1. Then Note that (g f )(x) = (f 2.3 (g f )(x) = x2 1 (f 1)2 g)(x) = (x g)(x) does not hold in general, in this example only if x = 1: Sequences A sequence in a set X is a function f : N ! X: We usually represent this function as (x1 ; x2 ; : : :); denoted by (xm ); where xi f (i) for each i 2 N: The set of all sequences in X is X N ; commonly denoted as X 1 : A subsequence (xmk ) of a sequence (xm ) 2 X 1 is a sequence composed of terms of (xm ) that appear in (xmk ) in the same order as in (xm ): For example, (xmk ) (1; 31 ; 51 ; : : :) is a subsequence of (xm ) (1; 21 ; 13 ; : : :): Reference: Ok, E. (2007): Chapter A, Sections 1.1 - 1.5 11 3. METRIC SPACES 3 ECO2010, Summer 2022 Metric Spaces The de…ning characteristic of a set is embodied in the concept of membership. Every set partitions the population of all existing things into two groups: those things that are members (i.e. elements) of the set and those things that are not. Sets do not intrinsically involve any concept of distance between (i.e. nearness or farness of) their elements. We can de…ne such concept using further mathematical structure and call it a metric. In this context a set that is endowed with a metric (or distance) function is often referred to as space, and each of its elements is referred to as a point in that space. 3.1 Metric Let X be a non-empty set. A metric on X is a function d : X X ! R+ with the following properties: (i) d(x; y) = 0 i¤ x = y; (ii) d(x; y) = d(y; x) for all x; y 2 X (Symmetry); (iii) d(x; y) + d(y; z) d(x; z) for all x; y; z 2 X (Triangle Inequality). If X is a set and d is a metric on X, then (X; d) is a metric space. When the existence of d is apparent from the context, we denote a metric space simply by X. 3.2 3.2.1 Examples of the Metric Function Discrete spaces For an arbitrary set X; with x; y 2 X; let d(x; y) = ( 0 if x = y 1 if x 6= y This metric is called a discrete metric. 3.2.2 Spaces of Sequences Let `p denote the set of all real-valued in…nite sequences, with (xm ); (ym ) 2 `p ; such that P p i jxi j < 1 for any 0 < p < 1. Then 1 X d((xm ); (ym )) = i=1 !1 p jxi p yi j is a metric on `p : Let `1 denote the set of all bounded real sequences, with (xm ); (ym ) 2 `1 ; such that supfjxm j : m 2 Ng < 1: Then d((xm ); (ym )) = supfjxm is a metric on `1 : 12 ym j : m 2 Ng 3. METRIC SPACES 3.2.3 ECO2010, Summer 2022 Spaces of Functions Let T be a non-empty set and let B(T ) denote the set of all bounded real-valued functions de…ned on T; that is B(T ) ff : T ! R; s.t. supff (x) : x 2 T g < 1g Let f; g 2 B(T ): Then d1 (f; g) = supfjf (x) g(x)j : x 2 T g is a metric on B(T ); called the sup-metric. Let C[a; b] denote the set of all continuous real-valued functions de…ned over [a; b] Then d(f; g) = supfjf (x) g(x)j : x 2 [a; b]g R: is a metric on C[a; b]; and dp (f; g) = Z a b jf (x) p 1 p g(x)j dx is also a metric on C[a; b]. 3.2.4 Let Rn Euclidean Space be the Euclidean Space. Let x; y 2 Rn : Then d1 (x; y) = n X i=1 jxi yi j is a metric on Rn (often called the Manhattan metric, or the taxicab metric) and v u n uX d2 (x; y) = t (xi yi )2 i=1 is also a metric on Rn : 3.2.5 Message Spaces Assume that we want to send messages in a language of N symbols (letters, numbers, punctuation marks, space, etc.) We assume that all messages have the same length K (if they are too short or too long, we either …ll them out or break them into pieces). We let X be the set of all messages, i.e. all sequences of symbols from the language of length K. If x = (x1 ; x2 ; :::; xK ) and y = (y1 ; y2 ; :::; yK ) are two messages, we de…ne d(x; y) = the number of indices n such that xn 6= yn Then d is a metric on the given message space. It is usually referred to as the Hamming metric, and is used in coding theory where it serves as a measure of how much a message gets distorted during transmission. 13 3. METRIC SPACES 3.2.6 ECO2010, Summer 2022 Further Examples Mathematica demonstration 1 : Metric Examples (left) http://demonstrations.wolfram.com/DistanceFunctions/ Mathematica demonstration 2: Radial Metric (right) http://demonstrations.wolfram.com/RadialMetric/ 3.3 3.3.1 Compactness Open and Closed Sets Let X be a metric space. For any x 2 X and " > 0; de…ne the "-neighborhood of x in X as the set N";X (x) fy 2 X : d(x; y) < "g: A neighborhood of x in X is any subset of X that contains at least one "-neighborhood of x in X: A subset S of X is said to be open in X (or an open subset of X) if, for each x 2 S; there exists an " > 0 such that N";X (x) S: A subset S of X is said to be closed in X (or a closed subset of X) if XnS is open in X: Note that changing a metric on a given set X would in general yield di¤erent classes of open (and hence closed) sets. 3.3.2 Interior, Closure, Boundary Let X be a metric space and S X: The largest open set in X that is contained in S is called the interior of S (relative to X) and is denoted intX (S): The smallest closed set in X that contains S is called the closure of S (relative to X) and is denoted clX (S): The boundary of S (relative to X) is de…ned as bdX (S) 3.3.3 clX (S)nintX (S): Convergent Sequences The property of closedness and openness of a set in a metric space can also be characterized by means of sequences in that space. Let X be a metric space, x 2 X; and let (xm ) 2 X 1 be a sequence in X: We say that (xm ) converges to x if, for each " > 0; 9M such that d(xm ; x) < " 8m M: We refer to x as the limit of (xm ): Equvalent notation: d(xm ; x) ! 0; or xm ! x; or lim xm = x: A sequence can converge to at most one limit. 14 3. METRIC SPACES 3.3.4 ECO2010, Summer 2022 Sequential Characterization of Closed Sets Proposition (C.1). A set S in a metric space X is closed i¤ every sequence in S that converges in X converges to a point in S. Note that closedness of S depends on X: For example, S = (0; 1) X = (0; 1) R but is not closed in Y = [0; 1] R: 3.3.5 R is closed in Bounded and Connected Sets A subset S of a metric space X is called bounded (in X) if there exists an " > 0 such that S N";X (x) for some s 2 S: If S is not bounded then it is said to be unbounded. Let X be a metric space. We say that S is connected if there do not exist two nonempty and disjoint open subsets O and U of S such that O [ U = S: 3.3.6 Compact Sets Let X be a metric space and S X: A class O of subsets of X is said to cover S if S [O: If all members of such a class O are open in X; then we say that O is an open cover of X: A metric space X is said to be compact if every open cover of X has a …nite subset that also covers X: A subset S of X is said to be compact in X (or a compact subset of X) if every open cover of S has a …nite subset that also covers S: Example 1: For more intuition, consider an example of a space that is not compact: the open interval (0; 1) R: Let O be the collection O and observe that (0; 1) = 1 ;1 i : i = 1; 2; : : : 1 ;1 [ 2 1 ;1 [ ::: 3 that is O is an open cover of (0; 1): However, O does not have a …nite subset that covers (0; 1) because the greatest lower bound of any …nite subset of O is bounded away from 0: We then conclude that (0; 1) is not a compact metric space. Example 2: Now consider the closed interval [0; 1] R and prove its compactness by contradiction. Assume there exists an open cover O of [0; 1] no …nite subset of which covers [0; 1]: Then either [0; 21 ] or [ 12 ; 1] is not covered by any …nite subset of O; pick the interval with this property and call it [a1 ; b1 ]: Then, either [a1 ; 21 (b1 + a1 )] or [ 12 (b1 + a1 ); b1 ] is not covered by any …nite subset of O; pick the interval with this property and call it [a2 ; b2 ]: Continuing in this way inductively, we obtain two sequences (am ) and (bm ) in [0; 1] such that, for each m = 1; 2; : : : ; 15 3. METRIC SPACES (i) am am+1 < bm+1 (ii) bm am = ECO2010, Summer 2022 bm ; 1 2m ; (iii) [am ; bm ] is not covered by any …nite subset of O; Using (i) and (ii); 9c 2 R such that fcg 2 \1 [ai ; bi ]; for which c = lim am = lim bm : Take any O 2 O with c 2 O: Then [am ; bm ] O for m large enough. This contradicts property (iii) and we conclude that [0; 1] is compact. 3.3.7 The Heine-Borel Theorem The example above generalizes to any Euclidean space in the following Theorem: Theorem 11 (The Heine-Borel Theorem). For any cube [a; b]n is compact. 1 < a < b < 1; the n-dimensional The proof of the Theorem follows the logic of the previous example, using n-dimensional cubes instead of intervals. Proposition (C.4). Any closed subset of a compact metric space X is compact. To prove the Proposition, let S be a closed subset of X: If O is an open cover of S (with sets open in S); then O [ fXnSg is an open cover of X: Since X is compact, there exists a …nite subset of O [ fXnSg; say O0 ; that covers X: Then O0 nfXnSg is a …nite subset of O that covers S: QED. An implication of the Heine-Borel Theorem and Proposition C.4 is that any n-dimensional prism [a1 ; b1 ] [an ; bn ] is a compact subset of Rn : 3.4 3.4.1 Completeness Cauchy Sequences Intuitively speaking, by a Cauchy sequence we mean a sequence the terms of which eventually get arbitrarily close to one another. Formally, a sequence (xm ) in a metric space X is called a Cauchy sequence if, for any > 0, there exists an M 2 N such that d(xk ; xl ) < for all k; l M . For example, For instance, (1; 21 ; 13 ; 41 ; : : :) is a Cauchy sequence in R. As another example, ( 1; 1; 1; 1; :::) is not a Cauchy sequence in R. Note that for any sequence (xm ) in a metric space X, the condition that consecutive terms of the sequence get closer and closer, that is, d(xm ; xm+1 ) ! 0, is a necessary but not su¢ cient condition for (xm ) to be Cauchy. For example, (ln(1); ln(2); ln(2); :::) is a divergent real sequence (not 1 Cauchy) but ln(m + 1) ln(m) = ln(1 + m ) ! 0. Note that Cauchy sequences are bounded. Moreover, any convergent sequence (xm ) in X is Cauchy. On the other hand, a Cauchy sequence need not be convergent. For example, consider (1; 21 ; 13 ; 41 ; : : :) as a sequence in the metric space (0; 1] (not in R). This sequence does not converge in (0; 1] because its limit 0 does not belong to the space. Nonetheless, the sequence converges in [0; 1] or R). Yet if we know that (xm ) is Cauchy, and that it 16 3. METRIC SPACES ECO2010, Summer 2022 has a convergent subsequence, say (xmk ), then we can conclude that (xm ) converges. We summarize these properties of Cauchy sequences in the following Proposition. Proposition 1. Let (xm ) be a sequence in a metric space X. (a) If (xm ) is convergent, then it is Cauchy. (b) If (xm ) is Cauchy, then fx1 ; x2 ; :::g is bounded, but (xm ) need not converge in X. (c) If (xm ) is Cauchy and has a subsequence that converges in X, then it converges in X as well. 3.4.2 Complete Metric Spaces Suppose that we are given a sequence (xm ) in some metric space, and we need to check if this sequence converges. Doing this directly requires us to “guess” a candidate limit x for the sequence, and then to show that we actually have xm ! x (or that this never holds for any choice of x), which may not be feasible or e¢ cient. A better alternative method is to check whether or not the sequence at hand is Cauchy. If it is not Cauchy, then it cannot be convergent by the Proposition above. If it is Cauchy then if we knew that in the given metric space all Cauchy sequences converge, then we have our result. In such a space a sequence is convergent i¤ it is Cauchy, and hence convergence can always be tested by using the “Cauchyness” condition. This property is called completeness. A metric space X is said to be complete if every Cauchy sequence in X converges to a point in X. We have seen that (0; 1] R is not complete. Another example of an incomplete metric space is the set of rational numbers Q R: Indeed, for any x 2 RnQ we can …nd an (xm ) 2 Q1 with lim xm = x. Then, (xm ) is Cauchy, but it does not converge in Q. Metric spaces that are complete include R; Rn ; `p ; and B(T ): There is a tight connection between the closedness of a set and its completeness as a metric subspace. Proposition 2. Let X be a metric space, and Y a metric subspace of X. If Y is complete, then it is closed in X. Conversely, if Y is closed in X and X is complete, then Y is complete. This is a useful observation that lets us obtain other complete metric spaces from the ones we already know to be complete. One example is the space C[0; 1] which is a metric subspace of B[0; 1]. Since B[0; 1] is complete and C[0; 1] is closed in B[0; 1] then C[0; 1] is also complete. Reference: Ok, Chapter C, Sections 1, 3, 5 17 4. ANALYSIS IN METRIC SPACES 4 ECO2010, Summer 2022 Analysis in Metric Spaces The theory behind many applications in economics is developed in general metric spaces. Examples include: Existence of competitive equilibrium in general equilibrium theory Existence of a solution of a dynamic optimization problem Existence of a Nash equilibrium in game theory 4.1 Continuity Let (X; dX ) and (Y; dY ) be two metric spaces. The map f : X ! Y is continuous at x 2 X if, for any " > 0; there exists a > 0 s.t. dX (x; y) < implies dY (f (x); f (y)) < " The map f is continuous if it is continuous at every x 2 X: Theorem 12. Let X and Y be two metric spaces, and f : X ! Y a continuous function. If S is a compact subset of X; then f (S) is a compact subset of Y: The following result is foundational for optimization theory. Theorem 13 (Weierstrass Extreme Value Theorem). If X is a compact metric space and ' 2 RX is a continuous function, then there exist x; y 2 X with '(x) = sup '(X) and '(y) = inf '(X): A metric space X is connected if there do not exist two non-empty and disjoint open subsets O and U such that O [ U = X: Example: any interval I R is connected in R: Theorem 14 (Intermediate Value Theorem). Let X be a connected metric space and ' : X ! R a continuous function. If '(x) '(y) for some x; y 2 X; then there exists z 2 X such that '(z) = : 4.2 Fixed Points De…nition 15 (Fixed point). Let (X; d) be a metric space and f : X ! X be a function (or a self-map). We call s 2 X a …xed point of the function f if s = f (s) Let X be a metric space. A self-map on X is said to be a contraction (or a contractive self-map) if there exists a real number 0 < K < 1 such that d( (x); (y)) Kd(x; y) 8x; y 2 X The in…mum of the set of all such K is called the contraction coe¢ cient of Example: let f 2 RR ; f t: Then f is a contraction i¤ j j < 1: 18 : 4. ANALYSIS IN METRIC SPACES 4.2.1 ECO2010, Summer 2022 Blackwell’s Contraction Lemma The contraction property for self-maps in metric spaces can be checked using the following Lemma. Lemma 16 (Blackwell’s Contraction). Let T be a non-empty set, and X B(T ) that is closed under addition by positive constant functions, that is f 2 X implies f + 2 X for any > 0: Assume that is an increasing self-map on X; that is (f ) (g) for any f; g 2 X with f g: If there exists 0 < < 1 such that (f + ) then (f ) + 8(f; ) 2 X R+ is a contraction. Example: Blackwell’s Contraction Lemma is typically used to show the existence of a solution to the Bellman Equation mapping on a space of functions in Dynamic programming which we will encounter later in the course. 4.2.2 Banach Fixed Point Theorem Theorem 17 (The Banach Fixed Point Theorem). Let X be a complete metric space. If 2 X X is a contraction, then there exists a unique x 2 X such that (x ) = x : By the Banach Fixed Point Theorem, the Bellman Equation in the Example above has a unique solution. Note that the exact de…nition of the domain and co-domain of matters: a metric space is complete if an only if every contraction on it has a …xed point. 4.2.3 Brower Fixed Point Theorem Existence of a …xed point can also be shown in settings where we cannot prove the contractive property of a self-map. We can guarantee a …xed point of a continuous self-map, as long as the domain of this map is su¢ ciently well-behaved. Theorem 18 (The Brower Fixed Point Theorem). For any given n 2 N; let S be a nonempty, closed, bounded, and covex subset of Rn : If is a continuous self-map on S; then there exists s 2 S such that (x) = x: Example: an important result due to Nash (1950) says that every …nite strategic form game has a mixed strategy equilibrium. The proof is based on Kakutani’s generalization of the Brower Fixed Point Theorem. 4.3 4.3.1 Convergence Uniform Convergence Let T be a metric space. Let B(T ) denote the set of all bounded real-valued functions de…ned on T endowed with the sup-metric d1 : Denote by CB(T ) the space of all continuous and bounded real-valued functions de…ned on T; whereby CB(T ) B(T ): Recall that C(T ) denotes the space of all continuous real-valued functions de…ned on T: 19 4. ANALYSIS IN METRIC SPACES ECO2010, Summer 2022 Let ('m ) be a sequence in B(T ) and let lim supfj'm (x) m!1 '(x)j : x 2 T g = 0 that is 'm (x) converges to some '(x) with respect to d1 : We refer to '(x) as the uniform limit of 'm (x); and say that ('m ) converges to ' uniformly. Uniform convergence is a global notion of convergence. For sequences of functions it involves convergence over the entire domain of the functions, not only at a particular point x: The following theorem says that uniform convergence preserves continuity in the limit. Theorem 19. If 'm 2 C(T ) and 'm ! ' uniformly then ' 2 C(T ). Furthermore, uniform convergence allows us to interchange the operation of taking limits. Let (xk ) be a sequence in T such that (xk ) ! x and let 'm (x) be a sequence in C(T ) such that 'm ! ' uniformly. Then, lim lim 'm xk = lim ' xk k!1 m!1 k!1 = '(x) = lim 'm (x) m!1 = lim lim 'm xk m!1 k!1 4.3.2 Pointwise Convergence In contrast, pointwise convergence requires only that lim j'm (x) m!1 ' (x) j = 0 for each x 2 T: Here we say that 'm ! ' pointwise. This mode of convergence is local, at a particular point x. Uniform convergence implies pointwise convergence, while the converse is not true. As an example, consider the sequence (fm ) in C[0; 1] de…ned by fm (t) = tm : Then (fm ) converges to 1f1g pointwise but not uniformly, where 1f1g is the indicator function of 1 in [0; 1]: If d1 (fm ; 1f1g ) ! 0 was the case then 1f1g would have to be continuous, which is not the case. Pointwise convergence does not necessarily allow us to interchange the operation of taking limits. As an example, let (tk ) = (0; 21 ; : : : ; 1 k1 ; : : :): Then lim lim fm (tk ) = 0 6= 1 = lim lim fm (tk ) k!1 m!1 m!1 k!1 Mathematica demonstration: convergence that is pointwise but not uniform 20 4. ANALYSIS IN METRIC SPACES ECO2010, Summer 2022 http://demonstrations.wolfram.com/ANoncontinuousLimitOfASequenceOfContinuousFunctions/ Reference: Ok, Chapter C, Section 6.1 Charalambos D. Aliprantis, C. D. and K. C. Border (2006) 21 5. VECTOR SPACES 5 ECO2010, Summer 2022 Vector Spaces We will now endow metric spaces with further structure consisting of additional properties and algebraic operations. De…nition 20 (Vector space). A vector space (also called linear space) over R is a set V of arbitrary elements (called vectors) on which two binary operations are de…ned: (i) vector addition: if u; v 2 V , then u + v 2 V (ii) scalar multiplication: if a 2 R and v 2 V; then av 2 V which satisfy the following axioms: C1: u + v = v + u; 8u; v 2 V C2: (u + v) + w = u + (v + w); 8u; v; w 2 V C3: 90 2 V 3 v + 0 = v = 0 + v; 8v 2 V C4: 8v 2 V; 9( v) 2 V 3 v + ( v) = 0 = ( v) + v C5: 1v = v; 8v 2 V C6: a(bv) = (ab)v; 8a; b 2R and 8v 2 V C7: a(u + v) = au + av; 8u; v 2 V C8: (a + b)v = av + bv; 8a; b 2R and 8v 2 V 5.1 Normed Space The notion of distance in a vector space is introduced through a function called a norm. De…nition 21 (Norm). If V is a vector space, then a norm on V is a function from V to R; denoted v 7! kvk ; which satis…es the following properties 8u; v 2 V and 8a 2 R : (i) kvk 0 (ii) kvk = 0 i¤ v = 0 (iii) kavk = jaj kvk (iv) ku + vk kuk + kvk De…nition 22 (Normed vector space). A vector space in which a norm has been de…ned is called a normed vector space. Note that the algebraic operations vector addition and scalar multiplication are used in de…ning a norm. Thus, a norm cannot be de…ned in a general metric space where these operations may not be de…ned. In the next theorem, we establish formally the relationship between a metric space and a normed vector space. 22 5. VECTOR SPACES ECO2010, Summer 2022 Theorem 23. Let V be a vector space. Then (i) If (V; d) is a metric space, then (V; k k) is a normed vector space with the norm k k : V ! R de…ned as k k = d(x; 0); 8x 2 V (ii) If (V; k k) is a normed vector space then (V; ) is a metric space with the metric : V V ! R de…ned (x; y) = kx yk ; 8x; y 2 V . Recall that any metric space having the property that all Cauchy sequences are also convergent sequences is said to be a complete metric space (i.e. limits are preserved within such spaces). De…nition 24 (Banach space). A complete normed vector space is called a Banach space. Examples of Banach spaces include C[a; b], the set of all real valued functions continuous in the closed interval [a; b]; and Rn , the Euclidean space. Note that in a vector space of functions, such as C[a; b]; each "vector" (i.e. element of the space) is a function. We will now introduce the concept of a subspace. De…nition 25. Suppose that V is a vector space over R, and that W is a subset of V . Then we say that W is a subspace of V if W forms a vector space over R under the vector addition and scalar multiplication de…ned in V . Example: Consider the vector space R2 of all points (x; y), where x; y 2 R. Let L be a line through the origin 0 = (0; 0). Suppose that L is represented by the equation x + y = 0 in other words, L = f(x; y) 2 R2 : x + y = 0g Note that if (x; y); (u; v) 2 L, then x+ y = 0 and u+ v = 0, so that (x+u)+ (y+v) = 0; whence (x; y) + (u; v) = (x + u; y + v) 2 L. Thus, every line in R2 through the origin is a vector space over R. 5.2 Linear Combination Our ultimate goal of this Section is to be able to determine a subset B of vectors in V and describe every element of V in terms of elements of B in a unique way. The …rst step in this direction is summarized below. De…nition 26 (Linear combination ). Suppose that v1 ; : : : ; vr are vectors in a vector space V over R. By a linear combination of the vectors v1 ; : : : ; vr we mean an expression of the type c1 v1 + : : : + cr vr where c1 ; : : : ; cr 2 R. Example: In R2 , every vector (x; y) is a linear combination of the two vectors i = (1; 0) and j = (0; 1); as any (x; y) = xi + yj: 23 5. VECTOR SPACES ECO2010, Summer 2022 Example: In R4 , the vector (2; 6; 0; 9) is not a linear combination of the two vectors (1; 2; 0; 4) and (1; 1; 1; 3). Mathematica demonstration: Linear combinations of curves in complex plane http://demonstrations.wolfram.com/LinearCombinationsOfCurvesInComplexPlane/ Let us investigate the collection of all vectors in a vector space that can be represented as linear combinations of a given set of vectors in V . De…nition 27 (Span). Suppose that v1 ; : : : ; vr are vectors in a vector space V over R. The set spanfv1 ; : : : ; vr g = fc1 v1 + : : : + cr vr : c1 ; : : : ; cr 2 Rg is called the span of the vectors v1 ; : : : ; vr . Example: The three vectors i = (1; 0; 0), j = (0; 1; 0); and k = (0; 0; 1) span R3 . Example: The two vectors (1; 2; 0; 4) and (1; 1; 1; 3) do not span R4 . 5.3 Linear Independence De…nition 28. Suppose that v1 ; : : : ; vr are vectors in a vector space V over R. (LD) We say that v1 ; : : : ; vr are linearly dependent if there exist c1 ; : : : ; cr 2 R, not all zero, such that c1 v1 + : : : + cr vr = 0. (LI) We say that v1 ; : : : ; vr are linearly independent if they are not linearly dependent; in other words, if the only solution of c1 v1 + : : : + cr vr = 0 in c1 ; : : : ; cr 2 R is given by c1 = : : : = cr = 0. 5.4 Basis and Dimension In this section, we complete the task of uniquely describing every element of a vector space V in terms of the elements of a suitable subset B. De…nition 29. Suppose that v1 ; : : : ; vr are vectors in a vector space V over R. We say that fv1 ; : : : ; vr g is a basis for V if the following two conditions are satis…ed: (B1) spanfv1 ; : : : ; vr g = V (B2) The vectors v1 ; : : : ; vr are linearly independent. 24 5. VECTOR SPACES ECO2010, Summer 2022 Proposition 3. Suppose that fv1 ; : : : ; vr g is a basis for a vector space V over R. Then every element u 2 V can be expressed uniquely in the form u = c1 v1 + : : : + cr vr where c1 ; : : : ; cr 2 R. Example: In Rn , the vectors e1 ; : : : ; en where ej = f 0; : : : ; 0 ; 1; 0; : : : ; 0 g for every j = 1; : : : ; n | {z } | {z } j 1 n j are linearly independent and span Rn . Hence fe1 ; : : : ; en g is a basis for Rn . This is known as the standard basis for Rn . Example: In the vector space M2;2 (R) of all 2 2 matrices with entries in R, the set ( ) 1 0 0 1 0 0 0 0 ; ; ; 0 0 0 0 1 0 0 1 is a basis. Example: In the vector space Pk of polynomials of degree at most k and with coe¢ cients in R, the set f1; x; x2 ; : : : ; xk g is a basis. Mathematica demonstration: B-spline basis functions http://demonstrations.wolfram.com/CalculatingAndPlottingBSplineBasisFunctions/ We have shown earlier that a vector space can have many bases. For example, any collection of three vectors not on the same plane is a basis for R3 . In the following discussion, we attempt to …nd out some properties of bases. However, we will restrict our discussion to the following simple case. De…nition 30. A vector space V over R is said to be …nite-dimensional if it has a basis containing only …nitely many elements. Example: The vector spaces Rn , M2;2 (R) and Pk that we have discussed earlier are all …nite-dimensional. 25 5. VECTOR SPACES ECO2010, Summer 2022 Proposition 4. Suppose that V is a …nite-dimensional vector space over R. Then any two bases for V have the same number of elements. De…nition 31. Suppose that V is a …nite-dimensional vector space over R. Then we say that V is of dimension n if a basis for V contains exactly n elements Example: The vector space Rn has dimension n. Example: The vector space M2;2 (R) has dimension 4. Example: The vector space Pk has dimension (k + 1). Proposition 5. Suppose that V is a n-dimensional vector space over R. Then any set of n linearly independent vectors in V is a basis for V . 5.5 Vector Spaces Associated with Matrices Consider the m n matrix A with rows r1 ; : : : ; rm and columns c1 ; : : : ; cn : Then 2 3 r1 h i 6 7 A = 4 ... 5 or A = c1 : : : cn rm We also consider the system of homogeneous equations Ax = 0. We will now introduce three vector spaces that arise from the matrix A. De…nition 32. Suppose that A is an m n matrix with entries in R. (RS) The subspace spanfr1 ; : : : ; rm g of Rn , where r1 ; : : : ; rm are the rows of the matrix A, is called the row space of A. (CS) The subspace spanfc1 ; : : : ; cn g of Rm , where c1 ; : : : ; cn are the columns of the matrix A, is called the column space of A. (NS) The solution space of the system of homogeneous linear equations Ax = 0 is called the nullspace of A. Our aim now is to …nd a basis for the row space of a given matrix A with entries in R. This task is made considerably easier by the following result. Proposition 6. Suppose that the matrix B can be obtained from the matrix A by elementary row operations. Then the row space of B is identical to the row space of A. To …nd a basis for the row space of A, we can now reduce A to row echelon form, and consider the non-zero rows that result from this reduction. It is easily seen that these non-zero rows are linearly independent. Example: Let 2 3 1 3 5 1 5 6 1 4 7 3 2 7 6 7 A=6 7 4 1 5 9 5 9 5 0 3 6 2 1 26 5. VECTOR SPACES ECO2010, Summer 2022 The matrix A can be reduced to row echelon 2 1 3 6 0 1 6 A=6 4 0 0 0 0 form as 5 2 0 0 1 2 1 0 5 7 5 0 3 7 7 7 5 If follows that v1 = (1; 3; 5; 1; 5); v2 = (0; 1; 2; 2; 7); v3 = (0; 0; 0; 1; 5): A basis for a column space can be found as a basis for a row space of A0 : Proposition 7. For any matrix A with entries in R, the dimension of the row space is the same as the dimension of the column space. De…nition 33. The rank of a matrix A, denoted by rank(A), is equal to the common value of the dimension of its row space and the dimension of its column space. Reference: Harrison and Waldron (2011), chapter 5 27 6. LINEAR ALGEBRA IN VECTOR SPACES 6 ECO2010, Summer 2022 Linear Algebra in Vector Spaces 6.1 Linear Transformations De…nition 34. Let V and W be vector spaces over R. The function T : V ! W is called a linear transformation if the following two conditions are satis…ed: (LT1) For every u; v 2 V , we have T (u + v) = T (u) + T (v). (LT2) For every u 2 V and c 2 R; we have T (cu) = cT (u). De…nition 35. A linear transformation T :Rn !Rm is said to be a linear operator if n = m. In this case, we say that T is a linear operator on Rn . 6.2 Inner Product Spaces De…nition 36. Suppose that V is a vector space over R. By a real inner product on V , we mean a function h ; i : V V !R which satis…es the following conditions: (IP1) For every u; v 2 V , we have hu; vi = hv; ui : (IP2) For every u; v; w 2 V , we have hu; v + wi = hu; vi + hu; wi (IP3) For every u; v 2 V and c 2 R, we have c hu; vi = hcu; vi (IP4) For every u 2 V , we have hu; ui 0 and hu; ui = 0 if and only if u = 0. The properties (IP1)-(IP4) describe respectively symmetry, additivity, homogeneity and positivity. The inner product is thus a whole class of operations which satisfy these properties. An inner product h ; i always de…nes a norm by the formula p kuk = hu; ui We can check that in such case all the conditions of a norm are satis…ed. However, the converse is not true, that is, not every norm gives rise to an inner product as the requirements for de…ning a norm are weaker than those for de…ning an innner product. While norms are used to evaluate the "size" of individual elements or "distance" of pairs of elements in vector spaces, inner products are used for transformations of pairs of elements. De…nition 37. A vector space over R with an inner product de…ned in it is called a real inner product space. A real inner product space is a normed vector space but the converse does not hold. De…nition 38. A complete inner product space is called a Hilbert space. - Important special cases of a Hilbert space include: the n-dimensional Euclidean space Rn , L2 , the set of all functions f :R!R with …nite integral of f 2 over R, the set of all k k matrices with real elements the set of all polynomials of degree at most k: 28 6. LINEAR ALGEBRA IN VECTOR SPACES ECO2010, Summer 2022 Example: Consider the vector space M2;2 (R) of all 2 2 matrices with entries in R. For matrices " # " # u11 u21 v11 v21 U= , V= u12 u22 v12 v22 in M2;2 (R) let hU; Vi = u11 v11 + u21 v21 + u12 v12 + u22 v22 It is easy to check that conditions (IP1)-(IP4) are satis…ed. Example: Consider the vector space P2 of all polynomials with real coe¢ cients and of degree at most 2. For polynomials p = p(x) = p0 + p1 x + p2 x2 and q = q(x) = q0 + q1 x + q2 x2 in P2 ; let hp; qi = p0 q0 + p1 q1 + p2 q2 It can be checked that conditions (IP1)-(IP4) are satis…ed. Example: An important special case of a vector space over R is formed by C[a; b], the collection of all real valued functions continuous in the closed interval [a; b]. For f; g 2 C[a; b]; let Z b f (x)g(x)dx hf; gi = a It can be checked that conditions (IP1)-(IP4) are satis…ed. 6.3 Orthogonal Bases Using the inner product, we now add the extra ingredient of orthogonality to the above discussion of bases. Suppose that V is a …nite-dimensional real inner product space. A basis fv1 ; : : : ; vr g of V is said to be an orthogonal basis of V if hvi ; vj i = 0 for every i; j = 1; : : : ; r satisfying i 6= j. It is said to be an orthonormal basis if it satis…es the extra condition that kvi k = 1 for every i = 1; : : : ; r. Example: The basis fe1 ; : : : ; en g ej = f 0; : : : ; 0 ; 1; 0; : : : ; 0 g for every j = 1; : : : ; n | {z } | {z } j 1 n j is an orthonormal basis of Rn with respect to the Euclidean inner product. 29 6. LINEAR ALGEBRA IN VECTOR SPACES ECO2010, Summer 2022 Mathematica demonstration: Orthogonal functions http://demonstrations.wolfram.com/OrthogonalityOfTwoFunctionsWithWeightedInnerProducts/ 6.4 Projections We can use the machinery of orthogonal vectors to provide a solution for a very practical problem. The Projection Problem: Given a …nite-dimensional subspace V of a real inner product space W , together with a vector b 2 W; …nd the vector v 2 V which is closest to b in the sense that kb vk2 is minimized. Note that the quantity kb vk2 is minimized exactly when kb vk is minimized. The square term has the virtue of avoiding square roots in the calculation. To solve the projection problem, we need the following key concept: De…nition 39. Let v1 ; : : : ; vn be an orthogonal basis for the subspace V of the inner product space W: For any b 2 W; the (parallel) projection of b into the subspace V is the vector hv1 ; bi hv2 ; bi hvn ; bi projV b = v1 + v2 + : : : + vn hv1 ; v1 i hv2 ; v2 i hvn ; vn i It may appear that projV b depends on the basis vectors v1 ; : : : ; vn but this is not the case. Theorem 40. Let v1 ; : : : ; vn be an orthogonal basis for the subspace V of the inner product space W: For any b 2 W; the vector v =projV b is the unique vector in V that minimizes kb vk2 : 30 6. LINEAR ALGEBRA IN VECTOR SPACES ECO2010, Summer 2022 As a special case, we obtain the least-squares projection b = Py = projV y y Mathematica demonstration: Projection http://demonstrations.wolfram.com/ThreeOrthogonalProjectionsOfPolyhedra/ 6.5 Euclidean Spaces An important special case of a real inner product space is the n-dimensional Euclidean space Rn . De…nition 41. Suppose that u = (u1 ; : : : ; un ) and v = (v1 ; : : : ; vn ) are vectors in Rn . The Euclidean dot product of u and v is de…ned by u v = u1 v1 + : : : + un vn The Euclidean norm of u is de…ned by kuk = (u u)1=2 = u21 + : : : + u2n 1=2 and the Euclidean distance between u and v is de…ned by ku vk = (u1 v1 )2 + : : : + (un 31 vn )2 1=2 6. LINEAR ALGEBRA IN VECTOR SPACES ECO2010, Summer 2022 The dot product is a special case of the inner product. We can de…ne an inner product of two vectors in Rn in a number of ways. Examples include hu; vi = u v or hu; vi = Au Av In R2 and R3 , we say that two non-zero vectors are perpendicular if their dot product is zero. We now generalize this idea to vectors in Rn . De…nition 42. Two vectors u; v 2 Rn are said to be orthogonal if u v=0 Example: Suppose that u; v 2 Rn are orthogonal. Then ku + vk2 = (u + v) (u + v) = u u + 2u v + v v = kuk2 + kvk2 This is an extension of Pythagoras’theorem. 6.6 Separating Hyperplane Theorem The Separating Hyperplane Theorem is a result concerning disjoint convex sets in ndimensional Euclidean space. In economic analysis, the Theorem is used in proving e.g. the Second Fundamental Theorem of Welfare Economics, and the existence of a competitive equilibrium for Aggregate Excess Demand Correspondences. De…nition 43. A set S Rn is convex if the following condition holds: Whenever A; B 2 S are points in S, the line segment [A; B] is contained entirely in S. Example in R2 : Convex set Non-convex set De…nition 44. A hyperplane of Rn is its subspace Rn 32 1. 6. LINEAR ALGEBRA IN VECTOR SPACES ECO2010, Summer 2022 Theorem 45 (Separating Hyperplane Theorem). Let A and B be two disjoint nonempty convex subsets of Rn . Then there exist a nonzero vector v and a real number c such that hx; vi c and hy; vi c for all x 2 A and y 2B. Example in R2 of convexity (left) and violation of convexity (right): Reference: Harrison and Waldron (2011), chapter 6 and section 7.6 33 7. CORRESPONDENCES 7 ECO2010, Summer 2022 Correspondences Recall the de…nition of a function from a set A to RK De…nition 46. Given a set A RN ; a function g : A ! RK is a rule that assigns an element g(x) 2 RK to every x 2 A: A correspondence generalizes the concept of a function by mapping into sets. De…nition 47 (M.H.1). Given a set A RN ; a correspondence f : A ! RK is a rule that assigns a set f (x) RK to every x 2 A: Example: 7.1 Continuity of Correspondences Intuitively, a correspondence is continuous if small changes in x produce small changes in the set F (x): The formal de…nition of continuity for correspondences is complicated though. It encompasses two di¤erent concepts: Upper hemicontinuity Lower hemicontinuity De…nition 48. Given A is the set RN and Y RK ; the graph of the correspondence f : A ! Y f(x; y) 2 A Y : y 2 f (x)g De…nition 49 (M.H.2). Given A RN and the closed set Y RK ; the correspondence f : A ! Y has a closed graph if for any two sequences xm ! x 2 A and y m ! y; with xm 2 A and y m 2 f (xm ) for every m; we have y 2 f (x): De…nition 50 (M.H.3). Given A RN and the closed set Y RK ; the correspondence f : A ! Y is upper hemicontinuous (uhc) if it has a closed graph and the images of compact sets are bounded, that is, for every compact set B A the set f (B) = fy 2 Y : y 2 f (x) for some x 2 Bg is bounded. 34 7. CORRESPONDENCES ECO2010, Summer 2022 Heuristically, a function is upper hemicontinuous when a convergent sequence of points in the domain maps to a sequence of sets in the range which contain another convergent sequence, then the image of limiting point in the domain must contain the limit of the sequence in the range. Example of a upper hemicontinuous correspondence that is not lower hemicontinuous: De…nition 51 (M.H.4). Given A RN and a compact set Y RK ; the correspondence f : A ! Y is lower hemicontinuous (lhc) if for every sequence xm ! x 2 A with xm 2 A for all m; and every y 2 f (x); we can …nd a sequence y m ! y and an integer M such that y m 2 f (xm ) for m > M: Lower hemicontinuity captures the idea that if a sequence in the domain converges, then for any given point in the range of the limit we can …nd a sub-sequence whose image contains a convergent sequence to the given point. Example of a lower hemicontinuous correspondence that is not upper hemicontinuous 7.2 Kakutani’s Fixed Point Theorem Theorem 52 (M.I.2). Suppose that A RN is a non-empty, compact, convex set, and that f : A ! A is an upper hemicontinuous correspondence from A into itself with the property that the set f (x) A is nonempty and convex for every x 2 A: Then f ( ) has a …xed point; that is, there is an x 2 A such that x 2 f (x): 35 7. CORRESPONDENCES ECO2010, Summer 2022 Kakutani’s Fixed Point Theorem generalizes the Brouwer Fixed Point Theorem. Areas of application include: Proof of existence of Nash Equilibrium in strategic games, Proof of existence of Walrasian competitive equilibrium. Illustration of the Kakutani’s Fixed Point Theorem: Reference: MasCollel, Whinston and Green (1995), Appendix M.H. 36 8. CONTINUITY 8 8.1 ECO2010, Summer 2022 Continuity Convexity Theorem 53. Let A = (a1 ; : : : ; an ) and B = (b1 ; : : : ; bn ) be any two points in S Rn , and let a; b be the corresponding column vectors. Then a function f : Rn ! R is: (i) convex i¤ f ((1 t)a + tb) (1 t)f (a) + tf (b) for all a; b 2 S and all t 2 [0; 1] ; (ii) concave i¤ f ((1 t)a + tb) (1 t)f (a) + tf (b) for all a; b 2 S and all t 2 [0; 1] ; When or is replaced by < or >; respectively, these become strictly convex or strictly concave. Example of a convex function f (x; y) = x2 + y 2 of two variables: A function f : Rn ! R is quasiconcave i¤ f ((1 t)a + tb) minff (a) ; f (b)g for all a; b 2 S and all t 2 [0; 1]: A function f : Rn ! R is quasiconvex i¤ f is quasiconcave. Every concave function in Rn is quasiconcave, but the converse does not hold. 8.2 The Maximum Theorem Let T be any metric space. Recall that we denote by C(T ) the set of all continuous functions de…ned on T , B(T ) the set of all bounded functions de…ned on T , and CB(T ) the set of all continuous and bounded functions on T . When T is compact then CB(T ) = C(T ) and we will consider C(T ) as metrized by the sup-metric d1 : In line with the reference source for this section (Ok, 2007) we will denote a correspondence by the symbol : Let X denote a metric space of "decision variables" and let denote a metric space of "parameters". Our goal is to determine how the maximum value and the maximizing vectors in a constrained maximization problem involving X depend on the exogenous parameters . Let ' denote the "objective function", de…ned as a map on X ; and let ( ) be a non-empty subset of X; representing a constraint on x: A general optimization problem is then de…ned as max '(x; ) x2 ( ) whereby the constraint ( ) depends on the given parameter : For each 2 ; let ' ( ) denote the maximized value of the objective function, and let ( ) denote the maximal choice, as functions of the parameter : The Maximum Theorem tells us that when solving a constrained optimization problem, if the objective function is continuous, and the correspondence de…ning the constraint set is continuous, compact, and 37 8. CONTINUITY ECO2010, Summer 2022 non-empty then the problem has a solution and: (a) the correspondence ( ) de…ning the optimal choice set is upper-hemicontinuous (if this is a function, it is a continuous function), (b) the optimized function ' ( ) is continuous in the parameters. The Theorem also appears in the literature under the name "Theorem of the Maximum" or "Berge’s Theorem". Theorem 54 (The Maximum Theorem). Let compact-valued correspondence, and ' 2 C(X ( ) and X be two metric spaces, ): De…ne arg maxf'(x; ) : x 2 ( )g for all : X a 2 and ' ( ) maxf'(x; ) : x 2 ( )g for all 2 ; and assume that is continuous at 2 : Then: (a) : X is compact-valued, upper hemicontinuous, and closed at ; (b) ' ( ) : ! R is continuous at : 8.2.1 Application 1: The Demand Correspondence Consider an agent whose income is commodity bundles is > 0 and whose utility function over n-vectors of u : Rn+ ! R. The standard choice problem of this consumer is max u(x) s.t. x 2 B(p; ) where p 2 Rn+ denotes the price vector in the economy, and B : Rn+1 + correspondence of the consumer: B(p; ) fx 2 Rn+ : p x Rn is the budget g: The optimum choice of the individual is conditional on the parameter vector (p; ); and given by the demand correspondence d : Rn+1 Rn de…ned by + d(p; ) arg maxfu(x) : x 2 B(p; )g: Since B is continuous, it follows from the Maximum Theorem that d is compact-valued, closed, and upper hemicontinuous. Moreover, the indirect utility function u : Rn+1 !R + de…ned by u (p;i) max(u(x) : x 2 B(p; )) is continuous. 38 8. CONTINUITY 8.2.2 ECO2010, Summer 2022 Application 2: Strategic Games A strategic game models strategic interaction of a group of m players for m 2: Denote the set of players by f1; : : : ; mg: Denote by Xi the action space of player i; de…ned as a nonempty set which contains all actions that are available to player i = 1; :::; m: The outcome of the game is obtained when all players choose an action. Hence, an outcome is any m-vector (x1 ; :::; xm ) 2 X1 X2 : : : Xm = X; where X is called the outcome space of the game. In a strategic game, the payo¤s of a player depend not only on her own actions but also on the actions taken by other players. The payo¤ function of player i, denoted i ; is thus de…ned on the entire outcome space: i : X ! R: Each player i is assumed to have a complete preference relation on X which is represented by i . Formally, an m-person strategic game is then de…ned as a set G f(X1 ; 1 ); :::; (Xm ; m )g: We say that a strategic game G is …nite if each action space Xi is …nite. A well-known example of a …nite strategic game is the prisoners’dilemma. The game is described by means of a payo¤ matrix of the type 1,1 0,6 6,0 5,5 Both players have strong incentives to play the game non-cooperatively by choosing , so the decentralized behavior dictates the outcome ( ; ): However, at the cooperative outcome ( ; ) both players are strictly better o¤. Thus the prisoners’dilemma illustrates the fact that decentralized individualistic behavior need not lead to a socially optimal outcome in strategic situations. For any strategic game G; let X i f(!; :::; ! m 1) : ! j 2 Xj for j < i and ! j 1 2 Xj for j > ig be the collection of all action pro…les of all the players except the player i. Denote by (a; x i ) the outcome x 2 X where the action taken by player i is a and the actions taken by other players are x i 2 X i : De…nition 55 (Nash Equilibrium). Let G f(Xi ; i )i=1;:::;m g be a strategic game. We say that an outcome x 2 X is a Nash equilibrium if xi 2 arg maxf i (xi ; x i ) : xi 2 Xi g for all i = 1; :::m: At a NE there is no incentive for any player to change her action, given the actions of other players at this outcome. A Nash equilibrium x is said to be symmetric if x1 = = xm . We denote the set of all NE and symmetric NE of a game G by N E(G) and N Esym (G); respectively. 39 8. CONTINUITY ECO2010, Summer 2022 If each Xi is a nonempty compact subset of a Euclidean space, then we say that the strategic game G is a compact Euclidean game. If, in addition, i 2 C(X) for each i = 1; :::; m, then we say that G is a continuous and compact Euclidean game. If, instead, each Xi is convex and compact, and each ( ; x i ) is quasiconcave for any given x i 2 X i ; then G is called a convex and compact Euclidean game. Finally, a compact Euclidean game which is both convex and continuous is called a regular Euclidean game. Theorem 56 (Nash’s Existence Theorem). If G game, then N E(G) 6= ;: f(Xi ; i )i=1;:::;m g is a regular Euclidean To prove the Theorem, for each i = 1; :::; m de…ne the best response correspondence bi : X i Xi by bi (x i ) arg maxf i (xi ; x i ) : xi 2 Xi g and the correspondence b : X b(x) X by b1 (x 1) bm (x m ): The proof proceeds by showing that Kakutani’s Fixed Point Theorem applies to b: Since each Xi is compact and convex, so is X: To show that b is convex-valued, …x an arbitrary x 2 X and 0 1 and note that, for any y; z 2 b(x) we have i (yi ; x i ) Thus by using the quasiconcavity of i( yi + (1 for any wi 2 Xi ; that is i = i (zi ; x i ): on Xi we …nd )zi ; x i ) yi + (1 i (yi ; x i ) i (wi ; x i ) )zi 2 bi (x i ): Since this holds for each i, we conclude that y + (1 )z 2 b(x) and hence b(x) is convexvalued. Furthermore, by the Maximum Theorem, b is upper hemicontinuous. QED. 8.3 Semicontinuity Continuity is a su¢ cient but not necessary condition for a function to attain a maximum (or minimum) over a compact set. Other su¢ cient conditions rely on further continuity concepts; we will now introduce some of them. Let X be any metric space, and ' 2 RX : We say that ' is upper semicontinuous at x 2 X if, for any " > 0; 9 > 0 such that d(x; y) < implies '(y) '(x) + " 8y 2 X; and ' is lower semicontinuous at x 2 X if, for any " > 0; 9 d(x; y) < implies '(y) 40 '(x) > 0 such that " 8y 2 X: 8. CONTINUITY ECO2010, Summer 2022 The function ' is said to be upper (lower) semicontinuous if it is upper (lower) semicontinuous at each x 2 X: 8.3.1 Baire Theorem Semicontinuity leads to the following generalization of the Weierstrass Theorem. Theorem 57 (Baire). Let X be a compact metric space, and ' 2 RX : If ' is upper semicontinuous, then 9 x 2 X with '(x) = sup '(X): If ' is lower semicontinuous, then 9 x 2 X with '(x) = inf '(X): Example: Consider the function f (t) + log(1 + t) where f 2 R[0;2] is de…ned as ( t2 2t for 0 t < 1 f (t) = 2t t2 for 1 t 2 Does f attain a maximum or minimum? Since f is discontinuous at t = 1; the Weierstrass Theorem is not applicable. Instead, we observe that f is upper semicontinuous on [0; 2] and hence by the Baire Theorem the maximum of f will exist. 41 8. CONTINUITY y ECO2010, Summer 2022 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 -0.2 2.0 x -0.4 8.3.2 Caristi’s Fixed Point Theorem The following Theorem generalizes the Banach …xed point theorem by allowing for discontinuous maps. Theorem 58 (Caristi). Let be a self-map on a complete metric space X: If d(x; (x)) '(x) '( (x)) 8x 2 X for some lower semicontinuous ' 2 RX that is bounded from below, then in X: Reference: Ok, p. 229, 234, 238 42 has a …xed point ECO2010, Summer 2022 Part II Optimization 9 9.1 Constrained Optimization Equality Constraints Let f (x) be a function in n variables, and let g1 (x) = b1 ; g2 (x) = b2 ; :::; gm (x) = bm (3) be m equality constraints, given by functions g1 ; : : : ; gm and constants b1 ; : : : bm . The problem of …nding the maximum or minimum of f (x) when x satisfy the constraints (3) is called the Lagrange problem with objective function f and constraint functions g1 ; : : : ; gm : We will describe a general method for solving Lagrange problems. De…nition 59. The Lagrangian or Lagrange function is the function L(x; ) = f (x) 1 (g1 (x) in the n + m variables x1 ; : : : ; xn ; multipliers. b1 ) 1; : : : ; 2 (g2 (x) m: b2 ) The variables m (gm (x) 1; : : : ; m bm ) are called Lagrange To solve the Lagrange problem, we solve the system of equations consisting of the n …rst order conditions @L(x; ) = 0; @x1 @L(x; ) = 0; @x2 ; @L(x; ) =0 @xn together with the m equality constraints (3). The solutions give us candidates for optimality. Example: Find the candidates for optimality for the following Lagrange problem: Maximize/minimize the function f (x1 ; x2 ) = x1 + 3x2 subject to the constraint x21 + x22 = 10. Solution: To formulate this problem as a standard Lagrange problem, let g(x1 ; x2 ) = x21 + x22 be the constraint function and let b = 10. We form the Lagrangian L(x; ) = f (x) (g(x) b) = x1 + 3x2 x21 + x22 10 The FOCs become g(x) @L(x; ) @f (x) = =1 @x1 @x1 @x1 @L(x; ) @f (x) g(x) = =3 @x2 @x2 @x2 @L(x; ) = x21 + x22 = 10 @ 43 1 2 3 2x2 = 0 =) x2 = 2 2x1 = 0 =) x1 = 9. CONSTRAINED OPTIMIZATION ECO2010, Summer 2022 Solving this system of equations, we obtain the candidates for optimality (x1 ; x2 ; ) = (1; 3; 1=2) and (x1 ; x2 ; ) = ( 1; 3; 1=2): Since f (1; 3) = 10 and f ( 1; 3) = 10 then if there is a maximum point, then it must be (1; 3), and if there is a minimum point, then it must be ( 1; 3): De…nition 60 (NDCQ). The non-degenerate constraint quali…cation (or NDCQ) condition is satis…ed if 2 3 0 (x) g 0 (x) 0 (x) g11 g1n 12 6 0 0 (x) 0 (x) 7 g2n 6 g21 (x) g22 7 7=m 6 rank 6 . . . .. .. 7 4 .. 5 0 (x) g 0 (x) gm1 m2 0 (x) gmn The NDCQ says that the constraints are independent at x: Theorem 61. Let the functions f; g1 ; : : : ; gm in a Lagrange problem be de…ned on a subset S Rn . If x is a point in the interior of S that solves the Lagrange problem and satis…es the NDCQ condition, then there exist unique numbers 1 ; : : : ; m such that x ; 1 ; : : : ; m satisfy the …rst order conditions. This theorem implies that any solution of the Lagrange problem can be extended to a solution of the …rst order conditions. This means that if we solve the system of equations that includes the …rst order conditions and the constraints, we obtain candidates for optimality. Mathematica demonstration: Lagrange multipliers in 2D http://demonstrations.wolfram.com/LagrangeMultipliersInTwoDimensions/ 9.2 Extreme Value Theorem Theorem 62 (Extreme Value Theorem). Let f (x) be a continuous function de…ned on a closed and bounded subset S Rn . Then f has a maximum point and a minimum point in S. 44 9. CONSTRAINED OPTIMIZATION ECO2010, Summer 2022 Theorem 63 (Su¢ cient Conditions for EVT). Let x be a point satisfying the constraints, and assume that there exist 1 ; : : : ; m such that (x1 ; : : : ; xn ; 1 ; : : : ; m ) satisfy the …rst order conditions. Then: (i) If L(x; ) is convex in x (with i = i …xed), then x is a solution to the minimum Lagrange problem. (ii) If L(x; ) is concave as a function in x (with i = i …xed), then x is a solution to the maximum Lagrange problem. The Lagrangian: An example (cont.): Returning to our earlier example: maximize/minimize the function f (x1 ; x2 ) = x1 + 3x2 subject to the constraint x21 + x22 = 10. Solution: The Lagrangian for this problem is x21 + x22 L(x; ) = x1 + 3x2 10 We found one solution to the FOCs as (x1 ; x2 ; ) = (1; 3; 1=2) and when we …x we obtain the Lagrangian L(x; ) = x1 + 3x2 1 2 (x + x22 ) 2 1 = 1=2 5 This function is concave and hence (x1 ; x2 ; ) = (1; 3) is a maximum point. Similarly, (x1 ; x2 ) = ( 1; 3) is a minimum point since L(x; ) is convex in x when we …x = 1=2: 9.3 Inequality Constraints Consider the following optimization problem with inequality constraints: max f (x) subject to g1 (x) b1 g2 (x) b2 .. . gm (x) bm To solve this problem, we can use a variation of the Lagrange method. Optimization with Inequality Constraints: 45 9. CONSTRAINED OPTIMIZATION ECO2010, Summer 2022 Just as in the case of equality constraints, the Lagrangian is given by L(x; ) = f (x) 1 (g1 (x) b1 ) m (gm (x) bm ) In the case of inequality constraints, we solve the Kuhn-Tucker conditions that consist of: The …rst order conditions @L(x; ) = 0; @x1 @L(x; ) = 0; @x2 ; @L(x; ) =0 @xn and the complementary slackness conditions given by j 0 for j = 1; 2; : : : ; m j = 0 whenever gj (x) < bj When we solve the Kuhn-Tucker conditions together with the inequality constraints we obtain candidates for maximum (consider f (x) for a minimum). Necessary condition: Theorem 64. Assume that x solves the optimization problem with inequality constraints. If the NDCQ condition holds at x , then there are unique Lagrange multipliers such that (x ; ) satisfy the Kuhn-Tucker conditions. Given a point x satisfying the constraints, the NDCQ condition holds if the rows in the matrix 2 3 0 (x ) g 0 (x ) 0 (x ) g11 g1n 12 6 0 0 (x ) 0 (x ) 7 g2n 6 g21 (x ) g22 7 6 7 . . . 6 7 .. .. .. 4 5 0 (x ) g 0 (x ) gm1 m2 46 0 (x ) gmn 9. CONSTRAINED OPTIMIZATION ECO2010, Summer 2022 corresponding to constraints where gj (x ) = bj are linearly independent. An Example: Maximize the function f (x1 ; x2 ) = x21 + x22 + x2 1 subject to g(x1 ; x2 ) = 2 x1 + x22 1: Solution: The Lagrangian is L = x21 + x22 + x2 x21 + x22 1 1 The FOCs are 2x1 2x1 = 0 =) 2x1 (1 )=0 2x1 + 1 2x2 = 0 =) 2x2 (1 )= From (4) we get x1 = 0 or and hence x1 = 0: = 1; but From (5) we get x2 = 1 2(1 ) since (4) 1 (5) = 1 is not possible by the second equation 6= 1: The complementary slackness conditions are 0 and = 0 if x21 + x22 < 1: There are two cases: Case 1: x21 + x22 < 1 and = 0: In this case, x2 = 1=2 from (5) and this satis…es the inequality. Hence, the point (x1 ; x2 ; ) = (0; 1=2; 0) is a candidate for maximality. Case 2: x21 + x22 = 1 and 0: Since x1 = 0 from (4), we obtain x2 = 1: We solve for in each case, and check that 0: Thus, f (0; 1=2) = 1:25 f (0; 1) = 1 f (0; 1) = 1 By the extreme value theorem, the function f has a maximum on the closed and bounded set given by x21 + x22 1 (a circular disk with radius one), and therefore (x1 ; x2 ) = (0; 1) is a maximum point. 9.4 Application: The Consumer Problem De…ne the consumer problem (CP) as: max U (x) s.t. p0 x x2Rn + w where p is a vector of prices and w is the consumer’s total wealth. Given p and w specify the budget set as B(p; w) = x 2 Rn+ : p0 x and note that it is compact. 47 w 9. CONSTRAINED OPTIMIZATION ECO2010, Summer 2022 Theorem 65. If U ( ) is continuous and p 0; then CP has a solution. Proof: Continuous functions on compact sets achieve their maximum. Form the Lagrangian L(x; ; p; w) = U (x) + 0 w px + n X i xi i=1 The Kuhn-Tucker conditions are: @U (x) = pk @xk w p0 x = 0 (6) k 0 i 0 for all i plus the original constraints. We can use the Kuhn-Tucker conditions to characterize (so-called Marshallian) demand. Using the FOCs (6) we can write: @U (x) @xk pk with equality if xk > 0 Hence, for all goods xj and xk consumed in positive quantity, their Marginal Rate of Substitution equals the ratio of their prices: M RSjk = @U (x)=@xj pj = @U (x)=@xk pk De…ne the indirect utility function as: p0 x v(p; w) = max U (x) s.t. w so that v(p; w) is the value of the consumer problem at the optimum. Assume that U (x) is continuous and concave, and internal solution to the CP exists. Then v(p; w) is di¤erentiable at (p; w) and by the Envelope Theorem @v(p; w) = @w 0 The Lagrange multiplier gives the value (in terms of utility) of having an additional unit of wealth. Because of this, is the sometimes called the shadow price of wealth or the marginal utility of wealth (or income). 48 9. CONSTRAINED OPTIMIZATION 9.5 ECO2010, Summer 2022 Envelope Theorem for Unconstrained Optimization In an optimization problem we often want to know how the value of the objective function will change if one or more of the parameter values changes. Let’s consider a simple example: a pricetaking …rm choosing the pro…t-maximizing amount to produce of a single product. If the market price of its product changes, how much will the …rm’s pro…t increase or decrease? The optimization problem is max (q; p) = pq q2R+ C(q) (7) where (q; p) is the …rm’s pro…t, q is the number of units of the …rm’s product (a decision variable), p is the market price (a parameter), and C(q) is the …rm’s cost function assumed convex and di¤erentiable. The F.O.C. for (7) is @ (q) = p @q C 0 (q) = 0 (8) yielding C 0 (q) = p The …rm chooses the output level q at which its marginal cost is equal to the market price, which the …rm takes as a given parameter. Denote the solution to (7) by q (p): The value function for (7) is v(p) = (q (p); p) = pq (p) C(q (p)) Then, @ @ @q (p) + @p @q @p @ @q (p) =q + @q @p v 0 (p) = (9) which takes into account of the …rm’s output response to a price change, @q@p(p) : However, this term is multiplied by @@q which, from (8), is 0 at the optimum . (9) thus becomes v 0 (p) = q which implies that at the optimum solution the derivative of the value function v(p) with respect to a parameter equals only the partial derivative of the objective function (q; p) with respect to the parameter. This result is summarized in the following theorem: Theorem 66 (Envelope Theorem for Unconstrained Optimization). For the maximization problem maxx2R f (x; ); if f : R2 ! R is continuously di¤ erentiable and if the solution function x : R ! R is di¤ erentiable on an open set U R, then the derivative of the value function satis…es @f v0( ) = @ x() 49 9. CONSTRAINED OPTIMIZATION ECO2010, Summer 2022 Here is the Envelope Theorem for n variables and m paramters: Theorem 67 (Multivariate Envelope Theorem for Unconstrained Optimization). For the maximization problem maxx2Rn f (x; ); if f : Rn Rm ! R is continuously di¤ erentiable and if the solution function x : Rm ! Rn is di¤ erentiable on an open set U Rm , then the partial derivatives of the value function satisfy @v( ) @f = @ i @ i for i = 1; : : : ; m: x ( ) Proof. For each i = 1; : : : ; m we have @v( ) @ = f (x ( ); ) @ i @ i n X @f @f @xj ( ) = + @ i @xj @ i j=1 = because 9.6 @f @xj @f @ i = 0 at xj ( ) according to the optimization problem’s F.O.C. Envelope Theorem for Constrained Optimization Now let’s add a constraint to the multivariate maximization problem: max f (x; ) subject to g(x; ) x2Rn 0 (10) where both f and g are continuously di¤erentiable and we assume that the solution function x ( ) exists and is di¤erentiable on an open set U Rn : The F.O.C. for (10) is rf (x; ) = rg(x; ) i.e. @f @g (x; ) = (x; ) for j = 1; : : : ; n: @xj @xj (11) Let’s assume that the constraint is binding and > 0: Then the solution function x ( ) satis…es the constraint with equality, g(x ( ); ) = 0 for all such that (x ( ); ) 2 U: We therefore have the following proposition, which will be instrumental in the proof of the Envelope Theorem. Proposition 8. If g : Rn R ! R is continuously di¤ erentiable and x ( ) : R satis…es g(x ( ); ) = 0 for all (x; ) 2 U; then n X @g @xj ( ) = @xj @ j=1 Proof. The 1st-degree Taylor polynomial equation for g( x; 50 )=0 @g @ ! Rn 9. CONSTRAINED OPTIMIZATION is @g x1 + @x1 + ECO2010, Summer 2022 @g @g xn + @xn @ + R( x; where for the remainder function R it holds that lim( x; and taking the limit, 3 2 n X @g 5 @g xj lim 4 + =0 !0 @xj @ )=0 R( x; )!0 k( x; ) )k = 0: Dividing by j=1 n X @g lim @xj !0 xj = j=1 n X @g @xj ( ) = @xj @ j=1 @g @ @g @ Now the proof of the Envelope Theorem becomes straightforward. Theorem 68 (Multivariate Envelope Theorem with One Constraint). For the maximization problem maxx2Rn f (x; ) subject to g(x; ) 0; if f : Rn Rm ! R and g : Rn Rm ! R are continuously di¤ erentiable, if the solution function x : Rm ! Rn is di¤ erentiable on an open set U Rm , and if is the value of the Lagrange multiplier in the F.O.C., then the partial derivatives of the value function satisfy @v( ) @f = @ i @ i x ( ) @g @ i for i = 1; : : : ; m: x ( ) Proof. For i = 1; : : : ; m; evaluating all partial derivatives at (x ( ); ); we have n X @v( ) @f @f @xj ( ) = + @ i @ i @xj @ i = @f + @ i @f = @ i j=1 n X j=1 @g @ i @g @xj ( ) @xj @ i (from the F.O.C. (11)) (by the Proposition above). Now we can extend the proof to maximization problems with multiple constraints, obtaining the general version of the Envelope Theorem. Theorem 69 (Multivariate Envelope Theorem with Many Constraints). For the maximization problem maxx2Rn f (x; ) subject to g(x; ) 0; if f : Rn Rm ! R and g : Rn Rm ! R` are continuously di¤ erentiable, if the solution function x : Rm ! Rn 51 9. CONSTRAINED OPTIMIZATION ECO2010, Summer 2022 is di¤ erentiable on an open set U Rm , and if 1 ; : : : ; ` are the values of the Lagrange multipliers in the F.O.C.s, then the partial derivatives of the value function satisfy @v( ) @f = @ i @ i x ( ) XÌ€ k=1 k @gk @ i for i = 1; : : : ; m: x ( ) Epplications of the Envelope Theorem include Hotelling’s Lemma, Shephard’s Lemma, Roy Identity, and solution to the Bellman equation in Dynamic Programming. 52 10. DYNAMIC OPTIMIZATION 10 10.1 ECO2010, Summer 2022 Dynamic Optimization Motivation: Labor Supply Labor supply is a part of a lifetime decision making process. There are some stylized facts about the pattern of wages and employment over the lifecycle: 1. Earnings and wages rise with age after schooling, then decline before retirement. 2. Hours worked generally rises with age, then falls before retirement, then goes to zero at retirement. 3. Individuals have been retiring earlier, on average, over the last 20 years. Variations in health status, family composition and real wages –anticipated or otherwise, provide incentives for individuals to vary the timing of their labor market earnings for income-smoothing and taste purposes. But these responses are not adequately captured in the static model In the static labor supply model, the household solves the problem: max U (C; l) C;l s.t. C w h=H C; l h+Y l 0 where C; l; h are consumption, leisure and hours of work, respectively, w is the wage rate, Y is non-labor income, and H is the time endowment (e.g. 24 hours per day). At an interior solution (denoted by ), the marginal rate of substitution between leisure and consumption equals the wage rate: UC (C ; l ) w = Ul (C ; l ) In the static model people make a one-shot decision about work and consumption. Time does not play a role. Thus the model is not suited for the study of dynamics of labor supply over the life-cycle. In the intertemporal labor supply model: 1. Utility does not only depend on present labor supply and consumption, but on the whole sequence of labor supply and consumption over the life-cycle: U (C0 ; C1 ; : : : ; CT ; l0 ; l1 ; : : : ; lT ) = T X t=0 where 2 (0; 1) is a discount factor. 53 t U (Ct ; lt ) 10. DYNAMIC OPTIMIZATION ECO2010, Summer 2022 2. Individuals need to form expectations about the future: hence, they maximize expected rather than deterministic utility. 3. The budget constraint is dynamic: individuals can shift resources between periods by using savings. Consequently, today’s decisions in‡uence future choice sets. 10.2 Approaches to Dynamic Optimization Solving the Intertemporal Labor Supply model, and similar dynamic models in economics, requires the use of dynamic optimization methods. There are three distinct approaches to dynamic optimization. 1. Calculus of variations, a centuries-old extension of calculus to in…nite-dimensional space. It was generalized under the stimulus of the space race in the late 1950s to develop optimal control theory, the most common technique for dealing with models in continuous time. 2. Dynamic programming, which was developed also in the mid-20th century, primarily to deal with optimization in discrete time. 3. The Lagrangean approach, which extends the Lagrangean technique of static optimization to dynamic problems. We will start with the Lagrangean approach, building on the concepts familiar from static optimization, and then move on to dynamic programming. 10.3 2 Periods Suppose you embark on a two-day hiking trip with w units of food. Your problem is to decide how much food to consume on the …rst day, and how much to save for the second day. It is conventional to label the …rst period with 0. Let c0 denote consumption on the …rst day and c1 denote consumption on the second day. The optimization problem is maxU (c0 ; c1 ) s.t. c0 + c1 = w c0 ;c1 Optimality requires that daily consumption be arranged so as to equalize the marginal utility of consumption on the two days, that is @ @ U (c0 ; c1 ) = U (c0 ; c1 ) @c0 @c1 Otherwise, the intertemporal allocation of food could be rearranged so as to increase total utility. Under optimality, marginal bene…t equals marginal cost, where the marginal cost of consumption in period 0 is the consumption foregone in period 1. This is the fundamental intuition of dynamic optimization - optimality requires that resources be allocated over time in such a way that there are no favorable opportunities for intertemporal trade. 54 10. DYNAMIC OPTIMIZATION ECO2010, Summer 2022 Typically, the utility function is assumed to be separable: U (c0 ; c1 ) = u0 (c0 ) + u1 (c1 ) and stationary: U (c0 ; c1 ) = u(c0 ) + u(c1 ) where represents the discount rate of future consumption. Then the optimality condition is u0 (c0 ) = u0 (c1 ) Assuming that u is concave, we can deduce that c0 > c1 () <1 Consumption is higher in the …rst period if future consumption is discounted. This model can be extended in various ways. For example, if it is possible to borrow and lend at interest rate r, the two-period optimization problem is maxU (c0 ; c1 ) s.t. c1 = (1 + r)(w c0 ;c1 c0 ) assuming separability and stationarity. Forming the Lagrangean L = u(c0 ) + u(c1 ) (c1 (1 + r)(w c0 )) the …rst-order conditions for optimality are @ L = u0 (c0 ) (1 + r) = 0 @c0 @ L = u0 (c1 ) =0 @c1 Eliminating , we conclude that optimality requires that u0 (c0 ) = u0 (c1 )(1 + r) (12) The left-hand side is the marginal bene…t of consumption today. For an optimal allocation, this must be equal to the marginal cost of consumption today, which is the interest foregone (1 + r) times the marginal bene…t of consumption tomorrow, discounted at the rate . Assuming u is concave, we can deduce from (12) that c0 > c1 () (1 + r) < 1 The balance of consumption between the two periods depends upon the interaction of the rate of time preference and the interest rate r. Alternatively, the optimality condition can be expressed as u0 (c0 ) =1+r u0 (c1 ) (13) The quantity on the left hand side is the intertemporal marginal rate of substitution. The quantity on the right can be thought of as the marginal rate of transformation, the rate at which savings in the …rst period can be transformed into consumption in the second period. 55 10. DYNAMIC OPTIMIZATION 10.4 ECO2010, Summer 2022 T Periods To extend the model to T periods, let ct denote consumption in period t and wt the remaining wealth at the beginning of period t. Then w1 = (1 + r)(w0 c0 ) w2 = (1 + r)(w1 .. . c1 ) wT = (1 + r)(wT 1 cT 1) where w0 denotes the initial wealth. The optimal pattern of consumption through time solves the problem max ct ;wt+1 T X1 t u(ct ) s.t. wt = (1 + r)(wt ct 1 1 ); t = 1; 2; : : : ; T t=0 This is a standard equality constrained optimization problem with T constraints. Assigning multipliers ( 1 ; 2 ; : : : ; T ) to the T constraints, the Lagrangean is L= T X1 t u(ct ) t=0 T X t (wt (1 + r)(wt 1 ct 1 )) t=1 This can be rewritten as L = u(c0 ) + T X1 1 (1 t + r)(w0 u(ct ) t+1 (1 c0 ) + r)ct + ( t+1 (1 + r) t )wt t=1 + T wT The …rst order conditions for optimality are @ L = u0 (c0 ) 1 (1 + r) = 0 @c0 @ L = t u0 (ct ) t = 1; 2; : : : ; T t+1 (1 + r) = 0; @ct @ L = (1 + r) t+1 t = 1; 2; : : : ; T 1 t = 0; @wt @ L= T = 0 @wT 1 Together, these equations imply t 0 u (ct ) = in every period t = 0; 1; : : : ; T t+1 (1 + r) = t (14) 1 and therefore t+1 0 u (ct+1 ) = 56 t+1 (15) 10. DYNAMIC OPTIMIZATION ECO2010, Summer 2022 Substituting (15) into (14), we get t 0 u (ct ) = t+1 0 u (ct+1 )(1 + r) 0 0 u (ct ) = u (ct+1 )(1 + r); t = 0; 1; : : : ; T 1 (16) which is identical to (12), the optimality condition for the two-period problem. An optimal consumption plan requires that consumption be allocated through time so that marginal bene…t of consumption in period t (u0 (ct )) is equal to its marginal cost, which is the interest foregone (1 + r) times the marginal bene…t of consumption tomorrow discounted at the rate . Assuming u is concave, (16) implies that ct > ct+1 () (1 + r) < 1 The choice of ct determines the level of wealth remaining, wt+1 ; to provide for consumption in future periods. The Lagrange multiplier t associated with the constraint wt = (1 + r)(wt 1 ct 1) measures the shadow price of this wealth wt at the beginning of period t. (14) implies that this shadow price is equal to t u0 (ct ). Additional wealth in period t can either be consumed or saved, and its value in these two uses must be equal. Consequently, its value must be equal to the discounted marginal utility of consumption in period t. Note that the …nal …rst-order condition at T is T = 0, since any wealth left over is assumed to be worthless. 10.5 General Problem The general …nite horizon dynamic optimization problem can be depicted as Starting from an initial state s0 , the decision maker chooses some action a0 2 A0 in the …rst period. This generates a contemporaneous return or bene…t f0 (a0 ; s0 ) and leads to a new state s1 : The scheme proceeds similarly until the terminal time T: In each period t, the transition to the new state is determined by the transition equation st+1 = gt (at ; st ) Assuming separability, the objective of the decision maker is to choose that sequence of actions a0 ; a1 ; : : : ; aT 1 which maximizes the discounted sum of the contemporaneous returns ft (at ; st ) plus the value of the terminal state v(sT ). Therefore, the general dynamic optimization problem is max at 2At T X1 t ft (at ; st ) + T v(sT ) t=0 s.t. st+1 = gt (at ; st ); t = 0; : : : ; T 57 1; given s0 (17) 10. DYNAMIC OPTIMIZATION ECO2010, Summer 2022 The variables in this optimization problem are of two types: 1. at 2 At is known as the control variable. It is immediately under the control of the decision-maker. 2. st is known as the state variable. It is determined indirectly through the transition equation. 10.6 Lagrangean approach The dynamic optimization problem (17) falls into the regular framework of constrained optimization problems solved by Lagrangean methods that we have encountered. In forming the Lagrangean, it is useful to multiply each constraint (transition equation) by t+1 giving the equivalent problem max at 2At t+1 s.t. (st+1 T X1 t ft (at ; st ) + T v(sT ) t=0 gt (at ; st )) = 0; t = 0; : : : ; T 1; given s0 Assigning multipliers 1 ; : : : ; T to the T constraints (transition equations), the Lagrangean is T T X1 X1 t t+1 L= ft (at ; st ) + T v(sT ) gt (at ; st )) t+1 (st+1 t=0 t=0 To facilitate derivation of the …rst-order conditions, it is convenient to rewrite the Lagrangean as L= T X1 t t=0 T X ft (at ; st ) + T X1 t+1 t+1 gt (at ; st ) t=0 t t st T + v(sT ) t=1 = f0 (a0 ; s0 ) + + T X1 t=1 T t [ft (at ; st ) + T sT + (18) 1 g0 (a0 ; s0 ) T t+1 gt (at ; st ) t st ] v(sT ) Note that the gradients of the constraints are linearly independent since each period’s at appears in only one transition equation. A necessary condition for optimality is that at must be chosen in each period t = 0; 1; : : : ; T 1 such that @ L= @at t @ ft (at ; st ) + @at 58 t+1 @ gt (at ; st ) = 0 @at 10. DYNAMIC OPTIMIZATION Similarly, in periods t = 1; : : : ; T @ L= @st t ECO2010, Summer 2022 1 the resulting st must satisfy @ ft (at ; st ) + @st t+1 @ gt (at ; st ) @st t =0 while the terminal state sT must satisfy @ L= @sT T T + v 0 (sT ) = 0 The sequence of actions a0 ; a1 ; : : : ; aT 1 and states s1 ; s2 ; : : : ; sT must also satisfy the transition equations st+1 = gt (at ; st ); t = 0; 1; : : : ; T 1 These necessary conditions can be rewritten as @ ft (at ; st ) + @at @ ft (at ; st ) + @st @ gt (at ; st ) = 0, t = 0; 1; : : : ; T 1 @at @ gt (at ; st ) = t , t = 1; : : : ; T 1 t+1 @st st+1 = gt (at ; st ); t = 0; : : : ; T 1 t+1 T 0 = v (sT ) (19) (20) (21) (22) If, in addition, ft and gt are increasing in st , and ft , gt and v are all concave, then t 0 for every t and the Lagrangean will be concave. To interpret these conditions, observe that a marginal change in at has two e¤ects: 1. It changes the instantaneous return in period t by 2. It changes the state in the next period st+1 by measured by t+1 . @ @at ft (at ; st ) @ @at gt (at ; st ); the value of which is In terms of interpreting the evolution of the state, observe that a marginal change in st has two e¤ects: 1. It changes the instantaneous return in period t by @ @st ft (at ; st ) 2. It alters the attainable state in the next period st+1 by is t+1 @s@ t gt (at ; st ): @ @st gt (at ; st ), the value of which The value of the total e¤ect discounted to the current period t is expressed by the shadow price t of st : Thus the shadow price in each period t measures the present and future consequences of a marginal change in st . The terminal condition (22), T = v 0 (sT ), is known as a transversality condition. The necessary and su¢ cient conditions (19)–(22) constitute a simultaneous system of 3T equations in 3T unknowns, which in principle can be solved for the optimal solution. 59 10. DYNAMIC OPTIMIZATION 10.7 ECO2010, Summer 2022 Maximum Principle Some economy of notation and additional economic interpretation can be made by de…ning the Hamiltonian by Ht (at ; st ; t+1 ) = ft (at ; st ) + t+1 gt (at ; st ) which measures the total discounted return in period t. The Hamiltonian augments the single period return f (at ; st ) to account for the future consequences of current decisions, aggregating the direct and indirect e¤ects of the choice of at in period t. Then the Lagrangean (18) becomes L = H0 (a0 ; s0 ; 1) + T X1 t [Ht (at ; st ; t+1 ) t st ] t=1 T T sT + T v(sT ) Optimality requires @ L= @at @ L= @st @ L= @sT @ Ht (at ; st ; t+1 ) = 0, @at @ t Ht (at ; st ; t+1 ) @st t T T t = 0; 1; : : : ; T = 0, t 1 t = 1; 2; : : : ; T 1 + v 0 (sT ) = 0 which can be rewritten as @ Ht (at ; st ; @at @ Ht (at ; st ; @st t+1 ) = 0, t+1 ) = T t, t = 0; 1; : : : ; T t = 1; 2; : : : ; T 1 1 = v 0 (sT ) The Maximum Principle characterizes the solution to the above constrained optimization problem expressed in terms of the Hamiltonian. Along the optimal path, at should be chosen in such a way as to maximize the total bene…ts in each period. In a limited sense, the Maximum Principle transforms a dynamic optimization problem into a sequence of static optimization problems. These static problems are related by two intertemporal equations - the transition equation and the corresponding equation determining the evolution of the shadow price t . 10.8 In…nite Horizon Many economic problems have no …xed terminal date and are more appropriately or conveniently modeled as having an in…nite horizon. In such case, the dynamic optimization problem becomes max at 2At 1 X t ft (at ; st ) s.t. st+1 = gt (at ; st ); t=0 60 t = 0; 1; : : : (23) 10. DYNAMIC OPTIMIZATION ECO2010, Summer 2022 given s0 : To ensure that the total discounted return is …nite, we assume that ft is bounded for every t and < 1. An optimal solution to (23) must also be optimal over any …nite period, provided the future consequences are correctly taken into account. That is, (23) is equivalent to max at 2At T X1 t ft (at ; st ) + T vT (sT ) s.t. st+t = gt (at ; st ); t = 0; 1; : : : ; T 1 t=0 where vT (sT ) = max at 2At 1 X t T ft (at ; st ) s.t. st+t = gt (at ; st ); t = T; T + 1; : : : t=T It follows that the in…nite horizon problem (23) must satisfy the same intertemporal optimality conditions as its …nite horizon counterpart. 61 11. DYNAMIC PROGRAMMING 11 11.1 ECO2010, Summer 2022 Dynamic Programming Finite Horizon: Key Concepts Dynamic programming (DP) is an approach to dynamic optimization which facilitates incorporation of uncertainty and lends itself to computation. Again consider the general dynamic optimization problem max at 2At T X1 t ft (at ; st ) + T v(sT ) s.t. st+1 = gt (at ; st ); t = 0; : : : ; T The (maximum) value function for this problem is (T 1 X t v0 (s0 ) = max ft (at ; st ) + T v(sT ) : st+1 = gt (at ; st ); t = 0; : : : ; T at 2At t=0 By analogy, we de…ne the value function for intermediate periods (T 1 X t vt (st ) = max f (a ; s ) + T t v(sT ) : s +1 = g (a ; s ); at 2At 1 t=0 1 ) = t; t + 1; : : : ; T =t 1 ) The value function measures the best that can be done given the current state and remaining time. Note that vt (st ) = max fft (at ; st ) + vt+1 (st+1 ) : st+1 = gt (at ; st )g at 2At = max fft (at ; st ) + vt+1 (gt (at ; st ))g (24) at 2At This fundamental recurrence relation, known as the Bellman equation, makes explicit the trade-o¤ between present and future values. It embodies the principle of optimality: "An optimal policy has the property that, whatever the initial state and decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the …rst decision" (Bellman, 1957). Assuming vt+1 is di¤erentiable and letting t+1 0 = vt+1 (st+1 ) the …rst-order condition for the maximization in the Bellman equation (24) is @ ft (at ; st ) + @at t+1 @ gt (at ; st ) = 0, @at t = 0; 1; : : : ; T 1 (25) which is the same F.O.C. derived using the Lagrangean approach. This optimality condition is called the Euler equation. The derivative of the value function t = vt0 (st ) 62 11. DYNAMIC PROGRAMMING ECO2010, Summer 2022 follows an analogous recursion which can be shown as follows. Let the policy function de…ne the solution of the Euler equation (25), denoted by at = ht (st ) (26) Substituting (26) into (24) yields vt (st ) = ft (ht (st ); st ) + vt+1 (gt (ht (st ); st )) Assuming h and v are di¤erentiable, t @ @ @ ft ht + ft + @at @st @st @ @ @ = ft + t+1 gt + ft + @st @st @at = vt0 (st ) = @ @ @ gt ht + gt @at @st @st @ @ gt ht t+1 @at @st t+1 (27) Using the Euler equation (25), the term in the brackets of (27) equals zero, and therefore t = @ ft + @st t+1 @ gt , @st t = 1; 2; : : : ; T 1 This is equivalent to the recursion we previously derived using the Lagrangean approach. Coupled with the transition equation and the boundary conditions, the optimal policy is characterized by @ ft (at ; st ) + @at @ ft (at ; st ) + @st @ gt (at ; st ) = 0, t = 0; 1; : : : ; T 1 @at @ gt (at ; st ) = t , t = 1; : : : ; T 1 t+1 @st st+1 = gt (at ; st ); t = 0; : : : ; T 1 t+1 T = v 0 (sT ) which is the same set of dynamic optimality condition as obtained from the Lagrangean approach. It is reassuring that the characterization of an optimal solution is the same regardless of the solution method adopted. The DP approach provides a more elegant derivation of the basic Euler equation characterizing the optimal solution than does the Lagrangean approach, although the latter is more easily and directly related to static optimization with which we are familiar. The main attraction of DP is that it o¤ers an alternative method of computation, backward induction, which is particularly amenable to computation. Using backward induction we solve the problem by computing the value function starting from the terminal state and proceeding iteratively to the initial state, in the process of which we compute the optimal solution. We will see backward induction derived in detail on the application of DP to a labor supply problem. 63 11. DYNAMIC PROGRAMMING 11.2 ECO2010, Summer 2022 In…nite Horizon In the stationary in…nite horizon problem max 1 X t f (at ; st ) s.t. st+1 = g(at ; st ); t = 0; 1; : : : t=0 The value function is v(s0 ) = max (1 X t f (at ; st ) : st+1 = g(at ; st ); t = 0; 1; : : : t=0 ) Note that v; f , and g are assumed to have the same functional form in all time periods. The Bellman equation v(s) = max ff (a; s) + v(g(a; s))g (28) a must hold in all periods and all states, so we can dispense with t. The …rst-order conditions for (28) can be used to derive the Euler equation to characterize the optimal solution. In many economic models, it is possible to dispense with the separate transition equation by identifying the control variable in period t with the state variable in the subsequent period. For example, in an economic growth model, we can consider the choice in each period e¤ectively as given capital stock today, select capital stock tomorrow, with consumption being determined as the residual. Letting xt denote the decision variable, the optimization problem becomes max x1 ;x2 ;::: 1 X t=0 t f (xt ; xt+1 ) s:t:xt+1 2 G(xt ); t = 0; 1; : : : ; given x0 : The Bellman equation for this problem is v(x) = max ff (x; y) + v(y)g y (29) For a stationary in…nite horizon problem, the Bellman equation (28) or (29) de…nes a functional equation, an equation in which the variable is the function v. From another perspective, the Bellman equation de…nes an operator v ! v on the space of value functions. Under appropriate conditions, this operator has a unique …xed point, which is the unique solution of the functional equation. On this basis, we can guarantee the existence and uniqueness of the optimal solution to an in…nite horizon problem, and also deduce some of the properties of the optimal solution. In those cases in which we want to go beyond the Euler equation and these deducible properties to obtain an explicit solution, we need to …nd the solution v of its functional equation (29). In the in…nite horizon problem, we cannot use backward induction since there is no terminal time T: There are at least three practical approaches to solving the Bellman equation in in…nite horizon problems: 64 11. DYNAMIC PROGRAMMING ECO2010, Summer 2022 1. Informed guess and verify: 2. Value function iteration; 3. Policy function iteration. In simple cases, it may be possible to guess the functional form of the value function, and then verify that it satis…es the Bellman equation. Given that the Bellman equation has a unique solution, we can be con…dent that our veri…ed guess is the only possible solution. In other cases, we can proceed by successive approximation. For the value function iteration solution, given a particular value function v 1 , (28) de…nes another value function v 2 by v 2 (s) = max f (a; s) + v 1 (g(a; s)) a (30) and so on. Eventually, this iteration converges to the unique solution of (30). Policy function iteration starts with a feasible policy h1 (s) and computes the value function assuming that policy is applied consistently: (1 ) X t v 1 (s) = max f (h1 (st ); st ) : st+1 = g(at ; st ); t = 0; 1; : : : t=0 Given this approximation to the value function, we compute a new policy function h2 which solves the Bellman equation assuming this value function, that is h2 (s) = arg max f (a; s) + v 1 (g(a; s)) a and then use this policy function to de…ne a new value function v 2 . Under appropriate conditions, this iteration will converge to the optimal policy function and corresponding value function. In many cases, convergence is faster than under value function iteration (Ljungqvist and Sargent 2000: 33). 65 12. DP APPLICATION - ECONOMIC GROWTH 12 ECO2010, Summer 2022 DP Application - Economic Growth A …nite horizon version of the optimal economic growth model optimizes the following problem: T X1 t max u(ct ) + T v(kT ) s.t. kt+1 = F (kt ) ct ct t=0 where c denotes consumption, k denotes capital, F (kt ) is the total supply of goods available at the end of period t, comprising current output plus undepreciated capital, and v(kT ) is the value of the remaining capital at the end of the planning horizon. Setting at = ct ; st = kt we obtain f (at ; st ) = u(ct ); g(at ; st ) = F (kt ) ct It is economically reasonable to assume that u is concave, and that F and v are concave and increasing. Then, using the optimality principle, an optimal plan satis…es the equations u0 (ct ) = t t+1 , = t+1 F kt+1 = F (kt ) T t = 0; 1; : : : ; T 0 (kt ), 1 t = 1; : : : ; T ct ; t = 0; : : : ; T 0 = v (kT ) 1 (31) 1 (32) (33) (34) From (33), in any t output can be either consumed or saved. The marginal bene…t of additional consumption in period t is u0 (ct ). The marginal cost of additional consumption is a reduction in capital available for the subsequent period, t+1 . (31) requires that consumption in each t be chosen so that the marginal bene…t of additional consumption equals its marginal cost. In period t + 1, (31) and (32) require u0 (ct+1 ) = t+1 = (35) t+2 t+2 F 0 (kt+1 ) (36) The impact of additional capital in period t + 1 is increased production F 0 (kt+1 ). This additional production could be saved for the subsequent period, in which case it would be worth t+2 F 0 (kt+1 ). Alternatively, the additional production could be consumed, in which case it would be worth u0 (ct+1 )F 0 (kt+1 ): (35) and (36) imply that t+1 = u0 (ct+1 )F 0 (kt+1 ) 66 12. DP APPLICATION - ECONOMIC GROWTH 12.1 ECO2010, Summer 2022 Euler Equation (31) implies that this is equal to the marginal bene…t of consumption in period t, that is u0 (ct ) = u0 (ct+1 )F 0 (kt+1 ) (37) which is the Euler equation for this problem, determining relative consumption between successive periods. The left-hand side is the marginal bene…t of consumption in period t, while the right-hand side is the marginal cost, where the marginal cost is measured by marginal utility of potential consumption foregone, u0 (ct+1 )F 0 (kt+1 ); discounted one period. The actual level of the optimal consumption path c0 ; c1 ; : : : ; cT 1 is determined by the initial capital k0 and by the requirement (34) that the shadow price of capital in the …nal period 0 T be equal to the marginal value of the terminal stock v (kT ): The Euler equation (37) can be rearranged to yield u0 (ct ) = F 0 (kt+1 ) u0 (ct+1 ) The left-hand side of this equation is the intertemporal marginal rate of substitution in consumption, while the right-hand side is the marginal rate of transformation in production, the rate of which additional capital can be transformed into additional output. Subtracting u0 (ct+1 ) from both sides, the Euler equation (37) can be expressed as u0 (ct ) u0 (ct+1 ) = F 0 (kt+1 ) 1 u0 (ct+1 ) Assuming that c is concave, this implies > > 0 ct+1 = ct () F (kt+1 ) = 1 < < Whether consumption is increasing or decreasing under the optimal plan depends on the balance between technology and the rate of time preference. 12.2 Hamiltonian The Hamiltonian for this problem is H(ct ; kt ; t+1 ) = u(ct ) + t+1 (F (kt ) ct ) which immediately yields the optimality conditions @ H = u0 (ct ) t = 0; 1; : : : ; T t+1 = 0, @c @ H = t+1 F 0 (kt ), t = 1; 2; : : : ; T 1 @k @ @ H = F (kt ) ct = kt+1 , t+1 T = v 0 (kT ) 67 t = 0; 1; : : : ; T 1 1 12. DP APPLICATION - ECONOMIC GROWTH 12.3 ECO2010, Summer 2022 Dynamic Programming Recall the economic growth dynamic optimization problem max ct T X1 t T u(ct ) + v(kT ) s.t. kt+1 = F (kt ) ct t=0 The Bellman equation is vt (kt ) = max fu(ct ) + vt+1 (kt+1 )g ct = max fu(ct ) + vt+1 (F (kt ) ct ct )g The F.O.C. for this problem is u0 (ct ) 0 vt+1 (F (kt ) ct ) = 0 (38) Note that the value function shifted by one time period becomes vt+1 (kt+1 ) = max fu(ct+1 ) + vt+2 (F (kt+1 ) ct+1 )g ct+1 (39) By the Envelope Theorem, 0 0 vt+1 (kt+1 ) = vt+2 (F (kt+1 ) ct+1 )F 0 (kt+1 ) (40) The F.O.C. for (39) is u0 (ct+1 ) 0 vt+2 (F (kt+1 ) ct+1 ) = 0 Substituting in (40) yields 0 vt+1 (kt+1 ) = u0 (ct+1 )F 0 (kt+1 ) (41) Substituting in (38) gives the Euler equation (37). 12.4 In…nite Horizon The in…nite horizon version of the economic growth problem is max ct 1 X t u(ct ) s.t. kt+1 = F (kt ) ct t=0 Substituting for ct using the transition equation, the problem becomes max ct 1 X t u(F (kt ) kt+1 ) t=0 The Bellman equation is v(kt ) = max fu(F (kt ) kt+1 68 kt+1 ) + v(kt+1 )g 12. DP APPLICATION - ECONOMIC GROWTH ECO2010, Summer 2022 The F.O.C. for a maximum is u0 (ct ) + v 0 (kt+1 ) = 0 where ct = F (kt ) kt+1 Applying the Envelope Theorem, v 0 (kt ) = u0 (ct )F 0 (kt ) and therefore v 0 (kt+1 ) = u0 (ct+1 )F 0 (kt+1 ) Substituting into (42) we derive the Euler equation u0 (ct ) = u0 (ct+1 )F 0 (kt+1 ) 69 (42) 13. DP APPLICATION - LABOR SUPPLY 13 ECO2010, Summer 2022 DP Application - Labor Supply 13.1 Intertemporal Labor Supply Consider the intertemporal labor supply problem that we have used to motivate the need for dynamic optimization. Here, C; l; h are consumption, leisure and hours of work, respectively, w is the wage rate, Y is non-labor income, and H is the time endowment (e.g. 24 hours per day). 1. Utility depends on the whole sequence of labor supply and consumption over the life-cycle: T X t U (C0 ; C1 ; : : : ; CT ; l0 ; l1 ; : : : ; lT ) = U (Ct ; lt ) t=0 where 2 (0; 1) is a discount factor. 2. Individuals need to form expectations about the future: hence, they maximize expected rather than deterministic utility. 3. The budget constraint is dynamic: individuals can shift resources between periods by using savings. Consequently, today’s decisions in‡uence future choice sets. The model assumes that the worker can freely borrow or save at a constant risk-free interest rate of r: Denote the savings (or debt) carried into period t by At : The intertemporal budget constraint is then given by At+1 + Ct = wt ht + (1 + r)At The budget constraint equates period t expenditures (savings carried into period t + 1 and contemporaneous consumption) with period t income (labor earnings plus principal and interest on the current asset holdings). The model further assumes that wages wt follow some stochastic process described by the conditional probability distribution (wt ); speci…ed as wt+1 t (wt ) Individuals have to solve the following optimization problem: max fCt ;lt gT t=0 subject to E T X t U (Ct ; lt ) t=0 At+1 + Ct = wt ht + (1 + r)At A0 = 0 lt = H wt+1 Ct ; lt ; ht ht t (wt ) 0 70 (43) 13. DP APPLICATION - LABOR SUPPLY ECO2010, Summer 2022 We will solve this problem using Dynamic Programming (DP). 13.2 Finite Horizon with No Uncertainty T=2 Case for the Consumer Problem Assume for simplicity for now that: 1. T = 2 2. there is no uncertainty 3. utility comes only from consumption and is logarithmic 4. leisure is constant and equal to one 5. period t income is given deterministically by wt Then the decision problem is described by: max ln(C0 ) + C0 ;C1 ln(C1 ) subject to A1 + C0 = w0 + (1 + r)A0 A2 + C1 = w1 + (1 + r)A1 A0 = 0 The problem is dynamic since consumption in period t = 0 determines the savings brought into next period, A1 , which in turn determines how much can be consumed in the second period. DP solves the maximization problem by backward induction. The problem in the last period is: max ln(C1 ) C1 subject to A2 + C1 = w1 + (1 + r)A1 The optimal decision is to consume all resources left: A2 = 0 C1 (A1 ) = w1 + (1 + r)A1 Given savings A1 , the highest utility that can be reached in period t = 1 is given by the value function v1 (A1 ) = ln [C1 (A1 )] = ln [w1 + (1 + r)A1 ] 71 13. DP APPLICATION - LABOR SUPPLY ECO2010, Summer 2022 Going backward in time to t = 0; the problem is still given by (43): max ln(C0 ) + ln(C1 ) C0 ;C1 but the DP solution at t = 1 replaces ln(C1 ) by ln(C1 (A1 )) = v1 (A1 ) which is the optimal value that can be reached next period. This substitution is based on the "Principle of Optimality" proven by Richard Bellman. The maximization problem in period t = 0 is thus given by max ln(C0 ) + v1 (A1 ) C0 subject to A1 + C0 = w0 or simply by max ln(C0 ) + v1 (w0 C0 C0 ) (44) Note that the intertemporal trade-o¤ between consumption at t = 0 and t = 1 is entirely captured by the value function v1 ( ): We can now solve (44) with a Lagrangian. In summary: DP solves the …nite-horizon optimization problem by backward induction, yielding functions: C1 (A1 ), the period t = 1 policy function; v1 (A1 ); the period t = 1 value function; The argument A1 is the state variable. Without uncertainty there is no gain from waiting until deciding how to spend savings in period 1. With uncertainty, there will be gains from the DP approach since it becomes valuable to wait. General Case for the Consumer Problem The general …nite horizon maximization problem without uncertainty to be solved is described by: 1. An objective function T X t U (Xt ) t=0 that depends on a sequence of vectors of control variables Xt 2 Xt = Ct ); 72 t (in our example 13. DP APPLICATION - LABOR SUPPLY ECO2010, Summer 2022 2. A law of motion St+1 = Ft (St ; Xt ) that describes the endogenous evolution of the state variable St 2 St = At ; and Ft is the period t budget constraint); (in our example 3. An initial value of the state, S0 , 4. A continuation value beyond period T , vT +1 (ST +1 ) which we assume to be equal to zero. General Case with Generic Notation In general, the DP algorithm works as follows: 1. Solve the maximization problem in the last period T for any ST 2 T : max U (XT ) XT subject to St+1 = FT (ST ; XT ) The solution is a policy function XT (ST ); often setting ST +1 = 0: 2. Obtain the value function vT (ST ) = U (XT (ST )) 3. Going backward, solve the following maximization problem at each value of the state variables St , replacing the discounted sum of future utilities by vt+1 (St+1 ) as obtained in the previous induction step: max U (Xt ) + vt+1 (Ft (St ; Xt )) Xt 2 (45) t 4. Repeat step 3 until we reach t = 0. The equation (45) is the Bellman Equation. The outcome of the DP algorithm is: A sequence of policy functions fXt (St )gTt=0 A sequence of value functions fvt (St )gTt=0 The policy functions are rules that give the optimal value of the controls at any value of the state variables. In the simple consumption-savings model considered above, the policy functions are consumption functions that depend on the current value of available resources. 73 13. DP APPLICATION - LABOR SUPPLY 13.3 ECO2010, Summer 2022 Finite Horizon with Uncertainty In most applications the uncertainty originates in the stochastic evolution of factor prices which enter via the dynamic resource constraint. Introduce a stochastic variable "t and assume it exhibits the Markov property P ("t+1 j"t ; "t 1 ; " t 2 ; : : : ; "0 ) = P ("t+1 j"t ) This amounts to assuming an AR(1) process for "t : More speci…cally, assume "t+1 The decision problem is then given by: max E fXt gT t=0 ("t ) " subject to T X t # U (Xt ) t=0 (46) St+1 = Ft (St ; Xt ; "t ) "t+1 ("t ) S0 ; "0 are given Xt 2 t; St 2 t; "t 2 t Note that in period t expectations are formed about "t+1 while "t has been observed (i.e. is treated as a …xed state variable). The Bellman equation is now given by vt (St ; "t ) = max fU (Xt ) + E [vt+1 (St+1 ; "t+1 )j"t ]g Xt 2 t subject to St+1 = Ft (St ; Xt ; "t ) "t+1 which can be re-written as vt (St ; "t ) = max Xt 2 U (Xt ) + t ("t ) Z vt+1 (St+1 ; "t+1 )f ("t+1 j"t ) d"t+1 Solution: Since T is …nite, the solution proceeds again by backward induction. Complications: 1. An additional state-variable "t (so-called state-space augmentation) 2. Need to evaluate an integral A popular approach: assume that "t is discrete, or discretize (approximate) a continuous stochastic process: X E [vt+1 (St+1 ; "t+1 )j"t ] = vt+1 (St+1 ; "t+1 )P ("t+1 j"t ) t 74 13. DP APPLICATION - LABOR SUPPLY 13.4 ECO2010, Summer 2022 In…nite Horizon without Uncertainty In many dynamic models in economics, T ! 1: This can happen as the discrete time periods of decision become very small. In this case we cannot apply backward induction to solve the DP problem. The starting point is to focus on "stationary economies" For this purpose de…ne stationarity as follows: Conditional on the state variables, the value and policy functions do not depend on time. This implies that conditional on the state, the problem always looks identical irrespective of calendar time. If in fact the data is nonstationary, we need to perform a suitable normalization of the problem. As an example, recall the neo-classical growth model in Macroeconomics where each endogenous variable is normalized by the population size and the current value of total factor productivity. In the following we assume stationarity and hence drop the time indices. The policy and value function are now written as X (S) and v(S): The "next period" state and control variables are denoted by S 0 and X 0 :The law of motion becomes S 0 = F (S; X) The Bellman equation (BE) becomes: v(S) = max U (X) + v(S 0 ) X2 0 s.t. S = F (S; X) This implies v(S) = max fU (X) + v(F (S; X))g X2 Note that the same value function enters both sides of the BE, albeit evaluated at di¤erent points in the state space. In contrast, in the …nite horizon problem we had vt vs. vt+1 with the latter obtained from backward induction. Hence, when T ! 1, the BE is a functional equation because the unknown it solves for is a function (v( )) rather than a number. We need to solve for v( ) along with the policy function X (S): Does the BE have a unique solution in v( ) and X ( ) ? De…ne the mapping T (v) by T (v) = max fU (X) + v(F (S; X))g X2 A solution of the BE is a function v( ) such that v = T (v) which is a …xed point of T (v): At the …xed point, performing the maximization of U (X) + v(F (S; X)) for any point in the state space gives back v evaluated at any point in the state space. Theorem 70 (Existence). Let the state space and control space be compact, 2 (0; 1) , the law of motion F (S; X) be continuous, the utility function U (X) be continuous and bounded. Then the Bellman equation has a unique …xed point. 75 13. DP APPLICATION - LABOR SUPPLY ECO2010, Summer 2022 The Existence Theorem is an application of the Contraction Mapping Theorem (in mathematics also called the Banach Fixed Point Theorem) which suggests an algorithm for solution: Theorem 71 (Fixed Point Iteration). Let v 0 (S) be some continuous and bounded function de…ned on the compact state space : Let v 1 = T (v 0 ) = max U (X) + v 0 (F (S; X)) X2 Given this initialization, perform the following "value function iteration" v n = T (v n Then this iteration converges to the unique …xed point v = T (v): 1 ): As a measure of convergence of the Fixed Point Iteration algorithm, de…ne the metric = supS2 v n (S) v n 1 (S) and declare the algorithm as converged if falls below a certain tolerance level. 13.5 In…nite Horizon with Uncertainty Additional condition: "0 (") The stochastic process in " is not allowed to diverge. One possible restriction for compliance would be assuming AR(1) with j j < 1 for the evolution of ": The Bellman equation now becomes: v(S; ") = max U (X) + E v(S 0 ; "0 )j" X2 0 s.t. S = F (S; X; ") "0 (") Note that we need to perform the integration over "0 conditional on " to obtain E [v(S 0 ; "0 )j"] in each value function iteration. 13.6 Example: Intertemporal Labor Supply Problem with Finite T Recall the intertemporal labor supply problem (43): max fCt ;lt gT t=0 E T X t U (Ct ; lt ) t=0 subject to At+1 + Ct = wt ht + (1 + r)At A0 = 0 lt = H wt+1 Ct ; lt ; ht ht t (wt ) 0 76 13. DP APPLICATION - LABOR SUPPLY ECO2010, Summer 2022 Here the state variables in period t are (At ; wt ), and the controls are (Ct ; lt ; ht ). The period t Bellman equation is given by vt (At ; wt ) = max fU (Ct ; lt ) + E [vt+1 (At+1 ; wt+1 )jwt ]g Ct ;lt ;ht s.t. At+1 + Ct = wt ht + (1 + r)At A0 = 0 lt = H wt+1 Ct ; lt ; ht ht t (wt ) 0 First assume that at the optimum the constraints Ct ; lt ; ht 0 are not binding. Then the …rst-order condition for the maximization problem on the RHS of the Bellman equation with respect to Ct is given by @U (Ct ; lt ) @vt+1 (At+1 ; wt+1 ) = E wt @Ct @At+1 (47) and the Envelope Theorem yields @vt (At ; wt ) @U (Ct ; lt ) = (1 + r) @At @Ct (48) Combining (47) and (48) gives us @U (Ct+1 ; lt+1 ) @U (Ct ; lt ) = (1 + r)E wt @Ct @Ct+1 (49) The equation (49) is called the Consumption Euler Equation. It equates the expected marginal rate of substitution between consumption in two subsequent periods with the discounted gross return on savings (1 + r). The Euler Equation describes the optimal evolution of consumption over time and imposes a restriction on the growth rate of expected marginal utilities, re‡ecting the consumption-smoothing motive. Individuals use savings to smooth life-cycle trajectories of consumption. Next consider the optimal choice of leisure in period t: The …rst-order condition is given by @U (Ct ; lt ) @vt+1 (At+1 ; wt+1 ) = wt E wt (50) @lt @At+1 and combining (50) with (47) we obtain @U (Ct ; lt ) @U (Ct ; lt ) = wt @lt @Ct (51) Note that (51) is the same condition for the optimal choice of leisure and consumption as in the static model. Within each period the static marginal rate of substitution between consumption and leisure is always equal to the static wage. Individuals adjust their labor supply period-by-period to smooth consumption. 77 14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME 14 ECO2010, Summer 2022 Dynamic Optimization in Continuous Time In a continuous time dynamic optimization problem the agent optimizes over an integral function of the control variable and the evolution of the state variable is given by a differential equation. In contrast, in the discrete time case the agent optimizes over a sum of functions of the control variable, and the evolution of the state variable is given by a di¤erence equation. 14.1 Discounting in Continuous Time The discount rate in the discrete time model can be thought of as the present value of $1 invested at the interest rate . That is, to produce a future return of $1 when the interest rate is ; 1 = 1+ needs to be invested, since this amount will accrue to (1 + ) = 1 after one period. However, suppose interest is accumulated n times during the period, with =n earned each sub-period and the balance compounded. Then, the present value is = 1 (1 + =n)n since after the full period this amount will accrue to (1 + =n)n = 1 Since lim (1 + =n)n = exp( ) n!1 the present value of $1 with continuous compounding over 1 period is = exp( ) = exp( t) and similarly over t periods t 14.2 Continuous Time Optimization A continuous time maximization problem typically takes the form Z T max [exp ( t) f (at ; st )] dt Xt (52) 0 subject to s_t = g(at ; st ) with the notational convention s_t = Here: @st @t : 78 (53) 14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME ECO2010, Summer 2022 at is the control (or choice) variable st is the state variable f and g are functions that describe the objective function and the evolution of the state variable respectively is the discount rate. In order to …nd the solution to the continuous time optimization problem, we …rst de…ne the Hamiltonian function Ht = f (at ; st ) + t s_t where t is a Lagrange multiplier term (known as a co-state variable) The Hamiltonian function has an economic interpretation: Ht describes the total impact on the inter-temporal objective function of choosing a particular level of the control variable Xt at time t. The direct impact is captured by the f (at ; st ) term. In addition, at also has an indirect e¤ect by changing the level of the state variable in the amount s_t : Since t tells us how much the objective function can be increased by one more unit of at at the margin, it follows that t s_t summarizes the total indirect impact on the objective function. Thus, Ht summarizes the total impact of picking at on the inter-temporal objective function. The conditions for an optimum come next. Since Ht summarizes the total impact of choosing a given level of at on the inter-temporal utility function, it follows that the optimal solution is characterized by a …rst-order condition of the form @Ht =0 @at As in the Bellman equation, there is another, less intuitive condition involving the state variable. In the continuous time case this equation is that @Ht = @st t t called the co-state equation. These two conditions form a part of the so-called Pontryagin’s Maximum Principle characterizing the solution to a dynamic optimization problem. The …nal two constraints are the initial condition: s0 = s (54) sT = s (55) and the terminal condition, If the time horizon is in…nite, the terminal condition becomes the so-called transversality condition lim sT T = 0 (56) T !1 79 14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME ECO2010, Summer 2022 This states that in the limit either there has to be no units of the state variable left over (sT = 0) or if there are units left over, then their value has to be zero in terms of maximizing utility ( T = 0). To summarize, given the optimization problem (52) and (53) max at Z T [exp ( t) f (at ; st )] dt s.t. s_t = g(at ; st ) 0 we de…ne the Hamiltonian Ht = f (at ; st ) + t s_t and the conditions for the solutions @Ht =0 @at @Ht = @st =) t t fa (at ; st ) + t ga (at ; st ) =) fs (at ; st ) + =0 t gs (at ; st ) = (57) t t (58) This is combined with the initial condition (54) and either the terminal condition (55) or the transversality condition (56) to form a system of di¤erential equations that characterize the solution to the problem. 14.3 Parallels with Discrete Time We can set up an approximate version of the above problem in discrete time as max at T X t=0 1 1+ t f (at ; st ) subject to st+1 = st + g(at ; st ) The transition equation is slightly di¤erent here than what we worked with previously, simply so that we can draw a parallel between the evolution of the state variable in the case of discrete time st+1 st = g(at ; st ) and the evolution of the state variable in the case of continuous time s_t = g(at ; st ) The value function in discrete time is vt (st ) = max f (at ; st ) + at s.t. st+1 = st + g(at ; st ) 80 1 1+ vt+1 (st+1 ) 14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME ECO2010, Summer 2022 The …rst-order condition and the envelope condition are 0 = fa (at ; st ) + vt0 (st ) = fs (at ; st ) + 1 1+ 1 1+ 0 vt+1 (st+1 )ga (at ; st ) (59) 0 vt+1 (st+1 ) [1 + gs (at ; st )] (60) Rearranging the envelope condition (60), 1 0 vt+1 (st+1 ) = fs (at ; st ) + 1+ 0 (s (1 + ) vt0 (st ) vt+1 t+1 ) = fs (at ; st ) + 1+ 0 (s vt0 (st ) vt+1 v 0 (st ) t+1 ) + t = fs (at ; st ) + 1+ 1+ vt0 (st ) 1 1+ 1 1+ 1 1+ 0 vt+1 (st+1 )gs (at ; st ) 0 vt+1 (st+1 )gs (at ; st ) 0 vt+1 (st+1 )gs (at ; st ) (61) The above rearrangement of the envelope condition is unnecessary to the solution in the discrete time case. However, the solution technique for the continuous time case generates two equations that are identical to the …rst-order condition (59) and the rearranged envelope condition (61). Comparing the …rst-order conditions (FOC) for the Hamiltonian case (57) given above with the FOC for the Bellman case (59), we can see that they are equivalent if t = 1 1+ 0 vt+1 (st+1 ) (62) This makes sense since we de…ned vt (st ) as the optimized value of the lifetime objective 0 (s function and hence 1+1 vt+1 t+1 ) is the change in the (discounted) maximized value of the objective function when we add 1 more unit of st for next period, i.e. the equivalent of t. Using the equivalency (62), we can rewrite the discrete-time envelope condition (61) as t 1 t + t 1 = fs (at ; st ) + t gs (at ; st ) (63) If we move from discrete time to continuous time, where the di¤erence between two time periods becomes in…nitesimally small, the envelope condition (63) can be thought of as being analogous to (58), i.e. t + t = fs (at ; st ) + t gs (at ; st ) which is Pontryagin’s co-state equation, showing that it has parallels in the Bellman equation. 81 14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME 14.4 ECO2010, Summer 2022 Application - Optimal Growth One of the most famous macroeconomic models is the Ramsey/Cass/Koopman’s model of an economy. The following is a simpli…ed version of that model describing the behavior of a consumer/producer in the economy. The individual faces the following maximization decision Z T [exp ( max Ct t) U (Ct )] dt s.t. K t = F (Kt ) Ct Kt 0 where Ct is consumption, Kt is the capital stock, 0 < < 1 is the rate of capital depreciation. The utility function is 1 1= C 1 U (Ct ) = t 1 1= where > 0: The production function is F (Kt ) = Kt where 0 < < 1: We can de…ne the Hamiltonian as Ht = U (Ct ) + tK t t [F (Kt ) = U (Ct ) + Ct Kt ] The initial level of capital K0 is given, and we also assume that the transversality condition holds, so that lim Kt t = 0 t!1 The …rst order condition and the co-state equation are @Ht =0 @Ct @Ht = @St =) t t U 0 (Ct ) t =0 F 0 (Kt ) =) t = t t Combining these gives us the Euler equation for consumption in continuous time: F 0 (Kt ) U 0 (Ct ) = U 0 (Ct ) U 00 (Ct )C t This simpli…es to the following Euler equation U 00 (Ct )C t = + F 0 (Kt ) U 0 (Ct ) Given the utility function (for a speci…c ) U (Ct ) = 2 and the production function p Ct F (Kt ) = Kt 82 (64) 14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME ECO2010, Summer 2022 we have 1 U 0 (Ct ) = Ct 2 3 1 U 00 (Ct ) = Ct 2 2 F 0 (Kt ) = Kt 1 Using these in the Euler equation (64) yields 3 Ct 2 C t = 2 + Kt 1 Ct =2 Ct + Kt 1 Ct 1 2 (65) Rearranging (65) yields (66) The growth rate of consumption is related to the di¤erence between the rate of return on capital (F 0 (Kt ) ) and the discount rate. If the rate of return on capital and the rate at which we discount the future are identical, i.e. F 0 (Kt ) = ; then consumption will 0 remain constant over time. If F (Kt ) > then consumption will be rising over time. If F 0 (Kt ) < then consumption will be falling over time. Kt The Euler equation (66) along with the ‡ow budget constraint K t = F (Kt ) Ct 0 and the initial condition K0 = K and transversality condition limt!1 Kt U (Ct ) constitute a system of non-linear di¤erential equations that make up the solution to the problem. It can be shown that the system is saddle-path stable and that the steady state of the model is at 1 1 K = + 1 1 C = 1 + + The di¤erential equation system with the solution are depicted in the phase diagram below. 83 14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME 84 ECO2010, Summer 2022 15. NUMERICAL OPTIMIZATION 15 ECO2010, Summer 2022 Numerical Optimization There are two broad classes of optimization algorithms: 1. Derivative-free methods - very useful if the objective function is not smooth, or if its derivatives are expensive to compute 2. Derivative-based methods (Newton-type methods) - very useful if objective function derivatives or derivative estimates are readily available 15.1 Golden Search A widely used derivative-free method is the Golden search method. Its principle is similar to the Bisection method for root…nding. Consider the problem max f (x) over [a; b] where f is continuous, concave and unimodal. Pick x1 ; x2 2 (a; b) with x1 < x2 . Evaluate f (x1 ); f (x2 ) and replace [a; b] with [a; x2 ] if f (x1 ) > f (x2 ) or with [x1 ; b] if f (x1 ) < f (x2 ). A local maximum must be contained in the new interval. Repeat this procedure until the length of the interval is shorter than some desired tolerance level. A key issue is how to pick the interior evaluation points. Golden search uses an optimal reduction factor for the search interval by minimizing the number of function evaluations, re-using interior points from the previous iterations. The choice of x1 ; x2 makes use of the p properties of the golden ratio = ( 5 + 1)=2 named by the ancient Greeks. Note that 1= = 1 and 1= 2 = 1 1= : The points x1 ; x2 are optimally selected as x1 = b x2 = a + a0 b a b a After one iteration of the search, it is possible that we will discard a and replace it with = x1 : Then the new value to use as x1 will be x01 =b b (b =b =b =a+ a0 (b b =b a) ( a) (1 a b x1 1) b =b =b 1= ) = b (b b b a a) 2 b + b= + a a= = x2 This implies that we can re-use a point that we already have: we don’t need a new evaluation of f (x0 ) and use f (x2 ) instead. Similarly, if we update to b0 = x2 ; then x02 = x1 and we can re-use that point. 85 15. NUMERICAL OPTIMIZATION 15.2 ECO2010, Summer 2022 Nelder-Mead For multivariate functions, a widely-used derivative-free optimization method is the NelderMead algorithm. The algorithm begins by evaluating the objective function at n + 1 points, forming a so-called simplex in the n-dimensional Euclidean space. At each iteration, the algorithm: Determines the point on the simplex with the lowest function value and alters that point by re‡ecting it through the opposite face of the simplex (Re‡ection). If the re‡ection …nds a new point that is higher than all the others on the simplex, the algorithm checks expanding the simplex further in this direction (Expansion). If the re‡ection fails to produce a point that is at least as good as the second worst point, the algorithm contracts the simplex by halving the distance between the original point and its opposite face (Contraction). If re‡ection point is not better than the second worst point, the algorithm shrinks the entire simplex toward the best point (Shrinkage). 2-dimensional space illustration: original triangle is in light grey, new triangle is in dark grey Animated demonstration: 86 15. NUMERICAL OPTIMIZATION ECO2010, Summer 2022 https://www.youtube.com/watch?v=HUqLxHfxWqU 15.3 Newton-Raphson Method The Newton-Raphson method is a popular derivative-based optimization method, identical to applying Newton’s method to …nding the root of the gradient of the objective function. The Newton-Raphson method uses successive quadratic approximations to the objective function. Given x(k) ; the subsequent iterate is computed by maximizing the second order Taylor approximation to f about x(k) : f (x) f (x(k) ) + f 0 (x(k) )(x 1 x(k) ) + (x 2 x(k) )T f 00 (x(k) )(x Solving the …rst-order condition f 0 (x(k) ) + f 00 (x(k) )(x x(k) ) = 0 yields the iteration rule x(k+1) = x(k) h i f 00 (x(k) ) 1 f 0 (x(k) ) The Newton-Raphson method can be very sensitive to starting values. Animated demonstration: http://bl.ocks.org/dannyko/raw/ffe9653768cb80dfc0da/ 87 x(k) ) 15. NUMERICAL OPTIMIZATION 15.4 ECO2010, Summer 2022 Quasi-Newton Methods There are a number of modi…cations to the Newton-Raphson method, called quasi-Newton 1 ; methods. Some employ an approximation B(x(k) ) to the inverse Hessian f 00 (x(k) ) using a search direction d(k+1) = B(x(k) )f 0 (x(k) ) called the Newton step. Robust quasi-Newton methods shorten or lengthen the Newton step in order to obtain improvement in the objective function. This includes the DavidsonFletcher-Powell (DFP), Broyden-Fletcher-Goldfarb-Shano (BFGS), and Berndt–Hall–Hall– Hausman (BHHH) updating methods. Mathematica demonstration: Minimizing the Rosenbrock function http://demonstrations.wolfram.com/MinimizingTheRosenbrockFunction/ 88 ECO2010, Summer 2022 Part III Statistical Analysis 16 Introduction to Probability Statistical analysis involves the fundamental concept of "probability". There are several di¤erent interpretations of probability. Two key categories are: 1. Frequentist probability: limit of the relative frequency of an event’s occurrence in a large number of trials. 2. Bayesian probability: quantity that is assigned to represent a state of knowledge (also called belief). For the purpose of illustration of the concepts involved, in this section we will use the well-known simple regression model y= We will consider the hypothesis H0 : to more elaborate models. 16.1 1 0 + 1x = 0; H1 : +" 1 6= 0: The concepts discussed generalize Randomness and Probability Population is de…ned as the entire collection of elements about which information is desired. Random process (or experiment) is de…ned as a procedure, involving a given population, that can conceptually be repeated many times, leading to certain outcomes (qualitative or abstract). Sample space is de…ned as set of all possible outcomes of the random process. An event is de…ned as a subset of the sample space. For any given event, only one of two possibilities may hold: it occurs or it does not. The relative frequency of occurrence of an event, observed in a number of repetitions of the experiment, is a measure of the frequentist probability of that event. Denote by nt the total number of random experiments conducted and by nx the number out of these trials where the event x occurred. The frequentist probability P (x) of the event x occurring is nx P (x) = lim nt !1 nt In the Bayesian approach "randomness" = uncertainty. The reason something is random is not because it is generated by a “random process" but because it is unknown. Hence, unknown parameters are treated as random variables. Typically, Bayesians call random variables "observables" and parameters "unobservables". In frequentist statistics: Data (events) are a repeatable random sample, underlying parameters remain constant during this repeatable process, parameters are …xed. In Bayesian Statistics: Data are observed from the realized sample, parameters are unknown and therefore described probabilistically, data are …xed. 89 16. INTRODUCTION TO PROBABILITY 16.1.1 ECO2010, Summer 2022 Frequentist Framework Frequentist Random Process: (Note: 1 arrow = 1 piece of "evidence", i.e. in general 1 data set consisting of many data points.) What does it mean to have 95% con…dence? The frequentist con…dence interval traps the truth in 95% of experiments. To de…ne anything frequentist, you have to imagine repeated experiments. Frequentist Hypothesis Testing: For frequentist inference and hypothesis testing, we need to imagine running the random experiment again and again. We always need to think about many other "potential" datasets, not just the one we actually have to analyze. How does Bayesian inference di¤er? 90 16. INTRODUCTION TO PROBABILITY 16.1.2 ECO2010, Summer 2022 Bayesian Framework The Bayes Theorem tells us: p( jY ) posterior / / f (Y j ) ( ) likelihood prior Prior distribution ( ): distributional model of what we know about parameter excluding the information in the data. Likelihood f (Y j ): based on modeling assumptions, how (relatively) likely the data Y are if the truth is : Keep adding data, and updating knowledge, as data becomes available, and knowledge will (in general) concentrate around true . Likelihood Function (left) and posterior probability (right): Example: here is exactly the same idea, in practice. During the search for Air France 447, from 2009-2011, knowledge about the black box location was described via probability, using Bayesian inference. Eventually, the black box was found in the red area. 91 16. INTRODUCTION TO PROBABILITY ECO2010, Summer 2022 No replications of data are involved (no replicate plane crashes). 16.2 Inference The frequentist approach is based on pre-data considerations. "If the hypothesis being tested is in fact true, what is the probability that we shall get data indicating that it is true?" Bayesians frame their argument in terms of post-data considerations: "What is the probability conditional on the data, that the hypothesis is true?" For frequentists the probability of a hypothesis is not meaningfully de…ned (it is a-priori …xed to be equal 1). A frequentist 95% Con…dence Interval (CI): – Contains 0 with probability 0.95 in hypothetical many random experiments, only before one has seen the data. – After the actual data has been observed, the probability is either 0 or 1 ( either inside the CI, or outside of the CI). 0 is – With a large number of repeated samples, 95% of calculated con…dence intervals would include the true value of the parameter. – (In practice, CIs are sometimes misinterpreted as guides to post-sample uncertainty.) A Bayesian 95% Credible Set (CS): – Has 95% coverage probability after one has seen the data. – is a random variable with a proper distribution quantifying the uncertainty inherent in 0 : 0 – Expresses the posterior probability that 0 is inside the CS. Where do priors come from? Recall from above: prior = model based on previous experience of the analyst. Priors come from all data external to the current study. Priors give the analyst the option of incorporating additional information beyond what is contained in the current data set. Priors can be informative (e.g. N ( 0 ; 20 )), or di¤use (e.g. / c s.t. c 2 R). Incorrect inference resulting from a misleading restrictive prior is equivalent to a frequentist model misspeci…cation. Remedies: qualitative priors (e.g. monotonicity or smoothness restrictions), or non-parametric priors, providing ‡exible modeling framework, quickly dominated by data evidence. Reference: Greenberg (2012), Geweke et al (2011) 92 17. MEASURE-THEORETIC PROBABILITY 17 ECO2010, Summer 2022 Measure-Theoretic Probability 17.1 Elements of Measure Theory A rigorous measure-theoretic de…nition of probability is based on the concept of probability measure, which is a special case of a measure of sets. Intuitively, we can think of a measure of a set A as an integral over the set’s "region": Z Z p(x)dx dx or (A) = (A) = A A for some function p(x): Examples: If A is a geometric object, such as a cone, then (A) can be the volume of A: If A is a physical object, such as a body, then (A) can be the physical mass of A: If A is a random event then (A) can be the probability mass of A (probability of A occurring). Let (A) = R A dx: Integrals have a number of decomposition properties, including: (;) = 0 (integral over an empty set is zero). For pairwise disjoint sets Ai ; (A1 [ A2 ) = (A1 ) + (A2 ) If B A; (B) (A) and (AnB) = (A) (B): For a non-empty set S; it is not always possible to integrate on P(S); i.e. the power set of S: Non-measurable sets are typically di¢ cult to construct and play no role in applications, but they do exist (e.g. a Vitali set). We need to restrict to a subset A ("measurable" sets) of P(S): 17.1.1 Measure Space De…nition 72. Let A be a non-empty class of subsets of S: A is an algebra if 1) Ac 2 A whenever A 2 A 2) A1 [ A2 2 A whenever A1 ; A2 2 A 93 17. MEASURE-THEORETIC PROBABILITY ECO2010, Summer 2022 De…nition 73. A is a -algebra if A is an algebra and 1 S 2’) An 2 A whenever An 2 A for n = 1; 2; : : : n=1 Since A is non-empty, (1) and (2) ) ; 2 A and S 2 A because A 2 A ) Ac 2 A ) A [ Ac = S 2 A ) S c = ; 2 A We can generate an algebra or -algebra from any collection of subsets by adding to the set the complements and unions of all its elements. The smallest -algebra is fS; ;g: The Borel -algebra B(S) is "generated" as follows: 1. Start with T , a collection of all open sets in S 2. Keep adding sets according to the de…nition of a -algebra 3. The resulting -algebra with the fewest number of sets is B(S): As a result, B(S) contains all open and closed sets. This is the usual -algebra we work with when S = R: Let S be a non-empty set and A a -algebra of subsets of S: The pair (S; A) is called a measurable space. A mapping : A ! [0; 1] is called a measure if: 1. 2. (;) = 0 1 S Ai = i=1 P1 i=1 (Ai ) if Ai 2 A are pairwise disjoint. The tuple (S; A; ) forms a measure space. 17.1.2 Density Intuitively, a density f is a function that transforms a measure pointwise re-weighting on S: Let Z Z d 2 (x) = f (x)d 1 (x) 2 (A) = A 1 into a measure 2 by A Derivative notation: d or 2 (x) d d 2 = f (x)d 1 (x) (x) = f (x) 1 Here f is a function that is integrated to obtain 2 and is hence a "derivative" of 2 : For given 1 ; 2 there may not always exist a corresponding density f: "Re-weighting" by f cannot work if 1 (A) = 0 and 2 6= 0. If such case is ruled out, i.e. if 1 (A) = 0 implies 94 17. MEASURE-THEORETIC PROBABILITY 2 (A) by = 0 for all A 2 A; then 2 is absolutely continuous with respect to 2 Important Measures Let A S and x 2 S: Dirac measure 1; denoted 1 The Radon-Nikodym theorem tells us that 2 1: 17.1.3 ECO2010, Summer 2022 2 has a density w.r.t. 1 if and only if x – De…ned as the set function x (A) – x = ( 1 if x 2 A 0 if x 2 =A is also called a point mass at x; or atom on x Counting measure – De…ned as a set function (A) = ( #A if A is …nite 1 if A is in…nite where #A is the number of elements in A: Lebesgue measure – More di¢ cult to de…ne, so we provide intuitive geometric interpretation in Rn – ([a; b]) = ((a; b)) = b a , length of an interval – of a surface segment in R2 is its area – of a body in R3 is its volume. Probability measure P – Mathematically, P (A) = (A) in a measurable space (S; A) with the property that (S) = 1 – P includes x and as special cases following the normalization (S) = 1 – The probability measure P is thus a large family of measures. 95 17. MEASURE-THEORETIC PROBABILITY 17.1.4 ECO2010, Summer 2022 Probability Space The tuple (S; A; P ) forms a probability space. In this context, S is interpreted as the set of all possible outcomes of a random process. P (A) is interpreted as the probability of event A occurring. When S is discrete, we usually take A = P(S): When S = R, or some subinterval thereof, we usually take A = (B): In cases when S is discrete and with …nite number of elements, P is often given by a normalized counting measure. In continuous cases, P is often given by a normalized Lebesgue measure. 17.2 Random Variable The environment of (S; A; P ) is useful for proofs, but not easy to work with for modeling. It is more convenient to work with a transformation of S using the notion of a random variable. Formally, a random variable X is the mapping X:S!R with the required property that X is [BnA] measurable, i.e. AX = fA S : X(A) 2 Bg = fX 1 (B) : B 2 Bg A The distribution of r.v. X is the probability measure PX de…ned by PX (X(A)) = P (X 1 (B)) Using the random variable X we have transferred (S; A; P ) ! (R; B; PX ). Example: We toss two dice, in which case the sample space is S = f(1; 1); (1; 2); : : : ; (6; 6)g: We can de…ne two random variables: the Sum and the Product: X1 (S) = f2; 3; 4; 5; : : : ; 12g X2 (S) = f1; 2; 3; 4; 5; 6; 8; 9; : : : ; 36g 17.3 Conditional Probability and Independence In many cases we have some information available (call this event B) and wish to predict some other event A given that we have observed B: De…nition 74. The probability of an event A given an event B; denoted P (AjB) is obtained as P (A \ B) P (AjB) = P (B) when P (B) > 0: Note that P ( jB) is a probability measure. In particular: 1. P (AjB) 0 96 17. MEASURE-THEORETIC PROBABILITY ECO2010, Summer 2022 2. P (SjB) = 1 3. P 1 S Ai B = i=1 P1 i=1 P (Ai jB) for any pairwise disjoint events fAi g1 i=1 : When A and B are mutually exclusive events, P (AjB) = 0: When A B then P (AjB) = P (A)=P (B) P (A) with strict inequality unless P (B) = 1: When B A then P (AjB) = 1: De…nition 75. The law of total probability is stated as P (A) = P (A \ B) + P (A \ B c ) Example: say we have information on 100 stocks ordered into the following table: Yesterday up down Today up down 53 25 15 7 68 32 78 22 100 Letting A = fup Yesterdayg and B = fup Todayg the information here is e¤ectively P (A \ B); P (A \ B c ); P (Ac \ B); P (Ac \ B c ): To convert the information into marginal probabilities, we can use the law of total probability. Conditional probabilities can be calculated using the previous de…nition. E.g. 78 100 22 P (down Yesterday) = 100 53 53=100 = P (up Todayjup Yesterday) = 78=100 78 P (up Yesterday) = De…nition 76 (Independence). Suppose P (A); P (B) > 0. Then A and B are independent events if 1. P (A \ B) = P (A)P (B) [symmetric in A and B; does not require P (A); P (B) > 0] 2. P (AjB) = P (A) 3. P (BjA) = P (B) 17.4 Bayes Rule The Bayes Rule is a formula for updating conditional probabilities, and as such is used in both frequentist and Bayesian inference. De…nition 77 (Bayes Rule). For two sets A and B P (BjA) = P (AjB)P (B) P (AjB)P (B) + P (AjB c )P (B c ) 97 17. MEASURE-THEORETIC PROBABILITY ECO2010, Summer 2022 This follows from the de…nition of conditional independence P (A \ B) = P (AjB)P (B) = P (BjA)P (A) and the law of total probability P (A) = P (A \ B) + P (A \ B c ) = P (AjB)P (B) + P (AjB c )P (B c ) Equivalently, Bayes Rule is often stated as P (AjB)P (B) P (A) / P (AjB)P (B) P (BjA) = P (B) is often called the "prior" probability, P (AjB) the "likelihood", and P (BjA) the "posterior" probability. Mathematica demonstration: Probability of being sick after having tested positive for a disease http://demonstrations.wolfram.com/ProbabilityOfBeingSickAfterHavingTestedPositiveForADiseaseBa/ 98 18. RANDOM VARIABLES AND DISTRIBUTIONS 18 ECO2010, Summer 2022 Random Variables and Distributions Associated with each random variable is the (cumulative) distribution function FX (x) = PX (X x) de…ned for all x 2 R: This function e¤ectively replaces PX : Discrete random variable can take on a discrete set of values (e.g. gender, age, number of years of schooling). The cdf of X is a step function. The di¤erence between the cdfs of two adjacent elements in its domain is the probability mass function (pmf): fX (xj ) = PX (X = xj ) Example: x 2 f1; 2; 3; 4; 5g xj fX (xj ) FX (xj ) 0 0.7 0.7 1 0.1 0.8 2 0.08 0.88 3 0.06 0.94 4 0.04 0.98 5 0.02 1.00 Continuous random variable can take on a continuum of values (e.g. GDP growth rate, stock returns). The cdf of X is a continuous function. The (Radon-Nikodym) derivative of the cdf is the probability density function (pdf). Note: PX (X = x) = 0. The set fX = xg is an example of a set of measure zero. Example: commuting time 99 18. RANDOM VARIABLES AND DISTRIBUTIONS ECO2010, Summer 2022 Joint density f (X; Y ) of two continuous random variables: Consider two random variables, X; Y , with a joint distribution F (X; Y ). The marginal distribution of Y is just another name for its probability distribution F (Y ), irrespective of X. The term is used to distinguish F (Y ) alone from the joint distribution of Y and another r.v. Similarly for marginal density: P (Y = yi ) = k X P (X = xj ; Y = yi ) j=1 f (Y = yi ) = Z f (X = x; Y = y)dx The distribution of a random variable Y conditional on another random variable X taking on a speci…ed value is called the conditional distribution of Y given X. In general, for the continuous case fY jX (yjx) = and the discrete case P (Y = yjX = x) = fY;X (y; x) fX (x) P (Y = y; X = x) P (X = x) Mathematica demonstration: Bivariate joint and conditional Gaussian 100 18. RANDOM VARIABLES AND DISTRIBUTIONS ECO2010, Summer 2022 http://demonstrations.wolfram.com/TheBivariateNormalAndConditionalDistributions/ 18.1 Moments The k th moment (or non-central moment) is the quantity h i = E Xk k The k th moment about the mean (or central moment) is the quantity h i k = E (X E[X]) k Orders: the 0th moment is one the 1st (non-central) moment is the mean the 2nd moment is the variance the 3rd moment is skewness the 4th moment is (excess) kurtosis Moments of order k > 2 are typically called "higher-order moments". Moments of order higher than a certain k may not exist for certain distributions. 18.1.1 Moment Generating Function For a random variable x with distribution function F its moment generating function (MGF) is mX (t) = E[exp(tX)] Z = exp(tX)dF (X): The MGF is also known as the Laplace transformation of the density of X. 101 (67) 18. RANDOM VARIABLES AND DISTRIBUTIONS ECO2010, Summer 2022 The rth derivative evaluated at zero is the rth uncentered moment of X : dr exp(tX) dtr = E [X r exp(tX)] (r) mX (t) = E and thus (r) mX (0) = E [X r ] : A limitation of the MGF is that it does not exist (i.e. is not …nite) for many random variables. Finiteness of the integral (67) requires the tail of the density of x to decline exponentially. This excludes thick-tailed distributions such as the Pareto. 18.1.2 Characteristic Function This limitation is removed if we consider the characteristic function (CF) of X, which is de…ned as 'X (t) = E[exp(itX)] Z = exp(itX)dF (X) p where i = 1: The CF is also known as the Fourier transformation of the density of X. The CF exists for all random variables and all values of t since exp(itX) = cos(tX) + i sin(tX) is bounded. Similarly to the MGF, the rth derivative of the characteristic function evaluated at zero takes the simple form (r) 'X (0) = ir E[X r ]: 'X (t) is named the characteristic function since it completely and uniquely characterizes the distribution F (X) of X. Theorem 78 (Power Series Expansion of CF). If 'X (t) converges on some open interval containing t = 0; then X has moments of any order and 'X (t) = 1 X (it)k k=0 k! E[X k ] The Theorem implies that 'X (t); and hence F (X); are uniquely characterized by the collection of their moments. 2 t2 =2): Example 1: if X N ( ; 2 ) then 'X (t) = exp(i t Example 2: if X P oisson( X ) then 'X (t) = exp( (exp(it) 1)): Theorem 79 (Convolution Theorem). Let X1 ; X2 ; :::; Xn be independent random variables P and let S = ni=1 Xi : Then n Y 'S (t) = 'Xi (t): i=1 102 18. RANDOM VARIABLES AND DISTRIBUTIONS Example: Let X P oisson( X ), Y P oisson( ECO2010, Summer 2022 Y ); and Z = X + Y: Then 'Z (t) = 'X (t)'Y (t) = exp( X (exp(it) = exp(( and hence Z 18.2 P oisson( X + X + 1)) exp( Y ) (exp(it) Y (exp(it) 1)) 1)) Y ): Transformations of Random Variables Goal: given a distribution FX (X) of a random variable X; …nd a distribution FY (Y ) of the r.v. Y = g(X) where g is a deterministic function. Denote by fX and fY the respective density functions. 18.2.1 Monotonic Transformations Let g(X) be a monotonic (i.e. increasing or decreasing) transformation. 1. Case g 10 (y) >0: FY (y) = P (Y y) = P (g 1 (Y ) 1 g (y)) = P (X 1 g (y)) = FX (g 1 (y)) Then, fY (y) = FY0 (y) = FX0 (g 2. Case g 10 (y) 1 10 (y))g (y) = fX (g 1 (y))g 10 (y) <0: FY (y) = P (Y 1 y) = P (g (Y ) g FX0 (g 1 1 (y)) = P (X g 1 (y)) = 1 FX (g 1 (y)) Then, fY (y) = FY0 (y) = Note that here g 10 (y) 10 (y))g (y) = fX (g 1 (y))g 10 (y) > 0. Hence, fY (y) = fX (g 1 10 (y)) g (y) Example: When X has a standard normal density function, X density of Y = + X: Since Y X = g 1 (Y ) = then g 10 (Y ) = 1 Therefore, fY (y) = fX (g 1 (y)) g 1 = p exp 2 103 10 (y) )2 (y 2 2 1 N (0; 1); obtain the 18. RANDOM VARIABLES AND DISTRIBUTIONS 18.2.2 ECO2010, Summer 2022 Non-Monotonic Transformation For illustration we will consider a special case where g(X) a quadratic transformation. More general cases follow by direct extension. Let Y = g(X) = X 2 Then, FY (y) = P (Y y) = P (X 2 y) = P 1 1 y2 < X y2 1 = FX y 2 FX 1 y2 and hence 1 y 2 1 fY (y) = FY0 (y) = FX0 y 2 1 y 2 = Example: Suppose X 1 2 h 1 fX y 2 + fX 1 2 2 FX 1 y2 1 y2 i 1 y 2 1 2 N (0; 1); and obtain the density of Y = X 2 : Then, 1 1 2 fX (x) = p exp x 2 2 1 1 1 p exp fY (y) = y 2 2 2 y =p 1 2 exp 1 1 y2 2 2 1 + p exp 2 1 1 y2 2 2 1 y 2 Note that we used f (x) = f ( x) that holds for N (0; 1) by symmetry around zero. The resulting fY (y) is the density of the 2 distribution with 1 degree of freedom. 18.3 Parametric Distributions Families of distributions (or densities) are often parametrized by 2 Rn and expressed in the form fFX (xj ) : 2 g : We will take a closer look at several continuous distributions. Chi-square Distribution: pdf and cdf P 2 . This result follows from a combination If Zi N (0; 1) and X = ki=1 Zi2 then X k of the Convolution Theorem and transformation of random variables. 104 18. RANDOM VARIABLES AND DISTRIBUTIONS ECO2010, Summer 2022 t-Distribution: pdf and cdf If Z 2; n N (0; 1) and X then T = p Z X=n tn : F Distribution: pdf and cdf If X 2 d1 and Y 2 d2 then Q = X=d1 Y =d2 Fd1;d2 : Other distributions: Benford, Beta, Boltzmann, categorical, Conway-Maxwell-Poisson, compound Poisson, discrete phase-type, extended negative binomial, Gauss-Kuzmin, geometric, hypergeometric, Irwin–Hall, Kumaraswamy, logarithmic, negative binomial, parabolic fractal, Rademacher, raised cosine, Skellam, triangular, U-quadratic, Wigner semicircle, Yule-Simon, Zipf, ZipfMandelbrot, zeta, Beta prime, Bose–Einstein, Burr, chi, Coxian, Erlang, exponential, Fermi-Dirac, folded normal, Fréchet, Gamma, generalized extreme value, generalized inverse Gaussian, half-logistic, half-normal, Hotelling’s T-square, hyper-exponential, hypoexponential, inverse chi-square (scaled inverse chi-square), inverse Gaussian, inverse gamma, Lévy, log-normal, log-logistic, Maxwell-Boltzmann, Maxwell speed, Nakagami, noncentral chi-square, Pareto, phase-type, Rayleigh, relativistic Breit–Wigner, Rice, Rosin–Rammler, shifted Gompertz, truncated normal, type-2 Gumbel, Weibull, Wilks’ lambda, Cauchy, extreme value, exponential power, Fisher’s z, generalized normal, generalized hyperbolic, Gumbel, hyperbolic secant, Landau, Laplace, logistic, normal inverse Gaussian, skew normal, slash, type-1 Gumbel, Variance-Gamma, Voigt,... Reference: Balakrishnan and Nevzorov (2003) 105 19. STATISTICAL PROPERTIES OF ESTIMATORS 19 ECO2010, Summer 2022 Statistical Properties of Estimators A generic econometric model can be written in the form ff (X; ) : 2 g where X is a matrix of random variables, is a vector of parameters, and is a parameter space. Consider a data sample x = (x1 ; : : : ; xn ) treated as realizations of the random variables X = (X1 ; : : : ; Xn ): An estimator is a function b(X) of (X1 ; : : : ; Xn ) intended as a basis for learning about the unknown . An estimate is a realized value of b(x) for a particular data sample x: Desirable properties of estimators: unbiasedness, consistency, e¢ ciency, and tractable asymptotic distribution (e.g. Normality). Examples of parameters and their estimators: 1. Analysis without an econometric model: P opulation quantities: M ean: X = E[X] Variance: 2 X = E[(X Covariance: XY = E[(X Sam ple m ean: X) 2 Sam ple variance: ] X )(Y Y )] Sam ple covariance: T heir estim ators: P X =n 1 n i=1 Xi 2 SX = (n 1) 1 Pn i=1 SXY = (n 1) 1 Pn X Yi i=1 Xi Xi X 2 Y 2. Analysis with an econometric model: Population quantity: Simple regression slope: Its estimator:P n Xi Yi Its estimator: b = Pi=1 n X2 i=1 19.1 Sampling Distribution i Recall that an estimator b(X) is a random variable. For example the sample mean X is combination (i.e. function) of other random variables Xi which makes X a (composite) random variable. The probability density function (pdf) of the estimator b(X) is called the sampling distribution. In most cases the sampling distribution of b(X) is unknown, but there are ways to estimate or approximate it (using asymptotics or bootstrap procedures). The sampling distribution is used for hypothesis testing and evaluation of the properties of estimators. We can broadly distinguish between two types of properties: 1. Small sample or …nite sample properties (for …xed n < 1): (a) Bias 106 19. STATISTICAL PROPERTIES OF ESTIMATORS ECO2010, Summer 2022 (b) E¢ ciency (c) Mean squared error 2. Large sample or asymptotic properties (as n ! 1): (a) Consistency (b) Asymptotic distribution 19.2 Finite Sample Properties The bias of b is given by E[b] : An estimator is unbiased if the mean of its sampling distribution is E[b] = . Example: b is a biased estimator and is an unbiased estimator of : If b and V ar(b): are two unbiased estimators of ; then is e¢ cient relative to b if V ar( ) The mean square error (MSE) allows us to compare estimators that are not necessarily unbiased: h i M SE(b) = E (b )2 = V ar(b) + bias2 (b) 107 19. STATISTICAL PROPERTIES OF ESTIMATORS ECO2010, Summer 2022 We prefer estimators with smaller MSE. 19.3 Convergence in Probability and Consistency A sequence of deterministic (i.e. nonstochastic) real numbers fan g converges to a limit a if, for any " > 0, there exists n = n (") such that, for all n > n , jan aj < " (68) which implies P (jaN aj < ") = 1 for all N > N For example, if an = 2 + 3=n, then the limit is a = 2 since jan aj = j2 + 3=nj = j3=nj < " all n > n = 3=". For a sequence of random (i.e. stochastic) variables fXn g, we can never be completely certain that the condition (68) will always be satis…ed, even for large n: Instead, we require that the probability that (68) is satis…ed approach 1 as n approaches in…nity: P (jXn cj < ") ! 1 as n ! 1 A formal de…nition of this requirement is as follows: De…nition 80 (Convergence in Probability). A sequence of random variables fXn g converges in probability to a limit c if, for any " > 0 and > 0, there exists n ("; ) such that, for all n > n , P (jXn cj < ") > 1 We write plim (Xn ) = c or, equivalently, p Xn ! c Recall that an estimator bn is a random variable. Note that bn can be viewed as a sequence indexed by the sample size n: De…nition 81. If bn converges in probability to 0: 108 0 then bn is a consistent estimator of 19. STATISTICAL PROPERTIES OF ESTIMATORS Thus, for a consistent estimator bn of or 0; ECO2010, Summer 2022 it holds by de…nition that p bn ! 0 plim(bn ) = 0 Note that it is possible for an estimator to be biased in small samples, but consistent (i.e., it is biased in small samples, but this bias goes away as n gets large). For example, for b 2 R with b 6= 0 we could have E[bn ] = lim E[bN ] = N !1 0 + b n 0 In general, as long as the variance of an estimator collapses to zero as the sample size gets larger, while any small sample bias disappears as the sample size gets larger, the estimator will be consistent. Convergence in probability is preserved under deterministic transformations. Formally, this is expressed by the following Theorem: Theorem 82 (Slutsky’s Theorem). Let Xn be a …nite-dimensional vector of random variables, and g( ) be a real-valued function continuous at a vector of constants c: Then p Xn ! c implies p g (Xn ) ! g(c) Laws of large numbers (LLN) are theorems for convergence in probability in the special case of sample averages. Theorem 83 (Weak Law of Large Numbers). Let fX1 ; : : : ; Xn g be a sequence of n independent and identically distributed (iid) random variables, with E[Xi ] = < 1: Then n Xn 1X p Xi ! n i=1 There are a number of di¤erent version of LLNs. These include the Kolmogorov or Markov LLN that imply convergence in probability. Mathematica demonstration: Law of Large Numbers 109 19. STATISTICAL PROPERTIES OF ESTIMATORS ECO2010, Summer 2022 http://demonstrations.wolfram.com/IllustratingTheLawOfLargeNumbers/ 19.4 Convergence in Distribution and Asymptotic Normality De…nition 84 (Convergence in Distribution). Let fXn g be a sequence of random variables and assume that Xn has cdf Fn (Xn ). fXn g is said to converge in distribution to a random variable X with cdf F (X) if lim Fn (Xn ) = F (X): n!1 We write d Xn ! X and we call F the limit distribution of fXn g. p Convergence in probability implies convergence in distribution; that is Xn ! c implies d Xn ! c where the constant c can be though of as a degenerate random variable with a probability mass of 1 placed at c. In general the converse does not hold. Theorem 85 (Continuous Mapping Theorem). Let Xn be a …nite-dimensional vector of random variables, and g( ) be a continuous real-valued function. Then d Xn ! X implies d g (Xn ) ! g(X) p d Theorem 86 (Transformation Theorem). If Xn ! X and Yn ! c where X is a random variable and c is a constant, then d (i) Xn + Yn ! X + c d (ii) Xn Yn ! cX d (iii) Xn =Yn ! c 110 1X if c 6= 0: 19. STATISTICAL PROPERTIES OF ESTIMATORS ECO2010, Summer 2022 Central limit theorems (CLT) are theorems on convergence in distribution of sample averages. Intuition: Take a (small) sample of independent measurements of a random variable and compute its average (Examples: commuting time, website clicks). Take another such sample and compute average, another sample and average, etc. Keep repeating. The distribution (histogram) of such averages becomes Normal (Gaussian) as n ! 1. Theorem 87 (Classical Central Limit Theorem). Let fX1 ; : : : ; Xn g be a sequence of n independent and identically distributed (iid) random variables, with E[Xi ] = < 1 and P 0 < V ar(Xi ) = 2 < 1: Then as n ! 1; the distribution of X n n 1 ni=1 Xi converges 2 to the Normal distribution with mean and variance n , i.e. d Xn ! N ( ; 2 =n) irrespective of the shape of the original distribution of the Xi : There are a number of di¤erent version of CLTs. These include the Lindeberg–Levy and Liapounov CLT. From a LLN, the sample average has a degenerate distribution as it converges to the 2 1=2 ) by n constant . We scale (X n ; the inverse of the standard deviation of X n ; to construct a random variable with unit variance. Hence, the CLT implies that 2 n 1=2 (X n )= p n(X n d )= ! N (0; 1) The CLT is used to derive the asymptotic distribution of a number of commonly used estimators (e.g. GMM, ML) Demonstration: CLT http://onlinestatbook.com/stat_sim/sampling_dist/index.html 111 20. STOCHASTIC ORDERS AND DELTA METHOD 20 ECO2010, Summer 2022 Stochastic Orders and Delta Method 20.1 A-notation Consider a so-called "A-notation" which means "absolutely at most." For example, A(2) stands for a quantity whose absolute value is less than or equal to 2. This notation has a natural connection with decimal numbers: Saying that to saying that is approximately 3.14 is equivalent = 3:14+A(0:005). Another example: 10A(2) = A(100) We can use the A-notation in general operations. For example: (3:14 + A(0:005)) (1 + A(0:01)) = 3:14 + A(0:005) + A(0:0314) + A(0:00005) The A-notation applies to variable quantities as well as to constant ones. Examples: sin(x) = A(1) A(x) = xA(1) A(x) + A(y) = A(x + y) if x 0 and y 0 (1 + A(t))2 = 1 + 3A(t) if t = A(1) The equality sign "=" is not symmetric with respect to such notation. We have 3 =A(5) and 4 =A(5) but not A(5)= 4. In this context, the equality sign is used as the word “is”in English: Aristotle is a man, but a man isn’t necessarily Aristotle. 20.2 O-notation The O-notation ("big oh notation") is conceptually similar to the A-notation, except that the former is less speci…c. In its simplest form, O(x) stands for something that is cA(x) for some constant c, but we do not specify what c is. For example, O (n) = O (n) Similarly, for n > 0; n+O p n (ln n + + O (1=n)) p p p = n ln n + n + O(1) + O n ln n + O n + O(1= n) 112 20. STOCHASTIC ORDERS AND DELTA METHOD ECO2010, Summer 2022 The O-notation is most useful for comparing the relative values of two functions, f (x) and g(x), as x approaches 1 or 0; depending on which of these two cases is being considered. We will suppose that g is positive-valued and that n > 0. 1. The case x ! 1 : f (x) = O(g(x)) if there exist constants A and x0 > 0 such that jf (x)j g(x) < A for all jf (x)j g(x) < A for all x > x0 : 2. The case x ! 0 : f (x) = O(g(x)) if there exist constants A and x0 > 0 such that x < x0 : Example 1: The function f (x) = 3x3 + 4x2 is O(x3 ) as x ! 1: Here, g(x) = x3 : We have that jf (x)j 3x3 + 4x2 4 = =3+ 3 g(x) x x There exist in…nitely many pairs of A and x0 to show that f is O(x3 ); for example jf (x)j g(x) < 4:1 for all x > 40: Example 2: The function f (x) = 3x3 + 4x2 is O(x2 ) as x ! 0: Here 3x3 + 4x2 jf (x)j = = 3x + 4 g(x) x2 and, for example, jf (x)j g(x) < 4:3 for all x < 0:1: o-notation 20.3 The o-notation ("little oh") is equivalent to the taking of limits, as opposed to specifying bounds. 1. The case x ! 1 : f (x) = o(g(x)) if lim x!1 jf (x)j =0 g(x) 2. The case x ! 0 : f (x) = o(g(x)) if jf (x)j =0 x!0 g(x) lim 113 20. STOCHASTIC ORDERS AND DELTA METHOD ECO2010, Summer 2022 Example 3: The function f (x) = 3x3 + 4x2 is o(x4 ) as x ! 1; because jf (x)j 3x3 + 4x2 = lim = lim x!1 g(x) x!1 x!1 x4 lim 3 4 + x2 x2 =0 Example 4: The function f (x) = 3x3 + 4x2 is o(x) as x ! 0; because jf (x)j 3x3 + 4x2 = lim = lim 3x2 + 4x = 0 x!0 g(x) x!0 x!0 x lim Example 5: The function f (x) = 3x3 + 4x2 is o(1) as x ! 0; because jf (x)j = lim (3x3 + 4x2 ) = 0 x!0 g(x) x!0 lim 20.4 Op-notation De…nition 88. Let fYn g be a sequence of random variables and g(n) a real-valued function of the positive integer argument n: Then Yn is Op (g(n)) if for all > 0; there exists a constant K and a positive integer N such that P Yn >K g(n) < for all n > N: De…nition 89. Let fYn g be a sequence of random variables and g(n) a real-valued function of the positive integer argument n: Then Yn is op (g(n)) if for all "; > 0; there exists a positive integer N such that P Yn >" g(n) < for all n > N or equivalently, for all " > 0 lim P n!1 Yn >" g(n) =0 or equivalently plim Yn g(n) = 0: Example: Let X1 ; X2 ; :::; Xn be iid random variables distributed according to F; with mean and variance 2 < 1: Then we know that for the average X n it holds Xn Xn p n Xn = op (1) (LLN) p = Op (1= n) (CLT) = Op (1) 114 <1 20. STOCHASTIC ORDERS AND DELTA METHOD ECO2010, Summer 2022 There are many simple rules for manipulating op (1) and Op (1) sequences which can be deduced from the Continuous Mapping Theorem. For example: op (1) + op (1) = op (1) op (1) + Op (1) = Op (1) Op (1) + Op (1) = Op (1) op (1)op (1) = op (1) op (1)Op (1) = op (1) Op (1)Op (1) = Op (1) 20.5 20.5.1 Delta Method Probabilistic Taylor Expansion Theorem 90. Let fXn g be a sequence of random variables such that Xn = a + Op (rn ) where a 2 R and 0 < rn ! 0 as n ! 1: If g is a function with s continuous derivatives at a then g(Xn ) = s X g (j) (a) j=0 j! (Xn a)j + op (rns ): (Reference: Proposition 6.1.5 in P. Brockwell and R. Davis, Time Series: Theory and Methods, Springer, 1991.) Consequently, for a sequence of k-vectors fXn g such that Xn = a+Op (rn ) where a 2 Rk and 0 < rn ! 0 as n ! 1, g(Xn ) = g(a) + rg(a)0 (Xn a) + op (rn ): (69) Theorem 91 (Multivariate Delta Method). Suppose that fXn g is a sequence of random k-vectors such that p n(Xn d ) ! N (0; ): Let g : Rk ! R be a di¤ erentiable function. Then p d n(g(Xn ) g( )) ! N (0; rg( )0 rg( )): To show that this result holds, note that, using (69), V ar(g(Xn )) = V ar(g( ) + rg( )0 (Xn = V ar(g( ) + rg( )0 Xn )) + op (rn ) rg( )0 ) + op (rn ) = V ar(rg( )0 Xn ) + op (rn ) = rg( )0 V ar(Xn )rg( ) + op (rn ) 115 21. REGRESSION WITH MATRIX ALGEBRA 21 ECO2010, Summer 2022 Regression with Matrix Algebra 21.1 Multiple Regression in Matrix Notation Recall the linear multiple regression model yi = For each i; de…ne 1 0 + 1 xi1 + 2 xi2 + ::: + k xik + ui (70) (k + 1) vector xi = (1; xi1 ; xi2 ; : : : ; xik ) and (k + 1) 1 vector of parameters =( 0; 1; : : : ; 0 k) Then we can write the equation (70) as yi = xi + ui We can now write the linear multiple regression model in matrix notation for all i = 1; : : : ; n : 2 3 2 3 x1 1 x11 x12 x1k 6 7 6 7 x2k 7 6 x2 7 = 6 1 x21 x22 X 6 . 7 6. .. 7 7 6. 7 . n (k + 1) 6 . . . 5 4 5 4 xn 1 xn1 xn2 xnk De…ne also u as the n 1 vector of unobservable errors. Then the multiple regression for all i = 1; : : : ; n can be written compactly as y =X +u where y is (n 1); X is (n (k + 1)); is (k + 1) 1; X is n 1; u is n 1: Ordinary Least Squares (OLS) principle: minimize the sum of squared residuals (SSR). De…ne the SSR as a function of (k + 1) 1 parameter vector b as SSR(b) n X (yi xi b)2 i=1 The (k + 1) 1 vector of OLS estimates b = ( b 0 ; b 1 ; : : : ; b k )0 minimizes SSR(b) over all possible vectors b: 116 21. REGRESSION WITH MATRIX ALGEBRA ECO2010, Summer 2022 Mathematica demonstration: Least Squares http://demonstrations.wolfram.com/LeastSquaresCriteriaForTheLeastSquaresRegressionLine/ b solves the …rst-order condition with the solution @SSR( b ) =0 @b n X x0i (yi i=1 or equivalently X0 (y which are called "Normal equations". Solving the FOC X0 (y for b yields xi b ) = 0 Xb) = 0 Xb) = 0 X0 y = (X0 X) b b = (X0 X) 1 X0 y We need (X0 X) to be invertible, which is guaranteed by the assumption MLR.3 of no perfect multicollinearity (i.e. no exact linear relationship among the independent variables giving us full rank of X0 X). The n 1 vectors of …tted values and residuals are given by b = Xb y 117 21. REGRESSION WITH MATRIX ALGEBRA ECO2010, Summer 2022 and b=y u b y Xb =y The sum of squared residuals can be written as SSR( b ) = n X i=1 0 b bu =u = (y 21.2 u b2i X b )0 (y MLR Assumptions in Matrix Notation Xb) MLR.1 The model is linear in parameters: y =X +u (71) MLR.2 We have a random sample X; y following (71). MLR.3 There is no perfect multicollinearity, i.e. rank(X) = k + 1: MLR.4 The error u has an expected value zero conditional on X : E[ujX] = 0 MLR.5 The error u is homoskedastic: V ar[ujX] = where In is the n 21.3 2 u In n identity matrix. Projection Matrices De…ne the matrices P = X X0 X M = In 1 X0 P where In is the n n identity matrix. P and M are called projection matrices due to the property that for any matrix Z which can be written as Z=X for some matrix PZ = PX = X X0 X 118 1 X0 X = X = Z 21. REGRESSION WITH MATRIX ALGEBRA ECO2010, Summer 2022 and MZ = (In P)Z = Z PZ = Z Z=0 The matrices P and M are symmetric and idempotent. By de…nition of P and M; b = Xb y 1 = X X0 X X0 y = Py and b=y y b u =y Py = My 21.4 Unbiasedness of OLS Theorem 92 (Unbiasedness of OLS). Under MLR.1 - MLR.4, E[ b jX] = : Proof. Use MLR.1 - MLR.3 to write b = (X0 X) 1 X0 y = (X0 X) 1 X0 (X + u) = (X0 X) 1 (X0 X) +(X0 X) = + (X0 X) 1 1 X0 u X0 u Taking the expectation conditional on X and using MLR.4 yields E[ b jX] = + (X0 X) 1 X0 E[ujX] = 21.5 Variance-Covariance Matrix of the OLS Estimator Theorem 93. Under MLR.1 - MLR.5, V ar( b jX) = 1 2 0 u (X X) Proof. Using the unbiasedness proof b and MLR.5 we obtain V ar( b jX) = V ar( + (X0 X) = V ar((X0 X) 1 1 X0 u)jX) X0 ujX) = (X0 X) 1 X0 V ar(ujX)X(X0 X) = (X0 X) 1 X0 = = 1 2 0 u In X(X X) 1 0 1 0 2 0 u (X X) X 1 2 0 u (X X) 119 X(X X) 1 21. REGRESSION WITH MATRIX ALGEBRA ECO2010, Summer 2022 Mathematica demonstration: Regression with Transformations http://demonstrations.wolfram.com/RegressionModelWithTransformations/ 21.6 Gauss-Markov in Matrix Notation Theorem 94 (Gauss-Markov). Under MLR.1 - MLR.5, b is the best linear unbiased estimator. Proof. Any other linear estimator of can be written as e = A0 y = A0 (X + u) = A0 X + A0 u where A is an n (k + 1) matrix such that (for unbiasedness) E[ e jX] = A0 X + A0 E[ujX] = A0 X = implying A0 X = Ik+1 From (72) we have V ar( e jX) = V ar(A0 ujX) = A0 V ar(ujX)A = 2 0 uA A Using A0 X = Ik+1 ; we have V ar( e jX) V ar( b jX) = = = = 1 2 0 0 u [A A (X X) ] 0 1 0 2 0 0 u [A A A X(X X) X A] 1 0 2 0 0 u A [In X(X X) X ]A 2 0 u A MA 120 (72) 21. REGRESSION WITH MATRIX ALGEBRA where M implying Thus, ECO2010, Summer 2022 In X(X0 X) 1 X0 is symmetric and idempotent (i.e. can be written as MM), 2 A0 MA is positive semi-de…nite. u V ar( e jX) for any other linear unbiased estimator e : V ar( b jX) 121 22. MAXIMUM LIKELIHOOD 22 ECO2010, Summer 2022 Maximum Likelihood 22.1 Method of Maximum Likelihood If the distribution of yi is F (y; ) where F is a known distribution function and 2 is an unknown m 1 vector, we say that the distribution is parametric and that is the parameter of the distribution F . The space is the set of permissible values for :In this setting the method of maximum likelihood (ML) is the appropriate technique for estimation and inference on : ML can be used for models that are linear or non-linear in parameters. The trade-o¤: an assumption has to be made about the form of F (y; ): If the distribution F is continuous then the density of yi can be written as f (y; ) and the joint density of a random sample (y1 ; :::; yn ) is fn (y1 ; :::; yn ; ) = n Y f (yi ; ) i=1 The likelihood of the sample is this joint density evaluated at the observed sample values, viewed as a function of , L( ) = fn (y1 ; :::; yn ; ) The log-likelihood function is its natural log ln L( ) = n X ln f (yi ; ) i=1 If the distribution F is discrete, the likelihood and log-likelihood are constructed by setting f (y; ) = P (yi = y; ) The maximum likelihood estimator or bM LE is the parameter value which maximizes the likelihood (equivalently, which maximizes the log-likelihood). We can write this as bM LE = arg max ln L( ) 2 In some simple cases, we can …nd an explicit expression for bM LE as a function of the data (the OLS is a special case under the Normality assumption on the residuals!), but these cases are relatively rare. More typically, the bM LE must be found by numerical methods. 22.2 22.2.1 Examples Example 1 Suppose we have n independent observations y1 ; : : : ; yn from a N ( ; is the ML estimate of and 2 ? 122 2) distribution. What 22. MAXIMUM LIKELIHOOD ECO2010, Summer 2022 Write the likelihood function as L( ; 2 jy1 ; : : : ; yn ) = n Y f (yi ) i=1 = p 1 2 2 n exp 2 n 1 X 2 ! (yi )2 n 1 X (yi i=1 Its logarithm is more convenient to maximize: l = ln L( ; 2 n ln( 2 n ln(2 ) 2 jy1 ; : : : ; yn ) = 2 ) 2 2 )2 i=1 To compute the maximum we need the partial derivatives: n 1 X @l = 2 (yi @ ) i=1 @l = @ 2 n n 1 X + (yi 2 2 2 4 )2 i=1 The maximum likelihood estimators are those values b and b2 which set these two partials to equal zero. The …rst equation determines b : n X yi = n i=1 n b= 1X yi = y n i=1 Now plug this b into the second equation to get n 1 X n = 2 (yi 2 2b 22.2.2 Example 2 b2 = 1 n i=1 n X y)2 y)2 (yi i=1 Consider the random variables t1 ; : : : ; tn that are independent and follow an exponential distribution with a parameter , i.e., f (t; ) = exp ( 123 t) for t > 0 22. MAXIMUM LIKELIHOOD ECO2010, Summer 2022 Then n L( jt1 ; : : : ; tn ) = n X exp ti i=1 ! n X ` = ln L( jt1 ; : : : ; tn ) = n ln @` n = @ n X b = Pnn Example 3 ti i=1 i=1 ti 22.2.3 ti i=1 1 t = We will now show that LS estimates can be obtained as a special case of ML in the multiple regression model. y=X +u 2 N (0; u In ) Conditional on X; 2 N (X ; y In ) and hence f (yi jxi ; ; 2 f (yjX; ; 2 2 ) = (2 ) N Y ) = (2 1=2 2 ) exp 1=2 xi )2 (yi 2 ) N=2 exp 2 (y exp xi )2 (yi i=1 = (2 2 2 ! 2 X )0 (y 2 2 ! X ) Rewrite as log-likelihood: `(yjX; ; 2 )= 2 (N=2) ln(2 ) (y X )0 (y 2 2 X ) Note that maximizing `(yjX; ; 2 ) w.r.t is equivalent to maximizing (y X )0 (y X ) w.r.t. (both yield the same argmax). Also note that (y X )0 (y X ) = u0 u = N X u2i i=1 which is the negative of the sum of square residuals criterion function of OLS. Hence maxiP 2 mizing `(yjX; ; 2 ) to obtain b M L yields the same result as minimizing N i=1 ui to obtain b OLS : 124 22. MAXIMUM LIKELIHOOD ECO2010, Summer 2022 Mathematica demonstration: MLE http://demonstrations.wolfram.com/MaximumLikelihoodEstimatorsWithNormallyDistributedError/ 22.3 Information Matrix Equality De…ne the negative of the Expected Hessian H= E @2 `(yj @ @ 0 0) (73) De…ne the outer product matrix =E @ `(yj @ 0) @ `(yj @ 0) 0 Theorem 95. For the Maximum Likelihood Estimator, I0 H= (74) The equality (74) is called the information matrix equality. The matrix I0 is called the Fisher information matrix. We will further see that I0 1 is the variance-covariance matrix of MLE. Before deriving the information matrix equality, we will …rst introduce several useful concepts. The score of the log-likelihood is de…ned as si ( ) r `i ( )0 = @`i @`i @`i ( ); ( ); : : : ; ( ) @ 1 @ 2 @ m 125 0 22. MAXIMUM LIKELIHOOD ECO2010, Summer 2022 An important property of the score is: Z E [si ( 0 )jxi ] = si (y; xi ; 0 )f (y; xi ; 0 )dy Z = r 0 log f (y; xi ; 0 )f (y; xi ; 0 )dy Z r 0 f (y; xi ; 0 ) = f (y; xi ; 0 )dy f (y; xi ; 0 ) Z = r 0 f (y; xi ; 0 )dy Z = r 0 f (y; xi ; 0 )dy (75) (76) = r 01 =0 (77) From (75) and (76) it follows that r f (y; xi ; 0) 0 0 ) f (y; xi ; = si (y; xi ; Assuming `i ( ) is twice continuously di¤erentiable over Hi ( ) Di¤erentiate (75) and (77) w.r.t. Z r si ( ) = r 0 0) (78) ; de…ne `i ( ) (79) ; and use the product rule to obtain Z r Z si (y; xi ; r E [si ( 0 )jxi ] = 0 0 )f (y; xi ; 0 )dy = 0 r [si (y; xi ; 0 )f (y; xi ; 0 )] dy = 0 Z r si (y; xi ; 0 )f (y; xi ; 0 )dy + si (y; xi ; 0 )r f (y; xi ; 0 )dy = 0 Use (78) and (79) to obtain Z Hi ( 0 )f (y; xi ; 0 )dy + Z E [Hi ( si (y; xi ; 0 )jxi ] 0 )si (y; xi ; + E si (y; xi ; Using the Law of Iterated Expectations (LIE) yields H+ =0 H= Note: in general, the LIE is stated as: E[E [Y jX]] = E[Y ] 126 0 0 ) f (y; xi ; 0 )si (y; xi ; 0 )dy = 0 0 0 ) jxi =0 22. MAXIMUM LIKELIHOOD 22.4 ECO2010, Summer 2022 Asympotic Properties of MLE When standardized, the log-likelihood is a sample average: n 1 1X p `( ) = `(yi ; )!E [`(y; )] n n i=1 As the bM LE maximizes the left-hand side, it is an estimator of the maximizer of the righthand side. Under regularity conditions (such as boundedness, smoothness and continuity of `( )); convergence in probability is preserved under the arg max operation and its inverse. Theorem 96 (MLE consistency). Under regularity conditions, bM LE is consistent: p bM LE ! 0 Theorem 97 (MLE Asymptotic Normality). Under regularity conditions, bM LE is asymptotically Normally distributed: p n bM LE d 0 !N (0;nI0 1 ) Therefore, the asymptotic variance of bM LE is AV ar(bM LE ) = I0 1 Typically, to estimate the asymptotic variance of bM LE we use an estimate based on the Hessian formula (73) n 1 X @2 b b (80) H= 0 `(yi ; M LE ) n @ @ i=1 We then set b Ib0 = H Asymptotic standard errors for bM LE are then the square roots of the diagonal elements of Ib0 1 : Theorem 98 (Cramer-Rao Lower Bound). If V ar( ) I0 1 : is an unbiased estimator of 2R then The ML estimator achieves the C-R lower bound and is therefore e¢ cient in the class of unbiased estimators. 127 23. GENERALIZED METHOD OF MOMENTS 23 ECO2010, Summer 2022 Generalized Method of Moments 23.1 Motivation Recall the generic form of an econometric model Y = f (X; U ; ) The multiple regression model estimated by OLS assumes that f ( ) is a linear function in parameters. If f ( ) is nonlinear, can be estimated by MLE as long as we impose a functional form assumption on the joint distribution of Y; X; U: In many cases, economic theory does not provide any basis for the full functional form of the joint distribution, but only one (or a few) of its moments. In such case, the model can be estimated using the Method of Moments (MM), or the Generalized Method of Moments (GMM). MM is based on the so-called analogy principle. If economic theory provides population moments of a model by analogy we can specify corresponding sample moments and use these to estimate : We can show that the OLS estimator is a special case of the MM estimator. Consider the regression model y =x +u (81) where x and are K 1 vectors. Assume that E[ujx] = 0 (82) The single conditional moment restriction (82) implies K unconditional moment restrictions E[xu] = 0 (83) since, using the Law of Iterated Expectations and (82), E[xu] = Ex [E[xujx]] = Ex [xE[ujx]] = Ex [x0] = 0 From (81) and (83) we obtain the population moment condition E[xu] = E[x (y x )] = 0 The MM estimator is the corresponding sample moment condition n X [xi (yi xi )] = 0 (84) i=1 Rearranging (84) to solve for yields b MM = n X i=1 0 xi x0i = (X X) 1 128 ! X0 y 1 n X i=1 xi yi 23. GENERALIZED METHOD OF MOMENTS ECO2010, Summer 2022 which coincides with the OLS estimator. Economic theory can generate moment conditions that can be used as a basis for estimation. Consider the model yi = E[yi jxi ; ] + ui (85) where the …rst right-hand side term measures the "anticipated" component of y; conditional on xi ; and the second term measures the "unanticipated" component. Here, yi can be e.g. a measure of demand. Under the assumptions of rational expectations and market clearing, we can conclude that E [yi E[yi jxi ; ]jIi ] = 0 where Ii denotes all information available. By the Law of Iterated Expectations, E [zi (yi E[yi jxi ; ])] = 0 (86) where zi is formed from any subset of Ii : This provides many moment conditions (86) that can be used for estimation. If dim(zi ) > dim( ); then not all conditions (86) may be satis…ed with exact equality in the sample. In such cases, instead of a direct analogy of (86) expressed as an MM estimator, we minimize the weighted average of (86). This leads to the Generalized Method of Moments estimator, which minimizes the quadratic form " n #0 " n # 1X 1X Qn ( ) = zi ui Wn zi u i n n i=1 where ui = yi 23.2 i=1 E[yi jxi ; ]: GMM Principle Assume the existence of r moment conditions for q parameters E [h(wi ; 0 )] =0 where is (q 1); h( ) is (r 1); with r q; and 0 denotes the true population value of : The vector wi includes all exogenous and endogenous observable variables. If r = q, the model is said to be just identi…ed, and 0 can be estimated by the MM estimator de…ned as the solution to n 1X h(wi ; ) = 0 n i=1 Equivalently, the MM estimator minimizes " n #0 " n # 1X 1X Qn ( ) = h(wi ; ) h(wi ; ) n n i=1 i=1 129 (87) 23. GENERALIZED METHOD OF MOMENTS ECO2010, Summer 2022 If r > q, the model is said to be overidenti…ed, and (87) has no solution. Instead, bGM M estimator is chosen so that a quadratic form in (87) is as close to zero as possible. Speci…cally, bGM M minimizes the quadratic form " #0 " n # n 1X 1X Qn ( ) = h(wi ; ) Wn h(wi ; ) n n where Wn is (r i=1 (88) i=1 r) weight matrix. Di¤erentiating (88) w.r.t. yields the FOC #0 " n " n # 1X 1 X @h(wi ; )0 Wn h(wi ; b) = 0 0 n n @ b = i=1 i=1 These equations will be generally nonlinear in b and will be solved using numerical methods. Let n 1 X @hi G0 = plim n @ 0 = 0 i=1 and n S0 = plim n 1 XX hi h0j n = 0 i=1 j=1 where hi = h(wi ; ): The corresponding consistent estimators are n X @hi b=1 G n @ 0 i=1 n =b i=1 =b X b= 1 S hi h0i n (89) for iid data. Di¤erent choices for the weighting matrix Wn yield di¤erent GMM estimators. When S0 is known, the most e¢ cient GMM results for the choice of Wn = S0 1 Then, p n(bGM M 0) d ! N 0; (G0 S0 1 G) 1 (90) The optimal GMM estimator weights by the inverse of the variance matrix of the sample moment conditions. The proof of the result (90) follows from the Multivariate Delta Method and the Continuous Mapping Theorem. b The optimal GMM estimator In practice, S0 is unknown and needs to be estimated by S: can be obtained using a two-step procedure: b using (89). 1. Set Wn = In and obtain a suboptimal GMM estimator. Obtain S 130 23. GENERALIZED METHOD OF MOMENTS ECO2010, Summer 2022 b 2. Run the optimal GMM estimator with Wn = S 1: b in (90) makes no di¤erence asymptotically, since S b is Although replacing S0 with S consistent, it will make a di¤erence in …nite samples. Depending on the application, using b may result in a small-sample bias. In general, adding moment restrictions improves S asymptotic e¢ ciency, as it reduces the limit variance (G0 S0 1 G) 1 of the optimal GMM estimator, or at least leaves it unchanged. Nonetheless, the number of moment restrictions cannot exceed the number of observations. GMM de…nes a class of estimators, with di¤erent choices of h( ) corresponding to different members of the class. For example: hi = xi ui yields the OLS estimator hi = xi ui =V [ui jxi ] yields the GLS estimator when errors are heteroskedastic If complete distributional assumptions are made on the model, the most e¢ cient estimator is the MLE. Thus the optimal choice of hi is hi = @ ln f (wi ; ) @ where @ ln f@(wi ; ) is the joint density of wi : Hypothesis tests on can be performed using the Wald test. There is also a general model speci…cation test that can be used for overidenti…ed models. Let H0 : E[h(w; 0 )] =0 which tests the initial population moment condition. The test can be implemented by P b i to 0: For the special case of the optimal GMM, measuring the closeness of n 1 ni=1 h under the null, the test statistic " n #0 " n # 1 Xb b 1 1 Xb 2 hi S hi OIR = r q n n i=1 23.3 i=1 Example: Euler Equation Asset Pricing Model Following Hansen and Singleton (1982), a representative agent is assumed to choose an optimal consumption path by maximizing the present discounted value of lifetime utility from consumption 1 X max E t0 U (Ct )jIt t=1 subject to the budget constraint Ct + Pt Qt Vt Qt 1 + Wt where It denotes the information available at time t, Ct denotes real consumption at t, Wt denotes real labor income at t, 131 23. GENERALIZED METHOD OF MOMENTS ECO2010, Summer 2022 Pt denotes the price of a pure discount bond maturing at time t + 1 that pays Vt+1 , Qt represents the quantity of bonds held at t, and 0 represents a time discount factor. The …rst order condition for the maximization problem can be represented as the Euler equation U 0 (Ct+1 ) E (1 + Rt+1 ) 0 0 1=0 (91) It U (Ct ) where 1 + Rt+1 = the power form Vt+1 Vt is the gross return on the bond at time t + 1. Assuming utility has U (C) = where 0 C1 1 0 0 represents the intertemporal rate of substitution (risk aversion), then U 0 (Ct+1 ) = U 0 (Ct ) 0 Ct+1 Ct (92) Using (92) in (91), the conditional moment equation becomes " # 0 Ct+1 E (1 + Rt+1 ) 0 It 1=0 Ct (93) De…ne the nonlinear error term as "t+1 = (1 + Rt+1 ) = a(zt+1 ; with zt+1 = (Rt+1 ; Ct+1 =Ct )0 and (93) may be represented as 0 0 0 Ct+1 Ct 1 0) =( 0; 0 0) : Then the conditional moment equation E [ "t+1 j It ] = E [ a(zt+1 ; 0 )j It ] =0 Let xt = (1; Ct =Ct Since xt 0 1 ; Ct 1 =Ct 2 ; Rt ; Rt 1 ) It ; (94) implies that E [ xt "t+1 j It ] = E [ xt a(zt+1 ; 0 )j It ] =0 and by the Law of Iterative Expectations E [xt "t+1 ] = 0 For GMM estimation, de…ne the nonlinear residual as et+1 = (1 + Rt+1 ) 132 Ct+1 Ct 1 (94) 23. GENERALIZED METHOD OF MOMENTS ECO2010, Summer 2022 Form the vector of moments h(wt+1 ; ) = xt et+1 0 Ct+1 Ct (1 + Rt+1 ) 1 B B Ct+1 B (Ct =Ct 1 ) (1 + Rt+1 ) 1 B Ct B B Ct+1 B 1 = B (Ct 1 =Ct 2 ) (1 + Rt+1 ) Ct B B Ct+1 B Rt (1 + Rt+1 ) 1 B Ct B @ Ct+1 1 Rt 1 (1 + Rt+1 ) Ct 1 C C C C C C C C C C C C C A There are r = 5 moment conditions to identify q = 2 model parameters, giving r overidentifying restrictions. 133 q=3 24. TESTING OF NONLINEAR HYPOTHESES 24 ECO2010, Summer 2022 Testing of Nonlinear Hypotheses A generic version of a hypothesis for a test of linear restrictions can be stated as: H0 : R 0 r=0 HA : R 0 r 6= 0 for h restrictions on k parameters with h k; where R is an (h k) matrix of full rank h; and r is an (h 1) vector of constants. Example: the H0 that 1 = 1 and 2 = 2 + 3 when k = 4 can be expressed with " # " # 10 0 0 1 R= ; r= 01 10 2 A generic version of a hypothesis for a test of nonlinear restrictions can be stated as: H0 : h ( 0) =0 HA : h ( 0) 6= 0 where h ( ) is an (h 1) vector-valued function of : Example: h( ) = We assume that h ( ) is such that the h q matrix R( ) = 1= 2 1 = 0: @h ( 0 ) @ 0 is of full rank h:This assumption is equivalent to linear independence of linear restrictions R where R = R( ): We will present three di¤erent procedures for testing nonlinear H0 : 1. Wald Test 2. Likelihood Ratio Test 3. Lagrange Multiplier Test All three tests are asymptotically equivalent, in the sense that all three test statistics converge in distribution to the same random variable under H0 . If the number of linear restrictions is r; the limiting distribution is 2r : In practice we choose among these tests based on ease of computation and …nite sample performance. M LE Let `(e ) denote the log-likelihood function evaluated at the ML estimate of the M LE restricted model. Let `(b ) denote the log-likelihood function evaluated at the ML estimate of the unrestricted model. 134 24. TESTING OF NONLINEAR HYPOTHESES 24.1 ECO2010, Summer 2022 Wald Test Intuition: to test h ( 0 ) = 0; obtain b (without imposing the restrictions h); and assess whether h(b) is close to 0: The closeness is assessed using the Wald test statistic W which is a quadratic form in the vector h(b) and the inverse of the estimate of its covariance matrix. When a h(b) N (0; V[h(b)]) under H0 then the Wald test statistic h i W = h(b)0 V[h(b)] 1 h(b) a 2 h (95) Note that the covariance matrix inverse acts as a set of weights on the quadratic distance of h(b) from zero. The Wald test is generic in that it does not require an underlying MLE procedure. Using a …rst-order Taylor series expansion under H0 ; h(b) has the same limit distribution b as R ( 0 ) (b 0 ): Then h( ) is asymptotically Normal under H0 with mean zero and variance matrix V[h(b)] = R ( 0 ) V(b)R ( 0 )0 A consistent estimate is b = R(b) when where R b b)] =Rn b V[h( p n(b 0) d 1b b0 CR (96) ! N (0; C0 ) b is any consistent estimate of C0 : where C Using (96) in (95), the Wald test statistic can be expressed as An equivalent expression is h i 0 bC bR b0 W = nh(b) R h i 0 b V( b b)R b0 W = h(b) R 135 1 h(b) 1 h(b) 24. TESTING OF NONLINEAR HYPOTHESES ECO2010, Summer 2022 b b) = n 1 C b is the estimated asymptotic variance of b: where V( The Wald test can be implemented as a nonlinear F test: F = W h a F [h; n k] This yields the regular F test in linear multiple regression as a special case. Moreover, in the p linear model, for a test of one restriction, W is equivalent to the t statistic. One limitation of the Wald statistic is that W is not invariant to reformulations of the restrictions. Finite sample performance of W may be quite poor. 24.2 Likelihood Ratio Test De…ne L( ) The Likelihood Ratio test statistic is h LR = 2 L(b) n X `i ( ) i=1 i L(e) = 2 ln L(b) L(e) ! 2 r where e is the MLE in the model restricted under H0 and b is the MLE of the unrestricted model. In the linear model, the F -statistic results as a special case of LR: 24.3 Lagrange Multiplier Test The Lagrange Multiplier (LM) test is based on a vector of Lagrange multipliers from a constrained maximization problem. In practice, LM tests are computed based on a gradient (score) vector of the unrestricted ln L function evaluated at the restricted estimates. Consider the special case of H0 : 2 = 0 where is partitioned as = ( 1 ; 2 ): The vector of restricted estimates can then be expressed as e = (e1 ; 0): The vector e1 maximizes the restricted function ln L( 1 ; 0) and so it satis…es the restricted likelihood equations s1 (e1 ; 0) = 0 (97) where s1 is the vector whose components are the k h partial derivatives of ln L w.r.t. the elements of 1 : If we partition s(e) = (s1 (e); s2 (e)); the …rst k h elements, which are the vectors of s1 (e); are equal to zero, by (97). However, the h-vector s2 (e) is in general non-zero. The LM test is a statistic based on a quadratic form of s2 (e); LM = sT2 (e)Ie221 s2 (e) where Ie is the sub-block of the Fisher information matrix corresponding to s2 (e): The conditions (97) imply that this is equivalent to LM = sT (e)Ie 136 1 s(e) (98) 24. TESTING OF NONLINEAR HYPOTHESES ECO2010, Summer 2022 As (98) does not depend on the partitioning ( 1 ; 2 ); the LM statistic is invariant to model reparametrization. Since (98) is based on the score function s; this LM statistic is also called the score test. 24.4 LM Test in Auxiliary Regressions LM test can be expressed as nR2 in an auxiliary regression. De…ne the (n scores 0 e 01 s1 ( ) B .. C S=@ . A sn (e)0 and let denote a (n k) matrix of 1) column of ones. Then 0 =n n X S0 = si (e) S0 S = i=1 n X i=1 si (e)si (e)0 Then, using the information matrix equality, we can rewrite the LM statistic as ! n ! n ! n X X X LM = si (e)0 si (e)si (e)0 si (e) i=1 0 i=1 0 = S SS 1 i=1 0 S (99) Now consider the auxiliary regression = S + residual (100) The OLS estimator is given by and the predicted values by b = S0 S b = Sb = S S0 S 137 1 S0 1 S0 (101) 24. TESTING OF NONLINEAR HYPOTHESES ECO2010, Summer 2022 Using (99) and (101), the LM test can then be written as 1 LM = 0 S S0 S =n =n =n S0 0 S (S0 S) 1 S0 0 0 S (S0 S) 1 (S0 S) (S0 S) 1 S0 0 0 bb 0 ESS T SS = nR2 =n where ESS is the explained variation, T SS is total variation, and R2 is the regular regression R-squared. The auxiliary regression (100) was used above make the link between the LM statistic and its nR2 form simple and transparent. Various auxiliary regressions are used in place of (100) depending on the context: Examples: Breusch-Pagan test for no heteroskedasticity: u b2i = x0 +residual Breusch-Godfrey test for no …rst order autocorrelation: u bt = u bt 1 + x0 +residual Using the nR2 test statistic, these are viewed as special cases of the LM test. 24.5 Test Comparison All three tests, Wald, LR, and LM, are asymptotically distributed 2h under H0 : The …nitesample distributions of these test statistics di¤er. For testing of linear restrictions in the linear model under residual Normality, W LR LM In practice, in this case the Wald test will reject H0 more often than the LR test which in turn will reject more often than the LM test. Ease of estimation: The Wald test only requires of a model under HA and is best to use when the restricted model is di¢ cult to estimate. The LM test only requires estimation under H0 and is best to use when the unrestricted model is di¢ cult to estimate. The lack of invariance to reparametrization is a major weakness of the Wald test. 138 25. BOOTSTRAP APPROXIMATION 25 ECO2010, Summer 2022 Bootstrap Approximation Exact …nite-sample results are unavailable for most estimators and related test statistics. The statistical methods presented above rely on asymptotic approximations, which may or may not be reliable in …nite samples. An alternative approximation is provided by the bootstrap (Efron 1979, 1982), which approximates the distribution of a statistic (estimator or a test) by numerical simulation. We can calculate the so-called bootstrap standard errors, con…dence intervals and critical values for hypothesis testing, instead of the asymptotic ones. Bootstrap approximation methods are used in place of asymptotic approximation for a number of reasons, including: To avoid the analytical calculation of the estimated asymptotic variance in cases when the asymptotic sampling distribution is very di¢ cult or infeasible to obtain, such as in multi-step estimators; To gain better approximation accuracy than the asymptotic distribution (so-called “high-order” re…nement); To bias-correct an estimator. The justi…cation for bootstrap relies on asymptotic theory. In situations where the latter fails the former can also fail. As with asymptotics, bootstrap inference is only exact in in…nitely large samples. 25.1 The Empirical Distribution Function The Empirical Distribution Function (EDF) for a sample fyi gni=1 is de…ned as n Fn (y) = 1X 1fyi n yg i=1 where 1f!g is the indicator function 1f!g = ( 1 if ! is true 0 if ! is false Fn is a nonparametric estimate of the population distribution function F:Fn is by construction a step function. The EDF is a consistent estimator of the CDF. To see this, note that for any y, 1fyi yg is an iid random variable with expectation F (y). Thus by the LLN, p Fn (y) ! F (y) Furthermore, by the CLT p n(Fn (y) d F (y)) ! N (0; F (y)(1 139 F (y))) 25. BOOTSTRAP APPROXIMATION ECO2010, Summer 2022 To see the e¤ect of sample size on the EDF, the Figure below shows the EDF and true CDF for four random samples of size n = 25; 50; 100, and 500. The random draws are from the N (0; 1) distribution. For n = 25 the EDF is only a crude approximation to the CDF, but the approximation appears to improve for the large n. In general, as the sample size gets larger, the EDF step function gets uniformly close to the true CDF. E¤ect of sample size on EDF 25.2 Bootstrap The bootstrap assumes that the sample is in fact the population. Instead of drawing from a speci…ed population distribution by a random number generator (as the MC would), the bootstrap draws with replacement from the sample, i.e. from the Empirical Distribution Function (EDF). In other words, the bootstrap assumes that the EDF is the population distribution function and the sample is thus "truly" representative of the population. This assumption may or may not be satis…ed in practice. Let F denote a distribution function for the population of observations (yi ; xi ): Let Tn = Tn ((y1 ; x1 ); : : : ; (yn ; xn ); F ) be a statistic of interest, for example an estimator b or a t-statistic (b )=s(b): Note that we write Tn as a function of F . For example, the t-statistic is a function of b which itself is a function of F . The exact (unknown) CDF of Tn when the data are sampled from the distribution F is Gn (u; F ) = P (Tn ujF ) Ideally, inference would be based on Gn (u; F ). This is generally impossible since F is unknown. Asymptotic inference is based on approximating Gn (u; F ) with G(u; F ) = lim Gn (u; F ) n!1 140 25. BOOTSTRAP APPROXIMATION ECO2010, Summer 2022 When G(u; F ) = G(u) does not depend on F , we say that Tn is asymptotically pivotal and use the distribution function G(u) for inferential purposes. The bootstrap constructs an alternative approximation. The unknown F is replaced by the EDF estimate Fn . Plugged into Gn (u; F ) we obtain Gn (u) = Gn (u; Fn ) We call Gn the bootstrap distribution. Bootstrap inference is then based on Gn (u). A random sample from Fn is called the bootstrap data: denote these by (yi ; xi ). The statistic Tn = Tn ((y1 ; x1 ); : : : ; (yn ; xn ); Fn ) is a random variable with distribution Gn . We call Tn the bootstrap statistic. The bootstrap algorithm is similar to our discussion of Monte Carlo simulation, but drawing from Fn instead of some assumed F: The bootstrap sample size n for each replication is the same as the sample size. Hence, a bootstrap sample f(y1 ; x1 ); ::::; (yn ; xn )g will typically have multiple replications of the same observation. The bootstrap statistic Tn = Tn ((y1 ; x1 ); : : : ; (yn ; xn ); Fn ) is calculated for each bootstrap sample. This is repeated B times. B is known as the number of bootstrap replications. It is desirable for B to be large, so long as the computational costs are reasonable. B = 1000 typically su¢ ces. 141 26. ELEMENTS OF BAYESIAN ANALYSIS 26 ECO2010, Summer 2022 Elements of Bayesian Analysis Bayesian methods provide a set of tools for statistical analysis that are alternative to frequentist methods. Many researchers treat frequentist and Bayesian methods as complementary and use both in areas of their relative strength. We have examined the foundations of each approach earlier in the course. We will now …rst brie‡y characterize their similarities and di¤erences (listed in italics), and then describe Bayesian methods in closer detail. Frequentist Methods: Before observing the data: – Start with the formulation of a model that seeks to adequately describe the situation of interest – Utilize a rule to estimate the unobservables ( ; i) – Determine the asymptotic (or bootstrap) properties of the rule After observing the data: – Update the system by conditioning on the data – Learn about the unobservables by running a max/min routine or or via numerical methods such as MCMC – Perform inference based on the asymptotic distribution of the estimation rule (form "con…dence intervals") – Perform prediction based on the model and the asymptotic distribution of the estimation rule Bayesian Methods: Before observing the data: – Start with the formulation of a model that seeks to adequately describe the situation of interest – Utilize a rule to estimate the unobservables (e.g. ; i) – Formulate a "prior" probability density over the unobservable quantities ( ; i) After observing the data: – Update the system by conditioning on the data using Bayes’ rule yielding a "posterior" probability density over the unobservables – Learn about the unobservables by taking their "draws" from the posterior density (direct, MCMC) 142 26. ELEMENTS OF BAYESIAN ANALYSIS ECO2010, Summer 2022 – Perform inference based on these draws (form "credible sets") – Perform prediction based on the model and these draws by computing predictive distributions Let ( ) represent the prior state of knowledge about : Let Y1 ; : : : ; Yn Using Bayes’Rule, the posterior is f (Y j ): ( jY1 ; : : : ; Yn ) / f (Y1 ; : : : ; Yn j ) ( ) = L( jY1 ; : : : ; Yn ) ( ) where L( jY1 ; : : : ; Yn ) is the likelihood function. The posterior distribution ( jY) summarizes our current state of knowledge about . ( jY) is a distribution over the whole parameter space of . In contrast, a frequentist estimator is a function of the data, b = b(Y), which yields a single value of . 26.1 The Normal Linear Regression Model Consider the multiple linear regression model y =X +" where X is n k with rank k: As in Maximum Likelihood analysis assume " N (0; 1 In ) Then, conditional on X ; 1 f (yjX; ) = N (yjX ; In ) where = ( ; ): Let the prior be given by ( ) = N( j 1 0; 0 ) ( ) = Gamma( j ; b) Then, ( j ; X; y) / exp 1 ( 2 1 )0 ( ) / N( j ; 1 ) where = X0 X+ 1 0 ; = 1 X0 y+ 0 0 and ( j ; X; y) = Gamma jn=2 + ; (y 143 X )0 (y X )=2 + b 1 1 26. ELEMENTS OF BAYESIAN ANALYSIS 26.2 ECO2010, Summer 2022 Bernstein-Von Mises Theorem Theorem 99 (Bernstein-Von Mises). The posterior distribution converges to Normal with covariance matrix equal to 1=n times the information matrix. This result says that under general conditions the posterior distribution for unknown quantities in any problem is e¤ectively independent of the prior distribution once the amount of information supplied by a sample of data is large enough. That is, in large samples, the choice of a prior distribution is not important in the sense that the information in the prior distribution gets dominated by the sample information. The Theorem implies that the asymptotic distribution of the posterior mean is the same as that of the MLE. 26.3 Posterior Sampling In place of frequentist maximization of the likelihood, Bayesians obtain a sequence of realizations (or "draws") from the posterior f r gR ( jX): The analyst chooses r=1 , where r the number of draws R: This sequence characterizes many posterior properties of interest, such as moments (mean, variance), or quantiles. Under weak conditions R 1 X h( R r=1 r jX) R!1 ! Z h( jy) ( jX)d for any integrable real-valued function h: For example, the posterior mean is obtained as R = 1 X R r r=1 Inference is performed by constructing Con…dence Sets using quantiles q of 95% CS = [q2:5 ; q97:5 ] Direct sampling: – If a posterior density is easy to draw from (e.g. N ( ; 1)) Indirect sampling (for complicated posteriors): – Acceptance and Importance Sampling – Markov chain Monte Carlo (MCMC) Gibbs sampling Metropolis-Hastings algorithm 144 ( jX), e.g. 26. ELEMENTS OF BAYESIAN ANALYSIS 26.3.1 ECO2010, Summer 2022 Gibbs Sampling Used when a multivariate joint posterior density is di¢ cult to sample, but can be split into a sequence of conditional posterior densities that are easy to sample. Obtain a sample from p( 1 ; 2 ) by drawing in turn from p( 2 j 1 ) and p( 1 j 2 ) In our Y1 ; : : : ; Yn N( ; 2) ( jY1 ; : : : ; Yn ; example above, we found two conditional posteriors: 2 )=N nY = n= + 0 = 20 ; 2 + 1= 2 n= 0 2 2 1 + 1= ( jY1 ; : : : ; Yn ; ) = Gamma n=2 + ; (ns2 =2 + b These results provide two Gibbs blocks for posterior sampling of ; Reference: Greenberg (2012), Geweke et al (2011) 145 1 2 0 ) 1 2 jY ; : : : ; Y : 1 n 27. MARKOV CHAIN MONTE CARLO 27 ECO2010, Summer 2022 Markov Chain Monte Carlo 27.1 Acceptance Sampling Consider a posterior p( jI) that is di¢ cult to sample from while a density p( jS) is easy to sample from. Let the ratio p( jI)=p( jS) be bounded above by a constant a: Then draw from p( jS) and accept the draw with probability = p( jI)=[a p( jS)] Mathematica demonstration: Acceptance Sampling http://demonstrations.wolfram.com/AcceptanceRejectionSampling/ 27.2 Metropolis-Hastings Algorithm The purpose of the Metropolis–Hastings (MH) algorithm is to draw p( ) where p( ) is a posterior distribution. MH uses a Markov process that is uniquely de…ned by its transition probabilities. Denote by p( 0 j ) the probability of transitioning from any given state to any other given state 0 . The Markov process has a unique stationary distribution p( ) when the following two conditions are met: 146 27. MARKOV CHAIN MONTE CARLO ECO2010, Summer 2022 1. Existence of stationary distribution, for which a su¢ cient condition is detailed balance: p( )p( 0 j ) = p( 0 )p( j 0 ) which requires that each transition to 0 is "reversible". 2. Uniqueness of stationary distribution is guaranteed by ergodicity of the Markov process, for which su¢ cient conditions are: (a) Aperiodicity - the chain does not return to the same state at …xed intervals; (b) Positive recurrence - the expected number of steps for returning to the same state is …nite. Detailed balance implies p( 0 ) p( 0 j ) = p( ) p( j 0 ) (102) Now separate the transition in two sub-steps: 1. Proposal step using the proposal distribution g( 0 j ) 2. Acceptance-rejection step, using the acceptance distribution A( 0 j ) Thus: p( 0 j ) = g( 0 j )A( 0 j ) (103) p( 0 ) g( 0 j )A( 0 j ) = p( ) g( j 0 )A( j 0 ) (104) Using (103) in (102) yields The acceptance probability that ful…lls condition (104) is given by A( 0 j ) = min 1; p( 0 ) g( j 0 ) p( ) g( 0 j ) with the normalization A( j 0 ) = 1: The Metropolis-Hastings Algorithm: 1. Step 1: initialize (0) 2. Step 2: in each iteration g = 1; : : : ; n0 + G; (a) Propose: a proposal value from g( j (g) ; y) and calculate the quantity (the acceptance probability or the probability of move) ( ) p( jy) g( (g) j ; y) (g) ( j ; y) = min 1; p( (g) jy) g( j (g) ; y) 147 27. MARKOV CHAIN MONTE CARLO (b) Move: set (g+1) = ( ECO2010, Summer 2022 with probability ( j (g) ; y) with probability 1 ( j (g) ; y) (g) 3. Step 3: Discard the draws from the …rst n0 iterations and save the subsequent G draws (n0 +1) ; : : : ; (n0 +G) Mathematica demonstration 1: Metropolis-Hastings algorithm http://demonstrations.wolfram.com/MarkovChainMonteCarloSimulationUsingTheMetropolisAlgorithm/ Mathematica demonstration 2: Metropolis-Hastings algorithm 2 http://demonstrations.wolfram.com/TargetMotionWithTheMetropolisHastingsAlgorithm/ 148 28. NEURAL NETWORKS AND MACHINE LEARNING 28 28.1 ECO2010, Summer 2022 Neural Networks and Machine Learning Arti…cial Neural Networks Arti…cial Neural Networks (ANNs) are models that allow complex nonlinear relationships between the response variable and its predictors. A neural network is composed of observed and unobserved random variables, called neurons (also called nodes), organized in layers. The observed predictor variables form the "Input" layer, and the predictions form the "Output" layer. Intermediate layers contain unobserved random variables (so-called “hidden neurons”). The simplest ANN with no hidden layers is equivalent to a linear regression: In the ANN notation, the formula for the …tted regression model is yb = a1 + w1;1 x1 + w2;1 x2 + w3;1 x3 + w4;1 x4 The coe¢ cients wk;j attached to the predictors xk are called weights. Once we add intermediate layer(s) with hidden neurons and activation functions, the ANN becomes non-linear. An example shown in the following …gure is known as the feedforward network (FFN). The weights wk;j are selected in the ANN framework using a machine learning algorithm that minimizes a loss function, such as the Mean Squared Error (MSE), or Sum of Squared 149 28. NEURAL NETWORKS AND MACHINE LEARNING ECO2010, Summer 2022 Residuals (SSR). In the special case of linear regression, OLS provides an analytical solution to the learning algorithm that minimizes SSR. In general ANNs the response variable is a nonlinear function of the predictors and hence OLS is not applicable as a method of model …t. A neural network with many hidden layers is called a deep neural network (DNN) and its training algorithm is called deep learning. 28.2 Feed-Forward Network In FFN each layer of nodes receives inputs from the previous layers. The outputs of the nodes in one layer are inputs to the next layer. The inputs to each node are merged using a weighted linear combination. For example, the inputs into hidden neuron j (blue dots) in the …gure above are combined linearly to give zj = aj + 4 X wk;j xk (105) k=1 In the hidden layer, the linear combination zj is transformed using a nonlinear activation function s(zj ), which adds ‡exibility and complexity to the model. Without the activation function the model would be limited to a linear combination of predictors, i.e. multiple regression. Popular activation functions are the logistic (or sigmoid) function logistic(zj ) = and the tanh function tanh(zj ) = 1 1+e ezj e ezj + e shown in the following Figure: 150 zj zj zj (106) 28. NEURAL NETWORKS AND MACHINE LEARNING ECO2010, Summer 2022 For building an ANN model, we need to pre-specify in advance: the number of hidden layers, the number of nodes in each hidden layer, and the functional form of the activation function. The parameters a1 ; a2 ; a3 and w1;1 ; : : : ; w4;3 are “learned” from the data. 28.3 Machine Learning for NNs Machine learning for neural networks, also known as "training" or "optimization", involves searching for the set of weights that best enable the network to model the patterns in the data. The goal of a learning algorithm is typically to minimize a loss function that quanti…es the lack of …t of the network to the data. We broadly distinguish several types of learning: Supervised learning – We supply the ANN with inputs and outputs, as in the examples above – The weights are modi…ed to reduce the di¤erence between the predicted and actual outputs using a loss function – Examples: NN (auto)regression, face, speech, or handwriting recognition, spam detection Unsupervised learning – We supply the ANN with inputs only – The ANN works only on the input values so that similar inputs create similar outputs – Examples: K-means clustering, dimensionality reduction We will focus here on supervised learning only as it is the method of choice in typical economic applications. Supervised learning typically involves many cycles of so-called forward propagation and backpropagation. The process of forward propagation in FFNs involves: Computing zj ; as in (105), at every hidden neuron j Applying the activation function s(zj ) at each j; as in (106) Constructing a linear combination of s(zj )0 s to obtain the predicted output Once the predicted output is obtained at the output layer, we compute the loss or "error" (predicted output minus the original output). 151 28. NEURAL NETWORKS AND MACHINE LEARNING ECO2010, Summer 2022 The goal of backpropagation is to adjust the weights in each layer to minimize the overall error (loss) at the output layer. One iteration of forward and backpropagation is called an epoch. Typically, many epochs, (often tens of thousands) are required to train a neural network well. The following steps are typically taken in FFN supervised learning for one vector of inputs xi = (xi;1 ; : : : ; xi;K ) and corresponding output yi : 1. (Initialization): Assign random weights and "biases" (parameters aj ) to each of the neurons in the hidden layer P 2. (Forward propagation start): Obtain aj + K k=1 wk;j xi;k at each hidden neuron 3. Apply the activation function at each hidden neuron, s(zj ) 4. Take this output and pass it onto the next layer of neurons, until the output layer (forward propagation end) 5. Calculate the error term at the output neuron(s) by ei = ybi predicted output value and yi is the actual data value yi where ybi is the 6. (Backpropagation start): Obtain the loss function, typically using the square loss L(ei ) = e2i = (b yi yi )2 (107) In the FFN example (105) and (106) above, 0 12 0 ! 3 3 4 X X X L(ei ) = @ wj s (zj ) yi A = @ wj s aj + wk;j xi;k j=1 j=1 7. Using the chain rule, obtain k=1 12 yA @L(e) @w : In the FFN example (105) and (106) above, ! 4 X @ @L(ei ) = 2wj xi;k s aj + wk;j xi;k @wk;j @ k=1 i) and similarly for @L(e @wj ; update each weight in the direction of the partial derivative (gradient descent method) (backpropagation end) 152 28. NEURAL NETWORKS AND MACHINE LEARNING ECO2010, Summer 2022 Backpropagation works well with relatively shallow networks with only a few hidden layers. As the networks gets deeper, with more hidden layers, the training time can increase exponentially and the learning algorithm may not converge. The chain rule multiplication of terms in the last step above often results in numerical values of less than 1 that compound during backpropagation. Weight adjustment slows down and the deepest layers closest to the inputs early either train very slowly or do not move away from their starting positions. This phenomenon is known as the vanishing gradient problem. A typical remedy involves changes in the network structure, such as in the case of Recurrent Neural Networks (RNN) covered further below. 28.4 Cross-Validation Machine learning methods typically use three di¤erent subsets of the complete data set: 1. A training set is used by the machine learning algorithm described above to train the model. The outputs are the …tted weights (model parameters) for any one given model structure (e.g. 2 hidden layers, 5 neurons each). Training the network minimizes the training loss, typically mean-square error called training MSE. An analyst often trains several or many networks with di¤erent structures. For the loss function L(ei ) in (107) and a training set of size M we obtain the training MSE as LM M 1 X = L(ei ) M i=1 2. Each trained (optimized) model is then evaluated on a validation set, not used in training the model. Evaluating the loss function then results in the validation MSE for each model. We can compare the validation MSE for the di¤erent models (e.g. with di¤erent number of hidden layers and neurons) and select the model with the smallest validation MSE. The model structural and learning elements (such as number of hidden layers, number of neurons, optimizer and epochs) are referred to as hyperparameters. Their selection using the validation MSE is referred to as hyperparameter tuning. 3. The selected model is then applied to a test set, not used in training or selecting the model. The test set is used only once at the end of the selection process. The aim is to get a measure of how the selected model will perform on future data that have not yet been observed. For example, if we train a neural net using stock market data for the past 6 months, we would like to see how well the model can predict the next month. Evaluating the model at the test set gives us an estimate the model prediction error. The test MSE is R 1 X M SE(test) = (b yi yi )2 R i=1 for the selected model predictions ybi based on the test set inputs xi , and observed test set data on yi : 153 28. NEURAL NETWORKS AND MACHINE LEARNING ECO2010, Summer 2022 The terms "validation" and "test" are commonly referred to as the hold-out (i.e. any set not used in training). If we only train one given model (e.g. 3 hidden layers, 10 neurons each) without further model selection then we don’t need to use the validation set, and the complete data set is only split into the training set and the test set. Similarly, if we don’t plan to estimate the model prediction error then we don’t need to set aside a test set, and only split the complete data set into the training set and the validation set. 28.4.1 Random-split Cross-Validation A basic way of obtaining a validation set is to split the complete data set randomly into the training set (blue) and the validation set (beige), typically of about equal size, as shown in the …gure below This approach is referred to as random-split cross-validation (CV). A drawback of this approach is that the model prediction error can be highly variable depending on which observations are included in the training set and which observations are included in the validation set. 28.4.2 Leave-one-out Cross-Validation Leave-one-out cross-validation chooses only one point into the validation set while the remainder of the data forms the training set. This is repeated for each observation chosen for the validation set at a time. Each time, a prediction error is obtained for the validation observation i: ei = (yi ybi ) For a complete data set of size n; the leave-one-out cross-validation MSE is then obtained as n 1X 2 M SE(n) = ei n i=1 The whole procedure requires training the given ANN model n times. This approach is more stable than a random-split CV, but can be very time-consuming for large samples and complex models. In the Figure below, the leave-one-out cross-validation training set is displayed in blue and the validation set in beige. 154 28. NEURAL NETWORKS AND MACHINE LEARNING 28.4.3 ECO2010, Summer 2022 k-fold Cross-Validation k-fold cross-validation involves randomly dividing the set of observations into k groups, or "folds", of approximately equal size. The …rst fold forms the validation set and the k 1 folds form the training set. M SE(j) is obtained for the validation observations for each j = 1; :::; k: The k-fold cross-validation MSE is then obtained as CV(k) k 1X = M SE(j) k j=1 Leave-one-out cross-validation is a special case with k = n: For k < n; …tting the model k times can result in substantial savings in computation time. In the Figure below, the k-fold cross-validation training set is displayed in blue and the validation set in beige. 28.5 Model Fit We can broadly distinguish three di¤erent categories of ANN models: 1. Over…t model - model that is very complex and ‡exible, with many hidden layers and neurons, which adapts very well to the training data set (low training MSE) but performs poorly in the validation set (high validation MSE) 155 28. NEURAL NETWORKS AND MACHINE LEARNING ECO2010, Summer 2022 2. Under…t model - with minimal ‡exibility, with small number of hidden layers and neurons, which adapts poorly to the training sample (high training MSE) and also performs poorly in the validation set (high validation MSE) 3. Good …t model - a model that suitably learns the training dataset (relatively low training MSE) and performs suitably well in the validation data (relatively low validation MSE) A model …t can be considered in the context of the bias-variance trade-o¤. An under…t model has high bias and low variance. Regardless of the speci…c samples in the training data, it cannot learn the problem. An over…t model has low bias and high variance. The model learns the training data too well and performance varies widely with new unseen examples or even statistical noise added to examples in the training dataset. A good …t model balances the bias and variance. 28.6 Recurrent Neural Networks Recall the structure of Feedforward Neural Networks (FFN). In the following Figure the earlier FFN graph is turned counter-clockwise to facilitate the transition to recurrent NNs. In general, FFNs do not account for sequential dependence in the hidden layer structure. As such, FFNs are typically used for cross-sectional data with independent inputs. 156 28. NEURAL NETWORKS AND MACHINE LEARNING ECO2010, Summer 2022 Recurrent Neural Networks (RNNs) are designed speci…cally for sequential data. In RNNs, connections between nodes form a graph along a temporal sequence, which allows RNNs to exhibit dynamic behavior. Typical uses of RNNs include: Speech processing (topic classi…cation, sentiment analysis, language translation) Handwriting recognition Time series modelling and forecasting In RNNs, output at current time step depends on current input as well as previous state via recurrent edges (feedback loops): We can "unroll" the RNN scheme in time to obtain a directed graph representation The input is a sequence of vectors fXt gL t=1 : Each Xt feeds into the hidden layer, which also has as input the hidden state vector At 1 from the previous element in the sequence, and produces the current hidden state vector At : The output layer produces a sequence of predictions Ot : As the network proceeds from t = 1 to t = L; the hidden nodes (or states) accumulate a history of information used for prediction. Suppose each vector Xt of the input sequence has p components, Xt = (Xt1 ; Xt2 ; : : : ; Xtp )0 : The hidden layer consists of K hidden neurons, At = (At1 ; At2 ; : : : ; AtK )0 : The weights of the input layer wkj are collected into a matrix W of dimension K (p + 1), which includes the intercepts (or "biases"). The weights of the hidden-to-hidden layer uks are collected into a matrix U of dimension K K: The weights for the output layer k are collected into a vector B of dimension K + 1; which includes the intercept. 157 28. NEURAL NETWORKS AND MACHINE LEARNING The RNN model for the hidden layer is then given by 0 p K X X @ Atk = g wk0 + wkj Xtj + uks At s=1 j=1 ECO2010, Summer 2022 1 1;s A where g( ) is an activation function. The RNN model for the output Ot is given by Ot = 0+ K X k Atk k=1 The same sets of weights W, U and B are used as each element of the sequence is processed, i.e. they are not functions of t: There are many variations and enhancements of the base-case RNN design introduced above. Examples include: GRU and LSTM node designs that we will brie‡y cover below; Bidirectional RNNs that obtain information from past (backwards) and future (forward) states simultaneously, used in language translation; Recurrent multilayer perceptron network (RMLP) network, consisting of cascaded subnetworks, each of which contains multiple layers of nodes, used in image recognition. 28.6.1 Training RNNs RNNs can in principle be trained as FFNs using backpropagation. However, when forecasting output at time T + 1; in the chain rule for the derivative of the loss function at time T the activation function g needs to be applied t + 1 times for weight adjustment at time T t: This leads to the problem of vanishing gradient for g < 1; or exploding gradient for g > 1: The problem can make base-case RNN backpropagation unfeasible for large T: Long Short-Term Memory networks (LSTMs) and Gated Recurrent Units networks (GRUs) were created to address the training problem. They have internal mechanisms called gates that can regulate the ‡ow of information. The gates can learn which data in a sequence is important to keep or delete. In this way only the relevant information can be passed down the long chain of sequences. Almost all state of the art results based on RNNs are achieved with the LSTM or GRU node structure. An excellent animated guide is under this link. Reference: sections 2.2, 5.1, and 10.5 of James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021) 158 ECO2010, Summer 2022 Part IV References Balakrishnan, N. and V.B. Nevzorov (2003) A Primer on Statistical Distributions, Wiley. Available online via U of T libraries. Cameron, C. A. and Trivedi, P. K. (2005) Microeconometrics: Methods and Applications, Cambridge University Press. Available online via U of T libraries. Charalambos D. Aliprantis, C. D. and K. C. Border (2006) In…nite Dimensional Analysis : a Hitchhiker’s Guide, 3rd ed, Springer. Available online via U of T libraries. Geweke, J., G. Koop, and H..van Dijk (2011) The Oxford Handbook of Bayesian Econometrics, Oxford University Press. Available online. Greenberg, E. (2012) Introduction to Bayesian Econometrics, Cambridge University Press. Available online via U of T libraries. James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021) An Introduction to Statistical Learning, 2nd edition, Springer Texts in Statistics. Available online. Judd, K. L. (1998) Numerical Methods in Economics, MIT Press. Available online via U of T libraries. Miranda, M. & Fackler, P. (2002) “Applied Computational Economics and Finance,” MIT Press. Ok, E. A. Real Analysis with Economic Applications. Princeton University Press, 2007. Available online via U of T libraries. Errata are available at the book’s website. Ruppert, D. and Matteson, D. S. (2015). Statistics and Data Analysis for Financial Engineering : with R Examples, chapter 20: Bayesian Data Analysis and MCMC, Springer. Available online via U of T libraries. Simon, C. P. and L. Blume (1994) Mathematics for Economics, WW Norton). 159