Investments Lecture 0: Mathematical Review 1. (Set and Function.) A well-defined collection of objects (called elements) is a set. A set of sets is usually referred to as a collection, and a set of collections is usually referred to as a family. A set contains nothing is called an empty set and denoted ∅. A set contains exactly one element is a singleton. If two sets A and B are such that B contains each and every element of A, then we say that A is a subset of B and B a superset of A, and we write A ⊂ B, and if furthermore B contains an element y which is not contained in A, then A is a proper subset of B. Two sets A and B are equal, if A ⊂ B and B ⊂ A. If one has a universe set X in mind (so that all sets under consideration are subsets of X), then define the complement of A ⊂ X to be the set containing exactly those x contained by X but not by A. The complement of A is denoted by Ac . Denote by 2X the collection containing all subsets of X, which will be referred to as the power set of X. 2. Given two sets A and B, an assigning rule f is a (single-valued) function from A into B if ∀a ∈ A, there is b ∈ B such that f (a) = b. In this case, we write f : A → B, and A and B are called the domain and range of f respectively. If A is itself a collection of sets, then we say that f is a set function. If instead B is a collection of sets then f is a multi-valued function (or a correspondence).1 The function f : A → B is surjective if B ⊂ f (A), injective if x, y ∈ A, x 6= y ⇒ f (x) 6= f (y), and bijective if it is both surjective ane injective. A bijective function is also called a one-to-one correspondence. Given f : A → B, and given any C ⊂ A and D ⊂ B, the sets f (C) = {f (a) : a ∈ C} and f −1 (D) = {a ∈ A : f (a) ∈ D} are called respectively the image of C and the pre-image of D under f .2 3. A bijective function f : A → B has an inverse function f −1 : B → A which assigns for each b ∈ B “the” a ∈ A that satisfies f (a) = 1 S In general we would write for the correspondence f : A → b∈B b instead. Let A = B be the set of real numbers. Let f (x) = x2 + 1, C = (−1, 0], and D = [0, 12 ]. What is f −1 (D)? What is f −1 (f (C))? Is the latter a subset or a superset of C? 2 1 b. (Depending on the context, the notation f −1 should not create confusion with the pre-image operator.) 4. Two sets A and B are of the same cardinality if we can find a one-to-one correspondence between them. A set A is finite if it is empty or there is a one-to-one correspondence between A and the set {1, 2, · · · , n} for some n ∈ Z+ (and we say in the latter case the cardinality of A, denoted |A| or #(A), is n), where Z+ stands for the set of strictly positive integers.3 A set A is countably infinite (denumerable) if there is a one-to-one correspondence between A and Z+ . A set which is neither finite nor countably infinite is uncountable. A set is countable if it is not uncountable.4 5. Given an indexed family of sets {Ab ; b ∈ β}, where β is an arbitrary non-empty index set, the Cartesian product of these sets, denoted by S Πb∈β Ab , is a set containing each and every function f : β → b∈β Ab satisfying that for all b ∈ β, f (b) ∈ Ab .5 The union of sets {Ab ; b ∈ β}, S denoted b∈β Ab , contains exactly those x contained by some Ab . The T intersection of these sets, denoted b∈β Ab , contains exactly those x contained by all Ab ’s. Two sets are disjoint if they have an empty intersection. 6. A collection {Ab ; b ∈ β} of subsets of X is called a partition of X if S T Ab = ∅. b∈β Ab = X and ∀a, b ∈ β, a 6= b ⇒ Aa 7. DeMorgan’s law says that [ [ b∈β Ab ]c = \ Acb , [ b∈β \ b∈β Ab ]c = [ Acb . (1) b∈β 3 The notation Z+ is taken from James R. Munkres, 1975, Topology: A First Course, New Jersey: Prentice-Hall. It represents the set of natural numbers. We shall interchangeably use the notation N to stand for the same set, as in Kai Lai Chung, 1974, A Course in Probability Theory. 4 The following three statements are apparently equivalent: (i) A is countable; (ii) there is a surjective function f : Z+ → A; (iii) there is an injective function g : A → Z+ . Because of (ii) a subset of a countable set is apparently countable. Because of (iii), A = Z2+ is countable: simply define g(n, m) = 2n 3m . Now since the set of positive rationals has a oneto-one correspondence with Z2+ , the former is countable. The following three statements are also equivalent, but less apparent: (a) A is infinite; (b) there exists an injective function f : Z+ → A; (c) there exists a bijective function of A with a proper subset of itself. 5 For example, Suppose that β = {1, 2}, A1 = {x, y}, A2 = {a, b, c}. Then Πb∈β Ab = A1 × A2 = {(x, a), (x, b), (x, c), (y, a), (y, b), (y, c)}, where, for example, (x, a) is a shorthand notation for f (x) = a. Similarly, if instead β = {2, 1} , then Πb∈β Ab = A2 × A1 = {(a, x), (a, y), (b, x), (b, y), (c, x), (c, y)}. 2 8. The following facts are well known. A countable union or a finite product of countable sets is countable. A infinite product of countable sets need not be countable; in particular, the power set of Z+ is uncountable. 9. (Linear Space or Vector Space.) 10. A set V is a linear space on <, where < is the set of real numbers, if it is equipped with and closed under operations of vector addition and scalar multiplication together with a number of well-defined operational (such as commutative and associative) laws. Elements in V are called vectors. Given a finite number of elements x1 , x2 , · · · , xn in V , an element y ∈ V is a linear combination of these elements if there P exist real numbers a1 , a2 , · · · , an such that y = ni=1 ai xi . A subset B of V is said to span V if all vectors in V are linear combinations of elements of B. If no proper subset of B can span B, then B is linearly independent, and in this case B is a basis for V if B spans V . The cardinality of (the number of elements in) a basis (any basis) is called the linear space V ’s dimension. V may be finite-dimensional or infinite-dimensional depending on whether the cardinality of B is finite or infinite. 11. Let V be the set of continuous functions f : [0, 1] → <. For any f, g ∈ V , define f + g as the function h : [0, 1] → < such that h(x) = f (x) + g(x) for all x ∈ [0, 1]. For any f ∈ V and any r ∈ <, define rf as the function g : [0, 1] → < such that g(x) = r · f (x) for all x ∈ [0, 1]. Then, clearly, for all f, g ∈ V and for all r ∈ <, f + g ∈ V and rf ∈ V . Then V is an infinite-dimensional real linear space. 12. A subset A of the linear space V is a convex set if for all x, y ∈ A and for all real numbers λ ∈ [0, 1], λx + (1 − λ)y ∈ A. The intersection of a collection of convex sets is itself convex. 13. (Metric Space and Topology.) Given any non-empty set X a function d : X × X → R+ (where R+ or interchangeably <+ denotes the set of non-negative real numbers) is a metric if (i) d(x, y) = 0 ⇔ x = y, ∀x, y ∈ X; 3 (ii) d(x, y) = d(y, x) ≥ 0, ∀x, y ∈ X; (iii) d(x, y) + d(y, z) ≥ d(x, z). The pair (X, d) is then a metric space. For example, let X = <, which is the set of real numbers, and define for all x, y ∈ X, d0 (x, y) = 1 if x 6= y and d0 (x, y) = 0 if otherwise. Define d(x, y) = |x − y| for all x, y ∈ X also. Then (X, d) and (X, d0 ) are both metric spaces, but they are different metric spaces. 14. A function x : Z+ → X is called a sequence in (X, d). It is said to converge to z ∈ X if for all real > 0 there exists n() ∈ Z+ such that n ≥ n() ⇒ d(x(n), z) < . A sequence is Cauchy if for all e > 0, there exists n(e) ∈ Z+ such that m, n ≥ n(e) ⇒ d(x(m), x(n)) < e. 15. Given two metric spaces (X, d) and (Y, d0 ), a function f : X → Y is continuous if {f (x(n))} is a convergent sequence in (Y, d0 ) whenever {x(n)} is a convergent sequence in (X, d). 16. Given a metric space (X, d), for all z ∈ X and e ∈ R++ ,6 the subsets B(z, e) = {x ∈ X : d(x, z) < e} are called open balls in X. A subset G of X is open if it can be represented as a (perhaps empty) union of open balls. A subset F of X is closed if F c is open. The closure of G ⊂ X is the smallest (with respect to the order relation “inclusion” on X) closed set containing G, and the interior of F is the largest open set contained in F . A point x ∈ X is a limit point of A ⊂ X if for all T T e > 0, B(x, e) A {x}c 6= ∅. The set of limit points of A, denoted A0 , is the derived set of A. A point x ∈ A is an isolated point if for T T some e > 0, B(x, e) A {x}c = ∅. A closed subset A of X is perfect if it contains no isolated points. The boundary of A is the intersection of the closure of A and the closure of Ac in (X, d). The collection τ of all the open sets on X is called a topology on X. The pair (X, τ ) is called a topological space. 17. A set A contained in the metric space (X, d) is bounded if for some x ∈ X and e ∈ R++ , A ⊂ B(x, e). 18. Take the linear space (<, k · k) for example, where for all x, y ∈ <, the norm or the length of vector x, denoted by kxk, is such that kxk = |x| and we can define a metric from the norm by d(x, y) = |x − y|, which generates the so-called standard topology on <. In this metric space 6 Here R++ , or interchangeably <++ , denotes the set of strictly positive real numbers. 4 (<, d), an open ball is an open interval, and an open set is a countable union of pair-wise disjoint open intervals. (Why countable, and why pair-wise disjoint?) • Let A = {x} for some x ∈ <. Is x a limit point of A? Is x an isolated point of A? Is A a closed set? • Suppose that {xn ; n ∈ Z+ } is a sequence in <, which converges to x0 ∈ <. Let X contains exactly all the xn ’s in the convergent sequence. Is x0 a limit point of X? If this is not always true, then offer a condition that ensures this conclusion. 19. Let (X, τ ) be a topological space. A subset γ of τ is an open covering of S X if X ⊂ U ∈γ U . X is compact if every open covering of X contains a finite sub-covering of X; that is, if for each open covering γ of X, there exists a finite number of member sets U1 , U2 , · · · , Un ∈ γ, such S that X ⊂ ni=1 Ui . Note that < with the standard topology is not compact, for the open covering {(n, n + 2) : n ∈ Z} does not have a finite sub-covering of <. (We shall see that this is because < is not bounded.) Similarly, the interval (0, 1) is not a compact subset of R under the standard 1 topology, for the collection {( n+2 , n1 ) : n ∈ Z+ } is an open covering of (0, 1) which does not have a finite sub-covering. (We shall see that this is because (0, 1) is not closed.) It can be shown that compact subsets of < under the standard topology are exactly those subsets which are both bounded and closed. 20. The following statements are true. • A closed subset of a compact space is compact. • A (finite or infinite) Cartesian product of compact spaces is compact. • Consider the metric space (<n , d), where the Euclidean metric d is such that for x, y ∈ <n , withpx = (x1 , x2 , · · · , xn ) and y = Pn 2 (y1 , y2 , · · · , yn ), we have d(x, y) = i=1 (xi − yi ) . Then under n the metric topology d, A ⊂ < is compact if and only if it is both closed and bounded. • Is the first orthant in <2 closed? Is it bounded? 5 • Given two topological spaces (X, τx ) and (Y, τy ) and a continuous function f : (X, τx ) → (Y, τy ), f (A) is compact in (Y, τy ) whenever A is compact in (X, τx ). That is, a continuous image of a compact space is compact. 21. (Real Sequence and Real Function.) If A = Z+ and B = <, then (the image of) f : A → B is called a real (real number) sequence, and we write f (n) as fn for all n ∈ Z+ . If h : Z+ → Z+ is increasing and x : Z+ → <, then {x(h(n))} is called a subsequence of {x(n)}. 22. We say that l ∈ < is a limit of the sequence {xn } if for all > 0 there is an N () ∈ Z+ such that n ∈ Z+ , n ≥ N () ⇒ |xn − l| < . In this case we say that {xn } converges to l and {xn } a convergent sequence. The limit of a sequence so defined is unique. 23. A set A ⊂ < is bounded above by u ∈ <, if for all x ∈ A, x ≤ u. In this case, u is called an upper bound of the set A. The least upper bound of A, if it exists, is denoted by sup A. Lower bounds are similarly defined. The greatest lower bound of A, if it exists, is denoted by inf A. The set A is bounded if it has an upper and a lower bound in <. 24. Given a sequence {xn } in <, define its limit inferior as limxn = inf n supk≥n xk and its limit superior as limxn = supn inf k≥n xk , which S will be shown to always exist in < {−∞, +∞}, and we can show that they are equal if and only if the real sequence {xn } converges. 25. A set I ⊂ < is an interval if it is not a singleton (a one-point set) and if x, y ∈ I, x < y implies that there exists some z ∈ I such that x < z < y. 26. A function g : I → <, where I is some interval, is continuous if for all x ∈ I and all {xn } converging to x in I, {g(xn )} is a sequence converging to g(x). The derivative of g(·) at x ∈ I, if it exists (i.e., it is a finite number), is the (common) limit of g(xxnn)−g(x) for all {xn } −x converging to x. If the derivative of g is defined for all x ∈ I, written g 0 (x), we say that g is differentiable on I. In this case, g 0 (·) itself is a 6 mapping from I into R. If g 0 is a continuous function, we say that g is continuously differentiable on I. 27. (Axiom of Continuity.) If the real sequence {xn } is increasing (i.e. xn+1 ≥ xn , for all n ∈ Z+ ), and if xn ≤ M ∈ < for all n ∈ Z+ , then {xn } converges to some l ≤ M and xn ≤ l for all n ∈ Z+ . 28. (Nested Interval Theorem.) Suppose that {[an , bn ]; Z+ } is a sequence of closed intervals such that for all n, [an+1 , bn+1 ] ⊂ [an , bn ] T and limn→+∞ (bn − an ) = 0. Then, n∈Z+ [an , bn ] is a singleton. 29. (Least Upper Bound Property of <.) Suppose that A is a nonempty subset of <, and A is bounded above by some u ∈ <. Then sup A exists. Similarly, if A is non-empty and bounded below, then inf A exists. 30. Define the extended real line < ≡ < {+∞, −∞}. Define sup ∅ = −∞ and inf ∅ = +∞. If A ⊂ < is unbounded above (there exists no upper bound for A in <), then define sup A = +∞. If A ⊂ < is unbounded below (there exists no lower bound for A in <), then define inf A = −∞. In this way, all subsets of < have well-defined suprema and infima in <. S 31. Suppose that {xn ; n ∈ Z+ } is a real sequence. Then its limit inferior limxn ≡ sup inf xk n∈Z+ k≥n and its limit superior limxn ≡ inf sup xk n∈Z+ k≥n both exist in extended real line <. 32. For any two real sequences {xn } and {yn }, we have limxn + limyn ≤ lim(xn + yn ) ≤ limxn + limyn ≤ lim(xn + yn ) ≤ limxn + limyn . 33. (Bolzano-Weierstrass Theorem.) Every bounded sequence in < contains a convergent subsequence. 7 34. A function f : A ⊂ < → < is continuous at x ∈ A if for all > 0, there exists δ(x, ) > 0 such that ∀y ∈ A, |x − y| < δ(x, ) ⇒ |f (x) − f (y)| < . The function is uniformly continuous at x if δ can be chosen to be independent of x. We say f is a (uniformly) continuous function on A if it is (uniformly) continuous at x for all x ∈ A. We can show that if f : [a, b] → < is continuous on [a, b], then it is uniformly continuous on [a, b]. 35. (Intermediate Value Theorem.) If f : I → < is continuous where I is an interval, then f (I) is either a singleton or an interval. 36. (Extreme Value Theorem.) If f : [a, b] → < is continuous then f ([a, b]) is either a singleton or a bounded closed interval. In particular, there exist m, M ∈ I such that for all x ∈ I, f (m) ≤ f (x) ≤ f (M ). 37. Concave Maximization. 38. A (real-valued) symmetric matrix An×n is positive definite (or PD), if for all xn×1 6= 0n×1 , we have x0 Ax > 0. A symmetric matrix An×n is negative definite (or ND), if −A is positive definite. A symmetric matrix An×n is positive semi-definite (or PSD), if for all xn×1 ∈ Rn , we have xT Ax ≥ 0. A symmetric matrix An×n is negative semidefinite (or NSD), if −A is positive semi-definite. 39. Consider a twice differentiable function f : Rn → R. Let the Df : Rn → Rn be the vector function ∂f ∂x1 ∂f ∂x2 Df = . . . ∂f ∂xn , 2 which will be referred to as the gradient of f . Let D2 f : Rn → Rn 8 be the matrix function ∂2f ∂x1 ∂x1 ∂2f ∂x2 ∂x1 D2 f = .. . ∂2f ∂xn ∂x1 ∂2f ∂x1 ∂x2 ··· ∂2f ∂x1 ∂xn ∂x2 ∂x2 ··· ∂2f ∂x2 ∂xn .. . .. . .. . ∂2f ∂xn ∂x2 ··· ∂2f ∂xn ∂xn ∂2f , which will be referred to as the Hessian of f . 40. A function f : Rn → R is concave, if for all x, y ∈ Rn and all λ ∈ [0, 1], f (λx + (1 − λ)y) ≥ λf (x) + (1 − λ)f (y). A concave function is said to be strictly concave if the above defining inequality is always strict. A function f is (strictly) convex if −f is (strictly) concave. A function is affine if it is both concave and convex. An affine function is linear if it passes through the origin; that is, if f (0n×1 ) = 0. Note that by definition, f is concave if f is strictly concave. Theorem 1 A twice differentiable function f : U ⊂ Rn → R is concave, where U is an open convex subset of Rn , if and only if D2 f is a negative semi-definite matrix at each and every x ∈ U ; and f is strictly concave if D2 f is a negative definite matrix at each and every x ∈ U .7 Suppose that n = 1 in the above theorem. Then f 00 ≤ 0 if f is concave, and if x̃ is a random variable with finite expectation, then we have f (E[x̃]) ≥ E[f (x̃)]. (This is the so-called Jensen’s inequality.) 41. Exercise 1 Define the function f (x, y) = xa y 1−a , ∀x, y ∈ (0, +∞), If f (x) = −x4 , then f 00 (0) = 0 (so that the Hessian of f at zero is negative semidefinite but not negative definite), but f is strictly concave. Hence the “only if” part for strict concavity in general does not hold. However, if f : U ⊂ < → < is twice continuously differentiable and concave (so that U becomes an open interval), and if f 00 is not constantly zero over any subinterval in U , then f is strictly concave; see the following link: people.exeter.ac.uk/dgbalken/BME05/LectTwo.pdf. 7 9 where a is a constant with 0 < a < 1. Clearly f : (0, +∞)×(0, +∞) → < is twice continuously differentiable. Let us verify that f is strictly concave. Note that −y 2 D2 f = a(1 − a)xa−2 y −1−a xy −x2 xy . Hence for any k 2 ∈< , h we have h k h i D2 f = a(1 − a)xa−2 y −1−a h k h k i h −y 2 xy xy −x2 k h = −a(1 − a)xa−2 y −1−a (ky − hx)2 ≤ 0, and the last inequality is strict unless k = h = 0. This proves that D2 f is negative definite for all x, y ∈ (0, +∞), and hence f (·, ·) is strictly concave. Exercise 2 Define the function √ f (x) = x, ∀x > 0. Clearly f : (0, +∞) → < is twice continuously differentiable. Note that ∂2f 1 D2 f = =− 3, (dx)2 4x 2 so that for all k ∈ <, we have k × D2 f × k = − k2 3 ≤ 0, 4x 2 and the last inequality is strict unless k = 0. This proves that D2 f is negative definite for all x > 0, and hence f (·) is strictly concave. 10 Exercise 3 Define the function f (x, y) = g(x) + h(y), where g, h : < → < are two strictly concave twice-differentiable functions. Then f (·, ·) is a strictly concave function also. Indeed, we have in this case g 00 (x) 0 D2 f = , 00 0 h (y) so that given any a 2 ∈< , b we have h a b i D2 f a b = a2 g 00 (x) + b2 h00 (y) ≤ 0, and the last inequality holds strictly except in the case where a = b = 0. This proves that D2 f is negative definite, and hence f (·, ·) is strictly concave. The preceding example shows that a twice-differentiable additively separable function f (x1 , x2 , · · · , xn ) = f1 (x1 ) + f2 (x2 ) + · · · + fn (xn ) is strictly concave if for all j, fj : < → < is strictly concave and twice differentiable. Theorem 2 Suppose that f : <n → < is twice-differentiable and concave, then for all x, a ∈ <n , we have8 f (x) − f (a) ≤ Df (a)0 (x − a). The above inequality becomes strict if the Hessian of f is everywhere negative definite and x 6= a. 8 The reader should draw a graph for these inequalities for the case of n = 1. 11 Proof. Recall the Theorem of Taylor Expansion with a Remainder: if f : <n → < is twice continuously differentiable, then for each x, a ∈ <n , there exists some y lying on the line segment connecting x and a such that 1 f (x) = f (a) + Df (a)0 (x − a) + (x − a)0 D2 f (y)(x − a). 2 Now suppose that f is concave, so that D2 f (y) is negative semidefinite. Then we have 1 (x − a)0 D2 f (y)(x − a) ≤ 0, 2 so that f (x) − f (a) ≤ Df (a)0 (x − a). If D2 f (y) is negative definite, then given x 6= a, we have 1 (x − a)0 D2 f (y)(x − a) < 0, 2 implying that f (x) − f (a) < Df (a)0 (x − a). Q.E.D. 42. According to the Theorem of Taylor Expansion with a Remainder, when moving from a to x ∈ B(a, r) in f ’s domain of definition, the functional value of f would increase if 1 Df (a)T (x − a) + (x − a)T D2 f (y)(x − a) > 0. 2 Note that when r is small enough, the preceding inequality holds if and only if9 Df (a)T (x − a) = kDf (a)kkx − ak cos(θ) > 0, 9 Imagine that we replace (x − a) by (x − a), and observe that when > 0 is a tiny number, |Df (a)T (x − a)| is much greater than 2 | 12 (x − a)T D2 f (y)(x − a)|. Thus Df (a)T (x − a) + 2 [ 21 (x − a)T D2 f (y)(x − a)] is positive (respectively, negative) if and only if Df (a)T (x − a) is positive (respectively, negative). 12 where θ is the angle between the two vectors Df (a) and x − a, which must be acute to ensure the above inequality.10 43. We shall from now on confine attention to twice differentiable functions. A necessary condition for x∗ ∈ <n to solve the following maximization program (P1) maxn f (x) x∈< Df (x∗ ) is that = 0, which will be referred to as the first-order conditions for the optimal solution x∗ . This necessary condition is also sufficient, if f is concave. 44. Consider the following maximization program (P2): max f (x) x∈<n subject to ∀i = 1, 2, · · · , m, gi (x) = 0, where m < n. Theorem 3 (Lagrange Theorem) Suppose that x∗ solves (P2) and {Dgj (x∗ ); j ∈ {1, 2, · · · , m} is a set of linearly independent gradient vectors. (This is known as a constraint qualifications condition.) Then, there must exist m constants (called the Lagrange multipliers) π1 , π2 , · · · , πm such that 10 Recall the following Law of Cosines: Let ∆ABC be a triangle whose sides a, b, c are such that a is opposite A, b is opposite B and c is opposite C. Let D be the element in the line passing A and C such that the two line segments BD and CD are orthogonal to each other. Then, since the length of BD is a sin C and the length of AD is |b − a cos C|, we have c2 = a2 + b2 − 2ab cos C. Now, given two vectors v, w and the angle θ between them, if we let a = kvk, b = kwk and c = kv − wk, then it follows from the Law of Cosines that kvk2 +kwk2 −2(v·w) = (v−w)·(v−w) = c2 = a2 +b2 −2ab cos θ = kvk2 +kwk2 −2kvkkwk cos(θ), implying that the following Cosine Formula for Dot Product must hold: v · w = kvkkwk cos(θ). 13 (i) ∀i = 1, 2, · · · , m, gi (x∗ ) = 0; and (ii) m X πi Dgi (x∗ ) = Df (x∗ ). i=1 Conversely, if f is concave and all gi ’s are affine, then if x∗ satisfies (i) and (ii), then x∗ solves (P2). If f is strictly concave, such x∗ is unique. To see the role of the constraint qualifications condition in the above necessary condition, consider the following example with n = m = 2: f (x1 , x2 ) = x1 + x2 , g1 (x1 , x2 ) = x21 + x22 − 1, g2 (x1 , x2 ) = x21 + (x2 − 2)2 − 1. There is only one feasible solution in this case, namely, (x1 , x2 ) = (0, 1), which is apparently the optimal solution. Thus x∗ = (0, 1). However, there does not exist π1 , π2 ∈ < such that Df (x∗ ) = π1 Dg1 (x∗ ) + π2 Dg2 (x∗ ). This happens because Dg1 (x∗ ) and Dg2 (x∗ ) are linearly dependent! 45. Now, let us prove the sufficiency of Lagrange Theorem. Suppose that there exists x∗ satisfying (i) ∀i = 1, 2, · · · , m, gi (x∗ ) = 0; and (ii) m X πi Dgi (x∗ ) = Df (x∗ ), i=1 for some π1 , π2 , · · · , πm ∈ <. Consider any other x satisfying ∀i = 1, 2, · · · , m, gi (x) = 0. We must show that f (x∗ ) ≥ f (x). To see that this is true, first note that for all i = 1, 2, · · · , m, by the fact that gi is affine, we have aTi x + bi = gi (x) = 0 = gi (x∗ ) = aTi x∗ + bi , where ai is the gradient of gi , so that 0 = gi (x) − gi (x∗ ) = aTi (x − x∗ ) = Dgi (x∗ )T (x − x∗ ). 14 Now, observe that f (x) − f (x∗ ) ≤ Df (x∗ )T (x − x∗ ) = m X πi Dgi (x∗ )T (x − x∗ ) = 0, i=1 and hence x∗ is indeed an optimal solution to (P2). Finally, if both x∗ and x∗∗ are optimal solutions to (P2), then x0 ≡ 12 (x∗ + x∗∗ ) must be feasible also (i.e., gi (x0 ) = 0 for all i), and yet by Jensen’s inequality we have 1 f (x0 ) > [f (x∗ ) + f (x∗∗ )] = f (x∗ ), 2 which is a contradiction. Hence the optimal solution must be unique when f is strictly concave. 46. Consider the following maximization program (P3): max f (x) x∈<n subject to ∀i = 1, 2, · · · , m, gi (x) ≤ 0. Theorem 4 (Kuhn-Tucker Theorem) Suppose that there exists some x̂ such that gi (x̂) < 0 for all i = 1, 2, · · · , m. (This is called the Slater Condition.)11 Then if x∗ is a solution to (P3), there must exist m non-negative constants (called the Lagrange multipliers for the m constraints) π1 , π2 , · · · , πm such that (i) m X πi Dgi (x∗ ) = Df (x∗ ); i=1 and (ii) (complementary slackness) for all i = 1, 2, · · · , m, πi gi (x∗ ) = 0. Conversely, if f is concave and for all i = 1, 2, · · · , m, gi : Rn → R is convex, and if x∗ satisfies the above (i) and (ii), then x∗ is a solution to the above program (P3). If f is strictly concave, such x∗ is unique. 11 Slater condition is a constraint qualification condition, and it applies mainly to the case where f is concave and gi ’s are convex. There are several other constraint qualification conditions. For example, one of them requires that there exist x̂ such that, for all i = 1, 2, · · · , m, Dgi (x∗ )T x̂ < 0 whenever gi (x∗ ) = 0. Let us call this condition A. 15 To see the role of the Slater Condition consider the following example with m = n = 2: f (x1 , x2 ) = x1 + x2 , g1 (x1 , x2 ) = x21 + x22 − 1, g2 (x1 , x2 ) = x21 + (x2 − 2)2 − 1. There is only one feasible solution in this case, namely, (x1 , x2 ) = (0, 1), which is apparently the optimal solution. Thus x∗ = (0, 1). However, there does not exist π1 , π2 ∈ <+ such that Df (x∗ ) = π1 Dg1 (x∗ ) + π2 Dg2 (x∗ ). This happens because we cannot find some x̂ such that g1 (x̂) < 0 and g2 (x̂) < 0! To see why Df (x∗ ) must be contained in the polyhedral cone generated by {Dgi (x∗ ); i = 1, 2, · · · , m}, note that in the opposite case we would be able to find some tiny vector a ≡ x − x∗ such that a and each Dgi (x∗ ) together create an obtuse angle but it and Df (x∗ ) together create an acute angle. This implies that we can find another feasible solution x close to x∗ such that f (x) > f (x∗ ), which is a contradiction. I shall demonstrate this fact in class using a graph.12 47. Let me prove the sufficiency of Kuhn-Tucker Theorem. Suppose that f and all −gi are concave. Suppose that there exist non-negative π1 , π2 , · · · , πm such that at x∗ , (i) m X πi Dgi (x∗ ) = Df (x∗ ); i=1 and (ii) (complementary slackness) for all i = 1, 2, · · · , m, πi gi (x∗ ) = 0. We must show that given any x such that gi (x) ≤ 0 for all i = 1, 2, · · · , m, we have f (x) − f (x∗ ) ≤ 0. 12 The necessity that Df (x∗ ) must be contained in the polyhedral cone generated by {Dgi (x∗ ); i = 1, 2, · · · , m} is actually a consequence of the following Farkas Lamma: if a1 , a2 , · · · , am , b ∈ <n are such that b · x ≤ 0 whenever Pm ai · x ≤ 0 for all i = 1, 2, · · · , m, where x ∈ <n , then it must be that b ∈ P ≡ { i=1 ti ai : ti ∈ <+ , ∀i = 1, 2, · · · , m}. We say that P is a polyhedral cone generated by a1 , a2 , · · · , am . The idea is that P is non-empty, closed and convex, and if P does not contain b, then there exists a hyperplane that separates strictly P and b; see section 11 of the lecture note Separating Hyperplane Theorem. That is, there exists some y ∈ <n and α ∈ < such that p · y < α for all p ∈ P and yet b · y > α. Note that if p ∈ P is such that p · y > 0, then α > 0, but since tp ∈ P for any t ∈ <+ , we must have tp · y > α for sufficiently large t, a contradiction. Since 0 ∈ P, we actually have α > 0 = 0 · y. Thus we must have p · y ≤ 0 for all p ∈ P and yet b · y > 0. Now, if we let b = Df (x∗ ) and ai = Dg(x∗ ) for all i = 1, 2, · · · , m, and if we suppose that the necessity fails but condition A holds, then for , e > 0 small enough, gi (x∗ + (1 − e)y + ex̂) ≤ gi (x∗ ) ≤ 0 but f (x∗ + (1 − e)y + ex̂) > f (x∗ ), which is a contradiction. 16 To this end, note that f (x) − f (x∗ ) ≤ Df (x∗ )T (x − x∗ ) = m X πi Dgi (x∗ )T (x − x∗ ) ≤ i=1 m X πi [gi (x) − gi (x∗ )] i=1 = m X πi gi (x) ≤ 0, i=1 where the first inequality follows from the preceding footnote and the fact that f is concave, the first equality follows from the assumption that the Kuhn-Tucker condition holds at x∗ , the second inequality follows from the fact that πi ≥ 0 and gi is convex for all i = 1, 2, · · · , m, and the last equality follows from the assumption that the complementary slackness holds at x∗ , and the fact that gi (x) ≤ 0 for all i = 1, 2, · · · , m. Thus x∗ is indeed an optimal solution to (P3). Finally, if both x∗ and x∗∗ are optimal solutions to (P2), then x0 ≡ 12 (x∗ + x∗∗ ) must be feasible also (i.e., by Jensen’s inequality gi (x0 ) ≤ 0 for all i), and yet by Jensen’s inequality again we have 1 f (x0 ) > [f (x∗ ) + f (x∗∗ )] = f (x∗ ), 2 which is a contradiction. Hence the optimal solution must be unique when f is strictly concave. 48. Theorem 5 (Kuhn-Tucker Theorem with Both Inequality and Equality Constraints.) Consider the following maximization problem, which generalizes the problems stated in Theorem 3 and Theorem 4: maxn f (x) x∈< subject to gi (x) ≤ 0, ∀i = 1, 2, · · · , m; hj (x) = 0, ∀j = 1, 2, · · · , l. This maximization problem involves m inequality constraints and l equality constraints. The associated Slater condition is as follows: there exists x̂ ∈ <n such that gi (x̂) < 0, ∀i = 1, 2, · · · , m; 17 hj (x̂) = 0, ∀j = 1, 2, · · · , l. When the Slater condition holds, the following Kuhn-Tucker necessary conditions must hold: at an optimal solution x∗ , where {Dhj (x∗ ); j = 1, 2, · · · , l} consists of a set of linearly independent vectors, there must T l exist (µ1 , µ2 , · · · , µm )T ∈ <m + and (λ1 , λ2 , · · · , λl ) ∈ < such that (Stationarity) Df (x∗ ) = m X µi Dgi (x∗ ) + i=1 l X λj Dhj (x∗ ); j=1 hj (x∗ ) = 0, ∀j = 1, 2, · · · , l; (Complementary Slackness) µi gi (x∗ ) = 0, ∀i = 1, 2, · · · , m. Conversely, when the Slater condition holds, f is concave, gi ’s are conT vex, and hj ’s are affine, if x∗ , (µ1 , µ2 , · · · , µm )T ∈ <m + , and (λ1 , λ2 , · · · , λl ) ∈ <l satisfy the above Kuhn-Tucker necessary conditions, then x∗ must be an optimal solution.13 49. Let us give exercises. Exercise 4 Consider the differentiable function f : <n → < defined by f (xn×1 ) = a0 x, ∀x ∈ <n , where an×1 is a given vector. Show that Df (x) = a for all x ∈ <n .14 Exercise 5 Consider the differentiable function f : <n → < defined by f (xn×1 ) = x0 Ax, ∀x ∈ <n , 13 Visit for example https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker conditions to see a list of regularity conditions (or constraint qualification conditions) including the Slater condition. After Harold W. Kuhn and Albert W. Tucker published the Kuhn-Tucker conditions in 1951, scholars discovered that these necessary conditions had been stated by William Karush in his master’s thesis in 1939. Hence nowadays people also refer to the Kuhn-Tucker conditions as Karush-Kuhn-Tucker conditions. Pn ∂f 14 By definition, f (xn×1 ) = a0 x = a x , so that ∂x = aj , which by definition is j=1 j j j the j-th element of Df . 18 where An×n is a given square matrix. Show that15 for all x ∈ <n , Df (x) = Ax + A0 x and D2 f (x) = A + A0 . Exercise 6 Let 1 be the n-vector of one’s. Let e and V be respectively some given n-vector and n × n symmetric positive definite matrix. Solve the following maximization problem: given constant k > 0, k max x0 e − x0 Vx, 2 x∈<n 15 (Step 1.) First recall that the j-th element of Am×n xn×1 is simply the inner product of x and the j-th row of A: α1T x α1T T α2T x α2 . , x = Ax = . .. . . T T αn x αn where αjT is the j-th row of A, and the second equality follows when we treat A as a column vector and x as a scalar (check conformability!). (Step 2.) Now, by definition, we have f (xn×1 ) = x0 Ax = n n X X xi aik xk , i=1 k=1 so that ∂ ∂f = ∂xj Pn Pn i=1 k=1 ∂xj xi aik xk = ∂ = Pn Pn i=1 n X xi k=1 aik xk ∂ |i=j + ∂xi ajk xk + k=1 n X Pn k=1 Pn xk ∂xk i=1 xi aik |k=j aij xi , i=1 which is the inner product of x and the j-th row of A + AT ! Hence we know from Step 1 that Df (x) = (A+A0 )x. Now, observe that the j-th row of D2 f is simply the transpose of ∂f the gradient of ∂x , or equivalently, it is the transpose of the gradient of the j-th element j of Df . Since the j-th element of Df is simply the inner product of x and the j-th row of A + AT , its gradient, by the preceding exercise, is simply the j-th row of A + AT ! Now, since A + AT is a symmetric matrix, the transpose of the j-th row of A + AT is exactly the j-th row of A + AT . Hence we conclude that the j-th row of D2 f is the j-th row of A + AT , or equivalently, D2 f = A + AT . 19 subject to the constraint x0 1 = 1. (Hint: First check if the objective function f (x) ≡ x0 e − k2 x0 Vx is concave and if the contraint function g(x) ≡ x0 1 − 1 is affine. Then apply Lagrange theorem.) Exercise 7 Consider the following utility maximization problem: 1 2 max f (x, y) = x 3 y 3 , x,y∈<+ subject to the budget constraint Px x + Py y ≤ I, where Px , Py , I > 0 are given constants. The idea is that the decisionmaker is a consumer endowed with I dollars, who wishes to maximize the utility f (x, y) from consuming x units of good X and y units of good Y, where the prices for the two goods are respectively Px and Py . (i) Show that f (x, y) = 0 if xy = 0, and yet there exists (x,0 , y 0 ) satisfying x0 > 0, y 0 > 0, f (x0 , y 0 ) > 0, and Px x0 + Py y 0 < I. (Try for example x0 = 3PI x and y 0 = 3PI y .) Thus we conclude that (1) the Slater condition holds; and (2) the decision-maker can focus on those (x, y) satisfying x > 0, y > 0, and Px x + Py y ≤ I. (ii) Denote by <2++ the set {(x, y) ∈ <2 : x > 0, y > 0}. Show that on the restricted domain <++ , f (x, y) is such that D2 f = 5 2 − 29 x− 3 y 3 2 − 23 − 13 y 9x 2 − 23 − 13 y 9x 1 4 − 29 x 3 y − 3 , and that h x y i 5 2 − 29 x− 3 y 3 2 − 32 − 13 y 9x 2 − 23 − 13 y 9x 1 4 − 92 x 3 y − 3 20 x y 2 ≤ 0, ∀(x, y) ∈ <++ . Conclude that f : <2++ → < a concave function. (iii) Define g(x, y) = Px x + Py y − I. Show that D2 g is a positive semi-definite matrix at all (x, y) ∈ <2++ . (iv) Use the Kuhn-Tucker theorem to show that the optimal solution is I 2I (x∗ , y ∗ ) = ( , ). 3Px 3Py 50. If Ãm×n is a two-dimensional array of mn random variables (i.e. each element ãij is a random variable), then à is called a random matrix and E(Ã) is a non-random matrix with the same size as Ã, with its (i, j)-th element being E(ãij ), ∀i = 1, 2, · · · , m; ∀j = 1, 2, · · · , n. We shall call E(Ã) the expected value of the random matrix Ã. If n = 1, Ãm×n = r̃m×1 is called a random vector. The variancecovariance matrix of the random vector r̃ is defined by the m × m non-random matrix V, with vij = cov(r˜i , r˜j ). Since for any two random variables x̃ and ỹ, cov(x̃, ỹ) = cov(ỹ, x̃) (verify!), the matrix V is symmetric. Moreover, since for any random variable x̃, cov(x̃, x̃) = var(x̃) (verify!), the major diagonal elements of V are, respectively, the variances of the random variables in the random vector r̃! 51. Theorem 6. Let V be the covariance matrix of the random vector r̃. The following statements are true. (i) Denote E[r̃] by e. Then V = E[(r̃ − e)(r̃ − e)0 ]. (ii) V is positive semi-definite. V is positive definite if and only if there exists no non-zero m-vector x such that the variance of xT r̃ equals zero. In case m = 2, such a non-zero x exists if and only if r̃1 and r̃2 are perfectly correlated. 21 (iii) If ỹ has zero variance, then the event {ỹ 6= E[ỹ]} may occur only with zero probability; that is, ỹ is essentially a (non-random) constant. (iv) Given any m-vectors w, w1 and w2 , the expected value of wT r̃ is w0 e, the variance of wT r̃ is w0 Vw, and the covariance of w1T r̃ and w2T r̃ is w10 Vw2 . Proof. Consider part (i). For all k, j ∈ {1, 2, · · · , m}, the (k, j)-th element of the matrix E[(r̃ − e)(r̃ − e)0 ] is E[(r̃k − E[r̃k ])(r̃j − E[r̃j ])] = cov(r̃k , r̃j ), and hence V and E[(r̃ − e)(r̃ − e)0 ] are the same matrix. Next, consider part (ii). Pick any xm×1 , and observe that x0 Vx = x0 E[(r̃ − e)(r̃ − e)0 ]x = E[x0 (r̃ − e)(r̃ − e)0 x] = E[(x0 r̃ − x0 e))((r̃0 x − e0 x)] = E[(x0 r̃ − E[x0 r̃])2 ] = var[x0 r̃] ≥ 0. Thus by definition, V is positive semi-definite. Moreover, we see that V is positive definite if and only if var[x0 r̃] = x0 Vx > 0 for all xm×1 6= 0m×1 . Now, if m = 2, we have " V2×2 = var[r̃1 ] cov(r̃1 , r̃2 ) cov(r̃1 , r̃2 ) var[r̃2 ] # . Recall that a positive semi-definite matrix V is singular if and only if V is not positive definite, and the latter is true if and only if there exists xm×1 6= 0m×1 such that x0 Vx = 0. Thus there exists a vector xm×1 6= 0m×1 such that x0 Vx = 0 if and only if the determinant of V is zero, or 22 cov(r̃1 , r̃2 )2 = 1, var[r̃1 ]var[r̃2 ] implying that the coefficient of correlation for (r̃1 , r̃2 ) equals either 1 or −1. For part (iii), suppose that var[ỹ] = 0. A mathematical fact says that if var[ỹ] is a non-negative real number, then E[ỹ] is a real number. By definition, 0 = var[ỹ] = E[(ỹ − E[ỹ])2 ], and the above right-hand side is the Lebesgue integral of a non-negative function (ỹ − E[ỹ])2 . Since the integral is equal to zero, the event that (ỹ − E[ỹ])2 > 0 may occur only with zero probability. Equivalently, we conclude that {ỹ 6= E[ỹ]} is a zero-probability event; that is, ỹ is essentially a constant. An example is as follows. Let z̃ ∼N(0, 1) be a standard normal random variable, and define a new random variable ỹ = f (z̃) as follows: f (x) = 0, if x is not a rational number; 1, if x is a rational number. Then we have E[ỹ] = 0 =var[ỹ], and although the realization of ỹ is not always equal to zero, those non-zero realizations together may occur only with zero probability. Finally, consider part (iv). By definition, we have E[w0 r̃] = E[ N X wj r̃j ] = j=1 N X wj E[r̃j ] = w0 e. j=1 Also, note that w10 Vw2 = E[(w10 r̃ − w10 e)(w20 r̃ − w20 e)0 ] 23 = E[(w10 r̃ − w10 e)(w20 r̃ − w20 e)] = cov(w10 r̃, w20 r̃). In particular, if w1 = w2 = w, then we have w0 Vw = w10 Vw2 = cov[w10 r̃, w20 r̃] = cov[w0 r̃, w0 r̃] = var[w0 r̃]. This finishes the proof. k 52. (Multi-variate Gaussian Random Variables.) Suppose that V is positive definite. We say that the m ≥ 1 random variables in r̃ are multi-variate normal, if the joint density function of (r̃1 , r̃2 , · · · , r̃m ) is m 1 1 0 f (r) = (2π)− 2 |V|− 2 e− 2 (r−e) V −1 (r−e) , ∀r ∈ <m . In this case, we simply write16 r̃ ∼ N(e, V). More generally, given a vector " z̃(m+n)×1 = x̃n×1 ỹm×1 # of multi-variate normal random variables, where " x̃n×1 ỹm×1 # " an×1 ∼ N( bm×1 # " , Vn×n Cn×m C0m×n Um×m # ), the conditional expectation of x given y is E[x|y] = a + CU−1 (y − b), and the conditional covariance matrix of x given y is var(x|y) = V − CU−1 C0 . 16 Some related results are as follows. Given any non-random matrix An×m , the random vector Ar̃ is multivariate normal if r̃ is multivariate normal. In fact, an equivalence condition for r̃ to be multivariate normal is that bT r̃ is a normal random variable for all bm×1 ∈ <m . Note that if r̃ is multivariate normal, then each random variable r̃i is one normal random variable. Conversely, suppose that for all i ∈ {1, 2, · · · , m}, r̃i is one normal random variable. Then, r̃ is multivariate normal if the m random variables (r̃1 , r̃2 , · · · , r̃n ) are totally independent. 24 53. Now we review conditional moments and the law of iterated expectations. Suppose that x̃ has two possible outcomes, 0 and 30, and that ỹ has two possible outcomes, 20 and -20. The joint probabilities for the 4 possible pairs of outcomes of x̃ and ỹ are summarized in the following table. −20 0.1 0.3 x̃/ỹ 0 30 20 0.2 0.4 The probabilities of the following 4 events define the joint distribution of x̃ and ỹ: prob.(x̃ = 0, ỹ = −20) = 0.1; prob.(x̃ = 0, ỹ = 20) = 0.2; prob.(x̃ = 30, ỹ = −20) = 0.3; prob.(x̃ = 30, ỹ = 20) = 0.4. From the joint distribution of the two random variables, we can derive the marginal distribution of each random variable. More precisely, we immediately have the marginal distribution of x̃, defined by the probabilities for the following two events: prob.(x̃ = 0) = prob.(x̃ = 0, ỹ = −20)+prob.(x̃ = 0, ỹ = 20) = 0.1+0.2 = 0.3; prob.(x̃ = 30) = prob.(x̃ = 30, ỹ = −20)+prob.(x̃ = 30, ỹ = 20) = 0.3+0.4 = 0.7. We can similarly obtain the marginal distribution of ỹ, defined by the probabilities for the following two events: prob.(ỹ = −20) = prob.(x̃ = 0, ỹ = −20)+prob.(x̃ = 30, ỹ = −20) = 0.1+0.3 = 0.4; prob.(ỹ = 20) = prob.(x̃ = 0, ỹ = 20)+prob.(x̃ = 30, ỹ = 20) = 0.2+0.4 = 0.6. Now we can obtain the expected value and variance for each random variable, using that random variable’s marginal distribution. Verify that E[x̃] = 0 × 0.3 + 30 × 0.7 = 21, E[ỹ] = −20 × 0.4 + 20 × 0.6 = 4, var(x̃) = (0 − 21)2 × 0.3 + (30 − 21)2 × 0.7 = 189, var(ỹ) = (−20 − 4)2 × 0.4 + (20 − 4)2 × 0.6 = 384. 25 Finally, we can use the joint distribution of the two random variables to compute their covariance: cov(x̃, ỹ) = (0 − 21) × (−20 − 4) × 0.1 + (0 − 21) × (20 − 4) × 0.2 +(30 − 21) × (−20 − 4) × 0.3 + (30 − 21) × (20 − 4) × 0.4 = −24. We shall need the concepts of conditional distribution and conditional expectation of a random variable. The former is the probability distribution of one random variable conditional on knowing the other random variable’s outcome. For example, the conditional distribution of x̃, after we know that the outcome of ỹ is −20, is defined by the probabilities of the following two events: prob.(x̃ = 0|ỹ = −20) = = prob.(x̃ = 0, ỹ = −20) 0.1 1 = = ; prob.(x̃ = 0, ỹ = −20) + prob.(x̃ = 30, ỹ = −20) 0.1 + 0.3 4 prob.(x̃ = 30|ỹ = −20) = = prob.(x̃ = 0, ỹ = −20) prob.(ỹ = −20) prob.(x̃ = 30, ỹ = −20) prob.(ỹ = −20) prob.(x̃ = 30, ỹ = −20) 0.3 3 = = . prob.(x̃ = 0, ỹ = −20) + prob.(x̃ = 30, ỹ = −20) 0.1 + 0.3 4 Now, after we know that the outcome of ỹ is −20, with the conditional distribution of x̃, we can compute the conditional expectation for x̃: E[x̃|ỹ = −20] = 0 × 1 3 90 + 30 × = . 4 4 4 Similarly, we can find the conditional distribution of x̃ after we know that the outcome of ỹ is 20. Again, this is defined by the probabilities of two events: prob.(x̃ = 0|ỹ = 20) = = prob.(x̃ = 0, ỹ = 20) prob.(ỹ = 20) prob.(x̃ = 0, ỹ = 20) 0.2 1 = = ; prob.(x̃ = 0, ỹ = 20) + prob.(x̃ = 30, ỹ = 20) 0.2 + 0.4 3 prob.(x̃ = 30|ỹ = 20) = 26 prob.(x̃ = 30, ỹ = 20) prob.(ỹ = 20) = 0.4 2 prob.(x̃ = 30, ỹ = 20) = = . prob.(x̃ = 0, ỹ = 20) + prob.(x̃ = 30, ỹ = 20) 0.2 + 0.4 3 Hence we have E[x̃|ỹ = 20] = 0 × 1 2 + 30 × = 20. 3 3 These computations tell us that, if a high outcome of x̃ would make us better off, then a lower realization of ỹ is good news. 54. Theorem 7. (Law of iterated expectations; LIE). The formula E[E[x̃|ỹ]] = E[x̃], which says that the expected value of the conditional expectations of x̃ is equal to the original (unconditional) expected value of x̃, holds whenever E[x̃] is a finite real number, and is referred to as the law of iterated expectations. Note that E[x̃|ỹ] is the expected value of x̃ after seeing the realization of ỹ, and hence it naturally varies with the realization of ỹ. In the above example, we know from the marginal distribution of ỹ that ỹ may equal −20 with probability 0.4 and equal 20 with probability 0.6, and E[x̃|ỹ] is the random variable that equals 90 4 when ỹ = −20 and equals 20 when ỹ = 20. Thus the expected value of these conditional expectations is 0.4 × 90 + 0.6 × 20 = 9 + 12 = 21 = E[x̃], 4 confirming that, indeed, E[E[x̃|ỹ]] = E[x̃]. 55. Suppose that a : < → <, b : < → <, and f : <2 → < are all continuous differentiable. Define F (x) ≡ Z a(x) f (x, t)dt, ∀x ∈ <. b(x) Then we have (Leibniz Rule) F 0 (x) = −b0 (x)f (x, b(x)) + a0 (x)f (x, a(x)) + Z a(x) ∂f b(x) 27 ∂x (x, t)dt. 56. A function f : U ⊂ <n → < is quasi-concave, if for all r ∈ <, the set {x ∈ U : f (x) ≥ r} is a convex subset of <n , where we assume that U is itself open and convex. When n = 1, any monotone function f : < → < is quasiconcave. The function f : <2++ → < defined by f (x, y) = xy is not concave, but is still quasi-concave. 57. A function f : <n → < is strictly increasing if for all > 0 and for all x ∈ <n , f (x + uj ) > f (x), where uj has its j-th element equal to one and other elements equal to zero. 58. Consider the following maximization problem (P4), where f : <n → < is strictly increasing and quasi-concave, and g : <n → < is strictly increasing and convex, and both f and g are twice continuously differentiable: maxn f (x) x∈< subject to g(x) ≤ 0. Suppose that there exists x0 , x̂ ∈ <n such that g(x̂) < 0 < g(x0 ). Then (P4) has a solution x∗ such that g(x∗ ) = 0 and for some µ > 0, Df (x∗ ) = µDg(x∗ ). 28