MIT 6.S087: Math Methods for Multidimensional Statistics IAP 2022 Problem Set: Lecture 4 Solve 30 points for full credit. You may submit problems to a maximum of 45 points. 1: Power Mean Inequality (5 points) Prove the power mean inequality: for any positive real number a1 , . . . , an and some two nonnegative real numbers p > q ≥ 0, we have that ap1 + · · · + apn n 1/p ≥ aq1 + · · · + aqn n 1/q . Hint: apply Jensen inequality on xp/q when p > q > 0. You may prefer the following formulation: for any positive random variable X, prove that E[X p ]1/p ≥ E[X q ]1/q . When q = 0, the corresponding quantity is the geometric mean; make sure to handle this case. 2: Epigraph of a Convex Function (5 points) We relate convex functions with convex sets. Consider a function f : Rn → R. The “epigraph” of f is defined as the set E(f ) = {(⃗x, y) : ⃗x ∈ Rn , y ∈ R, f (⃗x) ≤ y}. Show that f is a convex function if and only if E(f ) is a convex set. (Note: consider addition and constant multiplication on tuples (⃗x, y) as though they were (n+1)-dimensional vectors.) 3: Kullback-Leibler Divergence (5 points) Let pX (·) and qX (·) be two probability distributions supported on the same finite set X . Assume that pX (x) > 0 and qX (x) > 0 for all x ∈ X . Define D(pX , qX ) = X pX (x) log2 x∈X pX (x) . qX (x) Using Jensen’s inequality on an appropriate function, show that D(pX , qX ) ≥ 0 with equality if and only if pX (x) = qX (x) for all x ∈ X . You may use the fact that for a strictly convex function, equality holds in Jensen’s inequality if and only if all vectors ⃗vi with nonzero weight ai are equal. 1 MIT 6.S087 Problem Set: Lecture 4 IAP 2022 4: Total Variation Distance (10 points) Let pX (x) be a pmf on X = {1, 2, . . . , K}. One convenient represention of pX (x) is as a vector in RK p1 .. p⃗ = . (4.1) pK where pi := pX (i). The conditions that pX (x) is nonnegative and sums to 1 specify constraints on the possible values of pX (i) for i ∈ X . (a) (5 points) Show that if the vector representation in (4.1) is used, these constraints define a convex polytope P in RK . Plot this polytope for K = 3. (b) (5 points) Now, let p⃗ and ⃗q be two vectors in the polytope P . A commonly used “measure of similarity” between two distributions is the total variation distance defined as: K 1X |pi − qi |. DT V (⃗p, ⃗q) := 2 i=1 DT V defines a distance on P . Hence, for a given distribution ⃗q, it is often desirable to minimize the function f (⃗p) := DT V (⃗p, ⃗q) to find the distribution p that is most similar to q. Show that the function f (⃗p) is convex over P . 5: Aggregating Predictions (10 points) Suppose you wish to predict the temperature tomorrow, Y , by aggregating the predictions ⃗ = (X1 , X2 , . . . , Xn )T from n weather experts in some linear combination. X From previous observation, we know that the n predictions might not be of the same accuracy, so a simple mean would not be ideal. Even ignoring this, some predictions might be highly correlated with each other – if we have n = 3 expert predictions of the same quality, but the first two experts nearly always give the same prediction, then we effectively have only 2 predictions, so a simple mean over the 3 experts would give undue weight to the common factor shared by the first two experts. Formally, we model the expert predictions X1 , . . . , Xn and the temperature tomorrow Y as random variables, and define the quality of a prediction X as Corr[X, Y ]. Given the qualities of the n predictions as well as the correlations between the n predictions, we seek a ⃗ for some fixed weights ⃗v ) such that our resulting linear combination of the predictions (⃗v T X prediction has the highest possible quality. We approach this problem in several steps: 2 MIT 6.S087 Problem Set: Lecture 4 IAP 2022 (a) (3 points) We first establish the following useful result. Let Q be an n by n positive definite symmetric matrix, and ⃗u be a nonzero n-dimensional vector. Show that, subject to the constraint ⃗uT ⃗v = 1, the value of ⃗v T Q⃗v is minimized exactly when Q−1⃗u ⃗v = T −1 . ⃗u Q ⃗u Take care to show that this value is indeed the unique minimizer. (b) (2 points) For convenience, assume the values of the temperature and the predictions are normalized, i.e. X1 , X2 , . . . , Xn , Y all have mean 0 and variance 1. Let Σ be the ⃗ assuming it is invertible, and let ⃗q = (q1 , q2 , . . . , qn )T be the covariance matrix of X, ⃗ satisfies quality of the n predictions. Show that the quality of our prediction ⃗v T X T ⃗ Y ] = √⃗q ⃗v . Corr[⃗v T X, ⃗v T Σ⃗v (c) (2 points) Use the result from part (a) to find a weight vector ⃗v that maximizes ⃗ Y ], the quality of our prediction. In terms of ⃗q and Σ, what is this maximal Corr[⃗v T X, quality? Hint: You may need to add an artificial constraint in order to use the result from part (a). Be sure to explain why adding this constraint does not affect the maximal quality. (d) For the following parts, suppose that each expert’s prediction has the same quality. (i) (1 point) Suppose Σ = In , i.e. all of the predictions are independent from each other. Intuitively, the maximal quality is attained by taking the mean of all n predictions. Show that this agrees with your answer in part (c). (ii) (2 points) Now consider the case mentioned in the preamble, where n = 3 and where X1 and X2 are highly correlated with correlation 1 − ϵ, while X3 is independent from both X1 and X2 . (Note that the correlation cannot be 1 as we need Σ to be positive definite.) Show that for small ϵ > 0, the optimal value of ⃗v is close to a positive multiple of (1, 1, 2)T . This implies that the optimal ratio between the weights of X1 , X2 , and X3 is 1 : 1 : 2. 6: Letter Grading and Entropy (15 points) Suppose you are a teacher assigning the final letter grade for a class. The letter grades A, B, C, D, and F correspond to 5, 4, 3, 2, and 0pt on the GPA scale. As such, we view this as assigning a score from the set X = {0, 2, 3, 4, 5} to each student. Suppose you also want the mean score among your students to be exactly µ for some fixed constant 0 < µ < 5. (a) To make the grades more meaningful, you wish to choose a distribution of grades that minimizes the chance of two students getting the same grade; formally, you wish to choose a probability mass function (pmf) p that is defined on X such that if X and Y are independent random variables with pmf p, then P[X = Y ] is minimized. 3 MIT 6.S087 Problem Set: Lecture 4 IAP 2022 (i) (4 points) Formulate this problem as a convex optimization problem and show that any solution p of this optimization problem is unique. (ii) (4 points) If p is a solution of this optimization problem, show that there exist constant values a, b such that for any x ∈ X , p(x) = max(ax + b, 0). Hint: a common mistake is to just show p(x) ∈ {ax + b, 0} for each x instead of p(x) = max(ax + b, 0). Make sure you prove the latter. (b) Consider the following alternative approach. The “amount of information” contained in a distribution with pmf p can be characterized by its entropy, which is defined as X H(p) = − p(x) log2 p(x). x∈X Instead of minimizing the probability of students getting the same score, you now wish to maximize the entropy of the score distribution, subject to the constraint on the mean score. (i) (5 points) Show that any solution p of this optimization problem is unique, and that given p there exist constant values k, r such that for any x ∈ X , p(x) = krx . , what is the proportion of students in your class that should be awarded If µ = 256 61 each letter grade? (You may assume the above formula if you did not prove it.) (ii) (2 points) Now suppose you may assign scores as any non-negative integer, so X = {0, 1, 2, . . . }, but you still wish to have a mean score of µ and to maximize the entropy. You may note without proof that the derivation in part (i) does not involve the contents of the set X , so the results of part (i) still hold. Express the optimal p(x) in terms of x and µ. 7: Normal Distribution and Entropy (20 points) Let X be a finite subset of R. Consider the probability mass function (pmf) p of an n⃗ taking values in X n . As above, define its entropy as dimensional random vector X X H(p) = − p(⃗x) log2 p(⃗x). ⃗ x∈X n ⃗ to be E[X], ⃗ and define Moreover, define the first moment (mean) of a random vector X T ⃗ to be E[X ⃗X ⃗ ]. We wish to find the pmf p of the second moment of a random vector X ⃗ with the maximum entropy, where the first and second moments of X ⃗ a random vector X are constrained to be exactly µ ⃗ and M, for some fixed n-dimensional vector µ ⃗ and fixed symmetric positive definite n × n matrix M. 4 MIT 6.S087 Problem Set: Lecture 4 IAP 2022 (a) (10 points) Express the Lagrangian L in terms of p, µ ⃗ , M, X n , and any Lagrange multipliers. For full credit (and in order to do part (b)), stack the Lagrange multipliers in vector and matrix form to avoid summations over individual multipliers. How many Lagrange multipliers are required? Hint: recall that we assign a Lagrange multiplier to each individual constraint. How ⃗ =µ ⃗X ⃗ T ] = M? many constraints are we in effect stating with E[X] ⃗ or E[X (b) (5 points) Show that any solution p of this optimization problem is unique, and that given p there exist some constant scalar k, constant vector m ⃗ and constant matrix Λ such that for all ⃗x in X n , p(⃗x) = k exp(m ⃗ T ⃗x + ⃗xT Λ⃗x). (c) (5 points) Suppose we extend the optimization problem to random vectors over Rn , i.e. taking X = R. Let p be a solution to this optimization problem. You may assume the formula in (b) still holds for this p; note that it resembles a multivariate Gaussian distribution (how?). This shows that a multivariate Gaussian distribution is in fact the unique maximizer of entropy when the first and second moments are fixed. For this p, express k, m ⃗ and Λ from the formula in part (b) in terms of µ ⃗ , M and n. 5