Uploaded by Farrell Eldrian Wu

6.S087 Problem Set 4 (5)

advertisement
MIT 6.S087: Math Methods for Multidimensional Statistics
IAP 2022
Problem Set: Lecture 4
Solve 30 points for full credit. You may submit problems to a maximum of
45 points.
1: Power Mean Inequality (5 points)
Prove the power mean inequality: for any positive real number a1 , . . . , an and some two
nonnegative real numbers p > q ≥ 0, we have that
ap1 + · · · + apn
n
1/p
≥
aq1 + · · · + aqn
n
1/q
.
Hint: apply Jensen inequality on xp/q when p > q > 0. You may prefer the following
formulation: for any positive random variable X, prove that E[X p ]1/p ≥ E[X q ]1/q . When
q = 0, the corresponding quantity is the geometric mean; make sure to handle this case.
2: Epigraph of a Convex Function (5 points)
We relate convex functions with convex sets. Consider a function f : Rn → R. The
“epigraph” of f is defined as the set
E(f ) = {(⃗x, y) : ⃗x ∈ Rn , y ∈ R, f (⃗x) ≤ y}.
Show that f is a convex function if and only if E(f ) is a convex set. (Note: consider addition
and constant multiplication on tuples (⃗x, y) as though they were (n+1)-dimensional vectors.)
3: Kullback-Leibler Divergence (5 points)
Let pX (·) and qX (·) be two probability distributions supported on the same finite set X .
Assume that pX (x) > 0 and qX (x) > 0 for all x ∈ X . Define
D(pX , qX ) =
X
pX (x) log2
x∈X
pX (x)
.
qX (x)
Using Jensen’s inequality on an appropriate function, show that D(pX , qX ) ≥ 0 with
equality if and only if pX (x) = qX (x) for all x ∈ X . You may use the fact that for a strictly
convex function, equality holds in Jensen’s inequality if and only if all vectors ⃗vi with nonzero
weight ai are equal.
1
MIT 6.S087
Problem Set: Lecture 4
IAP 2022
4: Total Variation Distance (10 points)
Let pX (x) be a pmf on X = {1, 2, . . . , K}. One convenient represention of pX (x) is as a
vector in RK
 
p1
 .. 
p⃗ =  . 
(4.1)
pK
where
pi := pX (i).
The conditions that pX (x) is nonnegative and sums to 1 specify constraints on the possible
values of pX (i) for i ∈ X .
(a) (5 points) Show that if the vector representation in (4.1) is used, these constraints
define a convex polytope P in RK . Plot this polytope for K = 3.
(b) (5 points) Now, let p⃗ and ⃗q be two vectors in the polytope P . A commonly used
“measure of similarity” between two distributions is the total variation distance defined
as:
K
1X
|pi − qi |.
DT V (⃗p, ⃗q) :=
2 i=1
DT V defines a distance on P . Hence, for a given distribution ⃗q, it is often desirable to
minimize the function
f (⃗p) := DT V (⃗p, ⃗q)
to find the distribution p that is most similar to q. Show that the function f (⃗p) is
convex over P .
5: Aggregating Predictions (10 points)
Suppose you wish to predict the temperature tomorrow, Y , by aggregating the predictions
⃗ = (X1 , X2 , . . . , Xn )T from n weather experts in some linear combination.
X
From previous observation, we know that the n predictions might not be of the same
accuracy, so a simple mean would not be ideal. Even ignoring this, some predictions might
be highly correlated with each other – if we have n = 3 expert predictions of the same
quality, but the first two experts nearly always give the same prediction, then we effectively
have only 2 predictions, so a simple mean over the 3 experts would give undue weight to the
common factor shared by the first two experts.
Formally, we model the expert predictions X1 , . . . , Xn and the temperature tomorrow
Y as random variables, and define the quality of a prediction X as Corr[X, Y ]. Given the
qualities of the n predictions as well as the correlations between the n predictions, we seek a
⃗ for some fixed weights ⃗v ) such that our resulting
linear combination of the predictions (⃗v T X
prediction has the highest possible quality. We approach this problem in several steps:
2
MIT 6.S087
Problem Set: Lecture 4
IAP 2022
(a) (3 points) We first establish the following useful result. Let Q be an n by n positive
definite symmetric matrix, and ⃗u be a nonzero n-dimensional vector. Show that,
subject to the constraint ⃗uT ⃗v = 1, the value of ⃗v T Q⃗v is minimized exactly when
Q−1⃗u
⃗v = T −1 .
⃗u Q ⃗u
Take care to show that this value is indeed the unique minimizer.
(b) (2 points) For convenience, assume the values of the temperature and the predictions
are normalized, i.e. X1 , X2 , . . . , Xn , Y all have mean 0 and variance 1. Let Σ be the
⃗ assuming it is invertible, and let ⃗q = (q1 , q2 , . . . , qn )T be the
covariance matrix of X,
⃗ satisfies
quality of the n predictions. Show that the quality of our prediction ⃗v T X
T
⃗ Y ] = √⃗q ⃗v .
Corr[⃗v T X,
⃗v T Σ⃗v
(c) (2 points) Use the result from part (a) to find a weight vector ⃗v that maximizes
⃗ Y ], the quality of our prediction. In terms of ⃗q and Σ, what is this maximal
Corr[⃗v T X,
quality?
Hint: You may need to add an artificial constraint in order to use the result from part
(a). Be sure to explain why adding this constraint does not affect the maximal quality.
(d) For the following parts, suppose that each expert’s prediction has the same quality.
(i) (1 point) Suppose Σ = In , i.e. all of the predictions are independent from each
other. Intuitively, the maximal quality is attained by taking the mean of all n
predictions. Show that this agrees with your answer in part (c).
(ii) (2 points) Now consider the case mentioned in the preamble, where n = 3 and
where X1 and X2 are highly correlated with correlation 1 − ϵ, while X3 is independent from both X1 and X2 . (Note that the correlation cannot be 1 as we need Σ
to be positive definite.) Show that for small ϵ > 0, the optimal value of ⃗v is close
to a positive multiple of (1, 1, 2)T . This implies that the optimal ratio between
the weights of X1 , X2 , and X3 is 1 : 1 : 2.
6: Letter Grading and Entropy (15 points)
Suppose you are a teacher assigning the final letter grade for a class. The letter grades A,
B, C, D, and F correspond to 5, 4, 3, 2, and 0pt on the GPA scale. As such, we view this
as assigning a score from the set X = {0, 2, 3, 4, 5} to each student. Suppose you also want
the mean score among your students to be exactly µ for some fixed constant 0 < µ < 5.
(a) To make the grades more meaningful, you wish to choose a distribution of grades that
minimizes the chance of two students getting the same grade; formally, you wish to
choose a probability mass function (pmf) p that is defined on X such that if X and Y
are independent random variables with pmf p, then P[X = Y ] is minimized.
3
MIT 6.S087
Problem Set: Lecture 4
IAP 2022
(i) (4 points) Formulate this problem as a convex optimization problem and show
that any solution p of this optimization problem is unique.
(ii) (4 points) If p is a solution of this optimization problem, show that there exist
constant values a, b such that for any x ∈ X ,
p(x) = max(ax + b, 0).
Hint: a common mistake is to just show p(x) ∈ {ax + b, 0} for each x instead of
p(x) = max(ax + b, 0). Make sure you prove the latter.
(b) Consider the following alternative approach. The “amount of information” contained
in a distribution with pmf p can be characterized by its entropy, which is defined as
X
H(p) = −
p(x) log2 p(x).
x∈X
Instead of minimizing the probability of students getting the same score, you now wish
to maximize the entropy of the score distribution, subject to the constraint on the
mean score.
(i) (5 points) Show that any solution p of this optimization problem is unique, and
that given p there exist constant values k, r such that for any x ∈ X ,
p(x) = krx .
, what is the proportion of students in your class that should be awarded
If µ = 256
61
each letter grade? (You may assume the above formula if you did not prove it.)
(ii) (2 points) Now suppose you may assign scores as any non-negative integer, so
X = {0, 1, 2, . . . }, but you still wish to have a mean score of µ and to maximize
the entropy. You may note without proof that the derivation in part (i) does not
involve the contents of the set X , so the results of part (i) still hold.
Express the optimal p(x) in terms of x and µ.
7: Normal Distribution and Entropy (20 points)
Let X be a finite subset of R. Consider the probability mass function (pmf) p of an n⃗ taking values in X n . As above, define its entropy as
dimensional random vector X
X
H(p) = −
p(⃗x) log2 p(⃗x).
⃗
x∈X n
⃗ to be E[X],
⃗ and define
Moreover, define the first moment (mean) of a random vector X
T
⃗ to be E[X
⃗X
⃗ ]. We wish to find the pmf p of
the second moment of a random vector X
⃗ with the maximum entropy, where the first and second moments of X
⃗
a random vector X
are constrained to be exactly µ
⃗ and M, for some fixed n-dimensional vector µ
⃗ and fixed
symmetric positive definite n × n matrix M.
4
MIT 6.S087
Problem Set: Lecture 4
IAP 2022
(a) (10 points) Express the Lagrangian L in terms of p, µ
⃗ , M, X n , and any Lagrange
multipliers. For full credit (and in order to do part (b)), stack the Lagrange multipliers
in vector and matrix form to avoid summations over individual multipliers. How many
Lagrange multipliers are required?
Hint: recall that we assign a Lagrange multiplier to each individual constraint. How
⃗ =µ
⃗X
⃗ T ] = M?
many constraints are we in effect stating with E[X]
⃗ or E[X
(b) (5 points) Show that any solution p of this optimization problem is unique, and that
given p there exist some constant scalar k, constant vector m
⃗ and constant matrix Λ
such that for all ⃗x in X n ,
p(⃗x) = k exp(m
⃗ T ⃗x + ⃗xT Λ⃗x).
(c) (5 points) Suppose we extend the optimization problem to random vectors over Rn ,
i.e. taking X = R. Let p be a solution to this optimization problem. You may assume
the formula in (b) still holds for this p; note that it resembles a multivariate Gaussian
distribution (how?). This shows that a multivariate Gaussian distribution is in fact
the unique maximizer of entropy when the first and second moments are fixed.
For this p, express k, m
⃗ and Λ from the formula in part (b) in terms of µ
⃗ , M and n.
5
Download