Tsinghua-Berkeley Shenzhen Institute Learning from Data Fall 2019 Written Assignment 1 Issued: Sunday 29th September, 2019 Due: Sunday 13th October, 2019 1.1. A data set consists of m data pairs (x(1) , y (1) ), . . . , (x(m) , y (m) ), where x ∈ Rn is the independent variable, and y ∈ {1, . . . , k} is the dependent variable. The conditional probability Py|x (y|x)1 estimated by the softmax regression is exp(θlT x + bl ) Py|x (l|x) = ∑k , T j=1 exp(θj x + bj ) l = 1, . . . , k, where θ1 , . . . , θk ∈ Rn and b1 , . . . , bk ∈ R are the parameters of softmax regression. The term bi is called a bias term. The log-likelihood of the softmax regression model is ℓ= m ∑ log Py|x (y (i) |x(i) ). i=1 (a) Evaluate ∂ℓ . ∂bl The data set can be described by its empirical distribution P̂x,y (x, y) defined as 1 ∑ P̂x,y (x, y) = 1{x(i) = x, y (i) = y}, m i=1 m where 1{·} is the indicator function. Similarly, the empirical marginal distributions of this data set are 1 ∑ P̂x (x) = 1{x(i) = x}, m i=1 m 1 ∑ P̂y (y) = 1{y (i) = y}. m i=1 m (b) Suppose we have set the biases (b1 , . . . , bk ) to their optimal values, prove that ∑ P̂y (l) = Py|x (l|x)P̂x (x), x∈X where X = {x(i) : i = 1, . . . , m} is the set of all samples of x. ∂ℓ ∂ℓ ∂ℓ = = ··· = = 0. Hint: The optimality implies ∂b1 ∂b2 ∂bk 1.2. We have mentioned maximun likelyhood estimation (MLE) in the class. (a) Now let’s do it on a very simple case. The notation Py|x (y|x) stands for P(y = y|x = x), i.e., the conditional probability of y = y given x = x. 1 Written Assignment 1 Learning from Data Page 2 of 2 Suppose we have N samples (x1 ,x2 ,...,xn ) independently drawn from a normal distribution with KNOWN variance σ 2 and UNKNOWN mean µ ) ( 1 (xi − µ)2 P (xi |µ) = √ , exp − 2σ 2 2πσ 2 please find the MLE for µ (b) It is always a task to find a satisfying loss function. However MLE is not always the best. To explain MLE simply, it is trying to maximize the possiblity the events under the parameter you want to compute. Think about it, you have known the N samples. Why should we consider them as the events that should be estimated instead of the conditions we have known. Based on Bayes Rule, we can cultivate another method called Maximum A Posteriori (MAP). P (µ|x1 , x2 , ..., xn ) = P (x1 , x2 , ..., xn |µ) P (µ) P (x1 , x2 , ..., xn ) There exist some assumptions on P (µ). If µ is purely at random, it will make no difference between MLE and MAP. Here suppose ( ) 1 (µ − ν)2 P (µ) = √ exp − , 2θ2 2πθ2 do the calculation to find the estimator for µ. Also, compare the estimators of MLE and MAP when n is very large. 1.3. We want to talk about Multivariate Least squares in this problem. A data set consists of m data pairs (x(1) , y (1) ), . . . , (x(m) , y (m) ), where x ∈ Rn is the independent variable, and y ∈ Rl is the dependent variable. Denote the design matrix by X ≜ [x(1) , . . . , x(m) ]T , and let Y ≜ [y (1) , · · · , y (m) ]T . Please write down the square loss J(Θ), where Θ ∈ Rn×l is the parameter matrix you want to get, and then compute the solution. Hint: Hopefully you can write down the square loss without confusion. Just in case, we will write it as m l )2 1 ∑ ∑ ( T (i) (i) (Θ x )j − yj J(Θ) = 2 i=1 j=1 You can do the following things. 1.4. The multivariate normal distribution can be written as ( ) 1 1 T −1 − (y − µ) Σ (y − µ) , Py (y; µ, Σ) = n 1 exp 2 (2π) 2 |Σ| 2 where µ and Σ are the parameters. Show that the family of multivariate normal distributions is an exponential family, and find the corresponding η, b(y), T (y), and a(η). Hints: The parameters η and T (y) are not limited to be vectors, but can also be matrices. In this case, the Frobenius inner product can be used to define the inner product between two matrices, which is represented as the trace of their products. The properties of matrix trace might be useful.