advertisement

Co902 problem sheet, solutions 1. (a) (Markov inequality) Let X be a continuous, non-negative random variable (RV) and a a positive constant. Show that: E[X] P (X ≥ a) ≤ a Solution: Let p(x) represent the pdf of RV X. Then: Z ∞ x p(x) dx E[X] = Z0 a Z = x p(x) dx + 0 ∞ x p(x) dx a Since p(x) is a pdf, it is everywhere non-negative, so the first term on the RHS must be nonnegative. This means: Z ∞ x p(x) dx E[X] ≥ a Since a is the lower bound on the integral above, we can write Z ∞ Z ∞ a p(x) dx x p(x) dx ≥ a a which gives E[X] ≥ Z = a ∞ a p(x) dx a Z ∞ p(x) dx a = a P (X ≥ a) from which the required result follows. 2 exists. Show (b) (Chebyshev inequality) Let X be any RV with mean µX and whose variance σX that for any positive constant a: P (|X − µX | ≥ a) ≤ 2 σX a2 Solution: First, note that P (|X − µX | ≥ a) = P ((X − µX )2 ≥ a2 ) Here, (X − µX )2 is a non-negative RV. Using the Markov inequality, we get: P ((X − µX )2 ≥ a2 ) ≤ = as required. 1 E[(X − µX )2 ] a2 2 σX a2 (c) An estimator θ̂n of a parameter is said to be consistent if it converges in probability to the true parameter value θ, that is if: ∀ǫ > 0, lim P (|θ̂n − θ| ≥ ǫ) = 0 n→∞ Using the Chebyshev inequality, show (informally) that the following two conditions are sufficient to establish consistency: E[θ̂n ] = θ (unbiased) lim VAR(θ̂n ) = 0 n→∞ Solution: If θ̂n is unbiased, we can write P (|θ̂n − θ| ≥ ǫ) = P (|θ̂n − E[θ̂n ]| ≥ ǫ) Applying the Chebyshev inequality to the RHS, we get: VAR(θ̂n ) ǫ2 P (|θ̂n − E[θ̂n ]| ≥ ǫ) ≤ From the RHS above we can see that if lim VAR(θ̂n ) = 0 n→∞ the estimator converges in probability to θ, that is, it is consistent. (d) (Weak Law of Large Numbers) Let X1 , X2 . . . Xn be a set of independently and identically 2 < ∞. Use the distributed RVs (a random sample) with E[Xi ] = µX and VAR(Xi ) = σX Chebyshev inequality to show that the sample mean n X̄ = 1X Xi n i=1 converges in probability to the true mean µX . Solution: Let X̄n denote the sample mean derived from n observations. This is easily shown to be unbiased. Using the Chebyshev inequality: P (|X̄n − µX | ≥ ǫ) ≤ VAR(X̄n ) ǫ2 But: VAR(X̄n ) = VAR = 2 σX n 1 (X1 + . . . + Xn ) n Therefore P (|X̄n − µX | ≥ ǫ) ≤ 2 2 σX nǫ2 and lim P (|X̄n − µX | ≥ ǫ) = 0 n→∞ which means X̄n converges in probability to the true mean µX , as required. 2. X1 . . . Xn , Xi ∈ Rd are independently and identically distributed multivariate Normal Random Vectors, each having pdf: p(x | µ, Σ) = 1 −1 − 12 (x−µ)T Σ (x−µ) e (2π)d/2 |Σ|1/2 (a) Write down the log-likelihood function for this model. (2 marks) (b) Derive maximum likelihood estimators for the parameters µ and Σ. (4 marks) Solution: 2(a). Log-likelihood: n dn n 1X L(µ, Σ) = − (Xi − µ)T Σ−1 (Xi − µ) log(2π) − log(|Σ|) − 2 2 2 i=1 2(b). We proceed in two stages. We first treat Σ as fixed, and maximize L to get a value µ̂(Σ) which maximizes L for a given matrix parameter Σ. Taking the derivative of the L wrt vector µ, we get: n X d (Xi − µ))T L = (Σ−1 dµ i=1 Setting the derivative to zero, taking the transpose of both sides and pre-multiplying by Σ, we get: n X (Xi − µ) 0 = i=1 Solving for µ: n 1X Xi n µ̂(Σ) = i=1 = X̄ Since this solution does not depend on Σ, X̄ is the maximum likelihood estimator of µ for any Σ. To obtain Σ̂ we plug µ̂(Σ) = X̄ into the log-likelihood to obtain n dn n 1X − log(2π) − log(|Σ|) − (Xi − X̄)T Σ−1 (Xi − X̄) 2 2 2 i=1 3 (1) and maximize this function wrt Σ. We first introduce a sample covariance matrix S defined as follows: n S = 1X (Xi − X̄)(Xi − X̄)T n i=1 This allows us to re-write the quadratic form in (1) as a matrix trace: n X (Xi − X̄)T Σ−1 (Xi − X̄) = n Tr(Σ−1 S) i=1 where Tr(·) denotes the trace of its matrix argument. This in turn allows us to write the derivative of (1) wrt Σ as follows: − n d n d log(|Σ|) − Tr(Σ−1 S) 2 dΣ 2 dΣ At this point we make use of two useful matrix derivatives (these can be found in Appendix C of Bishop and the note “Matrix Identities” by Roweis, available on the course website): ∂ log(|A|) = (A−1 )T ∂A ∂ Tr(X−1 A) = −X−1 AT X−1 ∂X This gives the derivative (2) in the following form (where we make use of the fact that both Σ−1 and S are symmetric): n n − Σ−1 + Σ−1 SΣ−1 2 2 Setting to zero and solving, we get: Σ̂ = S n 1X (Xi − X̄)(Xi − X̄)T = n i=1 3. X1 . . . Xn , Xi ∈ R are independently and identically distributed Normal RVs, each having pdf: p(x | µ, σ 2 ) = 1 x−µ 2 1 √ e− 2 ( σ ) 2πσ (a) Write down the log-likelihood function for this model. (2 marks) (b) Derive maximum likelihood estimators for the parameters µ and σ 2 . (4 marks) 4 Solution: 3(a). The log-likelihood function for this model is: n n 1 X n (Xi − µ)2 L(µ, σ 2 ) = − log(2π) − log σ 2 − 2 2 2 2σ i=1 3(b). We first treat σ 2 as fixed, and maximize L to get a value µ̂(σ 2 ) which maximizes L for a given value σ 2 . Taking the derivative of the L wrt µ, setting to zero and solving, we get: n 1X Xi n 2 µ̂(σ ) = i=1 = X̄ Since this solution does not depend on σ 2 , X̄ is the maximum likelihood estimator of µ for any σ 2 . We now plug this estimate into the log-likelihood and maximize the resulting function wrt σ 2 to get σ̂ 2 . Taking the derivative wrt σ 2 and setting to zero: 0 = − n n 1 1 X (Xi − X̄)2 + 2 σ 2 2(σ 2 )2 i=1 Solving for σ 2 , we get: n σ̂ 2 = 1X (Xi − X̄)2 n i=1 Thus, the desired pair of MLEs is n 2 (µ̂, σ̂ ) = 1X X̄, (Xi − X̄)2 n i=1 X̄ = ! n 1X Xi n i=1 4. Consider a classifier with Bernoulli class-conditional distributions, in which input vectors Xi ∈ {0, 1}d are taken to be i.i.d. given class Y ∈ {0, 1} and the inputs are further taken to be mutually independent (Naive Bayes assumption). That is, if θjk is a Bernoulli parameter giving the probability that the j th input is 1, given output Y = k, the class-conditional distribution is: P (Xi | Yi = k) = d Y j=1 X θjkij (1 − θjk )(1−Xij ) (a) Write down the log-likelihood function L(θ) = log P (X1 . . . Xn | Y1 . . . Yn , θ) for this model. Here, θ denotes the full set of model parameters. 5 (2 marks) (b) Derive maximum likelihood estimates (MLEs) for the model parameters. (4 marks) Solution: 4(a). We have n input vectors Xi , each of dimensionality d, and conditionally independent given class labels. Let Xij denote the j th component of the ith input vector. Considering just one input j: P (X1j . . . Xnj | Y1 . . . Yn , θ) Y X Y X = θj1ij (1 − θj1 )(1−Xij ) × θj0ij (1 − θj0 )(1−Xij ) = i:Yi =1 n θj1j1 (1 (n1 −nj1 ) − θj1 ) × i:Yi =0 nj0 θj0 (1 − θj0 )(n0 −nj0 ) where, njk represents the number of observations in which the j th input is 1 when the class label is k, that is: njk = |{i : Xij = 1 ∧ Yi = k}| The Naive Bayes assumption means that the probability of the complete input vector is just the product of the probabilities of the individual inputs. This means the overall conditional likelihood is just P (X1 . . . Xn | Y1 . . . Yn , θ) = d Y j=1 n n θj1j1 (1 − θj1 )(n1 −nj1 ) × θj0j0 (1 − θj0 )(n0 −nj0 ) and the corresponding conditional log-likelihood is L(θ) = log P (X1 . . . Xn | Y1 . . . Yn , θ) = d X j=1 nj1 log θj1 + (n1 − nj1 ) log(1 − θj1 ) +nj0 log θj0 + (n0 − nj0 ) log(1 − θj0 ) 4(b). Taking the derivative of the log-likelihood function wrt θj1 and setting to zero, we get: 0 = nj1 n1 − nj1 − θj1 1 − θj1 Solving, we get the MLE θ̂j1 : θ̂j1 = nj1 n1 θ̂j0 = nj0 n0 Similarly: 6