18.657 PS 2 SOLUTIONS 1. Problem 1 (1) Let Y1 , . . . , Yn be ghost copies of X1 , . . . , Xn . Then we have # n n # " " 1 X 1 X σi [f (Xi ) − E[f (X)]] = E sup E sup σi [f (Xi ) − E[f (Yi )]] f ∈F n i=1 f ∈F n i=1 # " n 1 X σi [f (Xi ) − f (Yi )] , ≤ E sup n f ∈F i=1 by Jensen’s inequality, n # 1 X [f (Xi ) − f (Yi )] , = E sup f ∈F n i=1 " as σi [f (Xi ) − f (Yi )] and f (Xi ) − f (Yi ) have the same distribution, # " n 1 X = E sup [(f (Xi ) − E[f (Xi )]) − (f (Yi ) − E[f (Yi )])] f ∈F n i=1 # n " 1 X [f (Xi ) − E[f (Xi )]] , ≤ 2E sup n f ∈F i=1 by the triangle inequality. (2) The distribution of gi is the same as that of |gi |σi , so we can write " # " " ## n n X X E sup gi f (Xi ) = EXi ,σi Egi sup |gi |σi f (Xi ) | Xi , σi f ∈F i=1 " ≥ EXi ,σi sup f ∈F i=1 n X f ∈F i=1 E[|gi |]σi f (Xi ) , by Jensen’s inequality, r = " # n X 2 E sup σi f (Xi ) , π f ∈F i=1 using the first absolute moment of the standard Gaussian. Date: Fall 2015. 1 # 2 18.657 PS 2 SOLUTIONS (3) We compute: # n 1 X Rn (F + h) = E sup σi (f (Xi ) + h(Xi )) n f ∈F # n i=1 " # " n 1 X 1 X h(Xi ) , ≤ E sup σi f (Xi ) + E n f ∈F n " i=1 i=1 by the triangle inequality, # n 1 X h(Xi ) , = Rn (F) + E n i=1 v u !2 n u 1u X σi h(Xi ) , ≤ Rn (F) + tE n i=1 " by Jensen’s inequality, v uX n X 1u = Rn (F) + t E[h(Xi )2 ] + 2 E[σi σj h(Xi )h(Xj )] n i=1 i<j v u n u 1 X = Rn (F) + t E[h(Xi )2 ], n i=1 by symmetry, 1p nkhk2∞ n khk∞ = Rn (F) + . n ≤ Rn (F) + (4) By the triangle inequality, we have n k X X 1 Rn (F1 + . . . + Fk ) = E sup σi fj (Xi ) fj ∈Fj n i=1 j=1 # " k n X 1 X ≤ E sup σi fj (Xi ) fj ∈Fj n j=1 = k X j=1 i=1 Rn (Fj ). 18.657 PS 2 SOLUTIONS 3 (5) The supremum over different choices of fj is at least the supremum over a single repeated choice: X k X n 1 Rn (F + . . . + F ) = E sup σi fj (Xi ) | {z } fj ∈F n i=1 j=1 k n k 1 X X σi f (Xi ) ≥ E sup f ∈F n i=1 j=1 n # " 1 X = kE sup σi f (Xi ) f ∈F n i=1 = kRn (F), which provides the reverse bound to that of (4), establishing equality in this case. 4 18.657 PS 2 SOLUTIONS 2. Problem 2 (1) Let A be a maximum 2ε-packing of (T, d); let B be a minimum ε-covering. As B is an εcovering, for each a ∈ A there exists some choice of b(a) ∈ B such that d(a, b(a)) ≤ ε. This map b : A → B is an injection: by the triangle inequality, no point in B can be within ε of two points of A, as these are at least 2ε apart. Thus |A| ≤ |B|, so that D(T, d, 2ε) ≤ N (T, d, ε). Let C be maximum ε-packing. Note that C is in fact an ε-covering: if some point of T had distance greater than ε to each point of C, then we could add it to C to produce a larger ε-packing, contradicting maximality. It follows that N (T, d, ε) ≤ D(T, d, ε). (2) Let B n denote the Euclidean ball in Rn . Let u ∈ B n and v ∈ B m . Then we can write u = u∗ + εu , v = v∗ + εv , where u∗ ∈ Nn , v∗ ∈ Nm , εu ∈ 14 B n , and εv ∈ 14 B m . We compute: kM k = u> M v sup u∈B n ,v∈B m = sup u∗ ,v∗ ,εu ,εv > u∗> M v∗ + ε> u M v∗ + u M εv ≤ sup u∗ ∈Nn ,v∗ ∈Nm ≤ max u∗ ∈Nn ,v∗ ∈Nm u∗> M v∗ u> ∗ M v∗ ! + sup εu ∈ 14 B n ,v∗ ∈Nm ε> u M v∗ ! + sup > u M εv u∈B n ,εv ∈ 14 B m 1 1 + kM k + kM k. 4 4 Rearranging, we have kM k ≤ 2 max u∗ ∈Nn ,v∗ ∈Nm u> ∗ M v∗ as desired. (3) By independence and Hoeffding’s lemma, we have, for all s > 0, Y E[exp(s u> M v)] = E[exp(s ui Mij vj )] i,j Y 1 2 2 2 ≤ E exp s ui vj 2 i,j X 1 = E exp s2 u2 v 2 2 i,j i j ≤ exp(s2 /2), whenever u ∈ B n and v ∈ B m . d Recall from the notes that B d has covering number at most 3ε , so that we are maximizing over |Nn × Nm | ≤ 12n+m points. It follows from the standard maximal inequality for subgaussian random variables, or else by explicitly using Jensen with log and exp and replacing a maximum by a sum, that p √ √ EkM k ≤ 2 2 log(12m+n ) ≤ C( m + n). 18.657 PS 2 SOLUTIONS 5 3. Problem 3 1 (1) Let A be an ε-net for [0, 1]; we can construct one with size at most 2ε + 1. Let Xpre denote the set of non-decreasing functions from {x1 , . . . , xn , } to A, and let X be the set of functions [0, 1] → [0, 1] defined by piecewise linear extension of functions in Xpre (and constant extension x on [0, x1 ] and [xn , 1]). It is straightforward to see that X is an ε-net for (F, d∞ ): given f ∈ F , we define g ∈ Xpre by taking g(xi ) to be the least point of A lying within ε of f (xi ), and then the function in X defined by extension of g is within ε of f . 1 It remains to count X. Let a1 < . . . < ak be the elements of A, where k ≤ 2ε +1. Functions f ∈ X are uniquely defined by the count for each 1 ≤ i ≤ k of how many xj satisfy f (xj ) = ai . A naive count of the possibilities for these values yields N (F, dx∞ , ε) ≤ |X| ≤ (n + 1)1+1/2ε ≤ n2ε , valid for n ≥ 3. (2) As N (F, dx2 , ε) ≤ N (F, dx∞ , ε), and |f | ≤ 1 for all f ∈ F , the chaining bound yields Z 1q 12 Rn ≤ inf 4ε + √ log N (F, dx2 , t) dt ε>0 n ε Z 1p 12 ≤ inf 4ε + √ t−1 log n dt ε>0 n ε r √ log n = inf 4ε + 24 (1 − ε) ε>0 n r √ log n ≤ lim 4ε + 24 (1 − ε) ε→0 n r log n = 24 . n (3) We bound N (F, dx1 , ε) ≤ N (F, dx∞ , ε) ≤ n2/ε . The theorem in Section 5.2.1 yields r 2 log(2N (F, dx1 , ε)) Rn ≤ inf ε + ε n r −1 2ε log n + 2 log 2 ≤ inf ε + . ε n Setting the two terms equal, to optimize over ε (in asymptotics), we have r r 2 log n + 2ε log 2 2 log n 3/2 ε = ≈ , n n for a bound of 1/3 log n Rn . , n which is strictly weaker than the chaining bound. 6 18.657 PS 2 SOLUTIONS 4. Problem 4 (1) We adapt the proof from class to the case of a penalized, rather than constrained, norm. ¯ n be the span of the functions k(xi , −). We can decompose any function g uniquely Let W ¯ n and g ⊥ ⊥ W ¯ n . As in class, g ⊥ (xi ) = hg ⊥ , k(xi , −)i = 0, so as g = gn + g ⊥ , where gn ∈ W that g(xi ) = gn (xi ). Plugging in to the objective function: ψ(Y − g) + µkgk2W = ψ(Y − gn ) + µkgn k2W + µkg ⊥ k2W . For any fixed gn , this is a constant plus µkg ⊥ k2W , which is minimized uniquely at g ⊥ = 0. ¯ n , as desired. Thus (unfixing gn ) any minimizer fˆ must lie in W Pn > ˆ ˆ ˆ2 ¯ ^ (2) As f ∈ Wn , write f = i=1 θi k(xi , −), so that f = Kθ and kf kW = θ Kθ. Then the first-order optimality conditions read 0 = ∇ φ(Y − ^ f) + µkfˆk2W = ∇ Y> Σ−1/2 Y − 2Y> Σ−1/2 Kθ + θ> KΣ−1/2 Kθ + µθ> Kθ = −2KΣ−1/2 Y + 2KΣ−1/2 Kθ + 2µKθ. Rearranging, and recalling that Kθ = ^ f, we have (KΣ−1/2 + µIn )^ f = KΣ−1/2 Y, as desired. (3) As above, it suffices to consider f , g ∈ W̄n , so that we can write f = Kθf , g = Kθg . From the definition of fˆ, we have ψ(f + ξ − ^ f) + µkfˆk2 ≤ ψ(f + ξ − g) + µkgk2 . W W Separating out the contribution of ξ, we obtain ψ(f − ^ f) + φ(ξ) + ξ > Σ−1/2 (f − ^ f) + µkfˆk2W ≤ ψ(f − g) + ψ(ξ) + ξ > Σ−1/2 (f − g) + µkgk2W . Rearranging, ψ(f − ^ f) ≤ −ξ > Σ−1/2 (f − ^ f) − µkfˆk2W + ψ(f − g) + ξ > Σ−1/2 (f − g) + µkgk2W ^ − g) − µkfˆk2W − µkgk2W = ψ(f − g) + 2µkgk2W + ξ > Σ−1/2 (f ^ − g) − µ kfˆ − gk2W , ≤ ψ(f − g) + 2µkgk2W + ξ > Σ−1/2 (f 2 2 2 2 by the inequality ka − bkW ≤ 2kakW + 2kbkW . Let Z = Σ−1/2 ξ, which is distributed as N (0, I). Continuing the line of algebra, µ ψ(f − ^ f) ≤ ψ(f − g) + 2µkgk2W + Z > (^ f − g) − kfˆ − gk2W 2 µ 2 > = ψ(f − g) + 2µkgkW + Z K(θfˆ − θg ) − (θfˆ − θg )> K(θfˆ − θg ) 2 1 > 2 ≤ ψ(f − g) + 2µkgkW + Z KZ, µ by the inequality µ 1 a> P b ≤ a> P a + b> P b 2 µ for any PSD matrix P . Now since n n X X Z > KZ = Zi hk(xi , −), k(xj , −)iW = k Zi k(xi , −)k2W , i,j =1 i=1 ¯ n , which equals the infimum over g ∈ W . we conclude by taking the infimum over g ∈ W 18.657 PS 2 SOLUTIONS 7 (4) This follows from the previous part together with the observation 2 n n X X X E[Zi2 ]Kii = Tr(K). Zi k(xi , −) = E Zi Zj Kij = E i=1 W i,j i=1 ¯ n , since only its evaluations at the design points (5) It is sufficient to prove the bound for f ∈ W matter. For the Gaussian kernel we have k(x, x) = 1, so that Tr(K) = n. Applying the previous part, and taking the case g = f as an upper bound on the minimizer, we have ^)] ≤ ψ(f − f) + 2µkf k2W + n = 2µkf k2W + n . E[ψ(f − f µ µ p Taking µ = kf kW n/2 gives the result. MIT OpenCourseWare http://ocw.mit.edu 18.657 Mathematics of Machine Learning Fall 2015 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.