# 18.657 PS 2 SOLUTIONS 1. Problem 1 (1) Let Y

advertisement ```18.657 PS 2 SOLUTIONS
1. Problem 1
(1) Let Y1 , . . . , Yn be ghost copies of X1 , . . . , Xn . Then we have
#
n
n
#
&quot;
&quot;
1 X
1 X
σi [f (Xi ) − E[f (X)]] = E sup E sup σi [f (Xi ) − E[f (Yi )]]
f ∈F n i=1
f ∈F n i=1
#
&quot;
n
1 X
σi [f (Xi ) − f (Yi )] ,
≤ E sup n
f ∈F
i=1
by Jensen’s inequality,
n
#
1 X
[f (Xi ) − f (Yi )] ,
= E sup f ∈F n i=1
&quot;
as σi [f (Xi ) − f (Yi )] and f (Xi ) − f (Yi ) have the same distribution,
#
&quot;
n
1 X
= E sup [(f (Xi ) − E[f (Xi )]) − (f (Yi ) − E[f (Yi )])]
f ∈F n i=1
#
n
&quot;
1 X
[f (Xi ) − E[f (Xi )]] ,
≤ 2E sup n
f ∈F
i=1
by the triangle inequality.
(2) The distribution of gi is the same as that of |gi |σi , so we can write
&quot;
#
&quot;
&quot;
##
n
n
X
X
E sup
gi f (Xi ) = EXi ,σi Egi sup
|gi |σi f (Xi ) | Xi , σi
f ∈F i=1
&quot;
≥ EXi ,σi sup
f ∈F i=1
n
X
f ∈F i=1
E[|gi |]σi f (Xi ) ,
by Jensen’s inequality,
r
=
&quot;
#
n
X
2
E sup
σi f (Xi ) ,
π
f ∈F i=1
using the first absolute moment of the standard Gaussian.
Date: Fall 2015.
1
#
2
18.657 PS 2 SOLUTIONS
(3) We compute:
#
n
1 X
Rn (F + h) = E sup σi (f (Xi ) + h(Xi ))
n
f ∈F
#
n
i=1
&quot;
#
&quot;
n
1 X
1 X
h(Xi ) ,
≤ E sup σi f (Xi ) + E
n
f ∈F n &quot;
i=1
i=1
by the triangle inequality,
#
n
1 X
h(Xi ) ,
= Rn (F) + E
n i=1
v 
u
!2 
n
u
1u  X
σi h(Xi ) ,
≤ Rn (F) + tE
n
i=1
&quot;
by Jensen’s inequality,
v
uX
n
X
1u
= Rn (F) + t
E[h(Xi )2 ] + 2
E[σi σj h(Xi )h(Xj )]
n i=1
i&lt;j
v
u n
u
1 X
= Rn (F) + t
E[h(Xi )2 ],
n i=1
by symmetry,
1p
nkhk2∞
n
khk∞
= Rn (F) +
.
n
≤ Rn (F) +
(4) By the triangle inequality, we have

n
k
X
X
1
Rn (F1 + . . . + Fk ) = E  sup
σi
fj (Xi )
fj ∈Fj n i=1
j=1
#
&quot;
k
n
X
1 X
≤
E sup
σi fj (Xi )
fj ∈Fj n 
j=1
=
k
X
j=1
i=1
Rn (Fj ).
18.657 PS 2 SOLUTIONS
3
(5) The supremum over different choices of fj is at least the supremum over a single repeated
choice:


X
k
X
n
1

Rn (F + . . . + F ) = E sup σi
fj (Xi )
|
{z
}
fj ∈F n i=1
j=1
k


n
k
1 X X

σi
f (Xi )
≥ E sup f ∈F n i=1
j=1
n
#
&quot;
1 X
= kE sup σi f (Xi )
f ∈F n i=1
= kRn (F),
which provides the reverse bound to that of (4), establishing equality in this case.
4
18.657 PS 2 SOLUTIONS
2. Problem 2
(1) Let A be a maximum 2ε-packing of (T, d); let B be a minimum ε-covering. As B is an εcovering, for each a ∈ A there exists some choice of b(a) ∈ B such that d(a, b(a)) ≤ ε. This
map b : A → B is an injection: by the triangle inequality, no point in B can be within ε of two
points of A, as these are at least 2ε apart. Thus |A| ≤ |B|, so that D(T, d, 2ε) ≤ N (T, d, ε).
Let C be maximum ε-packing. Note that C is in fact an ε-covering: if some point of T
had distance greater than ε to each point of C, then we could add it to C to produce a larger
ε-packing, contradicting maximality. It follows that N (T, d, ε) ≤ D(T, d, ε).
(2) Let B n denote the Euclidean ball in Rn . Let u ∈ B n and v ∈ B m . Then we can write
u = u∗ + εu , v = v∗ + εv , where u∗ ∈ Nn , v∗ ∈ Nm , εu ∈ 14 B n , and εv ∈ 14 B m . We compute:
kM k =
u&gt; M v
sup
u∈B n ,v∈B m
=
sup
u∗ ,v∗ ,εu ,εv
&gt;
u∗&gt; M v∗ + ε&gt;
u M v∗ + u M εv
≤
sup
u∗ ∈Nn ,v∗ ∈Nm
≤
max
u∗ ∈Nn ,v∗ ∈Nm
u∗&gt; M v∗
u&gt;
∗ M v∗
!
+
sup
εu ∈ 14 B n ,v∗ ∈Nm
ε&gt;
u M v∗
!
+
sup
&gt;
u M εv
u∈B n ,εv ∈ 14 B m
1
1
+ kM k + kM k.
4
4
Rearranging, we have
kM k ≤ 2
max
u∗ ∈Nn ,v∗ ∈Nm
u&gt;
∗ M v∗
as desired.
(3) By independence and Hoeffding’s lemma, we have, for all s &gt; 0,
Y
E[exp(s u&gt; M v)] =
E[exp(s ui Mij vj )]
i,j
Y 1 2 2 2
≤
E exp
s ui vj
2
i,j



X
1
= E exp  s2
u2 v 2 
2 i,j i j
≤ exp(s2 /2),
whenever u ∈ B n and v ∈ B m .
d
Recall from the notes that B d has covering number at most 3ε , so that we are maximizing over |Nn &times; Nm | ≤ 12n+m points. It follows from the standard maximal inequality
for subgaussian random variables, or else by explicitly using Jensen with log and exp and
replacing a maximum by a sum, that
p
√
√
EkM k ≤ 2 2 log(12m+n ) ≤ C( m + n).
18.657 PS 2 SOLUTIONS
5
3. Problem 3
1
(1) Let A be an ε-net for [0, 1]; we can construct one with size at most 2ε
+ 1. Let Xpre denote
the set of non-decreasing functions from {x1 , . . . , xn , } to A, and let X be the set of functions
[0, 1] → [0, 1] defined by piecewise linear extension of functions in Xpre (and constant extension
x
on [0, x1 ] and [xn , 1]). It is straightforward to see that X is an ε-net for (F, d∞
): given f ∈ F ,
we define g ∈ Xpre by taking g(xi ) to be the least point of A lying within ε of f (xi ), and then
the function in X defined by extension of g is within ε of f .
1
It remains to count X. Let a1 &lt; . . . &lt; ak be the elements of A, where k ≤ 2ε
+1. Functions
f ∈ X are uniquely defined by the count for each 1 ≤ i ≤ k of how many xj satisfy f (xj ) = ai .
A naive count of the possibilities for these values yields
N (F, dx∞ , ε) ≤ |X| ≤ (n + 1)1+1/2ε ≤ n2ε ,
valid for n ≥ 3.
(2) As N (F, dx2 , ε) ≤ N (F, dx∞ , ε), and |f | ≤ 1 for all f ∈ F , the chaining bound yields
Z 1q
12
Rn ≤ inf 4ε + √
log N (F, dx2 , t) dt
ε&gt;0
n ε
Z 1p
12
≤ inf 4ε + √
t−1 log n dt
ε&gt;0
n ε
r
√
log n
= inf 4ε + 24
(1 − ε)
ε&gt;0
n
r
√
log n
≤ lim 4ε + 24
(1 − ε)
ε→0
n
r
log n
= 24
.
n
(3) We bound N (F, dx1 , ε) ≤ N (F, dx∞ , ε) ≤ n2/ε . The theorem in Section 5.2.1 yields
r
2 log(2N (F, dx1 , ε))
Rn ≤ inf ε +
ε
n
r
−1
2ε log n + 2 log 2
≤ inf ε +
.
ε
n
Setting the two terms equal, to optimize over ε (in asymptotics), we have
r
r
2 log n + 2ε log 2
2 log n
3/2
ε
=
≈
,
n
n
for a bound of
1/3
log n
Rn .
,
n
which is strictly weaker than the chaining bound.
6
18.657 PS 2 SOLUTIONS
4. Problem 4
(1) We adapt the proof from class to the case of a penalized, rather than constrained, norm.
&macr; n be the span of the functions k(xi , −). We can decompose any function g uniquely
Let W
&macr; n and g ⊥ ⊥ W
&macr; n . As in class, g ⊥ (xi ) = hg ⊥ , k(xi , −)i = 0, so
as g = gn + g ⊥ , where gn ∈ W
that g(xi ) = gn (xi ).
Plugging in to the objective function:
ψ(Y − g) + &micro;kgk2W = ψ(Y − gn ) + &micro;kgn k2W + &micro;kg ⊥ k2W .
For any fixed gn , this is a constant plus &micro;kg ⊥ k2W , which is minimized uniquely at g ⊥ = 0.
&macr; n , as desired.
Thus (unfixing gn ) any minimizer fˆ must lie in W
Pn
&gt;
ˆ
ˆ
ˆ2
&macr;
^
(2) As f ∈ Wn , write f =
i=1 θi k(xi , −), so that f = Kθ and kf kW = θ Kθ. Then the
first-order optimality conditions read
0 = ∇ φ(Y − ^
f) + &micro;kfˆk2W
= ∇ Y&gt; Σ−1/2 Y − 2Y&gt; Σ−1/2 Kθ + θ&gt; KΣ−1/2 Kθ + &micro;θ&gt; Kθ
= −2KΣ−1/2 Y + 2KΣ−1/2 Kθ + 2&micro;Kθ.
Rearranging, and recalling that Kθ = ^
f, we have
(KΣ−1/2 + &micro;In )^
f = KΣ−1/2 Y,
as desired.
(3) As above, it suffices to consider f , g ∈ W̄n , so that we can write f = Kθf , g = Kθg . From
the definition of fˆ, we have
ψ(f + ξ − ^
f) + &micro;kfˆk2 ≤ ψ(f + ξ − g) + &micro;kgk2 .
W
W
Separating out the contribution of ξ, we obtain
ψ(f − ^
f) + φ(ξ) + ξ &gt; Σ−1/2 (f − ^
f) + &micro;kfˆk2W ≤ ψ(f − g) + ψ(ξ) + ξ &gt; Σ−1/2 (f − g) + &micro;kgk2W .
Rearranging,
ψ(f − ^
f) ≤ −ξ &gt; Σ−1/2 (f − ^
f) − &micro;kfˆk2W + ψ(f − g) + ξ &gt; Σ−1/2 (f − g) + &micro;kgk2W
^ − g) − &micro;kfˆk2W − &micro;kgk2W
= ψ(f − g) + 2&micro;kgk2W + ξ &gt; Σ−1/2 (f
^ − g) − &micro; kfˆ − gk2W ,
≤ ψ(f − g) + 2&micro;kgk2W + ξ &gt; Σ−1/2 (f
2
2
2
2
by the inequality ka − bkW ≤ 2kakW + 2kbkW . Let Z = Σ−1/2 ξ, which is distributed as
N (0, I). Continuing the line of algebra,
&micro;
ψ(f − ^
f) ≤ ψ(f − g) + 2&micro;kgk2W + Z &gt; (^
f − g) − kfˆ − gk2W
2
&micro;
2
&gt;
= ψ(f − g) + 2&micro;kgkW + Z K(θfˆ − θg ) − (θfˆ − θg )&gt; K(θfˆ − θg )
2
1 &gt;
2
≤ ψ(f − g) + 2&micro;kgkW + Z KZ,
&micro;
by the inequality
&micro;
1
a&gt; P b ≤ a&gt; P a + b&gt; P b
2
&micro;
for any PSD matrix P . Now since
n
n
X
X
Z &gt; KZ =
Zi hk(xi , −), k(xj , −)iW = k
Zi k(xi , −)k2W ,
i,j =1
i=1
&macr; n , which equals the infimum over g ∈ W .
we conclude by taking the infimum over g ∈ W
18.657 PS 2 SOLUTIONS
7
(4) This follows from the previous part together with the observation


2
n
n
X
X
X
E[Zi2 ]Kii = Tr(K).
Zi k(xi , −) = E 
Zi Zj Kij  =
E
i=1
W
i,j
i=1
&macr; n , since only its evaluations at the design points
(5) It is sufficient to prove the bound for f ∈ W
matter. For the Gaussian kernel we have k(x, x) = 1, so that Tr(K) = n. Applying the
previous part, and taking the case g = f as an upper bound on the minimizer, we have
^)] ≤ ψ(f − f) + 2&micro;kf k2W + n = 2&micro;kf k2W + n .
E[ψ(f − f
&micro;
&micro;
p
Taking &micro; = kf kW n/2 gives the result.
MIT OpenCourseWare
http://ocw.mit.edu
18.657 Mathematics of Machine Learning
Fall 2015
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
```