18.657. Fall 2105 Rigollet October 18, 2015 Problem set #2 (due Wed., October 21) Should be typed in LATEX Problem 1. Rademacher Complexities and beyond Let F be a class of functions from X to IR and let X1 , . . . , Xn be iid copies of a random variable X ∈ X . Moreover, let σ1 , . . . , σn be n i.i.d. Rad(1/2) random variables and let g1 , . . . , gn be n i.i.d. N(0, 1). Assume that all these random variables are mutually independent. 1. Prove the desymmetrization inequality: n n 1 X 1 X h h i i IE sup f (Xi ) − IE[f (X)] σi f (Xi ) − IE[f (X)] ≤ 2IE sup f ∈F n i=1 f ∈F n i=1 2. Prove the Rademacher/Gaussian process comparison inequality n n i rπ h h i X X σi f (Xi ) ≤ IE sup IE sup gi f (Xi ) 2 f ∈F i=1 f ∈F i=1 n i 1 X σ f (X ) i i . Let F and G be two set of functions from X to n f ∈F i=1 IR and recall that F + G = {f + g : f ∈ F , g ∈ G}. h Define Rn (F ) = IE sup 3. Let h ∈ IRX be a given function and define F + h = {f + h : f ∈ F }. Show that khk∞ Rn (F + {h}) ≤ Rn (F ) + √ , n where khk∞ = supx∈X |h(x)|. 4. Let F1 , . . . , Fk be k sets of functions from X to IR. Show that Rn (F1 + · · · , Fk ) ≤ k X j=1 Rn (Fj ) . 5. Show that this inequality derived in 4. is in fact an equality when the Fj s are the same. page 1 of 4 Problem 2. Covering and packing Definition: A set P ⊂ T is called an ε-packing of the metric space (T, d) if d(f, g) > ε for every f, g ∈ P , f = 6 g. The largest cardinality of an ε-packing of (T, d) is called the packing number of (T, d): D(T, d, ε) = sup card(P ) : P is an ε packing of (T, d) Recall that N(T, d, ε) denotes the ε-covering number of (T, d). 1. Show that D(T, d, 2ε) ≤ N(T, d, ε) ≤ D(T, d, ε) Let M be an n × m random matrix with entries that are i.i.d Rad(1/2) entries. We are interested in its operator norm kMk = u⊤ Mv . sup u∈IRn : |u|2 ≤1 v∈IRm : |v|2 ≤1 2. Show that kMk ≤ 2 max u⊤ Mv , u∈Nn v∈Nm where Nn and Nm are 41 -nets of the unit Euclidean balls of IRn and IRm respectively. 3. Conclude that IEkMk ≤ C √ m+ √ n . Problem 3. Chaining Let F be the class of all nondecreasing functions from [0, 1] to [0, 1]. 1. Show that for any x = (x1 , . . . , xn ) ∈ [0, 1]n , the covering number of (F , dx∞) satisfy: N(F , dx∞ , ε) ≤ n2/ε . 2. Using the chaining bound, show that Rn (F ) ≤ C r log n n page 2 of 4 3. Show that there is indeed a strict improvement over the bound obtained using the theorem in section 5.2.1 Problem 4. Kernel ridge regression Consider the regression model: Yi = f (xi ) + ξi , , i = 1, . . . , n where x1 , . . . , xn are fixed design points in IRd , ξ = (ξ1 , . . . , ξn ) ∼ N (0, Σ) ∈ IRn with known covariance matrix Σ ≻ 0 and f : IRd → IR is an unknown regression function. Let W be an RKHS on IRd with reproducing kernel k. Define Y = (Y1 , . . . , Yn )⊤ and g = [g(x1 ), . . . , g(xn )]⊤ for any function g. Define the estimator fˆ of f by fˆ = argmin ψ(Y − g) + µkgk2W g∈W where k · kW denotes the Hilbert norm on W , ψ(x) = x⊤ Σ−1/2 x and µ > 0 is a tuning parameter to be chosen later. 1. Prove the representer theorem, i.e., that there exists a vector θ ∈ IRn such that fˆ(x) = n X θi k(xi , x) , i=1 for any x ∈ IRd ˆ n )]⊤ satisfies 2. Prove that the vector f̂ = [fˆ(x1 ), . . . , f(x (KΣ−1/2 + µIn )f̂ = KΣ−1/2 Y , where In is the identity matrix of IRn and K denotes the symmetric n × n matrix with elements Ki,j = k(xi , xj ). 3. Prove that the following inequality holds n 2 1X 2 ψ(f − f̂) ≤ inf ψ(f − g) + 2µkgkW + Zi k(xi , ·)W , g∈W µ i=1 where Z1 , . . . , Zn are iid N (0, 1). 4. Conclude that 1 2 ψ(f − g) + 2µkg kW + Tr(K) , g∈W µ IEψ(f − f̂) ≤ inf where Tr(K) denotes the trace of K. page 3 of 4 5. Assume now that k is the Gaussian kernel: ′ 2 k(x, x′ ) = e−|x−x |2 Show that there exists a choice of µ for which √ IEψ(f − f̂) ≤ 2kf kW 2n . page 4 of 4 MIT OpenCourseWare http://ocw.mit.edu 18.657 Mathematics of Machine Learning Fall 2015 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.