Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-mail: mazhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007 Least-square Regularized Regression Learn f : X → Y from random samples z = {(xi, yi)}m i=1 Take X to be a compact subset of Rn and Y = R. y ≈ f (x) Due to noises or other uncertainty, we assume a (unknown) probability measure ρ on Z = X × Y governs the sampling. marginal distribution ρX on X: {xi}m i=1 drawn according to ρX conditional distribution ρ(·|x) at x ∈ X R Learning the regression function: fρ(x) = Y ydρ(y|x) yi ≈ fρ(xi) First Previous Next Last Back Close Quit 1 Learning with a Fixed Gaussian fz,λ,σ := arg min  m 1 X f ∈HKσ  m i=1 (f (xi) − yi)2 + λkf k2 Kσ where λ = λ(m) > 0, and Kσ (x, y) = kernel on X |x−y|2 − e 2σ2   , (1)  is a Gaussian Reproducing Kernel Hilbert Space (RKHS) HKσ completion of span{(Kσ )t := Kσ (t, ·) : t ∈ X} with the inner product h , iKσ satisfying h(Kσ )x, (Kσ )y iK = Kσ (x, y). First Previous Next Last Back Close Quit 2 Theorem 1 (Smale-Zhou, Constr. Approx. 2007) Assume R |y| ≤ M and that fρ = X Kσ (x, y)g(y)dρX (y) for some g ∈ L2 ρX . For any 0 < δ < 1, with confidence 1 − δ, kfz,λ,σ − fρkL2 ≤ 2 log 4/δ ρX where λ = λ(m) = log 4/δ 12M 1/3 1 1/3 kgkL2 ρX m 2/3 1/3 2/3 12M/kgkL2 ρX 1/m . In Theorem 1, fρ ∈ C ∞ First Previous Next Last Back Close Quit 3 RKHS HKσ generated by a Gaussian kernel on X HKσ = HKσ (Rn)|X where HKσ (Rn) is the RKHS generated by Kσ as a Mercer kernel on Rn: H Kσ ( R n ) = n f ∈ L2(Rn) : kf kHK (Rn) < ∞ σ o where   kf kHK (Rn) =  σ 1/2 |fˆ(ξ)|2 Z Rn ( √ σ 2 |ξ| − 2 n 2πσ) e 2 dξ   . Thus HKσ (Rn) ⊂ C ∞(Rn) Steinwart First Previous Next Last Back Close Quit 4 If X is a domain with piecewise smooth boundary and dρX (x) ≥ c0dx for some c0 > 0, then for any β > 0, Dσ (λ) := inf f ∈HKσ 2 kf − fρk2 + λkf k 2 Kσ Lρ X = O(λβ ) implies fρ ∈ C ∞(X). Note kf − fρk2 L2 ρ X = E(f ) − E(fρ) where E(f ) := Z (f (x) − y)2dρ. R 1 Pm (f (x ) − y )2 ≈ E(f ). Then Denote Ez(f ) = m i i i=1 fz,λ,σ = arg min n f ∈HKσ First Previous Next Last Back o 2 Ez(f ) + λkf kKσ . Close Quit 5 If we define fλ,σ = arg min n f ∈HKσ then of λ pact o 2 E(f ) + λkf kKσ , fz,λ,σ ≈ fλ,σ and the error can be estimated in terms by the theory of uniform convergence over the com√ √ function set BM/ λ := {f ∈ HKσ : kf kKσ ≤ M/ λ} since β ) for any β > 0 imfz,λ,σ ∈ BM/√λ. But kfλ,σ − fρk2 = O(λ 2 Lρ X ∞ plies fρ ∈ C (X). So the learning ability of a single Gaussian is weak. One may choose less smooth kernel, but we would like radial basis kernels for manifold learning. One way to increase the learning ability of Gaussian kernels: let σ depend on m and σ = σ(m) → 0 as m → ∞. Steinwart-Scovel, Xiang-Zhou, ... First Previous Next Last Back Close Quit 6 Another way: allow all possible variances σ ∈ (0, ∞) Regularization Schemes with Flexible Gaussians: Zhou, Wu-Ying-Zhou, Ying-Zhou, Micchelli-Pontil-Wu-Zhou, ... fz,λ := arg min min 0<σ<∞ f ∈HKσ  m 1 X m (f (xi) − yi)2 + λkf k2 Kσ i=1   .  Theorem 2 (Ying-Zhou, J. Mach. Learning Res. 2007) Let ρX be the Lebesgue measure on a domain X in Rn with minimally smooth boundary. If fρ ∈ H s(X) for some s ≤ 2 and 2s+n − 4(4s+n) λ=m , then we have s p − 2(4s+n) Ez∈Z m kfz,λ − fρk2 =O m L2 First Previous Next Last Back Close Quit log m . 7 Major difficulty: is the function set n H = ∪0<σ<∞ f ∈ HKσ : kf kKσ ≤ R o with R > 0 learnable? That is, is this function set a uniform Glivenko-Cantelli class? Its closure is not a compact subset of C(X). Theory of Uniform Convergence for supf ∈H |Ez(f ) − E(f )|. Given a bounded set H of functions on X, when do we have    Z m 1 X  lim sup Prob sup sup f (xi) − f (x)dρ > = 0, ∀ > 0? m≥` f ∈H m  `→∞ ρ X i=1 Such a set is called a uniform Glivenko-Cantelli (UGC) class. Characterizations: Vapnik-Chervonenkis, and Alon, Ben-David, Cesa-Bianchi, Haussler (1997) First Previous Next Last Back Close Quit 8 Our quantitative estimates: If V : Y ×R → R+ is convex with respect to the second variable, M = kV (y, 0)kL∞(Z) < ∞, and ρ 0 (y, t)|} : y ∈ Y, |t| ≤ R} < ∞, CR = sup{max{|V−0 (y, t)|, |V+ then we have Ez∈Z m P o R 1 m supf ∈H m i=1 V (yi, f (xi)) − Z V (y, f (x))dρ n m 2M , ≤ C 0CR R log1/4 +√ m m where C 0 is a constant depending on n. Ideas: reducing the estimates for H to a much smaller subset F = {(Kσ )x : x ∈ X, 0 < σ < ∞}, then bounding empirical covering numbers. The UGC property follows from the characterization of Dudley-Giné-Zinn. First Previous Next Last Back Close Quit 9 Improve the learning rates when X is a manifold of dimension d with d much smaller than the dimension n of the underlying Euclidean space. Approximation by Gaussians on Riemannian manifolds Let X be a d-dimensional connected compact C ∞ submanifold of Rn without boundary. The approximation scheme is given by a family of linear operators {Iσ : C(X) → C(X)}σ>0 as 1 Iσ (f )(x) = ( √ Z d 2πσ) X Z 1 = ( √ d 2πσ) X Kσ (x, y)f (y)dV (y) ( exp − |x − y|2 2σ 2 ) f (y)dV (y), x ∈ X, where V is the Riemannian volume measure of X. First Previous Next Last Back Close Quit 10 Theorem 3 (Ye-Zhou, Adv. Comput. Math. 2007) If fρ ∈ Lip(s) with 0 < s ≤ 1, then kIσ (fρ) − fρkC(X) ≤ CX kfρkLip(s)σ s ∀σ > 0, (2) where CX is a positive constant independent of fρ or σ. By taking λ = s+d log2 m 8s+4d , we have m Ez∈Z m kfz,λ − fρk2 L2 ρ =O ! s ! 2 log m 8s+4d X m . s s in Theorem 3 is smaller than 2(4s+n) in TheThe index 8s+4d orem 2 when the manifold dimension d is much smaller than n. First Previous Next Last Back Close Quit 11 Classification by Gaussians on Riemannian manifolds Let φ(t) = max{1 − t, 0} be the hinge loss for the support vector machine classification. Define fz,λ = arg min min  m 1 X σ∈(0,+∞) f ∈HKσ m φ (yif (xi)) + λkf k2 Kσ i=1   .  By using Iσ : Lp(X) → Lp(X), we obtain learning rates for binary classification to learn the Bayes rule: ( fc(x) = 1, if ρ(y = 1|x) ≥ ρ(y = −1|x) −1, if ρ(y = 1|x) < ρ(y = −1|x). Here Y = {1, −1} represents two classes. The misclassification error is defined as R(f ) : Prob{y 6= f (x)} ≥ R(fc) for any f :X →Y. First Previous Next Last Back Close Quit 12 The Sobolev space Hpk (X) is the completion of C ∞(X) with respect to the norm kf kH k (X) = p k Z X j=0 X 1/p j p |∇ f | dV , where ∇j f denotes the jth covariant derivative of f . Theorem 4 If fc lies in the interpolation space (L1(X), H12(X))θ for some 0 < θ ≤ 1, then by taking λ = n log2 m m o Ez∈Z m R(sgn(fz,λ)) − R(fc) ≤ Ce ! 2θ+d 12θ+2d , we have ! θ 2 log m 6θ+d m , where Ce is a constant independent of m. First Previous Next Last Back Close Quit 13 Ongoing topics: variable selection dimensionality reduction graph Laplacian diffusion map First Previous Next Last Back Close Quit 14

Learnability of Gaussians with flexible variances

Related documents

Products

Support

Learnability of Gaussians with flexible variances

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib