Consistent Vector-valued Distribution Regression Zolt´an Szab´ o

Consistent Vector-valued Distribution Regression Zoltán Szabó Joint work with Arthur Gretton (UCL), Barnabás Póczos (CMU), Bharath K. Sriperumbudur (PSU) UCL Workshop on the Theory of Big Data January 8, 2015 Zoltán Szabó Consistent Vector-valued Distribution Regression The task Samples: {(xi , yi )}li =1 . Goal: f (xi ) ≈ yi , find f ∈ H. Distribution regression: xi -s are distributions, i available only through samples: {xi ,n }N n=1 . ⇒ Training examples: labelled bags. Zoltán Szabó Consistent Vector-valued Distribution Regression Example: aerosol prediction from satellite images Bag:= points of a multispectral satellite image over an area. Label of a bag:= aerosol value. Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5. Using distribution regression: without domain knowledge, 100 × RMSE = 7.81. Zoltán Szabó Consistent Vector-valued Distribution Regression Wider context Context: machine learning: multi-instance learning, statistics: point estimation tasks (without analytical formula). Applications: computer vision: image = collection of patch vectors, network analysis: group of people = bag of friendship graphs, natural language processing: corpus = bag of documents, time-series modelling: user = set of trial time-series. Zoltán Szabó Consistent Vector-valued Distribution Regression Several algorithmic approaches 1 Parametric fit: Gaussian, MOG, exp. family [Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012]. 2 Kernelized Gaussian measures: [Jebara et al., 2004, Zhou and Chellappa, 2006]. 3 (Positive definite) kernels: [Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005]. 4 Divergence measures (KL, Rényi, Tsallis): [Póczos et al., 2011]. 5 Set metrics: Hausdorff metric [Edgar, 1995]; variants [Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012]. Zoltán Szabó Consistent Vector-valued Distribution Regression Theoretical guarantee? MIL dates back to [Haussler, 1999, Gärtner et al., 2002]. Sensible methods in regression: require density estimation [Póczos et al., 2013, Oliva et al., 2014] + assumptions: 1 2 compact Euclidean domain. output = R. Zoltán Szabó Consistent Vector-valued Distribution Regression Problem formulation Given: labelled bags ẑ = {(x̂i , yi )}li =1 , where i .i .d. i th bag: x̂i = {xi ,1 , . . . , xi ,N } ∼ xi ∈ M+ 1 (D), yi ∈ Y . Task: find a M+ 1 (D) → Y mapping based on ẑ. Construction: distribution embedding (µx ) + ridge regression f ∈H=H(K ) µ=µ(k) M+ 1 (D) −−−−→ X ⊆ H = H(k) −−−−−−−→ Y . Our goal: risk bound compared to the regression function Z y dρ(y |µx ). fρ (µx ) = Y Zoltán Szabó Consistent Vector-valued Distribution Regression Goal in details Contribution: analysis of the excess risk E(fẑλ , fρ ) = R[fẑλ ] − R[fρ ] ≤ g (l , N, λ) → 0 and rates, R [f ] = E(x,y ) kf (µx ) − y k2Y (expected risk), l fẑλ = arg min f ∈H 1X kf (µx̂i ) − yi k2Y + λ kf k2H , l (λ > 0). i =1 We consider two settings: 1 2 well-specified case: fρ ∈ H, misspecified case: fρ ∈ L2ρX \H. Zoltán Szabó Consistent Vector-valued Distribution Regression Kernel (k, K ), RKHS k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert) k(a, b) = hϕ(a), ϕ(b)iH . ∃! RKHS: H(k) = {D → R functions}, ϕ(u) = k(·, u). Kernel examples: D = Rd (p > 0, θ > 0): k(a, b) = (ha, bi + θ)p : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, Graphs, texts, time series, distributions. Zoltán Szabó Consistent Vector-valued Distribution Regression Kernel (k, K ), RKHS k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert) k(a, b) = hϕ(a), ϕ(b)iH . ∃! RKHS: H(k) = {D → R functions}, ϕ(u) = k(·, u). Kernel examples: D = Rd (p > 0, θ > 0): k(a, b) = (ha, bi + θ)p : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, Graphs, texts, time series, distributions. Note: H(K ) = {X → Y functions}, K (µx , µx ′ ) ∈ L(Y ). Zoltán Szabó Consistent Vector-valued Distribution Regression µ Step-1 (distribution embedding): M+ → X ⊆ H(k) 1 (D) − Given: kernel k : D × D → R. Mean embedding of a distribution x, x̂i ∈ M+ 1 (D): Z k(·, u)dx(u) ∈ H(k), µx = D N 1 X k(·, u)dx̂i (u) = µx̂i = k(·, xi ,n ). N D Z n=1 Y = R, linear K ⇒ set kernel: K (µx̂i , µx̂j ) = µx̂i , µx̂j Zoltán Szabó H N 1 X k(xi ,n , xj,m ). = 2 N n,m=1 Consistent Vector-valued Distribution Regression Step-2 (ridge regression): analytical solution Given: training sample: ẑ, test distribution: t. Prediction: (fẑλ ◦ µ)(t) = k(K + l λIl )−1 [y1 ; . . . ; yl ], (1) l×l K = [Kij ] = [K (µx̂i , µx̂j )] ∈ L(Y ) , k = [K (µx̂1 , µt ), . . . , K (µx̂l , µt )] ∈ L(Y )1×l . (2) (3) Specially: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd×d . Zoltán Szabó Consistent Vector-valued Distribution Regression Blanket assumptions D: separable, topological domain. k: bounded, continuous. K : bounded, Hölder continuous (h ∈ (0, 1]: exponent). X = µ M+ 1 (D) ∈ B(H). Y : separable Hilbert. Zoltán Szabó Consistent Vector-valued Distribution Regression Performance guarantees (in human-readable format) If in addition 1 well-specified case: fρ is ’c-smooth’ with ’b-decaying 1 covariance operator’ and l ≥ λ− b −1 , then E(fẑλ , fρ ) ≤ 2 1 1 logh (l ) + λc + 2 + 1 . N h λ3 l λ lλb (4) misspecified case: fρ is ’s-smooth’, L2ρX is separable, and 1 ≤ l , then λ2 h E(fẑλ , fρ ) ≤ log 2 (l ) h 2 N λ 3 2 1 +√ + lλ Zoltán Szabó √ λmin(1,s) √ + λmin(1,s) . λ l Consistent Vector-valued Distribution Regression (5) Performance guarantee: example Misspecified case: assume s ≥ 1, h = 1 (K : Lipschitz), ✄ ✄ a 1 = ✂ ✁ ✂3 ✁in (5) ⇒ λ; l = N (a > 0) t = lN a : total number of samples processed. Then 1 2 s = 1 (’most difficult’ task): E(fẑλ , fρ ) ≈ t −0.25 , s → ∞ (’simplest’ problem): E(fẑλ , fρ ) ≈ t −0.5 . Zoltán Szabó Consistent Vector-valued Distribution Regression Nonlinear K examples Y = R; D: compact, metric; k: universal ⇒ Hölder K -s: KG e Ke kµa −µb k2H − 2θ 2 e− h=1 KC kµa −µb kH 2θ 2 1 2 h= Kt θ 2 h=1 −1 Ki 1 + kµa − µb kθH h= 1 + kµa − µb k2H /θ 2 −1 (θ ≤ 2) kµa − µb k2H + θ 2 h=1 − 1 2 They are functions of kµa − µb kH ⇒ computation: similar to set kernel. Zoltán Szabó Consistent Vector-valued Distribution Regression Summary Problem: distribution regression. Literature: large number of heuristics. Contribution: a simple ridge solution is consistent, specially, the set kernel is so (15-year-old open question). Code ∈ ITE toolbox: https://bitbucket.org/szzoli/ite/ Details (submitted to JMLR): http://arxiv.org/pdf/1411.2066. Zoltán Szabó Consistent Vector-valued Distribution Regression Thank you for the attention! Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. The work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK. Zoltán Szabó Consistent Vector-valued Distribution Regression Appendix: contents Well/misspecified assumptions. Topological definitions, separability. Vector-valued RKHS. Weak topology on M+ 1 (D). Measurability of µ. Universal kernel examples. Zoltán Szabó Consistent Vector-valued Distribution Regression Well-specified case: ρ ∈ P(b, c) Let the T : H → H covariance operator be Z K (·, µa )K ∗ (·, µa )dρX (µa ) T = X with eigenvalues tn (n = 1, 2, . . .). Assumption: ρ ∈ P(b, c) = set of distributions on X × Y α ≤ nb tn ≤ β (∀n ≥ 1; α > 0, β > 0), c−1 2 ∃g ∈ H such that fρ = T 2 g with kg kH ≤ R (R > 0), where b ∈ (1, ∞), c ∈ [1, 2]. Intuition: b – effective input dimension, c – smoothness of fρ . Zoltán Szabó Consistent Vector-valued Distribution Regression Misspecified case: assumption Let T̃ be the extension of T from H to L2ρX : SK∗ : H ֒→ L2ρX , SK : L2ρX → H, (SK g )(µu ) = Z K (µu , µt )g (µt )dρX (µt ), X T̃ = SK∗ SK : L2ρX → L2ρX . Our range space assumption on ρ: fρ ∈ Im T̃ s for some s ≥ 0. Zoltán Szabó Consistent Vector-valued Distribution Regression Misspecified case: note on the separability of L2ρX L2ρX : separable ⇔ measure space with d(A, B) = ρX (A △ B) is so [Thomson et al., 2008]. Zoltán Szabó Consistent Vector-valued Distribution Regression Topological space, open sets Given: D 6= ∅ set. τ ⊆ 2D is called a topology on D if: 1 2 3 ∅ ∈ τ, D ∈ τ. Finite intersection: O1 ∈ τ , O2 ∈ τ ⇒ O1 ∩ O2 ∈ τ . Arbitrary union: Oi ∈ τ (i ∈ I ) ⇒ ∪i ∈I Oi ∈ τ . Then, (D, τ ) is called a topological space; O ∈ τ : open sets. Zoltán Szabó Consistent Vector-valued Distribution Regression Closed-, compact set, closure, dense subset, separability Given: (D, τ ). A ⊆ D is closed if D\A ∈ τ (i.e., its complement is open), compact if for any family (Oi )i ∈I of open sets with A ⊆ ∪i ∈I Oi , ∃i1 , . . . , in ∈ I with A ⊆ ∪nj=1 Oij . Closure of A ⊆ D: Ā := \ C. A⊆C closed in D A ⊆ D is dense if Ā = D. (D, τ ) is separable if ∃ countable, dense subset of D. Counterexample: l ∞ /L∞ . Zoltán Szabó Consistent Vector-valued Distribution Regression (6) Vector-valued RKHS Definition: A H ⊆ Y X Hilbert space of functions is RKHS if Aµx ,y : f 7→ hy , f (µx )iY (7) is continuous for ∀µx ∈ X , y ∈ Y . = The evaluation functional is continuous in every direction. Riesz representation theorem ⇒ ∃Kµt ∈ L(Y , H): K (µx , µt )(y ) = (Kµt y )(µx ), K (·, µt )(y ) = Kµt y , (∀µx , µt ∈ X ), or shortly H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }. Zoltán Szabó Consistent Vector-valued Distribution Regression (8) (9) Vector-valued RKHS – continued Examples (Y = Rd ): 1 Ki : X × X → R kernels (i = 1, . . . , d). Diagonal kernel: K (µa , µb ) = diag (K1 (µa , µb ), . . . , Kd (µa , µb )). 2 (10) Combination of Dj diagonal kernels [Dj (µa , µb ) ∈ Rr ×r , Aj ∈ Rr ×d ]: K (µa , µb ) = m X A∗j Dj (µa , µb )Aj . (11) j=1 Zoltán Szabó Consistent Vector-valued Distribution Regression Weak topology on M+ 1 (D) Def.: It is the weakest topology such that the Lh : (M+ (D), τw ) → R, Z1 h(u)dx(u) Lh (x) = D mapping is continuous for all h ∈ Cb (D), where Cb (D) = {(D, τ ) → R bounded, continuous functions}. Zoltán Szabó Consistent Vector-valued Distribution Regression Measurability of µ k: bounded, continuous ⇒ µ : (M+ 1 (D), B(τw )) → (H, B(H)) measurable. µ measurable, X ∈ B(H) ⇒ ρ on X × Y : well-defined. If D is compact metric, k is universal, then µ is continuous and X ∈ B(H). Zoltán Szabó Consistent Vector-valued Distribution Regression Universal kernel examples On every compact subset of Rd : k(a, b) = e − ka−bk2 2 2σ 2 , (σ > 0) k(a, b) = e βha,bi , (β > 0), or more generally ∞ X an x n (∀an > 0) k(a, b) = f (ha, bi), f (x) = n=0 α k(a, b) = (1 − ha, bi) , Zoltán Szabó (α > 0). Consistent Vector-valued Distribution Regression Chen, Y. and Wu, O. (2012). Contextual Hausdorff dissimilarity for multi-instance clustering. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873. Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005). Semigroup kernels on measures. Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995). Measure, Topology and Fractal Geometry. Springer-Verlag. Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages 179–186. Zoltán Szabó Consistent Vector-valued Distribution Regression Haussler, D. (1999). Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. (http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf). Hein, M. and Bousquet, O. (2005). Hilbertian metrics and positive definite kernels on probability measures. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143. Jebara, T., Kondor, R., and Howard, A. (2004). Probability product kernels. Journal of Machine Learning Research, 5:819–844. Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009). Nonextensive information theoretical kernels on measures. Zoltán Szabó Consistent Vector-valued Distribution Regression Journal of Machine Learning Research, 10:935–975. Nielsen, F. and Nock, R. (2012). A closed-form expression for the Sharma-Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45:032003. Oliva, J. B., Neiswanger, W., Póczos, B., Schneider, J., and Xing, E. (2014). Fast distribution to real regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714. Póczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013). Distribution-free distribution regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. Póczos, B., Xiong, L., and Schneider, J. (2011). Zoltán Szabó Consistent Vector-valued Distribution Regression Nonparametric divergence estimation with applications to machine learning on distributions. In Uncertainty in Artificial Intelligence (UAI), pages 599–608. Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008). Real Analysis. Prentice-Hall. Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009). Closed-form Jensen-Rényi divergence for mixture of Gaussians and applications to group-wise shape registration. Medical Image Computing and Computer-Assisted Intervention, 12:648–655. Wang, J. and Zucker, J.-D. (2000). Solving the multiple-instance problem: A lazy learning approach. In International Conference on Machine Learning (ICML), pages 1119–1126. Zoltán Szabó Consistent Vector-valued Distribution Regression Wang, Z., Lan, L., and Vucetic, S. (2012). Mixture model for multiple instance regression and applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237. Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010). Identifying multi-instance outliers. In SIAM International Conference on Data Mining (SDM), pages 430–441. Zhang, M.-L. and Zhou, Z.-H. (2009). Multi-instance clustering with applications to multi-instance prediction. Applied Intelligence, 31:47–68. Zhou, S. K. and Chellappa, R. (2006). From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. Zoltán Szabó Consistent Vector-valued Distribution Regression IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929. Zoltán Szabó Consistent Vector-valued Distribution Regression

Consistent Vector-valued Distribution Regression Zolt´an Szab´ o

Related documents

Products

Support

Consistent Vector-valued Distribution Regression Zolt´an Szab´ o

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib