Consistent Vector-valued Distribution Regression Zoltán Szabó Joint work with Arthur Gretton (UCL), Barnabás Póczos (CMU), Bharath K. Sriperumbudur (PSU) UCL Workshop on the Theory of Big Data January 8, 2015 Zoltán Szabó Consistent Vector-valued Distribution Regression The task Samples: {(xi , yi )}li =1 . Goal: f (xi ) ≈ yi , find f ∈ H. Distribution regression: xi -s are distributions, i available only through samples: {xi ,n }N n=1 . ⇒ Training examples: labelled bags. Zoltán Szabó Consistent Vector-valued Distribution Regression Example: aerosol prediction from satellite images Bag:= points of a multispectral satellite image over an area. Label of a bag:= aerosol value. Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5. Using distribution regression: without domain knowledge, 100 × RMSE = 7.81. Zoltán Szabó Consistent Vector-valued Distribution Regression Wider context Context: machine learning: multi-instance learning, statistics: point estimation tasks (without analytical formula). Applications: computer vision: image = collection of patch vectors, network analysis: group of people = bag of friendship graphs, natural language processing: corpus = bag of documents, time-series modelling: user = set of trial time-series. Zoltán Szabó Consistent Vector-valued Distribution Regression Several algorithmic approaches 1 Parametric fit: Gaussian, MOG, exp. family [Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012]. 2 Kernelized Gaussian measures: [Jebara et al., 2004, Zhou and Chellappa, 2006]. 3 (Positive definite) kernels: [Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005]. 4 Divergence measures (KL, Rényi, Tsallis): [Póczos et al., 2011]. 5 Set metrics: Hausdorff metric [Edgar, 1995]; variants [Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012]. Zoltán Szabó Consistent Vector-valued Distribution Regression Theoretical guarantee? MIL dates back to [Haussler, 1999, Gärtner et al., 2002]. Sensible methods in regression: require density estimation [Póczos et al., 2013, Oliva et al., 2014] + assumptions: 1 2 compact Euclidean domain. output = R. Zoltán Szabó Consistent Vector-valued Distribution Regression Problem formulation Given: labelled bags ẑ = {(x̂i , yi )}li =1 , where i .i .d. i th bag: x̂i = {xi ,1 , . . . , xi ,N } ∼ xi ∈ M+ 1 (D), yi ∈ Y . Task: find a M+ 1 (D) → Y mapping based on ẑ. Construction: distribution embedding (µx ) + ridge regression f ∈H=H(K ) µ=µ(k) M+ 1 (D) −−−−→ X ⊆ H = H(k) −−−−−−−→ Y . Our goal: risk bound compared to the regression function Z y dρ(y |µx ). fρ (µx ) = Y Zoltán Szabó Consistent Vector-valued Distribution Regression Goal in details Contribution: analysis of the excess risk E(fẑλ , fρ ) = R[fẑλ ] − R[fρ ] ≤ g (l , N, λ) → 0 and rates, R [f ] = E(x,y ) kf (µx ) − y k2Y (expected risk), l fẑλ = arg min f ∈H 1X kf (µx̂i ) − yi k2Y + λ kf k2H , l (λ > 0). i =1 We consider two settings: 1 2 well-specified case: fρ ∈ H, misspecified case: fρ ∈ L2ρX \H. Zoltán Szabó Consistent Vector-valued Distribution Regression Kernel (k, K ), RKHS k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert) k(a, b) = hϕ(a), ϕ(b)iH . ∃! RKHS: H(k) = {D → R functions}, ϕ(u) = k(·, u). Kernel examples: D = Rd (p > 0, θ > 0): k(a, b) = (ha, bi + θ)p : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, Graphs, texts, time series, distributions. Zoltán Szabó Consistent Vector-valued Distribution Regression Kernel (k, K ), RKHS k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert) k(a, b) = hϕ(a), ϕ(b)iH . ∃! RKHS: H(k) = {D → R functions}, ϕ(u) = k(·, u). Kernel examples: D = Rd (p > 0, θ > 0): k(a, b) = (ha, bi + θ)p : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, Graphs, texts, time series, distributions. Note: H(K ) = {X → Y functions}, K (µx , µx ′ ) ∈ L(Y ). Zoltán Szabó Consistent Vector-valued Distribution Regression µ Step-1 (distribution embedding): M+ → X ⊆ H(k) 1 (D) − Given: kernel k : D × D → R. Mean embedding of a distribution x, x̂i ∈ M+ 1 (D): Z k(·, u)dx(u) ∈ H(k), µx = D N 1 X k(·, u)dx̂i (u) = µx̂i = k(·, xi ,n ). N D Z n=1 Y = R, linear K ⇒ set kernel: K (µx̂i , µx̂j ) = µx̂i , µx̂j Zoltán Szabó H N 1 X k(xi ,n , xj,m ). = 2 N n,m=1 Consistent Vector-valued Distribution Regression Step-2 (ridge regression): analytical solution Given: training sample: ẑ, test distribution: t. Prediction: (fẑλ ◦ µ)(t) = k(K + l λIl )−1 [y1 ; . . . ; yl ], (1) l×l K = [Kij ] = [K (µx̂i , µx̂j )] ∈ L(Y ) , k = [K (µx̂1 , µt ), . . . , K (µx̂l , µt )] ∈ L(Y )1×l . (2) (3) Specially: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd×d . Zoltán Szabó Consistent Vector-valued Distribution Regression Blanket assumptions D: separable, topological domain. k: bounded, continuous. K : bounded, Hölder continuous (h ∈ (0, 1]: exponent). X = µ M+ 1 (D) ∈ B(H). Y : separable Hilbert. Zoltán Szabó Consistent Vector-valued Distribution Regression Performance guarantees (in human-readable format) If in addition 1 well-specified case: fρ is ’c-smooth’ with ’b-decaying 1 covariance operator’ and l ≥ λ− b −1 , then E(fẑλ , fρ ) ≤ 2 1 1 logh (l ) + λc + 2 + 1 . N h λ3 l λ lλb (4) misspecified case: fρ is ’s-smooth’, L2ρX is separable, and 1 ≤ l , then λ2 h E(fẑλ , fρ ) ≤ log 2 (l ) h 2 N λ 3 2 1 +√ + lλ Zoltán Szabó √ λmin(1,s) √ + λmin(1,s) . λ l Consistent Vector-valued Distribution Regression (5) Performance guarantee: example Misspecified case: assume s ≥ 1, h = 1 (K : Lipschitz), ✄ ✄ a 1 = ✂ ✁ ✂3 ✁in (5) ⇒ λ; l = N (a > 0) t = lN a : total number of samples processed. Then 1 2 s = 1 (’most difficult’ task): E(fẑλ , fρ ) ≈ t −0.25 , s → ∞ (’simplest’ problem): E(fẑλ , fρ ) ≈ t −0.5 . Zoltán Szabó Consistent Vector-valued Distribution Regression Nonlinear K examples Y = R; D: compact, metric; k: universal ⇒ Hölder K -s: KG e Ke kµa −µb k2H − 2θ 2 e− h=1 KC kµa −µb kH 2θ 2 1 2 h= Kt θ 2 h=1 −1 Ki 1 + kµa − µb kθH h= 1 + kµa − µb k2H /θ 2 −1 (θ ≤ 2) kµa − µb k2H + θ 2 h=1 − 1 2 They are functions of kµa − µb kH ⇒ computation: similar to set kernel. Zoltán Szabó Consistent Vector-valued Distribution Regression Summary Problem: distribution regression. Literature: large number of heuristics. Contribution: a simple ridge solution is consistent, specially, the set kernel is so (15-year-old open question). Code ∈ ITE toolbox: https://bitbucket.org/szzoli/ite/ Details (submitted to JMLR): http://arxiv.org/pdf/1411.2066. Zoltán Szabó Consistent Vector-valued Distribution Regression Thank you for the attention! Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. The work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK. Zoltán Szabó Consistent Vector-valued Distribution Regression Appendix: contents Well/misspecified assumptions. Topological definitions, separability. Vector-valued RKHS. Weak topology on M+ 1 (D). Measurability of µ. Universal kernel examples. Zoltán Szabó Consistent Vector-valued Distribution Regression Well-specified case: ρ ∈ P(b, c) Let the T : H → H covariance operator be Z K (·, µa )K ∗ (·, µa )dρX (µa ) T = X with eigenvalues tn (n = 1, 2, . . .). Assumption: ρ ∈ P(b, c) = set of distributions on X × Y α ≤ nb tn ≤ β (∀n ≥ 1; α > 0, β > 0), c−1 2 ∃g ∈ H such that fρ = T 2 g with kg kH ≤ R (R > 0), where b ∈ (1, ∞), c ∈ [1, 2]. Intuition: b – effective input dimension, c – smoothness of fρ . Zoltán Szabó Consistent Vector-valued Distribution Regression Misspecified case: assumption Let T̃ be the extension of T from H to L2ρX : SK∗ : H ֒→ L2ρX , SK : L2ρX → H, (SK g )(µu ) = Z K (µu , µt )g (µt )dρX (µt ), X T̃ = SK∗ SK : L2ρX → L2ρX . Our range space assumption on ρ: fρ ∈ Im T̃ s for some s ≥ 0. Zoltán Szabó Consistent Vector-valued Distribution Regression Misspecified case: note on the separability of L2ρX L2ρX : separable ⇔ measure space with d(A, B) = ρX (A △ B) is so [Thomson et al., 2008]. Zoltán Szabó Consistent Vector-valued Distribution Regression Topological space, open sets Given: D 6= ∅ set. τ ⊆ 2D is called a topology on D if: 1 2 3 ∅ ∈ τ, D ∈ τ. Finite intersection: O1 ∈ τ , O2 ∈ τ ⇒ O1 ∩ O2 ∈ τ . Arbitrary union: Oi ∈ τ (i ∈ I ) ⇒ ∪i ∈I Oi ∈ τ . Then, (D, τ ) is called a topological space; O ∈ τ : open sets. Zoltán Szabó Consistent Vector-valued Distribution Regression Closed-, compact set, closure, dense subset, separability Given: (D, τ ). A ⊆ D is closed if D\A ∈ τ (i.e., its complement is open), compact if for any family (Oi )i ∈I of open sets with A ⊆ ∪i ∈I Oi , ∃i1 , . . . , in ∈ I with A ⊆ ∪nj=1 Oij . Closure of A ⊆ D: Ā := \ C. A⊆C closed in D A ⊆ D is dense if Ā = D. (D, τ ) is separable if ∃ countable, dense subset of D. Counterexample: l ∞ /L∞ . Zoltán Szabó Consistent Vector-valued Distribution Regression (6) Vector-valued RKHS Definition: A H ⊆ Y X Hilbert space of functions is RKHS if Aµx ,y : f 7→ hy , f (µx )iY (7) is continuous for ∀µx ∈ X , y ∈ Y . = The evaluation functional is continuous in every direction. Riesz representation theorem ⇒ ∃Kµt ∈ L(Y , H): K (µx , µt )(y ) = (Kµt y )(µx ), K (·, µt )(y ) = Kµt y , (∀µx , µt ∈ X ), or shortly H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }. Zoltán Szabó Consistent Vector-valued Distribution Regression (8) (9) Vector-valued RKHS – continued Examples (Y = Rd ): 1 Ki : X × X → R kernels (i = 1, . . . , d). Diagonal kernel: K (µa , µb ) = diag (K1 (µa , µb ), . . . , Kd (µa , µb )). 2 (10) Combination of Dj diagonal kernels [Dj (µa , µb ) ∈ Rr ×r , Aj ∈ Rr ×d ]: K (µa , µb ) = m X A∗j Dj (µa , µb )Aj . (11) j=1 Zoltán Szabó Consistent Vector-valued Distribution Regression Weak topology on M+ 1 (D) Def.: It is the weakest topology such that the Lh : (M+ (D), τw ) → R, Z1 h(u)dx(u) Lh (x) = D mapping is continuous for all h ∈ Cb (D), where Cb (D) = {(D, τ ) → R bounded, continuous functions}. Zoltán Szabó Consistent Vector-valued Distribution Regression Measurability of µ k: bounded, continuous ⇒ µ : (M+ 1 (D), B(τw )) → (H, B(H)) measurable. µ measurable, X ∈ B(H) ⇒ ρ on X × Y : well-defined. If D is compact metric, k is universal, then µ is continuous and X ∈ B(H). Zoltán Szabó Consistent Vector-valued Distribution Regression Universal kernel examples On every compact subset of Rd : k(a, b) = e − ka−bk2 2 2σ 2 , (σ > 0) k(a, b) = e βha,bi , (β > 0), or more generally ∞ X an x n (∀an > 0) k(a, b) = f (ha, bi), f (x) = n=0 α k(a, b) = (1 − ha, bi) , Zoltán Szabó (α > 0). Consistent Vector-valued Distribution Regression Chen, Y. and Wu, O. (2012). Contextual Hausdorff dissimilarity for multi-instance clustering. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873. Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005). Semigroup kernels on measures. Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995). Measure, Topology and Fractal Geometry. Springer-Verlag. Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages 179–186. Zoltán Szabó Consistent Vector-valued Distribution Regression Haussler, D. (1999). Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. (http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf). Hein, M. and Bousquet, O. (2005). Hilbertian metrics and positive definite kernels on probability measures. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143. Jebara, T., Kondor, R., and Howard, A. (2004). Probability product kernels. Journal of Machine Learning Research, 5:819–844. Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009). Nonextensive information theoretical kernels on measures. Zoltán Szabó Consistent Vector-valued Distribution Regression Journal of Machine Learning Research, 10:935–975. Nielsen, F. and Nock, R. (2012). A closed-form expression for the Sharma-Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45:032003. Oliva, J. B., Neiswanger, W., Póczos, B., Schneider, J., and Xing, E. (2014). Fast distribution to real regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714. Póczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013). Distribution-free distribution regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. Póczos, B., Xiong, L., and Schneider, J. (2011). Zoltán Szabó Consistent Vector-valued Distribution Regression Nonparametric divergence estimation with applications to machine learning on distributions. In Uncertainty in Artificial Intelligence (UAI), pages 599–608. Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008). Real Analysis. Prentice-Hall. Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009). Closed-form Jensen-Rényi divergence for mixture of Gaussians and applications to group-wise shape registration. Medical Image Computing and Computer-Assisted Intervention, 12:648–655. Wang, J. and Zucker, J.-D. (2000). Solving the multiple-instance problem: A lazy learning approach. In International Conference on Machine Learning (ICML), pages 1119–1126. Zoltán Szabó Consistent Vector-valued Distribution Regression Wang, Z., Lan, L., and Vucetic, S. (2012). Mixture model for multiple instance regression and applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237. Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010). Identifying multi-instance outliers. In SIAM International Conference on Data Mining (SDM), pages 430–441. Zhang, M.-L. and Zhou, Z.-H. (2009). Multi-instance clustering with applications to multi-instance prediction. Applied Intelligence, 31:47–68. Zhou, S. K. and Chellappa, R. (2006). From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. Zoltán Szabó Consistent Vector-valued Distribution Regression IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929. Zoltán Szabó Consistent Vector-valued Distribution Regression