Distribution Regression: Computational & Statistical Tradeoffs Zoltán Szabó (Gatsby Unit, UCL) Joint work with ◦ Bharath K. Sriperumbudur (Department of Statistics, PSU), ◦ Barnabás Póczos (ML Department, CMU), ◦ Arthur Gretton (Gatsby Unit, UCL) Princeton University November 26, 2015 Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Outline Motivation: application + theory. Problem formulation. Results: computational & statistical tradeoffs. Numerical examples. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs The task Samples: {(xi , yi )}`i=1 . Find f ∈ H such that f (xi ) ≈ yi . Distribution regression: xi -s are distributions, i available only through samples: {xi,n }N n=1 , labelled bags. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs The task Samples: {(xi , yi )}`i=1 . Find f ∈ H such that f (xi ) ≈ yi . Distribution regression: xi -s are distributions, i available only through samples: {xi,n }N n=1 , labelled bags. Goal: computational & statistical tradeoffs implied by N := Ni (∀i). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Motivation (application): aerosol prediction Bag := pixels of a multispectral satellite image over an area. Label of a bag := aerosol value. Relevance: climate research, sustainability. Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5. Using distribution regression? Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Wider context Context: machine learning: multi-instance learning, statistics: point estimation tasks (without analytical formula). Applications: computer vision: image = collection of patch vectors, network analysis: group of people = bag of friendship graphs, natural language processing: corpus = bag of documents, time-series modelling: user = set of trial time-series. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Several algorithmic approaches 1 Parametric fit: Gaussian, MOG, exp. family [Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012]. 2 Kernelized Gaussian measures: [Jebara et al., 2004, Zhou and Chellappa, 2006]. 3 (Positive definite) kernels: [Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005]. 4 Divergence measures (KL, Rényi, Tsallis, . . . ): [Póczos et al., 2011, Kandasamy et al., 2015]. 5 Set metrics: Hausdorff metric [Edgar, 1995]; variants [Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012]. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Motivation: theory MIL dates back to [Haussler, 1999, Gärtner et al., 2002]. Sensible methods in regression: require density estimation [Póczos et al., 2013, Oliva et al., 2014, Reddi and Póczos, 2014, Sutherland et al., 2015] + assumptions: 1 2 compact Euclidean domain. output = R ([Oliva et al., 2013, Oliva et al., 2015]: distribution/function). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Input-output requirements Input: distributions on ’structured’ D domains (kernels). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Input-output requirements Input: distributions on ’structured’ D domains (kernels). Output: simplest case: Y = R, but Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Input-output requirements Input: distributions on ’structured’ D domains (kernels). Output: simplest case: Y = R, but dependencies might matter: Y = Rd (or separable Hilbert). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Kernel, RKHS k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Kernel, RKHS k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D). Kernel examples: D = Rd (p > 0, θ > 0) p k(a, b) = (ha, bi + θ) : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, k(a, b) = e −θka−bk1 : Laplacian. In the H = H(k) RKHS (∃!): ϕ(u) = k(·, u). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Kernel: example domains (D) Euclidean space (D = Rd ), graphs, texts, time series, dynamical systems, distributions! Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Problem formulation (Y = R) Given: ` labelled bags ẑ = {(x̂i , yi )}i=1 , i.i.d. i th bag: x̂i = {xi,1 , . . . , xi,N } ∼ xi ∈ P (D), yi ∈ R. Task: find a P (D) → R mapping based on ẑ. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Problem formulation (Y = R) Given: ` labelled bags ẑ = {(x̂i , yi )}i=1 , i.i.d. i th bag: x̂i = {xi,1 , . . . , xi,N } ∼ xi ∈ P (D), yi ∈ R. Task: find a P (D) → R mapping based on ẑ. Construction: mean embedding (µx ) µ=µ(k) P (D) −−−−−→X ⊆ H = H(k) | {z } 2 =two-stage sampling Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Problem formulation (Y = R) Given: ` labelled bags ẑ = {(x̂i , yi )}i=1 , i.i.d. i th bag: x̂i = {xi,1 , . . . , xi,N } ∼ xi ∈ P (D), yi ∈ R. Task: find a P (D) → R mapping based on ẑ. Construction: mean embedding (µx ) + ridge regression µ=µ(k) f ∈H=H(K ) P (D) −−−−−→X ⊆ H = H(k)−−−−−−−→R . | {z } | {z } 2 =two-stage sampling Zoltán Szabó 1 =Hilbert → R regression Distribution Regression: Computational & Statistical Tradeoffs 1 : Hilbert → R regression, well-specified case Regression function, expected risk (assume for a moment: fρ ∈ H): Z fρ (µx ) = y dρ(y |µx ), R [f ] = E(x,y ) |f (µx ) − y |2 . R Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs 1 : Hilbert → R regression, well-specified case Regression function, expected risk (assume for a moment: fρ ∈ H): Z fρ (µx ) = y dρ(y |µx ), R [f ] = E(x,y ) |f (µx ) − y |2 . R Ridge estimator: ` fzλ = arg min f ∈H 1X |f (µxi ) − yi |2 + λ kf k2H , ` (λ > 0). i=1 Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs 1 : Hilbert → R regression, well-specified case Regression function, expected risk (assume for a moment: fρ ∈ H): Z fρ (µx ) = y dρ(y |µx ), R [f ] = E(x,y ) |f (µx ) − y |2 . R Ridge estimator: ` fzλ = arg min f ∈H 1X |f (µxi ) − yi |2 + λ kf k2H , ` (λ > 0). i=1 Excess risk: E(fzλ , fρ ) = R[fzλ ] − R[fρ ]. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs 1 : Hilbert → R regression Known [Caponnetto and De Vito, 2007]: if ρ(µx , y ) ∈ P(b, c), then the best/achieved rate bc E(fzλ , fρ ) = Op `− bc+1 (1 < b, c ∈ (1, 2]). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs 1 : Hilbert → R regression Known [Caponnetto and De Vito, 2007]: if ρ(µx , y ) ∈ P(b, c), then the best/achieved rate bc E(fzλ , fρ ) = Op `− bc+1 (1 < b, c ∈ (1, 2]). ρ ∈ P(b, c): Z T = K (·, µa )K ∗ (·, µa )dρX (µa ) : H → H. X c−1 Eigenvalues of T decay as λn = O(n−b ). fρ ∈ Im T 2 . Intuition: 1/b – effective input dimension, c – smoothness of fρ . Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Our question bc Can we reach this Op `− bc+1 minimax rate? N =? ` fẑλ 1X = arg min |f (µx̂i ) − yi |2 + λ kf k2H , ` f ∈H fzλ = arg min f ∈H 1 ` i=1 ` X |f (µxi ) − yi |2 + λ kf k2H . i=1 Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs 2 : mean embedding, µxi → µx̂i k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u). Mean embedding of a distribution x, x̂i ∈ P(D): Z µx = k(·, u)dx(u) ∈ H(k), D Z µx̂i = D k(·, u)dx̂i (u) = Zoltán Szabó N 1 X k(·, xi,n ). N n=1 Distribution Regression: Computational & Statistical Tradeoffs 2 : mean embedding, µxi → µx̂i k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u). Mean embedding of a distribution x, x̂i ∈ P(D): Z µx = k(·, u)dx(u) ∈ H(k), D Z µx̂i = D k(·, u)dx̂i (u) = N 1 X k(·, xi,n ). N n=1 Linear K ⇒ set kernel: N 1 X K (µx̂i , µx̂j ) = µx̂i , µx̂j H = 2 k(xi,n , xj,m ). N n,m=1 Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs 2 : mean embedding, µxi → µx̂i k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u). Mean embedding of a distribution x, x̂i ∈ P(D): Z µx = k(·, u)dx(u) ∈ H(k), D Z µx̂i = D k(·, u)dx̂i (u) = N 1 X k(·, xi,n ). N n=1 Nonlinear K example: K (µx̂i , µx̂j ) = e Zoltán Szabó − kµx̂ −µx̂ k2 i j H 2σ 2 . Distribution Regression: Computational & Statistical Tradeoffs 2 : ridge regression ⇒ analytical solution Given: training sample: ẑ, test distribution: t. Prediction on t: (fẑλ ◦ µ)(t) = k(K + `λI` )−1 [y1 ; . . . ; y` ], (1) K = [K (µx̂i , µx̂j )] ∈ R`×` , k = [K (µx̂1 , µt ), . . . , K (µx̂` , µt )] ∈ R (2) 1×` . (3) ⇒ K (µx , µx 0 ) = K (·, µx ), K (·, µ0x ) H(K ) matter. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs 1 - 2 : Why can we get consistency? – intuition Convergence of the mean embedding: 1 kµx − µx̂ kH(k) = Op √ . N Hölder property of K (0 < L, 0 < h ≤ 1): kK (·, µx ) − K (·, µx̂ )kH(K ) ≤ L kµx − µx̂ khH(k) . fẑλ depends ’nicely’ on µx̂ . Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs 1 - 2 : Proof idea By decomposing the excess risk, concentration, on P(b, c) we get E(fẑλ , fρ ) logh (`) ≤ N hλ | 1 1 1 1 +1+ + λc + 2 + 1 → 0, 1 2 2 1+ λ ` ` λ `λ b `λ b } | {z } {z 2 =two-stage sampling s.t. `λ b+1 b ≥ 1, log(`) 2 1 = H → R regression ≤ N. λh Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs 1 - 2 : Proof idea By decomposing the excess risk, concentration, on P(b, c) we get E(fẑλ , fρ ) logh (`) ≤ N hλ | 1 1 1 1 +1+ + λc + 2 + 1 → 0, 1 2 2 1+ λ ` ` λ `λ b `λ b } | {z } {z 2 =two-stage sampling s.t. `λ b+1 b ≥ 1, log(`) 2 1 = H → R regression ≤ N. λh a Let N = ` h log(`) ⇒ 1st term + constraints simplify. a > 0: needed, i.e. N > log(`). Bias-variance trick with constraint-checking ⇒ Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Computational & statistical tradeoff (W) If b(c bc b(c a≥ bc a≤ ac a + 1) , then E fẑλ , fρ = Op `− c+1 with λ = `− c+1 , +1 bc b + 1) then E fẑλ , fρ = Op `− bc+1 with λ = `− bc+1 . +1 Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Computational & statistical tradeoff (W) If b(c bc b(c a≥ bc a≤ ac a + 1) , then E fẑλ , fρ = Op `− c+1 with λ = `− c+1 , +1 bc b + 1) then E fẑλ , fρ = Op `− bc+1 with λ = `− bc+1 . +1 a Meaning (a-dependence, N = ` h log(`)): ’Smaller a’ = computational saving, but reduced statistical efficiency. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Computational & statistical tradeoff (W) If b(c bc b(c a≥ bc a≤ ac a + 1) , then E fẑλ , fρ = Op `− c+1 with λ = `− c+1 , +1 bc b + 1) then E fẑλ , fρ = Op `− bc+1 with λ = `− bc+1 . +1 a Meaning (a-dependence, N = ` h log(`)): ’Smaller a’ = computational saving, but reduced statistical efficiency. b(c + 1) Sensible choice: a ≤ < 2: bc + 1 N sub-quadratic in ` achieves onestage sampled minimax rate! (’=’) Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Computational & statistical tradeoff (W) If b(c bc b(c a≥ bc a≤ ac a + 1) , then E fẑλ , fρ = Op `− c+1 with λ = `− c+1 , +1 bc b + 1) then E fẑλ , fρ = Op `− bc+1 with λ = `− bc+1 . +1 a Meaning (h-dependence, N = ` h log(`)): smoother K kernel is rewarding = bag-size reduction; see smoothness of fρ . Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Computational & statistical tradeoff (W) If b(c bc b(c a≥ bc a≤ ac a + 1) , then E fẑλ , fρ = Op `− c+1 with λ = `− c+1 , +1 bc b + 1) then E fẑλ , fρ = Op `− bc+1 with λ = `− bc+1 . +1 Meaning (c-dependence): b(c + 1) c 7→ decreasing: smaller bags are enough for easier bc + 1 problems. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Misspecified setting Relevant case: fρ ∈ L2ρX \H. fρ : difficulty parameter = s ∈ (0, 1], larger s = easier problem. Proof idea: ∞-D exponential family fitting [Sriperumbudur et al., 2014], ridge solution. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Computational & statistical tradeoff (M) 2a Let N = ` h log(`) (a > 0). If 2sa a s +1 a≤ , then E fẑλ , fρ = Op `− s+1 with λ = `− s+1 , s +2 2s 1 s +1 a≥ , then E fẑλ , fρ = Op `− s+2 with λ = `− s+2 . s +2 Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Computational & statistical tradeoff (M) 2a Let N = ` h log(`) (a > 0). If 2sa a s +1 a≤ , then E fẑλ , fρ = Op `− s+1 with λ = `− s+1 , s +2 2s 1 s +1 a≥ , then E fẑλ , fρ = Op `− s+2 with λ = `− s+2 . s +2 Meaning (a-dependence): ’Smaller a’ = computational saving, but reduced statistical efficiency. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Computational & statistical tradeoff (M) 2a Let N = ` h log(`) (a > 0). If 2sa a s +1 a≤ , then E fẑλ , fρ = Op `− s+1 with λ = `− s+1 , s +2 2s 1 s +1 a≥ , then E fẑλ , fρ = Op `− s+2 with λ = `− s+2 . s +2 Meaning (a-dependence): ’Smaller a’ = computational saving, but reduced statistical efficiency. s +1 2 4 Sensible choice: a ≤ ≤ ⇒ 2a ≤ < 2! s +2 3 3 N can be sub-quadratic in ` again (’=’)! Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Computational & statistical tradeoff (M) 2a Let N = ` h log(`) (a > 0). If 2sa a s +1 a≤ , then E fẑλ , fρ = Op `− s+1 with λ = `− s+1 , s +2 2s 1 s +1 a≥ , then E fẑλ , fρ = Op `− s+2 with λ = `− s+2 . s +2 Meaning (h-dependence): 2a h 7→ is decreasing: smoother K kernel is rewarding. h Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Computational & statistical tradeoff (M) 2a Let N = ` h log(`) (a > 0). If 2sa a s +1 a≤ , then E fẑλ , fρ = Op `− s+1 with λ = `− s+1 , s +2 2s 1 s +1 a≥ , then E fẑλ , fρ = Op `− s+2 with λ = `− s+2 . s +2 Meaning (s-dependence): s 7→ = better rate, 2s is increasing, i.e easier task s +2 s → 0: arbitrary slow rate. 2 s = 1: `− 3 rate. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Optimality of the rate (M) 2s Our rate: r (`) = `− s+2 – range space assumption (s). 2s One-stage sampled optimal rate: ro (`) = `− 2s+1 [Steinwart et al., 2009], range-space assumption + eigendecay constraint, D: compact metric, Y = R. 0.8 Rate 0.6 0.4 −logl[ro(l)]=2s/(2s+1) 0.2 −logl[r(l)]=2s/(s+2) 0 0 0.2 0.4 0.6 Smoothness (s) Zoltán Szabó 0.8 1 Distribution Regression: Computational & Statistical Tradeoffs Blanket assumptions: both settings D: separable, topological domain. k: bounded: supu∈D k(u, u) ≤ Bk ∈ (0, ∞), continuous. K : bounded; Hölder continuous: ∃L > 0, h ∈ (0, 1] such that kK (·, µa ) − K (·, µb )kH ≤ L kµa − µb khH . y : bounded. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Hölder K examples (other than the linear K when h=1) In case of compact metric D, universal k: KG e Ke µa −µ 2 b H − 2θ 2 e h=1 KC µa −µ b H − 2θ 2 1 2 h= Ki 1 + kµa − µb kθH h= θ 2 −1 h=1 Kt 1 + kµa − µb k2H /θ2 −1 (θ ≤ 2) kµa − µb k2H + θ2 − 1 2 h=1 Functions of kµa − µb kH ⇒ computation: similar to set kernel. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Vector-valued output: similarly K (µa , µb ) ∈ L(Y ). Prediction on a new test distribution (t): (fẑλ ◦ µ)(t) = k(K + `λI` )−1 [y1 ; . . . ; y` ], (4) K = [K (µx̂i , µx̂j )] ∈ L(Y )`×` , (5) k = [K (µx̂1 , µt ), . . . , K (µx̂` , µt )] ∈ L(Y ) 1×` . (6) Specifically: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd×d . Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Demo Supervised entropy learning: RMSE: MERR=0.75, DFDR=2.02 2 4 0 RMSE entropy 1 true MERR −1 3 2 DFDR 1 −2 0 1 2 rotation angle (β) Zoltán Szabó 3 MERR DFDR Distribution Regression: Computational & Statistical Tradeoffs Demo Supervised entropy learning: RMSE: MERR=0.75, DFDR=2.02 2 4 0 RMSE entropy 1 true MERR −1 3 2 DFDR 1 −2 0 1 2 rotation angle (β) 3 MERR DFDR Aerosol prediction from satellite images: State-of-the-art baseline: 7.5 − 8.5 (±0.1 − 0.6). MERR: 7.81 (±1.64). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Summary Problem: distribution regression (k). Contribution: computational & statistical tradeoff analysis, specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Summary Problem: distribution regression (k). Contribution: computational & statistical tradeoff analysis, specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size. Code in ITE, analysis submitted to JMLR: https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Thank you for the attention! Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. A part of the work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Caponnetto, A. and De Vito, E. (2007). Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368. Chen, Y. and Wu, O. (2012). Contextual Hausdorff dissimilarity for multi-instance clustering. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873. Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005). Semigroup kernels on measures. Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995). Measure, Topology and Fractal Geometry. Springer-Verlag. Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Multi-instance kernels. In International Conference on Machine Learning (ICML), pages 179–186. Haussler, D. (1999). Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. (http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf). Hein, M. and Bousquet, O. (2005). Hilbertian metrics and positive definite kernels on probability measures. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143. Jebara, T., Kondor, R., and Howard, A. (2004). Probability product kernels. Journal of Machine Learning Research, 5:819–844. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Kandasamy, K., Krishnamurthy, A., Póczos, B., Wasserman, L., and Robins, J. M. (2015). Influence functions for machine learning: Nonparametric estimators for entropies, divergences and mutual informations. Technical report, Carnegie Mellon University. http://arxiv.org/abs/1411.4342. Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009). Nonextensive information theoretical kernels on measures. Journal of Machine Learning Research, 10:935–975. Nielsen, F. and Nock, R. (2012). A closed-form expression for the Sharma-Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45:032003. Oliva, J., Neiswanger, W., Póczos, B., Xing, E., and Schneider, J. (2015). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Fast function to function regression. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 717–725. Oliva, J., Póczos, B., and Schneider, J. (2013). Distribution to distribution regression. International Conference on Machine Learning (ICML; JMLR W&CP), 28:1049–1057. Oliva, J. B., Neiswanger, W., Póczos, B., Schneider, J., and Xing, E. (2014). Fast distribution to real regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714. Póczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013). Distribution-free distribution regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Póczos, B., Xiong, L., and Schneider, J. (2011). Nonparametric divergence estimation with applications to machine learning on distributions. In Uncertainty in Artificial Intelligence (UAI), pages 599–608. Reddi, S. J. and Póczos, B. (2014). k-NN regression on functional data with incomplete observations. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 692–701. Sriperumbudur, B. K., Fukumizu, K., Kumar, R., Gretton, A., and Hyvärinen, A. (2014). Density estimation in infinite dimensional exponential families. Technical report. (http://arxiv.org/pdf/1312.3516). Steinwart, I., Hush, D. R., and Scovel, C. (2009). Optimal rates for regularized least squares regression. In Conference on Learning Theory (COLT). Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Sutherland, D. J., Oliva, J. B., Póczos, B., and Schneider, J. (2015). Linear-time learning on distributions with approximate kernel embeddings. Technical report, Carnegie Mellon University. http://arxiv.org/abs/1509.07553. Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009). Closed-form Jensen-Rényi divergence for mixture of Gaussians and applications to group-wise shape registration. Medical Image Computing and Computer-Assisted Intervention, 12:648–655. Wang, J. and Zucker, J.-D. (2000). Solving the multiple-instance problem: A lazy learning approach. In International Conference on Machine Learning (ICML), pages 1119–1126. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs Wang, Z., Lan, L., and Vucetic, S. (2012). Mixture model for multiple instance regression and applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237. Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010). Identifying multi-instance outliers. In SIAM International Conference on Data Mining (SDM), pages 430–441. Zhang, M.-L. and Zhou, Z.-H. (2009). Multi-instance clustering with applications to multi-instance prediction. Applied Intelligence, 31:47–68. Zhou, S. K. and Chellappa, R. (2006). From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929. Zoltán Szabó Distribution Regression: Computational & Statistical Tradeoffs