Consistent Vector-valued Distribution Regression Zolt´an Szab´ o

advertisement
Consistent Vector-valued Distribution Regression
Zoltán Szabó
Joint work with Arthur Gretton (UCL), Barnabás Póczos (CMU),
Bharath K. Sriperumbudur (PSU)
UCL Workshop on the Theory of Big Data
January 8, 2015
Zoltán Szabó
Consistent Vector-valued Distribution Regression
The task
Samples: {(xi , yi )}li =1 . Goal: f (xi ) ≈ yi , find f ∈ H.
Distribution regression:
xi -s are distributions,
i
available only through samples: {xi ,n }N
n=1 .
⇒ Training examples: labelled bags.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Example: aerosol prediction from satellite images
Bag:= points of a multispectral satellite image over an area.
Label of a bag:= aerosol value.
Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5.
Using distribution regression:
without domain knowledge,
100 × RMSE = 7.81.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Wider context
Context:
machine learning: multi-instance learning,
statistics: point estimation tasks (without analytical formula).
Applications:
computer vision: image = collection of patch vectors,
network analysis: group of people = bag of friendship graphs,
natural language processing: corpus = bag of documents,
time-series modelling: user = set of trial time-series.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Several algorithmic approaches
1
Parametric fit: Gaussian, MOG, exp. family
[Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].
2
Kernelized Gaussian measures:
[Jebara et al., 2004, Zhou and Chellappa, 2006].
3
(Positive definite) kernels:
[Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].
4
Divergence measures (KL, Rényi, Tsallis): [Póczos et al., 2011].
5
Set metrics: Hausdorff metric [Edgar, 1995]; variants
[Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009,
Chen and Wu, 2012].
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Theoretical guarantee?
MIL dates back to [Haussler, 1999, Gärtner et al., 2002].
Sensible methods in regression: require density estimation
[Póczos et al., 2013, Oliva et al., 2014] + assumptions:
1
2
compact Euclidean domain.
output = R.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Problem formulation
Given: labelled bags
ẑ = {(x̂i , yi )}li =1 , where
i .i .d.
i th bag: x̂i = {xi ,1 , . . . , xi ,N } ∼ xi ∈ M+
1 (D), yi ∈ Y .
Task: find a M+
1 (D) → Y mapping based on ẑ.
Construction: distribution embedding (µx ) + ridge regression
f ∈H=H(K )
µ=µ(k)
M+
1 (D) −−−−→ X ⊆ H = H(k) −−−−−−−→ Y .
Our goal: risk bound compared to the regression function
Z
y dρ(y |µx ).
fρ (µx ) =
Y
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Goal in details
Contribution: analysis of the excess risk
E(fẑλ , fρ ) = R[fẑλ ] − R[fρ ] ≤ g (l , N, λ) → 0 and rates,
R [f ] = E(x,y ) kf (µx ) − y k2Y (expected risk),
l
fẑλ = arg min
f ∈H
1X
kf (µx̂i ) − yi k2Y + λ kf k2H ,
l
(λ > 0).
i =1
We consider two settings:
1
2
well-specified case: fρ ∈ H,
misspecified case: fρ ∈ L2ρX \H.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Kernel (k, K ), RKHS
k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert)
k(a, b) = hϕ(a), ϕ(b)iH .
∃! RKHS: H(k) = {D → R functions}, ϕ(u) = k(·, u).
Kernel examples:
D = Rd (p > 0, θ > 0):
k(a, b) = (ha, bi + θ)p : polynomial,
2
2
k(a, b) = e −ka−bk2 /(2θ ) : Gaussian,
Graphs, texts, time series, distributions.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Kernel (k, K ), RKHS
k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert)
k(a, b) = hϕ(a), ϕ(b)iH .
∃! RKHS: H(k) = {D → R functions}, ϕ(u) = k(·, u).
Kernel examples:
D = Rd (p > 0, θ > 0):
k(a, b) = (ha, bi + θ)p : polynomial,
2
2
k(a, b) = e −ka−bk2 /(2θ ) : Gaussian,
Graphs, texts, time series, distributions.
Note: H(K ) = {X → Y functions}, K (µx , µx ′ ) ∈ L(Y ).
Zoltán Szabó
Consistent Vector-valued Distribution Regression
µ
Step-1 (distribution embedding): M+
→ X ⊆ H(k)
1 (D) −
Given: kernel k : D × D → R.
Mean embedding of a distribution x, x̂i ∈ M+
1 (D):
Z
k(·, u)dx(u) ∈ H(k),
µx =
D
N
1 X
k(·, u)dx̂i (u) =
µx̂i =
k(·, xi ,n ).
N
D
Z
n=1
Y = R, linear K ⇒ set kernel:
K (µx̂i , µx̂j ) = µx̂i , µx̂j
Zoltán Szabó
H
N
1 X
k(xi ,n , xj,m ).
= 2
N
n,m=1
Consistent Vector-valued Distribution Regression
Step-2 (ridge regression): analytical solution
Given:
training sample: ẑ,
test distribution: t.
Prediction:
(fẑλ ◦ µ)(t) = k(K + l λIl )−1 [y1 ; . . . ; yl ],
(1)
l×l
K = [Kij ] = [K (µx̂i , µx̂j )] ∈ L(Y )
,
k = [K (µx̂1 , µt ), . . . , K (µx̂l , µt )] ∈ L(Y )1×l .
(2)
(3)
Specially: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd×d .
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Blanket assumptions
D: separable, topological domain.
k: bounded, continuous.
K : bounded, Hölder continuous (h ∈ (0, 1]: exponent).
X = µ M+
1 (D) ∈ B(H).
Y : separable Hilbert.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Performance guarantees (in human-readable format)
If in addition
1
well-specified case: fρ is ’c-smooth’ with ’b-decaying
1
covariance operator’ and l ≥ λ− b −1 , then
E(fẑλ , fρ ) ≤
2
1
1
logh (l )
+ λc + 2 + 1 .
N h λ3
l λ lλb
(4)
misspecified case: fρ is ’s-smooth’, L2ρX is separable, and
1
≤ l , then
λ2
h
E(fẑλ , fρ ) ≤
log 2 (l )
h
2
N λ
3
2
1
+√ +
lλ
Zoltán Szabó
√
λmin(1,s)
√
+ λmin(1,s) .
λ l
Consistent Vector-valued Distribution Regression
(5)
Performance guarantee: example
Misspecified case: assume
s ≥ 1, h = 1 (K : Lipschitz),
✄
✄
a
1
=
✂ ✁ ✂3 ✁in (5) ⇒ λ; l = N (a > 0)
t = lN a : total number of samples processed.
Then
1
2
s = 1 (’most difficult’ task): E(fẑλ , fρ ) ≈ t −0.25 ,
s → ∞ (’simplest’ problem): E(fẑλ , fρ ) ≈ t −0.5 .
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Nonlinear K examples
Y = R; D: compact, metric; k: universal ⇒ Hölder K -s:
KG
e
Ke
kµa −µb k2H
−
2θ 2
e−
h=1
KC
kµa −µb kH
2θ 2
1
2
h=
Kt
θ
2
h=1
−1
Ki
1 + kµa − µb kθH
h=
1 + kµa − µb k2H /θ 2
−1
(θ ≤ 2)
kµa − µb k2H + θ 2
h=1
− 1
2
They are functions of kµa − µb kH ⇒ computation: similar to set
kernel.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Summary
Problem: distribution regression.
Literature: large number of heuristics.
Contribution:
a simple ridge solution is consistent,
specially, the set kernel is so (15-year-old open question).
Code ∈ ITE toolbox:
https://bitbucket.org/szzoli/ite/
Details (submitted to JMLR):
http://arxiv.org/pdf/1411.2066.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Thank you for the attention!
Acknowledgments: This work was supported by the Gatsby Charitable
Foundation, and by NSF grants IIS1247658 and IIS1250350. The work
was carried out while Bharath K. Sriperumbudur was a research fellow in
the Statistical Laboratory, Department of Pure Mathematics and
Mathematical Statistics at the University of Cambridge, UK.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Appendix: contents
Well/misspecified assumptions.
Topological definitions, separability.
Vector-valued RKHS.
Weak topology on M+
1 (D).
Measurability of µ.
Universal kernel examples.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Well-specified case: ρ ∈ P(b, c)
Let the T : H → H covariance operator be
Z
K (·, µa )K ∗ (·, µa )dρX (µa )
T =
X
with eigenvalues tn (n = 1, 2, . . .).
Assumption: ρ ∈ P(b, c) = set of distributions on X × Y
α ≤ nb tn ≤ β (∀n ≥ 1; α > 0, β > 0),
c−1
2
∃g ∈ H such that fρ = T 2 g with kg kH ≤ R (R > 0),
where b ∈ (1, ∞), c ∈ [1, 2].
Intuition: b – effective input dimension, c – smoothness of fρ .
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Misspecified case: assumption
Let T̃ be the extension of T from H to L2ρX :
SK∗ : H ֒→ L2ρX ,
SK : L2ρX → H,
(SK g )(µu ) =
Z
K (µu , µt )g (µt )dρX (µt ),
X
T̃ = SK∗ SK : L2ρX → L2ρX .
Our range space assumption on ρ: fρ ∈ Im T̃ s for some s ≥ 0.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Misspecified case: note on the separability of L2ρX
L2ρX : separable ⇔ measure space with d(A, B) = ρX (A △ B) is so
[Thomson et al., 2008].
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Topological space, open sets
Given: D 6= ∅ set.
τ ⊆ 2D is called a topology on D if:
1
2
3
∅ ∈ τ, D ∈ τ.
Finite intersection: O1 ∈ τ , O2 ∈ τ ⇒ O1 ∩ O2 ∈ τ .
Arbitrary union: Oi ∈ τ (i ∈ I ) ⇒ ∪i ∈I Oi ∈ τ .
Then, (D, τ ) is called a topological space; O ∈ τ : open sets.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Closed-, compact set, closure, dense subset, separability
Given: (D, τ ). A ⊆ D is
closed if D\A ∈ τ (i.e., its complement is open),
compact if for any family (Oi )i ∈I of open sets with
A ⊆ ∪i ∈I Oi , ∃i1 , . . . , in ∈ I with A ⊆ ∪nj=1 Oij .
Closure of A ⊆ D:
Ā :=
\
C.
A⊆C closed in D
A ⊆ D is dense if Ā = D.
(D, τ ) is separable if ∃ countable, dense subset of D.
Counterexample: l ∞ /L∞ .
Zoltán Szabó
Consistent Vector-valued Distribution Regression
(6)
Vector-valued RKHS
Definition:
A H ⊆ Y X Hilbert space of functions is RKHS if
Aµx ,y : f 7→ hy , f (µx )iY
(7)
is continuous for ∀µx ∈ X , y ∈ Y .
= The evaluation functional is continuous in every direction.
Riesz representation theorem ⇒
∃Kµt ∈ L(Y , H):
K (µx , µt )(y ) = (Kµt y )(µx ),
K (·, µt )(y ) = Kµt y ,
(∀µx , µt ∈ X ), or shortly
H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
(8)
(9)
Vector-valued RKHS – continued
Examples (Y = Rd ):
1
Ki : X × X → R kernels (i = 1, . . . , d). Diagonal kernel:
K (µa , µb ) = diag (K1 (µa , µb ), . . . , Kd (µa , µb )).
2
(10)
Combination of Dj diagonal kernels [Dj (µa , µb ) ∈ Rr ×r ,
Aj ∈ Rr ×d ]:
K (µa , µb ) =
m
X
A∗j Dj (µa , µb )Aj .
(11)
j=1
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Weak topology on M+
1 (D)
Def.: It is the weakest topology such that the
Lh : (M+
(D), τw ) → R,
Z1
h(u)dx(u)
Lh (x) =
D
mapping is continuous for all h ∈ Cb (D), where
Cb (D) = {(D, τ ) → R bounded, continuous functions}.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Measurability of µ
k: bounded, continuous ⇒
µ : (M+
1 (D), B(τw )) → (H, B(H)) measurable.
µ measurable, X ∈ B(H) ⇒ ρ on X × Y : well-defined.
If D is compact metric, k is universal, then µ is continuous
and X ∈ B(H).
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Universal kernel examples
On every compact subset of Rd :
k(a, b) = e −
ka−bk2
2
2σ 2
,
(σ > 0)
k(a, b) = e βha,bi , (β > 0), or more generally
∞
X
an x n (∀an > 0)
k(a, b) = f (ha, bi), f (x) =
n=0
α
k(a, b) = (1 − ha, bi) ,
Zoltán Szabó
(α > 0).
Consistent Vector-valued Distribution Regression
Chen, Y. and Wu, O. (2012).
Contextual Hausdorff dissimilarity for multi-instance clustering.
In International Conference on Fuzzy Systems and Knowledge
Discovery (FSKD), pages 870–873.
Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005).
Semigroup kernels on measures.
Journal of Machine Learning Research, 6:11691198.
Edgar, G. (1995).
Measure, Topology and Fractal Geometry.
Springer-Verlag.
Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A.
(2002).
Multi-instance kernels.
In International Conference on Machine Learning (ICML),
pages 179–186.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Haussler, D. (1999).
Convolution kernels on discrete structures.
Technical report, Department of Computer Science, University
of California at Santa Cruz.
(http://cbse.soe.ucsc.edu/sites/default/files/
convolutions.pdf).
Hein, M. and Bousquet, O. (2005).
Hilbertian metrics and positive definite kernels on probability
measures.
In International Conference on Artificial Intelligence and
Statistics (AISTATS), pages 136–143.
Jebara, T., Kondor, R., and Howard, A. (2004).
Probability product kernels.
Journal of Machine Learning Research, 5:819–844.
Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q.,
and Figueiredo, M. A. T. (2009).
Nonextensive information theoretical kernels on measures.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Journal of Machine Learning Research, 10:935–975.
Nielsen, F. and Nock, R. (2012).
A closed-form expression for the Sharma-Mittal entropy of
exponential families.
Journal of Physics A: Mathematical and Theoretical,
45:032003.
Oliva, J. B., Neiswanger, W., Póczos, B., Schneider, J., and
Xing, E. (2014).
Fast distribution to real regression.
International Conference on Artificial Intelligence and
Statistics (AISTATS; JMLR W&CP), 33:706–714.
Póczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013).
Distribution-free distribution regression.
International Conference on Artificial Intelligence and
Statistics (AISTATS; JMLR W&CP), 31:507–515.
Póczos, B., Xiong, L., and Schneider, J. (2011).
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Nonparametric divergence estimation with applications to
machine learning on distributions.
In Uncertainty in Artificial Intelligence (UAI), pages 599–608.
Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008).
Real Analysis.
Prentice-Hall.
Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D.,
and Rangarajan, A. (2009).
Closed-form Jensen-Rényi divergence for mixture of Gaussians
and applications to group-wise shape registration.
Medical Image Computing and Computer-Assisted
Intervention, 12:648–655.
Wang, J. and Zucker, J.-D. (2000).
Solving the multiple-instance problem: A lazy learning
approach.
In International Conference on Machine Learning (ICML),
pages 1119–1126.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Wang, Z., Lan, L., and Vucetic, S. (2012).
Mixture model for multiple instance regression and
applications in remote sensing.
IEEE Transactions on Geoscience and Remote Sensing,
50:2226–2237.
Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010).
Identifying multi-instance outliers.
In SIAM International Conference on Data Mining (SDM),
pages 430–441.
Zhang, M.-L. and Zhou, Z.-H. (2009).
Multi-instance clustering with applications to multi-instance
prediction.
Applied Intelligence, 31:47–68.
Zhou, S. K. and Chellappa, R. (2006).
From sample similarity to ensemble similarity: Probabilistic
distance measures in reproducing kernel Hilbert space.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 28:917–929.
Zoltán Szabó
Consistent Vector-valued Distribution Regression
Download