Uploaded by Jeeho Choe

2205.01158

advertisement
Reproducing Kernels and New Approaches in
Compositional Data Analysis
arXiv:2205.01158v1 [stat.ML] 2 May 2022
Binglin Li1 and Jeongyoun Ahn2
1
2
University of Georgia
Korea Advanced Institute of Science and Technology
May 4, 2022
Abstract
Compositional data, such as human gut microbiomes, consist of non-negative variables whose only the relative values to other variables are available. Analyzing compositional data such as human gut microbiomes needs a careful treatment of the geometry
of the data. A common geometrical understanding of compositional data is via a regular
simplex. Majority of existing approaches rely on a log-ratio or power transformations
to overcome the innate simplicial geometry. In this work, based on the key observation
that a compositional data are projective in nature, and on the intrinsic connection between projective and spherical geometry, we re-interpret the compositional domain as
the quotient topology of a sphere modded out by a group action. This re-interpretation
allows us to understand the function space on compositional domains in terms of that
on spheres and to use spherical harmonics theory along with reflection group actions
for constructing a compositional Reproducing Kernel Hilbert Space (RKHS). This construction of RKHS for compositional data will widely open research avenues for future
methodology developments. In particular, well-developed kernel embedding methods
can be now introduced to compositional data analysis. The polynomial nature of
compositional RKHS has both theoretical and computational benefits. The wide applicability of the proposed theoretical framework is exemplified with nonparametric
density estimation and kernel exponential family for compositional data.
1
Introduction
Recent popularity of human gut microbiomes research has presented many data-analytic,
statistical challenges (Calle, 2019). Among many features of microbiomes, or meta-genomics
data, we address their compositional nature in this work. Compositional data consist of n
observations of (d + 1) non-negative variables whose values represent the relative proportions
to other variables in the data. Compositional data have been commonly observed in many
scientific fields, such as bio-chemistry, ecology, finance, economics, to name just a few. The
1
most notable aspect of compositional data is the restriction on their domain, specifically
that the sum of the variables is fixed. The compositional domain is not a classical vector
space, but instead a (regular) simplex, that can be modeled by the simplex:
(
)
d+1
X
∆d = (x1 , . . . , xd+1 ) ∈ Rd+1 |
xi = 1, xi ≥ 0, ∀i ,
(1)
i=1
which is topologically compact. The inclusion of zeros in (1) is crucial as most microbiomes
data have a substantial number of zeros.
Arguably the most prominent approach to handle the data on a simplex is to take a logratio transformation (Aitchison, 1982), for which one has to consider only the open interior
of ∆d , denoted by S d . Zeros are usually taken care of by adding a small number, however,
it has been noted that the results of analysis can be quite dependent on how the zeros
are handled (Lubbe et al., 2021). Gloor et al. (2017) pointed out “the dangers inherent in
ignoring the compositional nature of the data” and argued that microbiome datasets must
be treated as compositions at all stages of analysis. Recently some approaches that analyze
compositional data without any transformation have been gaining popularity (Li et al., 2020;
Rasmussen et al., 2020). The approach proposed in this paper is to construct reproducing
kernels of compositional data by interpreting compositional domains via projective spherical
geometries.
1.1
Methodological Motivation
Besides the motivation from microbiomes studies, another source of inspiration for this work
is the current exciting development in statistics and machine learning. In particular, the
rising popularity of applying higher tensors and kernel techniques, which allows multivariate
techniques to be extended to exotic structures beyond traditional vector spaces, e.g., graphs
(Jorgensen and Tian, 2016), manifolds (Minh and Sindhwani, 2011) or images (Zhou et al.,
2013). This work serves an attempt to construct reproducing kernel structures for compositional data, so that recent developments of (reproducing) kernel techniques from machine
learning theory can be introduced to this classical field in statistics.
The approach in this work is to model the compositional data as a group quotient of
a sphere Sd /Γ (see (7)), which gives a new connection of compositional data analysis with
directional statistics. The idea of representing data by using tensors and frames is not new
in directional statistics (Arnold et al., 2018), but the authors find it more convenient to
construct reproducing kernels for Sd /Γ (whose reason is given in Section 1.4).
We do want to mention that the construction of reproducing kernels for compositional
data indicates a new potential paradigm for compositional data analysis: traditional approaches aim to find direct analogue of multivariate concepts, like mean, variance-covariance
matrices and suitable regression analysis frameworks based on those concepts. However,
finding the mean point over non-linear spaces, e.g. on manifold, is not an easy job, and in
worst case scenarios, mean points might not even exist on the underlying space (e.g. the
mean point of the uniform distribution on a unit circle is not living on the circle).
In this work we take the perspective of kernel mean embedding (Muandet et al., 2017).
Roughly speaking, instead of finding the “physical ” point for the mean of a distribution, one
2
could do statistics distributionally. In other words, the mean or expectation is considered as
a linear functional on the RKHS, and this functional is represented by an actual function
in the Hilbert space, which is referred to as “kernel mean E[k(X, ·)]”. Instead of trying to
find another compositional point as the empirical mean of a compositional data set, one
can construct
“kernel mean” as a replacement of the traditional empirical mean, which is
Pn
just i=1 k(Xi , ·)/n. Moreover, one can also construct the analogue of variance-covariance
matrix purely from kernels; in fact, Fukumizu et al. (2009) considered the gram matrix
constructed out of reproducing kernels, as consistent estimators of cross-variance operators
(these operators play the role of covariance and cross-variance matrices in classical Euclidean
spaces).
Since we remodel compositional domain using projective/spherical geometry, compositional domain is not treated as a vector space, but a quotient topological space Sd /Γ.
Instead of “putting a linear structure on an Aitchison simplex” (Aitchison, 1982), or square
root transformation (which is still transformed from an Aitchison simplex), we choose to
“linearize” compositional data points by using kernel techniques (and possibly higher-tensor
constructions) and one can still do “multivariate analysis”. Our construction in this work initiates such an attempt to introduce these recent development of kernel and tensor techniques
from statistical learning theory into compositional data analysis.
1.2
Contributions of the Present Work
Our contribution in this paper is three folds. First, we propose a new geometric foundation
for compositional data analysis, Pd≥0 , a subspace of a full projective space Pd . Based on the
close connection of spheres with projective spaces, we will also describe Pd≥0 in terms of Sd /Γ,
a reflection group acting on a sphere, and the fundamental domain of this actions is the first
orthant Sd≥0 (a totally different reason of using “Sd≥0 ” in the traditional approach).
Secondly, based on the new geometric foundations of compositional domains, we propose
a new nonparametric compositional density estimation by making use of the well-developed
spherical density estimation theory. Furthermore, we provide a central limit theorem for
integral squared errors, which leads to a goodness-of-fit test.
Thirdly, also through this new geometric foundation, function spaces on compositional
domains can be related with those on the spheres. Square integrable functions L2 (Sd ) on
the sphere is a focus of an ancient subject in mathematics and physics, called “spherical
harmonics”. Moreover, spherical harmonics theory also tells that each Laplacian eigenspace
of L2 (Sd ) is a reproducing kernel Hilbert space, and this allows us to construct reproducing
kernels for compositional data points via “orbital integrals”, which opens a door for machine
learning techniques to be applied to compositional data. We also propose a compositional
exponential family as a general distributional family for compositional data modeling.
1.3
Why Projective and Spherical Geometries?
According to Aitchison (1994), “any meaningful function of a composition must satisfy the
requirement f (ax) = f (x) for any a 6= 0.” In geometry and topology, a space consisting
of such functions is called a projective space, denoted by Pd , therefore, projective geometry
should be the natural candidate to model compositional data, rather than a simplex. Since
3
a point in compositional domains can not have opposite signs, a compositional domain is in
fact a “positive cone” Pd≥0 inside a full projective space.
A key property of projective spaces is that stretching or shrinking the length of a vector
in Pd does not alter the point. Thus one can stretch a point in ∆d to a point in the first
orthant sphere by dividing it by its `2 norm. Figure 1 illustrates this stretching (“stretching”
is not a transformation from projective geometry point of view) in action. In short, projective geometry is more natural to model the compositional data according to the original
philosophy in Aitchison (1994).
However, spheres are easier to work with because mathematically speaking, the function
space on spheres is a well-treated subject in spherical harmonics theory, and statistically
speaking, we can connect with directional statistics in a more natural way. Our compositional
domain Pd≥0 can be naturally identified with Sd /Γ, a sphere modded out by a reflection group
action. This reflection group Γ acts on the sphere Sd by reflection, and the fundamental
domain of this action is Sd≥0 (notions of group actions, fundamental domains and reflection
groups are all discussed in Section 2). Thus our connection with the first orthant sphere Sd≥0
is a natural consequence of projective geometry and its connection with spheres with group
actions, having nothing to do with square root transformations.
1.4
Why Reproducing Kernels?
As explained in Section 1.1, we strive to use new ideas of tensors and kernel techniques in
machine learning to propose another framework for compositional data analysis, and Section
1.3 explains the new connection with spherical geometry and directional statistics. However,
it is not new in directional statistics where the idea of tensors was used to represent data
points (Arnold et al., 2018). So a naive idea would be to mimic directional statistics when
studying ambiguous rotations: Arnold et al. (2018) studied how to do statistics over coset
space SO(3)/K where K is a finite subgroup of SO(3). In their case, the subgroup K
has to be a special class of subgroups of special orthogonal groups, and within this class,
they manage to study the corresponding tensors and frames, which gives the inner product
structures of different data points.
However, in our case, a compositional domain is Sd /Γ = O(d) \ O(d + 1)/Γ, a double
coset space. Unlike Arnold et al. (2018) that only considered d = 3 case, our dimension
d is completely general; moreover, our reflection group Γ is not a subgroup of any special
orthogonal groups, so constructions of tenors and frames in Arnold et al. (2018) does not
apply to our situation directly.
Part of the novelty of this work is to get around this issue by making use of the reproducing
kernel Hilbert Space (RKHS) structures on spheres, and “averaging out” the group action
at the reproducing kernel level, which in return gives us a reproducing kernel structure
on compositional domains. Once we have RKHS in hand, we can “add” and take the
“inner product” of two data points, so our linearization strategy can also be regarded as
a combination of “the averaging approach” and the “embedding approach” as in Arnold
et al. (2018). In fact, an abstract function space together with reproducing kernels plays
an increasingly important role. In below we provide some philosophical motivations on the
importance of function space over underlying data set:
4
(a) Hilbert spaces of functions are naturally linear with an inner product structure. With
the existence of (reproducing) kernels, data points are naturally incorporated into
the function space, which leads to interesting interactions between the data set and
functions defined over them. There has been a large amount of literature of embedding
distributions into RKHS, e.g. Smola et al. (2007), and using reproducing kernels to
recover exponential families, e.g. Dai et al. (2019). RKHS has also been used to recover
classical statistical tests, e.g. goodness-of-fit test in Chwialkowski et al. (2016), and
regression in de los Campos et al. (2009). Those works do not concern the analysis of
function space, but primarily focus on the data analysis on the underlying data set, but
all of them are done by passing over to RKHS. This implies the increasing recognition
of the importance of abstract function space with (reproducing) kernel structure.
(b) Mathematically speaking, given a geometric space M , the function space on M can
recover the underlying geometric space M itself, and this principle has been playing
a big role in different areas of geometry; in particular, modern algebraic geometry,
following the philosophy of Grothendieck, is based upon this insight. Function spaces
can be generalized to matrix valued function spaces, and this generalization gives rise
to non-commutative RKHS, which is used in shape analysis in Micheli and Glaunés
(2014); moreover, non-commutative RKHS is connected with free probability theory
(Ball et al., 2016), which has been used in random effects and linear mixed effects
models (Zhou and Johnstone, 2019; Fan et al., 2021) .
1.5
Structure of the Paper
We describe briefly the content of the main sections of this article:
• In Section 2, we will rebuild the geometric foundation of compositional domains by
using projective geometry and spherical geometry. We will also point out that the old
model using the closed simplex ∆d is topologically the same as the new foundation. In
diagrammatic way, we establish the following topological equivalence:
∆d ∼
= Pd≥0 ∼
= Sd /Γ ∼
= Sd≥0 ,
(2)
where Sd≥0 is the first orthant sphere, which is also the fundamental domain of the
group action Γ y Sd . All of the four spaces in (2) will be referred to as “compositional
domains”.
As a direct application, we propose a compositional density estimation method by
using the spherical density estimation theory via a spread-out construction through
the quotient map π : Sd → Sd /Γ, and proved that our compositional density estimator
also possesses integral square errors that satisfies central limit theorems (Theorem 2.6),
which can be used for goodness-of-fit tests.
• Section 3 will be devoted to constructing. compositional reproducing kernel Hilbert
spaces. Our construction relies on the reproducing kernel structures on spheres, which
is given by spherical harmonics theory. Wahba (1981) constructed splines using reproducing kernel structures on S2 (2-dimensional sphere), in which she also used spherical
5
Figure 1: Illustration of the stretching action on ∆1 to S1 . Note that the stretching keeps
the relative compositions where the square root transformation fails to do so.
harmonics theory in Sansone (1959), which only treated 2-dimensional case. Our theory
deals with general d-dimensional case, so we need the full power of spherical harmonics
theory, which will be reviewed at the beginning of Section 3, and then we will use
spherical harmonics theory to construct compositional reproducing kernels using an
“orbital integral” type of idea.
• Section 4 will give a couple of applications of our construction of compositional reproducing kernels. (i) The first example is the representer theorem, but with one
caveat: our RKHS is finite dimensional consisting degree 2m homogeneous polynomials, with no transcendental functions, so linear independence for distinct data points is
not directly available, however we show that when the degree m is high enough, linear
independence still holds. Our statement of representer theorem is not new purely from
RKHS theory point of view. Our point is to demonstrate that intuitions from traditional statistical learning can still be used in compositional data analysis, with some
extra care. (ii) Secondly, we construct the compositional exponential family, which can
be used to model the underlying distribution of compositional data. The flexible construction will enable us to utilize the distribution family in many statistical problems
such as mean tests.
2
New Geometric Foundations of Compositional Domains
In this section, we give a new interpretation of compositional domains as a cone Pd≥0 in
a projective space, based on which compositional domains can be interpreted as spherical
quotients by reflection groups. This connection will yield a “spread-out” construction on
spheres and we demonstrate an immediate application of this new approach to compositional
density estimation.
6
2.1
Projective and Spherical Geometries and a Spread-out Construction
Compositional data consist of relative proportions of d + 1 variables, which implies that each
observation belongs to a projective space. A d-dimensional projective space Pd is the set of
one-dimensional linear subspace of Rd+1 . A one-dimensional subspace of a vector space is
just a line through the origin, and in projective geometry, all points in a line through the
origin will be regarded as the same point in a projective space. Contrary to the classical
linear coordinates (x1 , · · · , xd+1 ), a point in Pd can be represented by a projective coordinate
(x1 : · · · : xd+1 ), with the following property
(x1 : x2 : · · · : xd+1 ) = (λx1 : λx2 : · · · : λxd+1 ),
for any λ 6= 0.
It is natural that an appropriate ambient space for compositional data is non-negative projective space, which is defined as
Pd≥0 = (x1 : x2 : · · · : xd+1 ) ∈ Pd | (x1 , x2 : · · · : xd+1 ) = (|x1 | : |x2 | : · · · : |xd+1 |) .
(3)
It is clear that the common representation of compositional data with a (closed) simplex ∆d
in (1) is in fact equivalent to (3), thus we have the first equivalence:
Pd≥0 ∼
= ∆d .
(4)
Let Sd denote a d-dimensional unit sphere, defined as
(
)
d+1
X
d
d+1
2
S = (x1 , x2 , . . . , xd+1 ) ∈ R
|
xi = 1 ,
i=1
and let Sd≥0 denote the first orthant of Sd , a subset in which all coordinates are non-negative.
The following lemma states that Sd≥0 can be a new domain for compositional data as there
exists a bijective map between ∆d and Sd≥0 .
Lemma 2.1. There is a canonical identification of ∆d with Sd≥0 , namely,
∆d o
f
g
/ d
S
≥0
,
where f is the inflation map g is the contraction map, with both f and g being continuous
and inverse to each other.
Proof It is straightforward to construct the inflation map f . For v ∈ ∆d , it is easy to see
that f (v) ∈ Sd≥0 when f (v) = v/kvk2 , where kvk2 is the `2 norm of v. Note that the inflation
map makes sure that f (v) is in the same projective space as v. To construct the shrinking
map g, for s ∈ Sd≥0 we define g(s) = s/ksk1 , where ksk1 is the `1 norm of s and see that
g(s) ∈ ∆d . One can easily check that both f and g are continuous and inverse to each other.
7
Based on Lemma 2.1, we now identify ∆d alternatively with the quotient topological
space Sd /Γ for some group action Γ. In order to do so, we first show that the cone Sd≥0
is a strict fundamental domain of Γ, i.e., Sd≥0 ∼
= Sd /Γ. We start by defining a coordinate
hyperplane for a group. The i-th coordinate hyperplane Hi ∈ Rd+1 with respect to a choice
of a standard basis {e1 , e2 , . . . , ed+1 } is a codimension one linear subspace which is defined
as
Hi = {(x1 , . . . , xi , . . . , xd+1 ) ∈ Rd+1 : xi = 0}, i = 1, . . . , d + 1.
We define the reflection group Γ with respect to coordinate hyperplanes as the follows:
Definition 2.2. The reflection group Γ is a subgroup of general linear group GL(d + 1) and
it is generated by {γi , i = 1, . . . , d + 1}. Given the same basis {e1 , . . . , ed+1 } for Rd+1 , the
reflection γi is a linear map specified via:
γi : (x1 , . . . , xi−1 , xi , xi+1 , . . . , xd+1 ) 7→ (x1 , . . . , xi−1 , −xi , xi+1 , . . . , xd+1 ).
Note that if restricted on Sd , γi is an isometry map from the unit sphere Sd to itself,
which we denote by Γ y Sd . Thus, one can treat the group Γ as a discrete subgroup of the
isometry group of Sd . In what follows we establish that Sd≥0 is a fundamental domain of the
group action Γ y Sd in the topological sense. In general, there is no uniform treatment of
a fundamental domain, but we will follow the approach by Beardon (2012). To introduce a
fundamental domain, let us define an orbit first. For a point z ∈ Sd , an orbit of the group Γ
is the following set:
OrbitΓz = {γ(z), ∀γ ∈ Γ}.
(5)
Note that one can decompose Sd into a disjoint union of orbits. The size of an orbit is not
necessarily the same as the size of the group |Γ|, because of the existence of a stabilizer
subgroup, which is defined as
Γz = {γ ∈ Γ, γ(z) = z}.
(6)
The set Γz forms a group itself, and we call this group Γz the stabilizer subgroup of Γ. Every
element in OrbitΓz has isomorphic stabilizer subgroups, thus the size of OrbitΓz is the quotient
|Γ|/|Γz |, where | · | here is the cardinality of the sets. There are only finite possibilities for
the size of a stabilizer subgroup for the action Γ y Sd , and the size of stabilizer subgroups
is dependent on codimensions of coordinate hyperplanes.
Definition 2.3. Let G act properly and discontinuously on a d-dimensional sphere, with
d > 1. A fundamental domain for the group action G is a closed subset F of the sphere such
that every orbit of G intersects F in at least one point and if an orbit intersects with the
interior of F , then it only intersects F at one point.
A fundamental domain is strict if every orbit of G intersects F at exactly one point. The
following proposition identifies Sd≥0 as the quotient topological space Sd /Γ, i.e., Sd≥0 = Sd /Γ.
Proposition 2.4. Let Γ y Sd be the group action described in Definition 2.2, then Sd≥0 is a
strict fundamental domain.
8
In topology, there is a natural quotient map Sd → Sd /Γ. With the identification Sd≥0 =
Sd /Γ, there should be a natural map Sd → Sd≥0 . Now define a contraction map c : Sd → Sd≥0
via (x1 , . . . , xd+1 ) 7→ (|x1 |, . . . , |xd+1 |) by taking component-wise absolute values. Then it is
straightforward to see that the c is indeed the topological quotient map Sd → Sd /Γ, under
the identification Sd≥0 = Sd /Γ.
So far, via (4), Lemma 2.1 and Proposition 2.4, we have established the following equivalence:
(7)
Pd≥0 = ∆d = Sd≥0 = Sd /Γ.
For the rest of the paper we will use the four characterizations of a compositional domain
interchangeably.
2.1.1
Spread-Out Construction
Based on (7), one can turn a compositional data analysis problem into one on a sphere via
spread-out construction. The key idea is to associate one compositional data point z ∈ ∆d =
Sd≥0 with a Γ-orbit of data points OrbitΓz ⊂ Sd in (5). Formally, given a point z ∈ ∆d , we
construct the following data set (not necessarily a set because of possible repetitions):
c−1 (z) = |Γz0 | copies of z 0 , for z 0 ∈ OrbitΓz ,
(8)
where Γz0 is the stabilizer subgroup of Γ with respect to z 0 in (6). In general, if there
are n observations in ∆d , the spread-out construction will create a data set with n2d+1
observations on Sd , in which observations with zero coordinates are repeated. Figure 2 (a)
and (b) illustrate this idea with a toy data set with d = 2.
2.2
Illustration: Compositional Density Estimation
The spread-out construction in (8) provides a new intimate relation between directional
statistics and compositional data analysis. Indeed, this construction produces a directional
data set out of a compositional data set, then we can literally transform a compositional data
problem into a directional statistics problem via this spread-out construction. For example,
we can perform compositional independence/uniform tests by doing directional independence/uniform tests (Jupp and Spurr, 1985; Jupp, 2008) through spread-out constructions.
In this section, we will give a new compositional density estimation framework by using
spread-out constructions. In directional statistics, density estimation for spherical data has
a long history dating back to the late 70s in Beran (1979). In the 80s, Hall et al. (1987) and
Bai et al. (1989) established systematic framework for spherical density estimation theory.
Spherical density estimation theory became popular later partly because its integral squared
error (ISE) is close dly related with goodness of fit test as in Zhao and Wu (2001) and a
recent work Garcı́a-Portugués et al. (2015).
The rich development in spherical density estimation theory will yield a compositional
density framework via spread-out constructions. In the following we apply this idea to
nonparametric density estimation for compositional data. Instead of directly estimating the
density on ∆d , one can perform the estimation with the spread-out data on Sd , from which
a density estimate for compositional data can be obtained.
9
(a) Compositional data on ∆2
(b) “Spread-out” data on S2
(c) Density estimate on S2
(d) “Pulled-back” estimate on ∆2
Figure 2: Toy compositional data on the simplex ∆2 in (a) are spread out to a sphere S2 in
(b). The density estimate on S2 in (c) are pulled back to ∆2 in (d).
Let p(·) denote a probability density function of a random vector Z on Sd≥0 , or equivalently
on ∆d . The following proposition gives a form of the density of the spread-out random vector
Γ(Z) on the whole sphere Sd .
Proposition 2.5. Let Z be a random variable on Sd≥0 with probability density p(·), then the
induced random variable Γ(Z) = {γ(Z)}γ∈Γ , has the following density p̃(·) on Sd :
p̃(z) =
|Γz |
p(c(z)), z ∈ Sd ,
|Γ|
(9)
where |Γz | is the cardinality of the stabilizer subgroup Γz of z.
Let c∗ denote the analogous operation for functions to the contraction map c that applies
to data points. It is clear that given a probability density p̃ on Sd , we can obtain the original
10
density on the compositional domain via the “pull back” operation c∗ :
X
p(z) = c∗ (p̃)(z) =
p̃(x), z ∈ Sd≥0 .
x∈c−1 (z)
Now consider estimating density on Sd with the spread-out data. Density estimation for
data on a sphere has been well studied in directional statistics (Hall et al., 1987; Bai et al.,
1989). For x1 , . . . xn ∈ Sd , a density estimate for the underlying density is
n
ch X
1 − z T xi
ˆ
fn (z) =
K
, z ∈ Sd ,
n i=1
hn
where K is a kernel function that satisfies common assumptions in Assumption A.1, and ch
is a normalizing constant. Applying this to the spread-out data c−1 (xi ), i = 1, . . . , n, we
have a density estimate of p̃(·) defined on Sd :
T
X
c
1
−
z
γ(x
)
h
i
Γ
fˆn (z) =
K
, z ∈ Sd ,
(10)
n|Γ| 1≤i≤n,γ∈Γ
hn
from which a density estimate on the compositional domain is obtained by applying c∗ . That
is,
X
pˆn (z) = c∗ fˆnΓ (z) =
fˆnΓ (x), z ∈ Sd≥0 .
x∈c−1 (z)
Figure 2 (c) and (d) illustrate this density estimation process with a toy example.
The consistency of the spherical density estimate fˆn is established by Zhao and Wu (2001);
Garcı́a-Portugués
et al. (2015), where it is shown that the integral squared error (ISE) of
R
2
ˆ
ˆ
fn , Sd (fn − f ) dz follows a central limit theorem. It is straightforward to show that the
ISE of the proposed compositional density estimator p̂n on the compositional domain also
asymptotically normally distributed by CLT. However, the CLT of ISE for spherical densities
in Zhao and Wu (2001) contains an unnecessary finite support assumption on the density
kernel function K (very different from reproducing kernels); although in Garcı́a-Portugués
et al. (2015) such finite support condition is dropped, their result was on directional-linear
data, and their proof does not directly applies to the pure directional context. For the
readers’ convenient, we will provide the proof for the CLT of ISE for both compositional and
spherical data, without the finite support condition as in Zhao and Wu (2001)
Theorem 2.6. CLT for ISE holds for both directional and compositional data under the mild
conditions (H1, H2 and H3) in Section A.1, without the finite support condition on density
kernel functions K.
The detail of the proof of THeorem 2.6 plus the statements of the technical conditions
can be found Section A.1.
11
3
Reproducing Kernels of Compositional Data
We will be devoted to construct reproducing kernel structures on compositional domains,
based on the topological re-interpretation of ∆d in Section 2. The key idea is that based on
the quotient map π : Sd → Sd /Γ = ∆d , we can use function spaces on spheres to understand
function spaces on compositional domains. Moreover, we can construct reproducing kernel
structures of a compositional domain ∆d based on those on Sd .
The reproducing kernel was first introduced in 1907 by Zaremba when he studied boundary value problems for harmonic and biharmonic functions, but the systematic development
of the subject was finally done in the early 1950s by Aronszajn (1950). Reproducing kernels
on Sd were essentially discovered by Laplace and Legendre in the 19th centuary, although
the reproducing kernels on spheres were called zonal spherical functions at that time. Both
spherical harmonics theory and RKHS have found applications in theoretical subjects like
functional analysis, representation theory of Lie groups and quantum mechanics. In statistics, the successful application of RKHS in spline models by Wahba (1981) popularized
RKHS theory for Sd . In particular, they used spherical harmonics theory to construct an
RKHS on S2 . Generally speaking, for a fixed topological space X, there exists (and one can
construct) multiple reproducing kernel Hilbert spaces on X; In their work, an RKHS on S2
was constructed by considering a subspace of L2 (S2 ) under a finiteness condition, and the
reproducing kernels were also built out of zonal spherical functions. Their work is motivated
by studying spline models on the sphere, while our motivation has nothing to do with spline
models of any kind. In this work we consider reproducing structures on spheres which are
different from the one in Wahba (1981), but we share the same building blocks, which is
spherical harmonics theory.
Evolved from the re-interpretation of a compositional domain ∆d as Sd /Γ, we will construct reproducing kernels of compositional by using reproducing kernel structures on spheres.
Since spherical harmonics theory gives reproducing kernel structures on Sd , and a compositional domain ∆d are topologically covered by spheres with their deck transformations group
Γ. Thus naturally we wonder (i) whether function spaces on ∆d can identified with the subspaces of Γ-invariant functions on Sd , and (ii) whether one might “build” Γ-invariant kernels
out of spherical reproducing kernels, and hope that the Γ-invariant kernels can play the
role of “reproducing kernels” on ∆d . It turns out that the answers for both (i) and (ii) are
positive (see Remark 3.8 and Theorem 3.12). The discovery of reproducing kernel structures
on ∆d is crucially based on the reinterpretation of compositional domains via projective and
spherical geometries in Section 2.
By considering Γ-invariant objects in spherical function spaces we managed to construct
reproducing kernel structures for compositional domains, and compositional reproducing
Hilbert spaces. Although compositional RKHS was first considered as a candidate “inner
product space” for data points to be mapped into, the benefit of working with RKHS goes
far beyond than this, due to exciting development of kernel techniques in machine learning
theory that can be applied to compositional data analysis as is mentioned in Section 1.1.
This gives a new chance to construct a new framework for compositional data analysis, in
which we “upgrade” compositional data points as functions (via reproducing kernels), and the
classical statistical notions, like means and variance-covariances, will be “upgraded” to linear
functionals and linear operators over the functions space. Traditionally important statistical
12
topics such as dimension reduction, regression analysis, and many inference problems can be
re-addressed in the light of this new “kernel techniques”.
3.1
Recollection of Basic Facts from Spherical Harmonics Theory
We give a brief review of the theory of spherical harmonics in the following. See Atkinson
and Han (2012) for a general introduction to the topic. In classical linear algebra, a finite
dimensional linear space with a linear map to itself can be decomposed into direct sums
of eigenspaces. Such a phenomenon still holds for L2 (Sd ) with Laplacians being the linear
operator to itself. Recall that the Laplacian operator on a function f with d + 1 variables is
∆f =
d+1 2
X
∂ f
i=1
∂x2i
.
Let Hi be the i-th eigenspace of the Laplacian operator. It is known that L2 (Sd ) can be
orthogonally decomposed as
∞
M
2 d
L (S ) =
Hi ,
(11)
i=1
where
the orthogonality is endowed with respect to the inner product in L2 (Sd ): hf, gi =
Z
f ḡ.
Sd
Let Pi (d + 1) be the space of homogeneous polynomials of degree i in d + 1 coordinates on
Sd . A homogeneous polynomial is a polynomial whose terms are all monomials of the same
degree, e.g., P4 (3) includes xy 3 + x2 yz. Further, let Hi (d + 1) be the space of homogeneous
harmonic polynomials of degree i on Sd , i.e.,
Hi (d + 1) = {P ∈ Pi (d + 1)| ∆P = 0}.
(12)
For example, x3 y + xy 3 − 3xyz 2 and x2 − 6x2 y 2 + y 4 are members of H4 (3).
Importantly, the spherical harmonic theory has established that each eigenspace Hi in
(11) is indeed the same space as Hi (d + 1). This implies that any function in L2 (Sd ) can be
approximated by an accumulated direct sum of orthogonal homogeneous harmonic polynomials. The following well-known proposition further reveals that the Laplacian constraint in
(12) is not necessary to characterize the function space on the sphere.
Proposition 3.1. Let Pm (d + 1) be the space of degree m homogeneous polynomial on d + 1
variables on the unit sphere and Hi be the ith eigenspace of L2 (Sd ). Then
bm/2c
M
Pm (d + 1) =
H2i ,
i=dm/2e−bm/2c
where d·e and b·c stand for round-up and round-down integers respectively.
From Proposition 3.1, one can see that any L2 function on Sd can be approximated by
homogeneous polynomials. An important feature of spherical harmonics theory is that it
13
gives reproducing structures on spheres, and now we will recall this fact. For the following
discussion, we will fix a Laplacian eigenspace Hi inside L2 (Sd ), so Hi is a finite dimensional
Hilbert space on Sd ; such a restriction on a single piece Hi is necessary because the entire
Hilbert space L2 (Sd ) does not have a reproducing kernel given that the Delta functional on
L2 (Sd ) is not a bounded functional1 .
3.2
Zonal Spherical Functions as Reproducing Kernels in Hi
On each Laplacian eigenspace Hi inside L2 (Sd ) on general d-dimensional spheres, we define
a linear functional Lx on Hi , such that for each Y ∈ Hi , Lx (Y ) = Y (x) for a fixed point
x ∈ Sd . General spherical harmonics theory tells us that there exists ki (x, t) such that:
Z
Lx (Y ) = Y (x) =
Y (t)ki (x, t)dt, x ∈ Sd ;
Sd
this function ki (x, t) is the representing function of the functional Lx (Y ), and classical spherical harmonics theory refers to the function ki (x, t) as the zonal spherical function, and
furthermore, they are actually “reproducing kernels” inside Hi ⊂ L2 (Sd ) in the sense of
Aronszajn (1950). Another way to appreciate spherical harmonics theory is that it tells that
each Laplacian eigenspace Hi ⊂ L2 (Sd ) is actually a reproducing kernel Hilbert space on Sd ,
a special case when d = 2 was used Wahba (1981).
Let us recollect some basic facts of zonal spherical functions for readers’ convenience in
the next Proposition. One can find their proofs in almost any modern spherical harmonics
references, in particular in Stein and Weiss (1971, Chapter IV):
Proposition 3.2. The following properties hold for the zonal spherical function ki (x, t),
which is also the reproducing kernel inside Hi ⊂ L2 (Sd ) with dimension ai .
(a) For a choice of orthonormal basis {Y1 , . . . , Yai } in Hi , we can express the kernel
ai
X
Yi (x)Yi (t), but ki (x, t) does not depend on choices of basis.
ki (x, t) =
i=1
(b) ki (x, t) is a real-valued function and symmetric, i.e., ki (x, t) = ki (t, x).
(c) For any orthogonal matrix R ∈ O(d + 1), we have ki (x, t) = ki (Rx, Rt).
(d) ki (x, x) =
ai
for any point x ∈ Sd .
d
vol(S )
(e) ki (x, t) ≤
ai
for any x, t ∈ Sd .
vol(Sd )
Remark 3.3. The above proposition “seems” obvious from traditional perspectives, as if it
could be found in any textbook, so readers with rich experience with RKHS theory might
think that we are stating something trivial. However, we want to point out two facts.
1
At first sight, this might seem to contradict the discussion on splines on 2-dimensional spheres in Wahba
(1981), but a careful reader can find that a finiteness constraint was imposed there, and it was never claimed
that L2 (S2 ) is a RKHS. That is, their RKHS on S2 is a subspace of L2 (S2 ).
14
(1) Function spaces over underlying spaces with different topological structures behave
very differently. Spheres are compact with no boundary, and their function spaces have
Laplacian operators whose eigenspaces and finite dimensional, which possesses reproducing kernels structures inside finite dimensional eigenspaces. These coincidences are
not expected to happen over other general topological spaces.
(2) Relative to classical topological spaces whose RKHS were used more often, e.g. unit
intervals or vector spaces, spheres are more “exotic” topological structures (simply
connected space, but with nontrivial higher homotopy groups), while intervals or vector
spaces are contractible with trivial homotopy groups. One way to appreciate spherical
harmonics theory is that classical “naive” expectations can still happen on spheres.
In the next subsection we discuss the corresponding function space in the compositional
domain ∆d .
3.3
Function Spaces on Compositional Domains
With the identification ∆d = Sd /Γ, the functions space L2 (∆d ) can be identified with
L2 (Sd /Γ), i.e., L2 (∆d ) = L2 (Sd /Γ). The function space L2 (Sd ) is well understood by spherical harmonics theory as above, so we want to relate L2 (Sd /Γ) with L2 (Sd ) as follows. Notice
that a function h ∈ L2 (Sd /Γ) is a map from Sd /Γ to (real or complex) numbers. Thus a
natural associated function π ∗ (h) ∈ L2 (Sd ) is given by the following composition of maps:
π◦h:
Sd
π
/ Sd /Γ
h
/
C.
Therefore, the composition π ◦ h = π ∗ (h) ∈ L2 (Sd ) gives rise to a natural embedding of the
function space of compositional domains to that of a sphere π ∗ : L2 (Sd /Γ) → L2 (Sd ).
The embedding π ∗ identifies the Hilbert space of compositional domains as a subspace of
the Hilbert space of spheres. A natural question is how to characterize the subspace in L2 (Sd )
that corresponds to functions on compositional domains. The following proposition states
that f ∈ im(π ∗ ) if and only if f is constant on fibers of the projection map π : Sd → Sd /Γ,
almost everywhere. In other words, f takes the same values on all Γ orbits, i.e., on the set
of points which are connected to each other by “sign flippings”.
Proposition 3.4. The image of the embedding π ∗ : L2 (Sd /Γ) → L2 (Sd ) consists of functions
f ∈ L2 (Sd ) such that up to a measure zero set, is constant on π −1 (x) for every x ∈ Sd /Γ,
where π is the natural projection Sd → Sd /Γ.
We call a function f ∈ L2 (Sd ) that lies in the image of the embedding π ∗ a Γ-invariant
function. Now we construct the contraction map π∗ : L2 (S d ) → L2 (S d /Γ) and this map will
descend every function on spheres to a function on compositional domains. To construct π∗ ,
it suffices to associate a Γ-invariant function to every function in L2 (Sd ). For a point z ∈ Sd
and a reflection γ ∈ Γ, a point γ(z) lies in the set OrbitΓz which is defined in (5). Starting
with a function f ∈ L2 (Sd ), we will define the associated Γ-invariant function f Γ as follows:
15
Proposition 3.5. Let f be a function in L2 (Sd ). Then the following f Γ
f Γ (z) =
1 X
f (γ(z)),
|Γ| γ∈Γ
z ∈ Sd ,
(13)
is a Γ-invariant function.
Proof Each fiber of the projection map π : Sd → Sd /Γ is OrbitΓz for some z in the fiber.
For any other point y on the same fiber with z for the projection π, there exists a reflection
γ ∈ Γ such that y = γ(z). Then this proposition follows from the identity f Γ (z) = f Γ (γ(z)),
which can be easily checked.
The contraction map f 7→ f Γ on spheres naturally gives the following map
π∗ : L2 (Sd ) → L2 (Sd /Γ), with f 7→ f Γ
(14)
Remark 3.6. Some readers might argue that each element in an L2 space is a function class
rather than a function, so in that sense π∗ (f ) = f Γ is not well-defined, but note that each
element in L2 (Sd ) can be approximated by polynomials, and the π∗ which is well defined on
individual polynomial, will induce a well defined map on function classes.
Theorem 3.7. This contraction map π∗ : L2 (Sd ) → L2 (Sd /Γ), as defined in (14), has a
section given by π ∗ , namely the composition π∗ ◦ π ∗ induces the identity map from L2 (Sd /Γ)
to itself. In particular, the contraction map π∗ is a surjection.
Proof One way to look at the relation of the two maps π∗ and π∗ is through the diagram
L2 (Sd /Γ) o
π∗
π∗
/
L2 (Sd ) . The image of π ∗ consists of Γ-invariant functions in L2 (Sd ).
Conversely, given a Γ-invariant function g ∈ L2 (Sd ), the map g 7→ g Γ is an identity map,
i.e., g = g Γ , thus the theorem follows.
Remark 3.8. Theorem 3.7 identifies functions on compositional domains as Γ-invariant functions in L2 (Sd ). For any function f ∈ L2 (Sd ), we can produce the corresponding Γ-invariant
function f Γ by (13). More importantly, we can “recover” L2 (∆d ) from L2 (Sd ), without losing any information. This allows us to construct reproducing kernels of ∆d from L2 (Sd ) in
Section 3.5.
3.4
Further Reduction to Homogeneous Polynomials of Even Degrees
In this section we provide
L a further simplification of the homogeneous polynomials in the
finite direct sum space m
that if m is even, then Pm (d + 1) =
i=0 Hi . Proposition 3.1 tells us
Lm/2
L(m−1)/2
H2i+1 , where Pm (d + 1) is
i=0 H2i , and that if m is odd then Pm (d + 1) =
i=0
16
the space of degree m homogeneous polynomials in d + 1 variables. In either of the cases
(m being even or odd), the degree of the homogeneous polynomials m is the same as the
max{2i,
L dm/2e − bm/2c ≤ i ≤ bm/2c}. Therefore we can decompose the finite direct sum
space m
i=0 Hi into the direct sum of two homogeneous polynomial spaces:
m
M
Hi = Pm (d + 1)
M
Pm−1 (d + 1).
i=0
However we will show that any monomial of odd degree term will collapse to zero by taking
its Γ-invariant, thus only one piece of the above homogeneous polynomial space will “survive”
under the contraction map π∗ . This will further simplify the function space, which in turn
facilitates an easy computation.
L
Specifically, when working with accumulated direct sums m
i=0 Hi on spheres, not every
d
d
function
is
a
meaningful
function
on
∆
=
S
/Γ,
e.g.,
we
can
find
a nonzero function f ∈
Lm
Γ
= 0. In fact, all of the odd pieces of the eigenspace Hm with m being
i=0 Hi , but f
odd do not L
contribute to anything to L2 (∆d ) = L2 (Sd /Γ). In other words, the accumulated
direct sum m
i=0 H2i+1 is “killed” to zero under π∗ , as shown by the following Lemma:
Q
αi
exits k with αk being
Lemma 3.9. For every monomial d+1
i=1 xi (each αi ≥ 0), if there
Qd+1 αi Γ
Qd+1 αi
odd, then the monomial i=1 xi is a shadow function, that is, ( i=1 xi ) = 0.
LkAn important implication of this Lemma is that since each homogeneous polynomial in
i=0 H2i+1 is a linear combination of monomials with
Lat∞least one odd term, it is killed under
2 d
π∗ . This implies that all “odd” pieces in L (S ) = i=0 Hi do not contribute anything to
L2 (∆d ) = L2 (Sd /Γ). Therefore, whenever using spherical harmonics theory to understand
function spaces of compositional domains, it suffices to consider only even i for Hi in L2 (Sd ).
In summary, the function space on the compositional domain ∆d = Sd /Γ has the following
eigenspace decomposition:
2
d
2
d
L (∆ ) = L (S /Γ) =
∞
M
Γ
H2i
,
(15)
i=0
Γ
where H2i
:= {h ∈ H2i , h = hΓ }.
3.5
Reproducing Kernels for Compositional Domain
With the understanding of function spaces on compositional domains as invariant functions
on spheres, we are ready to use spherical harmonic theory to construct reproducing kernel
structures on compositional domains.
3.5.1
Γ-invariant Functionals on Hi
The main goal of this section is to establish reproducing kernels for compositional data.
Inside each Laplacian eigenspace Hi in L2 (Sd ), recall that the Γ-invariant subspace HiΓ can
be regarded as a function space on ∆d = Sd /Γ, based on (15). To find a candidate of
17
reproducing kernel inside HiΓ , we first identify the representing function for the following
linear functional LΓz on Hi , which is defined as follows: For any function Y ∈ Hi ,
LΓz (Y ) = Y Γ (z) =
1 X
Y (γz),
|Γ| γ∈Γ
for a given z ∈ Sd . One can easily see that LΓz and Lz agree on the subspace HiΓ inside Hi
and also that LΓz can be seen as a composed map LΓz = Lz π∗ : Hi → HiΓ → C. Note that
although LΓz is defined on Hi , it can actually be seen as a “Delta functional” on Sd /Γ = ∆d .
To find the representing function for LΓz , we will use zonal spherical functions: Let ki (·, ·)
be the reproducing kernel in the eigenspace Hi . Define the “compositional” kernel kiΓ (·, ·)
for Hi as
1 X
kiΓ (x, y) =
ki (γx, y), ∀x, y ∈ Sd ,
(16)
|Γ| γ∈Γ
from which it is straightforward to check that kiΓ (z, ·) represents linear functionals of the
form LΓz , simply by following the definitions.
Remark 3.10. The above definition of “compositional kernels” in (16) is not just a trick only
to get rid of the “redundant points” on spheres. This definition is inspired by the notion of
“orbital integrals” in analysis and geometry. In our case, the “integral” is a discrete version,
because the “compact subgroup” in our situation is replaced by a finite discrete reflection
group Γ. In fact, such kind of “discrete orbital integral” construction is not new in statistical
learning theory, e.g., Reisert and Burkhardt (2007) also used the “orbital integral” type of
construction to study equivariant matrix valued kernels.
At first sight, a compositional kernel is not symmetric on the nose, because we are only
“averaging” over the group orbit on the first variable of the function ki (x, y). However
since ki (x, y) is both symmetric and orthogonally invariant by Propositional 3.2, so quite
counter-intuitively, compositional kernels are actually symmetric:
Proposition 3.11. Compositional kernels are symmetric, namely kiΓ (x, y) = kiΓ (y, x).
Proof Recall that ki (x, y) = ki (y, x) and that ki (Gx, Gy) = ki (x, y) for any orthogonal
matrix G. Notice that every reflection γ ∈ Γ can be realized as an orthogonal matrix, then
we have
1 X
kiΓ (x, y) =
ki (γx, y)
|Γ| γ∈Γ
1 X
1 X
=
ki (y, γx) =
ki (γ −1 y, γ −1 (γx))
|Γ| γ∈Γ
|Γ| γ∈Γ
X
1
=
ki (γ −1 y, x)
|Γ| γ∈Γ
1 X
=
ki (γy, x)
|Γ| γ∈Γ
= kiΓ (y, x)
18
Recall that HiΓ is the Γ-invariant functions inside Hi , and by (15), HiΓ is the i-th subspace
of a compositional function space L2 (∆d ). A naı̈ve candidate for the reproducing kernel inside
HiΓ , denoted as wi (x, y), might be the spherical reproducing kernel ki (x, y), but ki (x, y) is
not Γ-invariant. It turns out that the compositional kernels are actually reproducing with
respect to all Γ-invariant functions in Hi , while being Γ-invariant on both arguments.
Theorem 3.12. Inside Hi , the compositional kernel kiΓ (x, y) is Γ-invariant on both arguments x and y, and moreover kiΓ (x, y) = wi (x, y), i.e., the compositional kernel is the
reproducing kernel for HiΓ .
Proof Firstly by the definition, kiΓ (x, y) is Γ-invariant on the first argument x; by the
symmetry of kiΓ (x, y) in Proposition 3.11, it is then also Γ-invariant on the second argument
y, hence the compositional kernel kiΓ (x, y) is a kernel inside HiΓ .
Secondly, let us prove the reproducing property of kiΓ (x, y). For any Γ-invariant function
f ∈ HiΓ ⊂ Hi ,
< f (t), kiΓ (x, t) > = < f (t),
X 1
ki (γx, t) >
|Γ|
γ∈Γ
1 X
< f (t), ki (γx, t) >
|Γ| γ∈Γ
1 X
1 X
=
f (γx) =
f (x) (f is Γ-invariant)
|Γ| γ∈Γ
|Γ| γ∈Γ
=
= f (x)
Remark 3.13. Theorem 3.12 justifies that a compositional kernel is actually the reproducing
kernel for functions inside HiΓ . Although the compositional kernel kiΓ (x, y) is symmetric as
is proved in Proposition 3.11, we will still use wi (x, y) to denote kiΓ (x, y) because wi (x, y) is,
notationally speaking, more visually symmetric than the notation for compositional kernels.
3.5.2
Compositional RKHS and Spaces of Homogeneous Polynomials
Recall that based on Theorem 3.1, the direct sum of an even (resp. odd) number of
eigenspaces can be expressed as the set of homogeneous polynomials
of a fixed degree. FurL∞
2 d
ther recall that the direct sum
L (S ) =
i=0 Hi is an orthogonal one, so
L decomposition
Γ
is the direct sum L2 (∆d ) = ∞
H
.
By
the
orthgonality
between
eigenspaces,
i
i=0
Lm
Pm the reproΓ
ducing kernels for the finite direct sum i=0 Hi is naturally the summation i=0 wi . Note
that by Lemma 3.9, it suffices to consider only even pieces of eigenspaces H2i . Finally, we
give a formal definition of “the degree m reproducing kernel Hilbert space” on ∆d , consisting
degree 2m homogeneous polynomials:
Definition 3.14. Let wi be the reproducing kernel for Γ-invariant functions in the ith
eigenspace Hi ⊂ L2 (Sd ). The degree m compositional reproducing kernel Hilbert space is
19
L
Γ
defined to be the finite direct sum m
i=0 H2i , and the reproducing kernel for the degree m
compositional reproducing kernel Hilbert space is
ωm (·, ·) =
m
X
w2i (·, ·).
(17)
i=0
Lm
Γ
Thus the degree m RKHS for the compositional domain is the pair
i=0 H2i , ωm .
Lm
L
Γ
which is
Recall that the direct sum m
i=0 H2i ,L
i=0 H2i can identified as a subspace of
Γ
isomorphic to the space of degree 2m homogeneous polynomials, so each function in m
i=0 H2i
can be written as a degree 2m homogeneous polynomial, including the reproducing kernel
ωm (x, ·), P
although it is not obvious from (17). Notice that for a point (x1 , x2 , . .L
. , xd+1 ) ∈ Sd ,
d+1 2
Γ
the sum i=1 xi = 1, so one can always use this sum to turn each element in m
i=0 H2i to a
homogeneous polynomial. For example, x2 + 1 is not a homogeneous polynomial, but each
point (x, y, z) ∈ S2 satisfies x2 + y 2 + z 2 = 1, then we have x2 + 1 = x2 + x2 + y 2 + z 2 =
2
2x2 + y 2 + z 2 , which is a homogeneous polynomial
Lm onΓ the sphere S .
In fact, we can say something more about i=0 H2i . Recall that Proposition 3.9 “killed”
the contributions fromL
“odd pieces” H2k+1 under the contraction map π∗ : L2 (Sd ) → L2 (∆d ).
However, even inside m
i=0 H2i , only a subspace can be identified with a compositional function space, namely, those Γ-invariant homogeneous polynomials. The
Lmfollowing proposition
gives a characterization
of
which
homogeneous
polynomials
inside
i=0 H2i come from the
Lm
Γ
subspace i=0 H2i :
L
Lm
2 d
Γ
Proposition 3.15. Given any element θ ∈ m
i=0 H2i ⊂ L (S /Γ), there exists
i=0 H2i ⊂
a degree m homogeneous polynomial pm , such that
θ(x1 , x2 , . . . , xd+1 ) = pm (x21 , x22 , · · · , x2d+1 ).
(18)
Proof Note that
homogeneous Γ-invariant polynomial, then each monomial
Q θ isaai degreeP2m
d+1
with
x
in θ has form d+1
i=1 ai = 2m.
i=1 i
Q
ai
that ai is odd for some
If θ contains one monomial d+1
i=1 xi with nonzero coefficient suchQ
Q
ai Γ
Γ
1 ≤ i ≤ d + 1. Note that θ is Γ-invariant, i.e., θ = θ , which implies d+1
xai i = ( d+1
i=1 xi ) ,
i=1
Qd+1 ai Γ
but the term ( i=1 xi ) is zero by Proposition 3.9. Thus θ is a linear combination of monoQ
Qd+1 2 ai /2
P
ai
mials of the form d+1
with each ai being even and i ai /2 = m, thus
i=1 xi =
i=1 (xi )
the proposition follows.
Lm
Γ
Recall
that
the
degree
m
compositional
RKHS
is
H
,
ω
3.14, and
m
2i
i=0
Lm
Lm in Definition
Γ
H
consists
of
degree
2m
homogeneous
polynomials
while
H
is
just
a
subspace
2i
2i
i=0
i=0
of
Proposition 3.15 tells us that one can also have a concrete description of the subspace
Lit.
m
Γ
i=0 H2i via those degree m homogeneous polynomials on squared variables.
4
Applications of Compositional Reproducing Kernels
The availability of compositional reproducing kernels will open a door to many statistical/machine learning techniques for compositional data analysis. However, we will only
20
present two application scenarios, as an initial demonstration of the influence of RKHS thoery on compositional data analysis. The first application is the representer theorem, which is
motivated by newly developed kernel-based machine learning, especially by the rich theory
of vector valued regression (Micchelli and Pontil, 2005; Minh and Sindhwani, 2011). The
second one is constructing exponential families on compositional domains. Parameters of
compositional exponential models are compositional reproducing kernels. To the best of authors’ knowledge, these will be the first class of nontrivial examples of explicit distributions
on compositional domains with non-vanishing densities on the boundary.
4.1
Compositional Representer Theorems
Beyond the successful applications on traditional spline models, representer theorems are
increasingly relevant due to the new kernel techniques in machine learning. We will consider
minimal normal interpolations and least square regularzations in this paper. Regularizations
are especially important in many situations, like structured prediction, multi-task learning,
multi-label classification and related themes that attempt to exploit output structure.
A common theme in the above-mentioned contexts is non-parametric estimation of a
vector-valued function f : X → Y, between a structured input space X and a structured
output space Y. An important adopted framework in those analyses is the “vector-valued reproducing kernel Hilbert spaces” in Micchelli and Pontil (2005). Unsurprisingly, representer
theorems not only are necessary, but also call for further generalizations in various contexts:
(i) In classical spline models, the most frequently used version of representer theorems are
about scalar valued kernels, but besides the above-mentioned scenario f : X → Y
in manifold regularization context, in which vector valued representer theorems are
needed, higher tensor valued kernels and their corresponding representer theorems
are also desirable. In Reisert and Burkhardt (2007), matrix valued kernels and their
representer theorems are studied, with applications in image processing.
(ii) Another related application lies in the popular kernel mean embedding theories, in
particular, conditional mean embedding. Conditional mean embedding theory essentially gives an operator from an RKHS to another (Grunewalder et al., 2012). In
order to learn such operators, vector-valued regressions plus corresponding representer
theorems are used.
In vector-valued regression framework, an important assumption discussed in representer
theorems are linear independence conditions (Micchelli and Pontil, 2005). As our construction of compositional RKHS is based on finite dimensional spaces of polynomials, the linear
independence conditions are not freely satisfied on the nose, so we will address this problem
in this paper. Instead of dealing with vector-valued kernels, we will only focus on the special
case of scalar valued (reproducing) kernels, but the issue can be clearly seen in this special
case.
4.1.1
Linear Independence of Compositional Reproducing Kernels
Lm
Γ
The compositional RKHS that was constructed in Section 3 takes the form
H
,
ω
m
2i
i=0
indexed by m. Based on the finite dimensional nature of compositional RKHS, it is not even
21
L
Γ
clear whether different points yield to different functions ωm (xi , ·) inside m
i=0 H2i . we will
give a positive answer when m is high enough.
Given a set of distinct compositional data points {xi }ni=1 ⊂ ∆d , we will show that the
corresponding
set of reproducing functions {ωm (xi , ·)}ni=1 form a linearly independent set
Lm
Γ
inside i=0 H2i
if m is high enough.
Theorem 4.1. Let {xi }ni=1 be distinct data points on a compositional domain ∆d . Then
there exists a positive integer M L
, such that for any m > M , the set of functions ωm (xi , ·)}ni=1
Γ
is a linearly independent set in m
i=0 H2i .
Proof
The quotient map c∆ : Sd → ∆d can factor through a projective space, i.e., c∆ : Sd →
d
P → ∆d . The main idea is to prove a stronger statement, in which we showed that distinct
data points in Pd will give linear independence of projective kernels for large enough m,
where projective kernels are reproducing kernels in Pd whose definition was given in A.3.
Then we construct two vector subspace V1m and V2m and a linear map gm from V1m to V2m .
The key trick is that the matrix representing the linear map gm becomes diagonally dominant
when m is large enough, which forces the spanning sets of both V1m and V2m to be linear
independent. More details of the proof are given in Section A.3.
In the proof of Theorem 4.1, we make use of the homogeneous polynomials (yi ·t)2m
L,mwhich
is not living inside a single piece H2i , thus we had to use the direct sum space
i=0 H2i
for our construction of RKHS. Without using projective kernels, one might wonder if the
same argument works, however, the issue is that the matrix might have ±1 at off-diagonal
entries, which will fail to be diagonally dominant when m grows large enough. We break
down to linear independence of projective kernels for distinct points, because reproducing
kernels for distinct compositional data points are linear combinations of distinct projective
kernels, then in this way, the off diagonal terms will be the power of inner product of two
vectors that will not be antipodal or identical, thus off diagonal terms’ m-th power will go
to zero with increasing m.
Another consequence of Theorem 4.1 is that ωm (xi , ·) 6= ωm (xj , ·) whenever i 6= j when
m is large enough. Not only large enough m will separate points on the reproducing kernel
level, but also gives each data point their “own dimension.”
4.1.2
Minimal Norm Interpolation and Least Squares Regularization
Once the linear independence is established in Theorem 4.1, it is an easy corollary to establish
the representer theorems for minimal norm interpolations and least square regularizations.
Nothing is new from the point of view of general RKHS theory, but we will include these
theorems and proofs on account of completeness. Again, we will focus on the scalar-valued
(reproducing) kernels and functions, instead of the vector-valued kernels and regressions.
However, Theorem 4.1 sheds important lights on linearly independence issues, and interested
readers can generalize these compositional representer theorems to vector-valued cases by
following Micchelli and Pontil (2005).
22
The first representer theorem we provide is a solution to minimal norm interpolation
problem: for a fixed set of distinct points {xi }ni=1 in ∆d and a set of numbers y = {yi ∈ R}ni=1 ,
let Iym be the set of functions that interpolates the data
Iym = {f ∈
m
M
Γ
: f (xi ) = yi },
H2i
i=0
and out goal is to find f0 with minimum `2 norm, i.e.,
kf0 k = inf{kf k , f ∈ Iym }.
Theorem 4.2. Choose m large enough so that the reproducing kernels {ωm (xi , t)}ni=1 are
linearly independent,
then the unique solution of the minimal norm interpolation problem
Lm
Γ
: f (xi ) = yi } is given by the linear combination of the kernels:
min{kf k , f ∈ i=0 H2i
f0 (t) =
n
X
ci ωm (xi , t)
i=1
where {ci }ni=1 is the unique solution of the following system of linear equations:
n
X
ωm (xi , xj )cj = yi , 1 ≤ i ≤ n.
j=1
Proof For any other f in Iym , define g = f − f0 . By considering the decomposition:
kf k2 = kg + f0 k2 = kgk2 + 2 < f0 , g > + kf0 k2 , one can argue that the cross term
< f0 , g >= 0. The detail can be found in Section A.4. We want to point out that the
linear independence of reproducing kernels guarantees the uniqueness and existence of f0 .
The second representer theorem is for a more realistic scenario with `2 regularization,
which has the following objective:
n
X
(f (xi ) − yi )2 + µ kf k2 .
(19)
i=1
L
Γ
The goal is to find the Γ-invariant function fµ ∈ m
i=0 H2i that minimizes (19). The solution
to this problem is provided by the following representer theorem:
Theorem 4.3. For a set of distinct compositional data points {xi }ni=1 , choose m large enough
such that the reproducing kernels {ωm (xi , t)}ni=1 are linearly independent. Then the solution
to (19) is given by
n
X
fµ (t) =
ci ωm (xi , t),
i=1
where
{ci }ni=1
is the solution of the following system of linear equations:
µci +
n
X
ωm (xi , xj )cj = yi , 1 ≤ i ≤ n.
j=1
23
Proof The detail of this proof can be found in Section A.4, but we want to point out how
the linear independence P
condition
plays a role in here.
of the proof we need
In the middle P
n to show that µfµ (t) = i=1 (yi − fµ (xi ))ωm (xi , t) , where fµ (t) = ni=1 ωm (xi , t)ci . We
use the linear independence in Theorem 4.1 to establish the equivalence between this linear
equation system of {ci }ni=1 and the one given in the theorem.
4.2
Compositional Exponential Family
With the construction of RKHS in hand, one can produce exponential families using the
technique developed in Canu and Smola (2006). Recall that for a function space H with the
inner product < ·, · > on a general topological space X , whose reproducing kernel is given
by k(x, ·), the exponential family density p(x, θ) with the parameter θ ∈ H is given by:
Z
where g(θ) = log
p(x, θ) = exp{< θ(·), k(x, ·) > −g(θ)},
exp < θ(·), k(x, ·) > dx.
X
For compositional data we define the density of the mth degree exponential family as
pm (x, θ) = exp {< θ(·), ωm (x, ·) > −g(θ)} , x ∈ Sd /Γ,
(20)
R
Lm
Γ
and g(θ) = log Sd /Γ exp(< θ(·), ωm (x, ·) >)dx. Note that this density
where θ ∈ i=0 H2i
canLbe made more explicit by using homogeneous polynomials. Recall that any function
Γ
in m
i=0 H2i can be written as a degree m homogeneous polynomial with squared variables
by Lemma 3.9. Thus the density in (20) can be simplified to the following form: for x =
(x1 , . . . , xd+1 ) ∈ Sd≥0 ,
pm (x, θ) = exp{sm (x21 , x22 , . . . , x2d+1 ; θ) − g(θ)},
(21)
x2i ’s
where sm is a polynomial on squared variables
with θ as coefficients. Note that sm
is invariant under “sign-flippings”, and the normalizing constant can be computed via the
integration over the entire sphere as follows:
Z
Z
1
g(θ) =
exp(sm )dx =
exp(sm )dx.
|Γ| Sd
Sd /Γ
Figure 3 displays three examples of compositional exponential distribution. The three
densities respectively have the following θ:
θ1 = −2x41 − 2x42 − 3x43 + 9x21 x22 + 9x21 x23 − 2x22 x23 ,
θ2 = −x41 − x42 − x43 − x21 x22 − x21 x23 − x22 x23 ,
θ3 = −3x41 − 2x42 − x43 + 9x21 x22 − 5x21 x23 − 5x22 x23 .
The various shapes of the densities in the Figure implies that the compositional exponential
family can be used to model data with a wide range of locations and correlation structures.
Further investigation is needed on the estimation of the parameters of the compositional
exponential model (21), which is suggested as a future direction of research. A natural
starting point is maximum likelihood estimation and a regression-based method such as the
one discussed by Beran (1979).
24
(a) p4 (x, θ1 )
(b) p4 (x, θ2 )
(c) p4 (x, θ3 )
Figure 3: Three example densities from compositional exponential family. See text for
specification of the parameters θ1 , θ2 , and θ3 .
5
Discussion
A main contribution of this work is that we use projective and spherical geometries to
reinterpret compositional domains, which allows us to construct reproducing kernels for
compositional data points by using spherical harmonics under group actions. With the
rapid development of kernel techniques (especially kernel mean embedding philosophy) in
machine learning theory, this work will make it possible to introduce reproducing kernel
theories to compositional data analysis.
Let us for example consider the mean estimation problem for compositional data. Under
the kernel mean embedding framework that is surveyed by Muandet et al. (2017), one can
focus on kernel mean E[k(X, ·)] in the function space, rather a physical mean that exist in
the compositional domain. The latter is known to be difficult to even define properly (Paine
et al., 2020; Scealy and Welsh, 2011). On the other hand, the kernel mean is endowed with
flexibility and linear structure of the function space. Although inspired by the kernel mean
embedding techniques, we did not address the computation of the kernel mean E[k(X, ·)]
and cross-variance operator as a replacement of traditional means and variance-covariances
in traditional multivariate analysis. The authors will, in the forthcoming work, develop
further techniques to come back to this issue of kernel means and cross-variance operators
for compositional exponential models, via applying deeper functional analysis techniques.
Although we only construct reproducing kernels for compositional data, it does not mean
that “higher tensors” is abandoned in our consideration. In fact, higher-tensor valued reproducing kernels are also included in kernel techniques with applications in manifold regularizations (Minh and Sindhwani, 2011) and shape analysis (Micheli and Glaunés, 2014).
These approaches on higher-tensor valued reproducing kernels indicate further possibilities
of regression frameworks between exotic spaces f : X → Y with both the source X and
Y being non-linear in nature, which extends the intuition of multivariate analysis further
to nonlinear contexts, and compositional domains (traditionally modeled by an “Aitchison
simplex”) are an interesting class of examples which can be treated non-linearly.
25
A
Supplementary Proofs
A.1
Proof of Central Limit Theorems on Integral Squared Errors
(ISE) in Section 2.2
Assumption A.1. For all kernel density estimators and bandwidth parameters in this paper,
we assume the following:
H1 The kernel function K : [0, ∞) → [0, ∞) is continuous such
Z ∞that both λd (K) and
K(r)rd/2−1 dr.
λd (K 2 ) are bounded for d ≥ 1, where λd (K) = 2d/2−1 vol(S d )
0
H2 If a function f on Sd ⊂ Rd+1 is extended to the entire Rd+1 /{0} via f (x) = f (x/ kxk),
then the extended function f needs to have its first three derivatives bounded.
H3 Assume the bandwidth parameter hn → 0 as nhdn → ∞.
Let f be the extended function from Sd to Rd+1 /{0} via f (x/ kxk), and let
φ(f, x) = −xT ∇f (x) + d−1 (∇2 f (x) − xT (Hx f )x) = d−1 tr[Hx f (x)],
where Hx f is the Hessian matrix of f at the point x.
The term bd (K) in the statement of Theorem 2.6 is defined to be:
Z ∞
K(r)rd/2 dr
0
bd (K) = Z ∞
K(r)rd/2−1 dr
0
The term φ(hn ) in the statement of Theorem 2.6 is defined to be:
4bd (K)2 2 4
φ(hn ) =
σx hn
d2
Proof of Theorem 2.6:
Proof The strategy in Zhao and Wu (2001) in the directional set-up follows that in Hall
(1984), whose key idea is to give asymptotic bounds for degenerate U-statistics, so that one
can use Martingale theory to derive the central limit theorem. The step where the finite
support condition was used in Zhao and Wu (2001), is when they were trying to prove the
asymptotic
bound:E(G2n (X1 , X2 )) = O(h7d ), where Gn (x, y) = E[Hn (X, xHn (X, y))] with
Z
Kn (z, x)Kn (z, y)dz and the centered kernel Kn (x, y) = K[(1 − x0 y)/h2 ] − E{[K(1 −
Hn =
Sd
2
x0 X)/h ]}. During that prove, they were trying to show that the following term:
Z
Z
T1 =
f (x)dx
SZd
f (y)dy×
SZd
0
2
2
Z
K[(1 − u x)/h ]K[(1 − u z)/h ]du ·
f (z)dz
Sd
0
Sd
2
0
2
K[(1 − u y)/h ]K[(1 − u z)/h ]du
Sd
26
0
2
,
satisfies T1 = O(h7d ). During this step, in order to give an upper bound for T1 , the finite
support condition was substantially used.
The idea to avoid this assumption was based on the observation in Garcı́a-Portugués et al.
(2015) where they only concern the case of directional-linear CLT, whose result can not be
directly used to the only directional case. Based on the method provided in Lemma 10 in
Garcı́a-Portugués et al. (2015), one can easily deduce the following asymptotic equivalence:
Z
1 − xT y i
)φ (y)dy ∼ hd λd (K j )φi (x),
Kj(
2
h
Sd
Z ∞
where λd (K j ) = 2d/2−1 vol(Sd−1 )
K j (r)rd/2−1 dr. As a special case we have:
0
Z
Sd
K 2(
1 − xT y
)dy ∼ hd λd (K 2 )C, with C being a positive constant.
h2
Now we will proceed the proof without the finite support condition:
Z
Z
T1 =
f (x)dx
f (y)dy
SdZ
Sd Z
2
Z
0
2
0
2
0
2
0
2
K[(1 − u y)/h ]K[(1 − u z)/h ]du
K[(1 − u x)/h ]K[(1 − u z)/h ]du ·
f (z)dz
×
d
S
Sd
Sd
Z
2
Z
Z
1 − xT z 1 − yT z d
d
f (z) λd (K)h K(
∼
f (x)dx
f (y)dy
) × λd (K)h K(
) dz
h2
h2
Sd
Z Sd
Z Sd
2
1 − xT y
f (x)dx
f (y) λd (K)hd K(
∼ λd (K)4 h4d
)f (y) dy
2
Sd
h
ZSd Z
T
1
−
x
y
K 2(
)f 3 (y)dy f (x)dx
= λd (K)6 h6d
2
h
Sd
ZSd
6 6d
λd (K 2 )hd C · f 3 (x)f (x)dx
∼ λd (K) h
d
Z
S
6
2 7d
f ( x)dx = O(h7d ).
= Cλd (K) λd (K )h
Sd
Thus we have proved T1 = O(h7d ) without finite support assumption, then the rest of the
proof will follow through as in Zhao and Wu (2001).
Observe the identity:
Z
Z
2
(p̂n − p) dx = |Γ|
Sd≥0
(fˆn − p̃)2 dy,
(22)
Sd
then the CLT of compositional ISE follows from the identity (22) and our proof of CLT
on spherical ISE without finite support conditions on kernels.
A.2
Proofs of Shadow Monomials in Section 3
Proof of Proposition 3.9:
27
Proof A direct computation yields:
d+1
1 X Y
(si xi )αi
|Γ|
si ∈{±1} i=1
X Y
1 X Y
=
(si xi )αi xαk k +
(si xi )αi (−xk )αk
|Γ|
si ∈{±1} i6=k
si ∈{±1} i6=k
X Y
1 X Y
αk 1
αi
= xk
(si xi ) − xαk k
(si xi )αi
|Γ|
|Γ|
i6=k
i6=k
Q
αi Γ
( d+1
=
i=1 xi )
si ∈{±1}
si ∈{±1}
= 0.
A.3
Proof of Linear Independence of Reproducing Kernels in Theorem 4.1
We sketch a slight of more detailed (not complete) proof:
Proof This is the most technical lemma in this article. We will sketch the philosophy of
the proof in here, which can be intuitively understood topologically.
Recall that we can produce a projective space Pd by identifying every pair of antipodal
points of a sphere Sd (identify x with −x), in other words Pd = Sd /Z2 where Z2 = {0, 1}
d
is a cyclic group of order 2. Then we can define a projective kernel in Hi ⊂ L2 (S
) to be
L
p
ki (x, ·) = [ki (x,
+ ki (−x, ·)]/2. We can also denote the projective kernel inside m
i=0 H2i
P·)
m
p
p
by k m (x, ·) = i=0 k2i (x, ·).
Now we spread out the data set {xi }ni=1 by “spread-out” construction in Section 2.1, and
n
denote the spread-out data set as {Γ · xi }ni=1 = {c−1
∆ (xi )}i=1 (a data set, not a set because
of repetitions). A compositional reproducing kernel kernel is a summation of spherical re−1
producing kernels of on c−1
∆ (xi ), divided by the number of elements in c∆ (xi ). This data
set c−1
∆ (xi ) has antipodal symmetry, then a compositional kernel is a linear combination of
projective kernels. Notice that different fake kernels are linear combinations of different
projective kernels. It suffices to show the linear independence of projective kernels for distinct data points and large enough m, which implies the linear independence of fake kernels
{k Γm (xi , ·)}ni=1 .
Now we are focusing on the linear independence of projective kernels. A projective kernel
l
d
can be seen as a reproducing kernel for a point in Pd . For a set of distinct points {yL
i }i=1 ⊂ P ,
m
we will show that the corresponding set of projective kernels {k pm (yi , ·)}li=1 ⊂
i=0 H2i is
linearly independent for an integer l and a large
m.
enough
p
m
2m l
m
l
Consider two vector subspace
V
=
span
{(y
·
t)
}
and
V
=
span
{k
(y
,
t)}
,
i
i
1
i=1
2
m
i=1
Lm
2 d
m
both of which are inside i=0 H2i ⊂ L (S ). Then we can define a linear map hm : V1 →
P
V2m by setting hm ((yi · t)2m ) = lj=1 < (yi · t)2m , k pm (yj , t) > k pm (yj , t). This linear map hm is
represented by an l × l symmetric matrix whose diagonal elements are 1’s, and off diagonal
elements are [(yi · yj )]2m . Notice that yi 6= yj in Pd , which means that they are not antipodal
to each other in Sd , thus |yi · yj | < 1. When m is large enough, all off-diagonal elements
28
will go to zero while diagonal elements always stay constant, then the matrix representing
hm will become a diagonally dominant matrix, which is full rank. When the linear map hm
has full rank, both spanning sets {(yi · t)2m }li=1 and {k pm (yi , t)}li=1 have to be a basis for V1m
and V2m correspondingly, then the set of projective kernels {k pm (yi , t)}m
i=1 have to be linearly
independent when m is large enough.
A.4
Proof of Representer Theorems in Section 4.1.2
Proof of Theorem 4.2 on minimal norm
Linterpolation:
Proof Note that the set Iym = {f ∈ m
i=0 H2i : f (xi ) = yi } is non-empty, because the f0
defined by the linear system of equation is naturally in Iym . Let f be any other element in
Iym , define g = f − f0 , then we have:
kf k2 = kg + f0 k2 = kgk2 + 2 < f0 , g > + kf0 k2 .
Notice that g ∈
Lm
i=0
H2i and that g(xi ) = 0 for 1 ≤ i ≤ n, we have:
P
< f0 , g > = < ni=1 ωm (xi , ·)ci , g(·) >
n
X
=
ci < ωm (xi , ·), g(·) > .
i=1
=
n
X
ci g(xi ) = 0.
i=1
Thus kf k2 = kg + f0 k2 = kgk2 + kf0 k2 , which implies that f0 is the solution to the minimal
norm interpolation problem.
Proof of Theorem 4.3, on regularization problems:
P
Proof First defineL
the loss functional E(f ) = ni=1 |f (xi )−yi |2 +µ kf k2 . For any Γ-invariant
function f = f Γ ∈ m
i=0 H2i , let g = f − fµ , then a simple computation yields:
E(f ) = E(fµ ) +
n
X
i=1
I want to show
equality is:
Pn
i=1 (yi
n
X
|g(xi )| − 2
(yi − fµ (xi ))g(xi ) + 2µ < fµ , g > +µ kgk2 .
2
i=1
− fµ (xi ))g(xi ) = µ < fµ , g >, and an equivalent way of writing this
n
X
< (yi − fµ (xi ))ωm (xi , t), g(t) >= µ < fµ , g > .
i=1
P Now I claim that µfµ (t) = ni=1 (yi − fµ (xi )) · ωm (xi , P
t) , which implies the above equality.
To prove this claim, plug this linear combination fµ = ni=1 ci · ωm (xi , t) into the claim, then
we get a system of linear equations in {ci }ni=1 , thus the proof of the claim breaks down to
checking the system of linear equations in {ci }ni=1 , produced by the claim.
29
Note that {ωm (xi , t)}ni=1 is a linearly independent set, so one can check that the system
n
n
of linear
Pnequations in {ci }i=1 produced by the claim is true, if and only if {ci }i=1 satisfy
µck + i=1 ci · ωm (xi , xk ) = yk for every k with 1 ≤ k ≤ n, which is given by the condition
of this theorem. The equivalence of these two systems of linear equations is given by the
linearindependence of the set{ωm (xi , t)}ni=1 . Therefore we conclude that the claim µfµ (t) =
P
n
i=1 (yi − fµ (xi )) · ωm (xi , t) is true.
To finish the proof of this theorem, notice that
P
P
E(f ) = E(fµ ) + ni=1 |g(xi )|2 − 2 ni=1 (yi − fµ (xi ))g(xi ) + 2µ < fµ , g > +µ kgk2
n
n
X
X
2
2
= E(fµ ) +
|g(xi )| + µ kgk + 2 µ < fµ , g > −
(yi − fµ (xi ))g(xi )
i=1
= E(fµ ) +
n
X
i=1
= E(fµ ) +
n
X
i=1
2
|g(xi )|2 + µ kgk + 2 µ < fµ , g > −
i=1
n
X
|g(xi )|2 + µ kgk2 + 2 < µfµ (t) −
(yi − fµ (xi )) · ωm (xi , t) , g(t) >
i=1
|
= E(fµ ) +
< (yi − fµ (xi )) · ωm (xi , t), g(t) >
i=1
n
X
n
X
{z
=0
}
|g(xi )|2 + µ kgk2 .
i=1
The term i=1 |g(xi )|2 + µ kgk2 in the above equality is always non-negative, thus E(fµ ) ≤
E(f ), then the theorem follows.
Pn
References
Aitchison, J. (1982), “The Statistical Analysis of Compositional Data,” Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 44, 139–177.
— (1994), “Principles of compositional data analysis,” Lecture Notes-Monograph Series, 73–
81.
Arnold, R., Jupp, P. E., and Schaeben, H. (2018), “Statistics of ambiguous rotations,”
Journal of Multivariate Analysis, 165, 73–85.
Aronszajn, N. (1950), “Theory of Reproducing Kernels,” Transactions of the American
Mathematical Society, 68, 337–404.
Atkinson, K. and Han, W. (2012), Spherical Harmonics and Approximations on the Unit
Sphere: An Introduction, Springer.
Bai, Z., Rao, C. R., and Zhao, L. (1989), “Kernel estimators of density function of directional
data,” Journal of Multivariate Analysis, 27, 24–39.
Ball, J. A., Marx, G., and Vinnikov, V. (2016), “Noncommutative reproducing kernel Hilbert
spaces,” Journal of Functional Analysis, 271, 1844–1920.
30
Beardon, A. (2012), The Geometry of Discrete Groups, vol. 91, Springer Science & Business
Media.
Beran, R. (1979), “Exponential models for directional data,” Annals of Statistics, 7, 1162–
1178.
Calle, M. L. (2019), “Statistical analysis of metagenomics data,” Genomics & Informatics,
17.
Canu, S. and Smola, A. (2006), “Kernel methods and the exponential family,” Neurocomputing, 69, 714–720.
Chwialkowski, K., Strathmann, H., and Gretton, A. (2016), “A kernel test of goodness of
fit,” International Conference on Machine Learning, 2606–2615.
Dai, B., Dai, H., Gretton, A., Song, L., Schuurmans, D., and He, N. (2019), “Kernel exponential family estimation via doubly dual embedding,” in Proceedings of Machine Learning
Research, pp. 2321–2330.
de los Campos, G., Gianola, D., and Rosa, G. (2009), “Reproducing kernel Hilbert spaces
regression: a general framework for genetic evaluation,” Journal of Animal Science, 87,
1883–1887.
Fan, Z., Sun, Y., and Wang, Z. (2021), “Principal components in linear mixed models with
general bulk,” Annals of Statistics, 49, 1489–1513.
Fukumizu, K., Bach, F., and Jordan, M. (2009), “Kernel Dimension Reduction in Regression,” Annals of Statistics, 37, 1871–1905.
Garcı́a-Portugués, E., Crujeiras, R. M., and González-Manteiga, W. (2015), “Central limit
theorems for directional and linear random variables with applications,” Statistica Sinica,
25, 1207–1229.
Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., and Egozcue, J. J. (2017), “Microbiome datasets are compositional: and this is not optional,” Frontiers in microbiology, 8,
2224.
Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., and Pontil, M.
(2012), “Conditional mean embeddings as regressors,” Proceedings of the 29th International Conference on International Conference on Machine Learning, 1803–1810.
Hall, P. (1984), “Central limit theorem for integrated square error of multivariate nonparametric density estimators,” Journal of Multivariate Analysis, 14, 1–16.
Hall, P., Watson, G. S., and Cabrera, J. (1987), “Kernel density estimation with spherical
data,” Biometrika, 74, 751–762.
Jorgensen, P. and Tian, F. (2016), “Graph Laplacians and discrete reproducing kernel Hilbert
spaces from restrictions,” Stochastic Analysis and Applications, 34, 722–747.
31
Jupp, P. E. (2008), “Data-driven Sobolev tests of uniformity on compact Riemannian manifolds,” Annals of Statistics, 36, 1246–1260.
Jupp, P. E. and Spurr, B. D. (1985), “Sobolev tests for independence of directions,” Annals
of Statistics, 13, 1140–1155.
Li, G., Li, Y., and Chen, K. (2020), “It’s All Relative: New Regression Paradigm for Microbiome Compositional Data,” arXiv preprint arXiv:2011.05951.
Lubbe, S., Filzmoser, P., and Templ, M. (2021), “Comparison of zero replacement strategies for compositional data with large numbers of zeros,” Chemometrics and Intelligent
Laboratory Systems, 210, 104248.
Micchelli, C. A. and Pontil, M. (2005), “On Learning Vector-Valued Functions,” Neural
Computation, 17, 177–204.
Micheli, M. and Glaunés, J. A. (2014), “Matrix-valued Kernels for Shape Deformation Analysis,” Geometry, Imaging and Computing, 1(1), 57–139.
Minh, H. Q. and Sindhwani, V. (2011), “Vector-valued Manifold Regularization,” in Proceedings of the 28th International Conference on International Conference on Machine
Learning, pp. 57–64.
Muandet, K., Fukumizu, K., Sriperumbudur, B., and Schölkopf, B. (2017), “Kernel Mean
Embedding of Distributions: A Review and Beyond,” Foundations and Trends in Machine
Learning, 10.
Paine, P. J., Preston, S., Tsagris, M., and Wood, A. (2020), “Spherical regression models
with general covariates and anisotropic errors,” Statistics and Computing, 30, 153–165.
Rasmussen, C. L., Palarea-Albaladejo, J., Johansson, M. S., Crowley, P., Stevens, M. L.,
Gupta, N., Karstad, K., and Holtermann, A. (2020), “Zero problems with compositional
data of physical behaviors: a comparison of three zero replacement methods,” International Journal of Behavioral Nutrition and Physical Activity, 17, 1–10.
Reisert, M. and Burkhardt, H. (2007), “Learning Equivariant Functions with Matrix Valued
Kernels,” Journal of Machine Learning Research, 8, 385–408.
Sansone, G. (1959), Orthogonal Functions, Dover Books on Mathematics Series, Dover.
Scealy, J. L. and Welsh, A. H. (2011), “Regression for compositional data by using distributions defined on the hypersphere,” Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 73, 351–375.
Smola, A., Gretton, A., Song, L., and Schölkopf, B. (2007), “A Hilbert space embedding for
distributions,” in International Conference on Algorithmic Learning Theory, Springer, pp.
13–31.
Stein, E. M. and Weiss, G. (1971), Introduction to Fourier Analysis on Euclidean Spaces,
Princeton Mathematical Series, Princeton University Press.
32
Wahba, G. (1981), “Spline interpolation and smoothing on the sphere,” SIAM Journal on
Scientific and Statistical Computing, 2, 5–16.
Zhao, L. and Wu, C. (2001), “Central limit theorem for integrated square error of kernel
estimators of spherical density,” Science in China Series A: Mathematics, 44, 474–483.
Zhou, F. and Johnstone, I. M. (2019), “Eigenvalue distributions of variance components
estimators in high-dimensional random effects models,” Annals of Statistics, 47, 2855–
2886.
Zhou, H., Li, L., and Zhu, H. (2013), “Tensor Regression with applications in neuroimaging
data analysis,” Journal of the American Statistical Association, 108, 540–552.
33
Download