The Annals of Statistics 2009, Vol. 37, No. 4, 1871–1905 DOI: 10.1214/08-AOS637 c Institute of Mathematical Statistics, 2009 arXiv:0908.1854v1 [math.ST] 13 Aug 2009 KERNEL DIMENSION REDUCTION IN REGRESSION1 By Kenji Fukumizu, Francis R. Bach and Michael I. Jordan Institute of Statistical Mathematics, INRIA—Ecole Normale Supérieure and University of California We present a new methodology for sufficient dimension reduction (SDR). Our methodology derives directly from the formulation of SDR in terms of the conditional independence of the covariate X from the response Y , given the projection of X on the central subspace [cf. J. Amer. Statist. Assoc. 86 (1991) 316–342 and Regression Graphics (1998) Wiley]. We show that this conditional independence assertion can be characterized in terms of conditional covariance operators on reproducing kernel Hilbert spaces and we show how this characterization leads to an M -estimator for the central subspace. The resulting estimator is shown to be consistent under weak conditions; in particular, we do not have to impose linearity or ellipticity conditions of the kinds that are generally invoked for SDR methods. We also present empirical results showing that the new methodology is competitive in practice. 1. Introduction. The problem of sufficient dimension reduction (SDR) for regression is that of finding a subspace S such that the projection of the covariate vector X onto S captures the statistical dependency of the response Y on X. More formally, let us characterize a dimension-reduction subspace S in terms of the following conditional independence assertion: (1) Y ⊥ ⊥ X|ΠS X, where ΠS X denotes the orthogonal projection of X onto S. It is possible to show that under weak conditions the intersection of dimension-reduction subspaces is itself a dimension-reduction subspace, in which case the intersection is referred to as a central subspace [5, 6]. As suggested in a seminal Received October 2006; revised July 2008. Supported by JSPS KAKENHI 15700241, a grant from the Inamori Foundation, a Scientific grant from the Mitsubishi Foundation and NSF Grant DMS-05-09559. AMS 2000 subject classifications. Primary 62H99; secondary 62J02. Key words and phrases. Dimension reduction, regression, positive definite kernel, reproducing kernel, consistency. 1 This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2009, Vol. 37, No. 4, 1871–1905. This reprint differs from the original in pagination and typographic detail. 1 2 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN paper by Li [23], it is of great interest to develop procedures for estimating this subspace, quite apart from any interest in the conditional distribution P (Y |X) or the conditional mean E(Y |X). Once the central subspace is identified, subsequent analysis can attempt to infer a conditional distribution or a regression function using the (low-dimensional) coordinates ΠS X. The line of research on SDR initiated by Li is to be distinguished from the large and heterogeneous collection of methods for dimension reduction in regression in which specific modeling assumptions are imposed on the conditional distribution P (Y |X) or the regression E(Y |X). These methods include ordinary least squares, partial least squares, canonical correlation analysis, ACE [4], projection pursuit regression [12], neural networks and LASSO [29]. These methods can be effective if the modeling assumptions that they embody are met, but if these assumptions do not hold there is no guarantee of finding the central subspace. Li’s paper not only provided a formulation of SDR as a semiparametric inference problem—with subsequent contributions by Cook and others bringing it to its elegant expression in terms of conditional independence— but also suggested a specific inferential methodology that has had significant influence on the ensuing literature. Specifically, Li suggested approaching the SDR problem as an inverse regression problem. Roughly speaking, the idea is that if the conditional distribution P (Y |X) varies solely along a subspace of the covariate space, then the inverse regression E(X|Y ) should lie in that same subspace. Moreover, it should be easier to regress X on Y than vice versa, given that Y is generally low-dimensional (indeed, one-dimensional in the majority of applications) while X is high-dimensional. Li [23] proposed a particularly simple instantiation of this idea—known as sliced inverse regression (SIR)—in which E(X|Y ) is estimated as a constant vector within each slice of the response variable Y , and principal component analysis is used to aggregate these constant vectors into an estimate of the central subspace. The past decade has seen a number of further developments in this vein. Some focus on finding a central subspace, for example, [9, 10], while others aim at finding a central mean subspace, which is a subspace of the central subspace that is effective only for the regression E[Y |X]. The latter include principal Hessian directions (pHd, [24]) and contour regression [22]. A particular focus of these more recent developments has been the exploitation of second moments within an inverse regression framework. While the inverse regression perspective has been quite useful, it is not without its drawbacks. In particular, performing a regression of X on Y generally requires making assumptions with respect to the probability distribution of X, assumptions that can be difficult to justify. In particular, most of the inverse regression methods make the assumption of linearity of the conditional mean of the covariate along the central subspace (or make a related assumption for the conditional covariance). These assumptions KERNEL DIMENSION REDUCTION 3 hold in particular if the distribution of X is elliptic. In practice, however, we do not necessarily expect that the covariate vector will follow an elliptic distribution, nor is it easy to assess departures from ellipticity in a high-dimensional setting. In general, it seems unfortunate to have to impose probabilistic assumptions on X in the setting of a regression methodology. Many of inverse regression methods can also exhibit some additional limitations depending on the specific nature of the response variable Y . In particular, pHd and contour regression are applicable only to a one-dimensional response. Also, if the response variable takes its values in a finite set of p elements, SIR yields a subspace of dimension at most p − 1; thus, for the important problem of binary classification SIR yields only a one-dimensional subspace. Finally, in the binary classification setting, if the covariance matrices of the two classes are the same, SAVE and pHd also provide only a one-dimensional subspace [7]. The general problem in these cases is that the estimated subspace is smaller than the central subspace. One approach to tackling these limitations is to incorporate higher-order moments of Y |X [34], but in practice the gains achievable by the use of higher-order moments are limited by robustness issues. In this paper, we present a new methodology for SDR that is rather different from the approaches considered in the literature discussed above. Rather than focusing on a limited set of moments within an inverse regression framework, we focus instead on the criterion of conditional independence in terms of which the SDR problem is defined. We develop a contrast function for evaluating subspaces that is minimized precisely when the conditional independence assertion in (1) is realized. As befits a criterion that measures departure from conditional independence, our contrast function is not based solely on low-order moments. Our approach involves the use of conditional covariance operators on reproducing kernel Hilbert spaces (RKHSs). Our use of RKHSs is related to their use in nonparametric regression and classification; in particular, the RKHSs given by some positive definite kernels are Hilbert spaces of smooth functions that are “small” enough to yield computationally-tractable procedures, but are rich enough to capture nonparametric phenomena of interest [32], and this computational focus is an important aspect of our work. On the other hand, whereas in nonparametric regression and classification the role of RKHSs is to provide basis expansions of regression functions and discriminant functions, in our case the RKHS plays a different role. Our interest is not in the functions in the RKHS per se, but rather in conditional covariance operators defined on the RKHS. We show that these operators can be used to measure departures from conditional independence. We also show that these operators can be estimated from data and that these estimates are functions of Gram matrices. Thus, our approach—which we refer 4 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN to as kernel dimension reduction (KDR)—involves computing Gram matrices from data and optimizing a particular functional of these Gram matrices to yield an estimate of the central subspace. This approach makes no strong assumptions on either the conditional distribution pY |ΠS X (y|ΠS x) or the marginal distribution pX (x). As we show, KDR is consistent as an estimator of the central subspace under weak conditions. There are alternatives to the inverse regression approach in the literature that have some similarities to KDR. In particular, minimum average variance estimation (MAVE, [33]) is based on nonparametric estimation of the conditional covariance of Y given X, an idea related to KDR. This method explicitly estimates the regressor, however, assuming an additive noise model Y = f (X) + Z, where Z is independent of X. While the purpose of MAVE is to find a central mean subspace, KDR tries to find a central subspace, and does not need to estimate the regressor explicitly. Other related approaches include methods that estimate the derivative of the regression function; these are based on the fact that the derivative of the conditional expectation g(x) = E[y|B T x] with respect to x belongs to a dimension reduction subspace [18, 27]. The purpose of these methods is again to extract a central mean subspace; this differs from the central subspace which is the focus of KDR. The difference is clear, for example, if we consider the situation in which a direction b in a central subspace satisfies E[g′ (bT X)] = 0; a condition that occurs if g and the distribution of X exhibit certain symmetries. The direction cannot be found by methods based on the derivative. Also, there has also been some recent work on nonparametric methods for estimation of central subspaces. One such method estimates the central subspace based on an expected log likelihood [35]. This requires, however, an estimate of the joint probability density, and is limited to single-index regression. Finally, Zhu and Zeng [36] have proposed a method for estimating the central subspace based on the Fourier transform. This method is similar to the KDR method in its use of Hilbert space methods and in its use of a contrast function that can characterize conditional independence under weak assumptions. It differs from KDR, however, in that it requires an estimate of the derivative of the marginal density of the covariate X; in practice this requires assuming a parametric model for the covariate X. In general, we are aware of no practical method that attacks SDR directly by using nonparametric methodology to assess departures from conditional independence. We presented an earlier kernel dimension reduction method in [14]. The contrast function presented in that paper, however, was not derived as an estimator of a conditional covariance operator, and it was not possible to establish a consistency result for that approach. The contrast function that we present here is derived directly from the conditional covariance perspective; KERNEL DIMENSION REDUCTION 5 moreover, it is simpler than the earlier estimator and it is possible to establish consistency for the new formulation. We should note, however, that the empirical performance of the earlier KDR method was shown by Fukumizu, Bach and Jordan [14] to yield a significant improvement on SIR and pHd in the case of nonelliptic data, and these empirical results motivated us to pursue the general approach further. While KDR has advantages over other SDR methods because of its generality and its directness in capturing the semiparametric nature of the SDR problem, it also reposes on a more complex mathematical framework that presents new theoretical challenges. Thus, while consistency for SIR and related methods follows from a straightforward appeal to the central limit theorem (under ellipticity assumptions), more effort is required to study the statistical behavior of KDR theoretically. This effort is of some general value, however; in particular, to establish the consistency of KDR we prove the uniform O(n−1/2 ) convergence of an empirical process that takes values in a reproducing kernel Hilbert space. This result, which accords with the order of uniform convergence of an ordinary real-valued empirical process, may be of independent theoretical interest. It should be noted at the outset that we do not attempt to provide distribution theory for KDR in this paper, and in particular we do not address the problem of inferring the dimensionality of the central subspace. The paper is organized as follows. In Section 2 we show how conditional independence can be characterized by cross-covariance operators on an RKHS and use this characterization to derive the KDR method. Section 3 presents numerical examples of the KDR method. We present a consistency theorem and its proof in Section 4. Section 5 provides concluding remarks. Some of the details in the proof of consistency are provided in the Appendix. 2. Kernel dimension reduction for regression. The method of kernel dimension reduction is based on a characterization of conditional independence using operators on RKHSs. We present this characterization in Section 2.1 and show how it yields a population criterion for SDR in Section 2.2. This population criterion is then turned into a finite-sample estimation procedure in Section 2.3. In this paper, a Hilbert space means a separable Hilbert space, and an operator always means a linear operator. The operator norm of a bounded operator T is denoted by kT k. The null space and the range of an operator T are denoted by N (T ) and R(T ), respectively. 2.1. Characterization of conditional independence. Let (X , BX ) and (Y, BY ) denote measurable spaces. When the base space is a topological space, the Borel σ-field is always assumed. Let (HX , kX ) and (HY , kY ) be RKHSs of functions on X and Y, respectively, with measurable positive definite kernels 6 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN kX and kY [1]. We consider a random vector (X, Y ) : Ω → X × Y with the law PXY . The marginal distribution of X and Y are denoted by PX and PY , respectively. It is always assumed that the positive definite kernels satisfy (2) EX [kX (X, X)] < ∞ and EY [kY (Y, Y )] < ∞. Note that any bounded kernels satisfy this assumption. Also, under this assumption, HX and HY are included in L2 (PX ) and L2 (PY ), respectively, where L2 (µ) denotes the Hilbert space of square integrable functions with respect to the measure µ, and the inclusions JX : HX → L2 (PX ) and JY : HY → L2 (PY ) are continuous, because EX [f (X)2 ] = EX [hf, kX (·, X)i2HX ] ≤ kf k2HX EX [kX (X, X)] for f ∈ HX . The cross-covariance operator of (X, Y ) is an operator from HX to HY so that (3) hg, ΣY X f iHY = EXY [(f (X) − EX [f (X)])(g(Y ) − EY [g(Y )])] holds for all f ∈ HX and g ∈ HY [3, 14]. Obviously, ΣY X = Σ∗XY , where T ∗ denotes the adjoint of an operator T . If Y is equal to X, the positive self-adjoint operator ΣXX is called the covariance operator. For a random variable X : Ω → X , the mean element mX ∈ HX is defined by the element that satisfies hf, mX iHX = EX [f (X)] (4) for all f ∈ HX ; that is, mX = JX∗ 1, where 1 is the constant function. The explicit function form of mX is given by mX (u) = hmX , k(·, u)iHX = E[k(X, u)]. Using the mean elements, (3), which characterizes ΣY X , can be written as hg, ΣY X f iHY = EXY [hf, kX (·, X) − mX iHX hkY (·, Y ) − mY , giHY ]. Let QX and QY be the orthogonal projections which map HX onto R(ΣXX ) and HY onto R(ΣY Y ), respectively. It is known [3], Theorem 1, that ΣY X has a representation of the form 1/2 (5) 1/2 ΣY X = ΣY Y VY X ΣXX , where VY X : HX → HY is a unique bounded operator such that kVY X k ≤ 1 and VY X = QY VY X QX . A cross-covariance operator on an RKHS can be represented explicitly as an integral operator. For arbitrary ϕ ∈ L2 (PX ) and y ∈ Y, the integral (6) Gϕ (y) = Z X ×Y kY (y, ỹ)(ϕ(x̃) − EX [ϕ(X)]) dPXY (x̃, ỹ) always exists and Gϕ is an element of L2 (PY ). It is not difficult to see that SY X : L2 (PX ) → L2 (PY ), ϕ 7→ Gϕ 7 KERNEL DIMENSION REDUCTION is a bounded linear operator with kSY X k ≤ EY [kY (Y, Y )]. If f is a function in HX , we have for any y ∈ Y Gf (y) = hkY (·, y), ΣY X f iHY = (ΣY X f )(y), which implies the following proposition: Proposition 1. The covariance operator ΣY X : HX → HY is the restriction of the integral operator SY X to HX . More precisely, JY ΣY X = SY X JX . Conditional variance can be also represented by covariance operators. Define the conditional covariance operator ΣY Y |X by 1/2 1/2 ΣY Y |X = ΣY Y − ΣY Y VY X VXY ΣY Y , where VY X is the bounded operator in (5). For convenience we sometimes write ΣY Y |X as ΣY Y |X = ΣY Y − ΣY X Σ−1 XX ΣXY , which is an abuse of notation, because Σ−1 XX may not exist. The following two propositions provide insights into the meaning of a conditional covariance operator. The former proposition relates the operator to the residual error of regression, and the latter proposition expresses the residual error in terms of the conditional variance. For any g ∈ HY , Proposition 2. hg, ΣY Y |X giHY = inf EXY |(g(Y ) − EY [g(Y )]) − (f (X) − EX [f (X)])|2 . f ∈HX 1/2 1/2 Proof. Let ΣY X = ΣY Y VY X ΣXX be the decomposition in (5), and define Eg (f ) = EY X |(g(Y )−EY [g(Y )])−(f (X)−EX [f (X)])|2 . From the equality 1/2 1/2 1/2 1/2 Eg (f ) = kΣXX f k2HX − 2hVXY ΣY Y g, ΣXX f iHX + kΣY Y gk2HY , 1/2 replacing ΣXX f with an arbitrary φ ∈ HX yields 1/2 1/2 inf Eg (f ) ≥ inf {kφk2HX − 2hVXY ΣY Y g, φiHX + kΣY Y gk2HY } f ∈HX φ∈HX 1/2 = inf kφ − VXY ΣY Y gk2HX + hg, ΣY Y |X giHY φ∈HX = hg, ΣY Y |X giHY . 8 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN For the opposite inequality, take an arbitrary ε > 0. From the fact that 1/2 1/2 1/2 VXY ΣY Y g ∈ R(ΣXX ) = R(ΣXX ), there exists f∗ ∈ HX such that kΣXX f∗ − 1/2 VXY ΣY Y gkHX ≤ ε. For such f∗ , 1/2 1/2 1/2 1/2 Eg (f∗ ) = kΣXX f∗ k2HX − 2hVXY ΣY Y g, ΣXX f∗ iHX + kΣY Y gk2HY 1/2 1/2 1/2 1/2 = kΣXX f∗ − VY X ΣY Y gk2HX + kΣY Y gkHY − kVXY ΣY Y gk2HX ≤ hg, ΣY Y |X giHY + ε2 . Because ε is arbitrary, we have inf f ∈HX Eg (f ) ≤ hg, ΣY Y |X giHY . Proposition 2 is an analog for operators of a well-known result on covariance matrices and linear regression: the conditional covariance matrix −1 CY Y |X = CY Y − CY X CXX CXY expresses the residual error of the least square regression problem as bT CY Y |X b = mina EkbT Y − aT Xk2 . To relate the residual error in Proposition 2 to the conditional variance of g(Y ) given X, we make the following mild assumption: (AS) HX + R is dense in L2 (PX ), where HX + R denotes the direct sum of the RKHS HX and the RKHS R [1]. As seen later in Section 2.2, there are many positive definite kernels that satisfy the assumption (AS). Examples include the Gaussian radial basis function (RBF) kernel k(x, y) = exp(−kx − yk2 /σ 2 ) on Rm or on a compact subset of Rm . Proposition 3. Under the assumption (AS), hg, ΣY Y |X giHY = EX [VarY |X [g(Y )|X]] (7) for all g ∈ HY . Proof. From Proposition 2, we have hg, ΣY Y |X giHY = inf Var[g(Y ) − f (X)] f ∈HX = inf {VarX [EY |X [g(Y ) − f (X)|X]] + EX [VarY |X [g(Y ) − f (X)|X]]} f ∈HX = inf VarX [EY |X [g(Y )|X] − f (X)] + EX [VarY |X [g(Y )|X]]. f ∈HX Let ϕ(x) = EY |X [g(Y )|X = x]. Since ϕ ∈ L2 (PX ) from Var[ϕ(X)] ≤ Var[g(Y )] < ∞, the assumption (AS) implies that for an arbitrary ε > 0 there exists KERNEL DIMENSION REDUCTION 9 f ∈ HX and c ∈ R such that h = f + c satisfies kϕ − hkL2 (PX ) < ε. Because Var[ϕ(X) − f (X)] ≤ kϕ − hk2L2 (PX ) ≤ ε2 and ε is arbitrary, we have inf f ∈HX VarX [EY |X [g(Y )|X] − f (X)] = 0, which completes the proof. Proposition 3 improves a result due to Fukumizu, Bach and Jordan [14], Proposition 5, where the much stronger assumption E[g(Y )|X = ·] ∈ HX was imposed. Propositions 2 and 3 imply that the operator ΣY Y |X can be interpreted as capturing the predictive ability for Y of the explanatory variable X. 2.2. Criterion of kernel dimension reduction. Let M (m × n; R) be the set of real-valued m × n matrices. For a natural number d ≤ m, the Stiefel manifold Sm d (R) is defined by T Sm d (R) = {B ∈ M (m × d; R)|B B = Id }, which is the set of all d orthonormal vectors in Rm . It is well known that m T Sm d (R) is a compact smooth manifold. For B ∈ Sd (R), the matrix BB m defines an orthogonal projection of R onto the d-dimensional subspace spanned by the column vectors of B. Although the Grassmann manifold is often used in the study of sets of subspaces in Rm , we find the Stiefel manifold more convenient as it allows us to use matrix notation explicitly. Hereafter, X is assumed to be either a closed ball Dm (r) = {x ∈ Rm |kxk ≤ r} or the entire Euclidean space Rm ; both assumptions satisfy the condition that the projection BB T X is included in X for all B ∈ Sm d (R). m (R) denote the subset of matrices whose columns span a dimenLet Bm ⊆ S d d sion-reduction subspace; for each B0 ∈ Bm d , we have (8) pY |X (y|x) = pY |B T X (y|B0T x), 0 where pY |X (y|x) and pY |B T X (y|u) are the conditional probability densities of Y given X, and Y given B T X, respectively. The existence and positivity of these conditional probability densities are always assumed hereafter. As we have discussed in the Introduction, under conditions given by [6], Section 6.4, this subset represents the central subspace (under the assumption that d is the minimum dimensionality of the dimension reduction subspaces). We now turn to the key problem of characterizing the subset Bm d using conditional covariance operators on reproducing kernel Hilbert spaces. In the following, we assume that kd (z, z̃) is a positive definite kernel on Z = Dd (r) B or Rd such that EX [kd (B T X, B T X)] < ∞ for all B ∈ Sm d (R), and we let kX denote a positive definite kernel on X given by (9) B kX (x, x̃) = kd (B T x, B T x̃) 10 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN B B for each B ∈ Sm d (R). The RKHS associated with kX is denoted by HX . B T Note that HX = {f : X → R|there exists g ∈ Hkd such that f (x) = g(B x)}, where Hkd is the RKHS given by kd . As seen later in Theorem 4, if X and Y are subsets of Euclidean spaces and Gaussian RBF kernels are used for kX and kY , under some conditions the subset Bm d is characterized by the set of solutions of an optimization problem B Bm d = arg min ΣY Y |X , (10) B∈Sm (R) d B where ΣB Y X and ΣXX denote the (cross-) covariance operators with respect to the kernel kB , and B B ΣB Y Y |X = ΣY Y − ΣY X ΣXX −1 ΣB XY . The minimization in (10) refers to the minimal operators in the partial order of self-adjoint operators. We use the trace to evaluate the partial order of self-adjoint operators. While other possibilities exist (e.g., the determinant), the trace has the advantage of yielding a relatively simple theoretical analysis, which is conm ducted in Section 4. The operator ΣB Y Y |X is trace class for all B ∈ Sd (R), since ΣB Y Y |X ≤ ΣY Y and Tr[ΣY Y ] < ∞, which is shown in Section 4.2. Henceforth the minimization in (10) should thus be understood as that of minimizing Tr[ΣB Y Y |X ]. From Propositions 2 and 3, minimization of Tr[ΣB Y Y |X ] is equivalent to the minimization of the sum of the residual errors for the optimal prediction of functions of Y using B T X, where the sum is taken over a complete orthonormal system {ξa }∞ a=1 of HY . Thus, the objective of dimension reduction is rewritten as (11) min m B∈Sd (R) ∞ X min E|(ξa (Y ) − E[ξa (Y )]) − (f (X) − E[f (X)])|2 . B a=1 f ∈HX This is intuitively reasonable as a criterion of choosing B, and we will see that this is equivalent to finding the central subspace under some conditions. We now introduce a class of kernels to characterize conditional independence. Let (Ω, B) be a measurable space, let (H, k) be an RKHS over Ω with the kernel k measurable and bounded, and let S be the set of all probability measures on (Ω, B). The RKHS H is called characteristic (with respect to B) if the map (12) S ∋ P 7→ mP = EX∼P [k(·, X)] ∈ H is one-to-one, where mP is the mean element of the random variable with law P . It is easy to see that H is characteristic if and only if the equality KERNEL DIMENSION REDUCTION R 11 R f dP = f dQ for all f ∈ H means P = Q. We also call a positive definite kernel k characteristic if the associated RKHS is characteristic. It is known that the GaussianP RBF kernel exp(−kx − yk2 /σ 2 ) and the socalled Laplacian kernel exp(−α m i=1 |xi − yi |) (α > 0) are characteristic on m m R or on a compact subset of R with respect to the Borel σ-field [2, 15, 28]. The following theorem improves Theorem 7 in [14], and is the theoretical basis of kernel dimension reduction. In the following, let PB denote the probability on X induced from PX by the projection BB T : X → X . B in L2 (P ) is included Theorem 4. Suppose that the closure of the HX X 2 m in the closure of HX in L (PX ) for any B ∈ Sd (R). Then, ΣB Y Y |X ≥ ΣY Y |X , (13) where the inequality refers to the order of self-adjoint operators. If further B , P ) satisfy (AS) for every B ∈ Sm (R) and H is char(HX , PX ) and (HX Y B d acteristic, the following equivalence holds (14) ΣY Y |X = ΣB Y Y |X ⇐⇒ Y ⊥ ⊥ X|B T X. Proof. The first assertion is obvious from Proposition 2. For the second assertion, let C be an m × (m − d) matrix whose columns span the orthogonal complement to the subspace spanned by the columns of B, and let (U, V ) = (B T X, C T X) for notational simplicity. By taking the expectation of the well-known relation VarY |U [g(Y )|U ] = EV |U [VarY |U,V [g(Y )|U, V ]] + VarV |U [EY |U,V [g(Y )|U, V ]] with respect to V , we have EU [VarY |U [g(Y )|U ]] = EX [VarY |X [g(Y )|X]] + EU [VarV |U [EY |U,V [g(Y )|U, V ]]], from which Proposition 3 yields hg, (ΣB Y Y |X − ΣY Y |X )giHY = EU [VarV |U [EY |U,V [g(Y )|U, V ]]]. It follows that the right-hand side of the equivalence in (14) holds if and only if EY |U,V [g(Y )|U, V ] does not depend on V almost surely. This is equivalent to EY |X [g(Y )|X] = EY |U [g(Y )|U ] almost surely. Since HY is characteristic, this means that the conditional probability of Y given X is reduced to that of Y given U . The assumption (AS) and the notion of characteristic kernel are closely related. In fact, from the following proposition, (AS) is satisfied if a characteristic kernel is used. Thus, if Y is Euclidean, the choice of Gaussian RBF 12 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN kernels for kd , kX and kY is sufficient to guarantee the equivalence given by (14). Proposition 5. Let (Ω, B) be a measurable space, and (k, H) be a bounded measurable positive definite kernel on Ω and its RKHS. Then, k is characteristic if and only if H + R is dense in L2 (P ) for any probability measure P on (Ω, B). Proof. For the proof of “if” part, suppose mP = mQ for P 6= Q. Denote the total variation of P − Q by |P − Q|. Since H + R is dense in L2 (|PR − Q|), for arbitrary ε > 0 and A ∈ B, there exists f ∈ H + R such that |f − IA | d|P − Q| < ε, where IA is the index function of A. It follows that |(EP [f (X)] − P (A)) − (EQ [f (X)] − Q(A))| < ε. Because EP [f (X)] = EQ [f (X)] from mP = mQ , we have |P (A) − Q(A)| < ε for any ε > 0, which contradicts P 6= Q. For the opposite direction, Rsuppose H + R Ris not dense in L2 (P ). There is nonzero f ∈ L2 (P ) such that f dP = 0 and f ϕ dP = 0 for any ϕ ∈ H. Let c R= 1/kf kL1 (P ) , and defineR two probability measures Q1 and Q2 by Q1 (E) = set E. By f 6= 0, c E |f | dP and Q2 (E) = c E (|f | − f ) dP for any measurable R we have Q1 6= Q2 , while EQ1 [k(·, X)] − EQ2 [k(·, X)] = c f (x)k(·, x) dP (x) = 0, which means k is not characteristic. 2.3. Kernel dimension reduction procedure. We now use the characterization given in Theorem 4 to develop an optimization procedure for estimating the central subspace from an empirical sample {(X1 , Y1 ), . . . , (Xn , Yn )}. We assume that {(X1 , Y1 ), . . . , (Xn , Yn )} is sampled i.i.d. from PXY and we T assume that there exists B0 ∈ Sm d (R) such that pY |X (y|x) = pY |B T X (y|B0 x). 0 b (n) by evaluating the We define the empirical cross-covariance operator Σ Y XP n 1 cross-covariance operator at the empirical distribution acting on functions f ∈ HX ical covariance b (n) f i hg, Σ Y X HY n i=1 δXi δYi . When b (n) gives the empirand g ∈ HY , the operator Σ YX n 1X g(Yi )f (Xi ) − = n i=1 ! n 1X g(Yi ) n i=1 ! n 1X f (Xi ) . n i=1 b B(n) Also, for B ∈ Sm d (R), let ΣY Y |X denote the empirical conditional covariance operator : (15) B(n) (n) B(n) B(n) B(n) −1 b b b b b Σ Y Y |X = ΣY Y − ΣY X (ΣXX + εn I) ΣXY . The regularization term εn I (εn > 0) is required to enable operator inversion and is thus analogous to Tikhonov regularization [17]. We will see that the regularization term is also needed for consistency. 13 KERNEL DIMENSION REDUCTION B(n) b (n) as any minimizer of Tr[Σ b We now define the KDR estimator B Y Y |X ] m m on the manifold Sd (R); that is, any matrix in Sd (R) that minimizes b (n) − Σ b B(n) (Σ b B(n) + εn I)−1 Σ b B(n) ]. Tr[Σ YY YX XX XY (16) In view of (11), this is equivalent to minimizing ∞ X a=1 ) " n ( n X 1X min ξa (Yj ) ξa (Yj ) − n f ∈HB i=1 X j=1 ( # )2 + εn kf k2HB X n 1X − f (Xj ) − f (Xj ) n j=1 ∞ over B ∈ Sm d (R), where {ξa }a=1 is a complete orthonormal system for HY . The KDR contrast function in (16) can also be expressed in terms of Gram matrices (given a kernel k, the Gram matrix is the n × n matrix whose entries are the evaluations of the kernel on all pairs of n data points). B Let φB i ∈ HX and ψi ∈ HY (1 ≤ i ≤ n) be functions defined by B φB i = k (·, Xi ) − B(n) n 1X kB (·, Xj ), n j=1 ψi = kY (·, Yi ) − B(n) (n) n 1X kY (·, Yj ). n j=1 (n) ⊥ ⊥ b b b b Because R(Σ XX ) = N (ΣXX ) and R(ΣY Y ) = N (ΣY Y ) are spanned by n n b B(n) (φB i )i=1 and (ψi )i=1 , respectively, the trace of ΣY Y |X is equal to that of B(n) n b the matrix representation of Σ Y Y |X on the linear hull of (ψi )i=1 . Note that n although the vectors (ψi )i=1 are over-complete, the trace of the matrix representation with respect to these vectors is equal to the trace of the operator. B For B ∈ Sm d (R), the centered Gram matrix GX with respect to the kernel B k is defined by B B (GB X )ij = hφi , φj iHB X B = kX (Xi , Xj ) − + n n 1X 1X B kX (Xi , Xb ) − kB (Xa , Xj ) n b=1 n a=1 X n X n 1 X kB (Xa , Xb ) n2 a=1 b=1 X and GY is defined similarly. By direct calculation, it is easy to obtain B(n) b Σ Y Y |X ψi = n n 1X 1X B −1 ψj (GY )ji − ψj (GB X (GX + nεn In ) GY )ji . n j=1 n j=1 14 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN B(n) n b It follows that the matrix representation of Σ Y Y |X with respect to (ψi )i=1 B −1 is n1 {GY − GB X (GX + nεn In ) GY } and its trace is B(n) b Tr[Σ Y Y |X ] = 1 B −1 Tr[GY − GB X (GX + nεn In ) GY ] n −1 = εn Tr[GY (GB X + nεn In ) ]. Omitting the constant factor, the KDR contrast function in (16) thus reduces to (17) −1 Tr[GY (GB X + nεn In ) ]. The KDR method is defined as the optimization of this function over the manifold Sm d (R). Theorem 4 is the population justification of the KDR method. Note that this derivation imposes no strong assumptions either on the conditional probability of Y given X, or on the marginal distributions of X and Y . In particular, it does not require ellipticity of the marginal distribution of X, nor does it require an additive noise model. The response variable Y may be either continuous or discrete. We confirm this general applicability of the KDR method by the numerical results presented in the next section. Because the contrast function (17) is nonconvex, the minimization requires a nonlinear optimization technique; in our experiments we use the steepest descent method with line search. To alleviate potential problems with local optima, we use a continuation method in which the scale parameter σ in Gaussian RBF kernel exp(−kx − yk/σ 2 ) is gradually decreased during the iterative optimization process. In numerical examples shown in the next section, we used a fixed number of iterations, and decreased σ 2 linearly from σ 2 = 100 to σ 2 = 10 for standardized data with standard deviation 5.0. Since the covariance operator approaches the covariance operator induced by a linear kernel as σ → ∞, which is solvable as an eigenproblem. In addition to σ, there is another tuning parameter εn , the regularization coefficient. As both of these tuning parameters have a similar smoothing effect, it is reasonable to fix one of them and select the other; in our experiments we fixed εn = 0.1 as an arbitrary choice and varied σ 2 . While there is no theoretical guarantee for this choice, we observe the results are generally stable if the optimization process is successful. There also exist heuristics for choosing kernel parameters in similar RKHS-based dependency analysis; an example is to use the median of pairwise distances of the data for the parameter σ in the Gaussian RBF kernel [16]. Currently, however, we are not aware of theoretically justified methods of choosing these parameters; this is an important open problem. The proposed estimator is shown to be consistent as the sample size goes to infinity. We defer the proof to Section 4. KERNEL DIMENSION REDUCTION 15 3. Numerical results. 3.1. Simulation studies. In this section we compare the performance of the KDR method with that of several well-known dimension reduction methods. Specifically, we compare to SIR, pHd and SAVE on synthetic data sets generated by the regressions in Examples 6.2, 6.3 and 6.4 of [22]. The results are evaluated by computing the Frobenius distance between the projection matrix of the estimated subspace and that of the true subspace; this evaluation measure is invariant under change of basis and is equal to bB b T kF , kB0 B0T − B b are matrices in the Stiefel manifold Sm (R) representing where B0 and B d the true subspace and the estimated subspace, respectively. For the KDR method, a Gaussian RBF kernel exp(−kz1 − z2 k2 /c) was used, with c = 2.0 for regression (A) and regression (C) and c = 0.5 for regression (B). b was updated 100 times by the steepest descent The parameter estimate B method. The regularization parameter was fixed at ε = 0.1. For SIR and SAVE, we optimized the number of slices for each simulation so as to obtain the best average norm. Regression (A) is given by (A) Y = X1 + (1 + X2 )2 + σE, 0.5 + (X2 + 1.5)2 where X ∼ N (0, I4 ) is a four-dimensional explanatory variable, and E ∼ N (0, 1) is independent of X. Thus, the central subspace is spanned by the vectors (1, 0, 0, 0) and (0, 1, 0, 0). For the noise level σ, three different values were used: σ = 0.1, 0.4 and 0.8. We used 100 random replications with 100 samples each. Note that the distribution of the explanatory variable X satisfies the ellipticity assumption, as required by the SIR, SAVE and pHd methods. Table 1 shows the mean and the standard deviation of the Frobenius norm over 100 samples. We see that the KDR method outperforms the other three methods in terms of estimation accuracy. It is also worth noting that in the results presented by Li, Zha and Chiaromonte [22] for their GCR method, the average norm was 0.28, 0.33, 0.45 for σ = 0.1, 0.4, 0.8, respectively; again, this is worse than the performance of KDR. The second regression is given by (B) Y = sin2 (πX2 + 1) + σE, where X ∈ R4 is distributed uniformly on the set [0, 1]4 \ {x ∈ R4 |xi ≤ 0.7 (i = 1, 2, 3, 4)}, 16 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN Table 1 Comparison of KDR and other methods for regression (A) KDR σ 0.1 0.4 0.8 SIR SAVE pHd NORM SD NORM SD NORM SD NORM SD 0.11 0.17 0.34 0.07 0.09 0.22 0.55 0.60 0.69 0.28 0.27 0.25 0.77 0.82 0.94 0.35 0.34 0.35 1.04 1.03 1.06 0.34 0.33 0.33 and E ∼ N (0, 1) is independent noise. The standard deviation σ is fixed at σ = 0.1, 0.2 and 0.3. Note that in this example the distribution of X does not satisfy the ellipticity assumption. Table 2 shows the results of the simulation experiments for this regression. We see that KDR again outperforms the other methods. The third regression is given by Y = 12 (X1 − a)2 E, (C) where X ∼ N (0, I10 ) is a ten-dimensional variable and E ∼ N (0, 1) is independent noise. The parameter a is fixed at a = 0, 0.5 and 1. Note that in this example the conditional probability p(y|x) does not obey an additive noise assumption. The mean of Y is zero and the variance is a quadratic function of X1 . We generated 100 samples of 500 data. The results for KDR and the other methods are shown by Table 3, in which we again confirm that the KDR method yields significantly better performance than the other methods. In this case, pHd fails to find the true subspace; this is due to the fact that pHd is incapable of estimating a direction that only appears in the variance [8]. We note also that the results in [22] show that the contour regression methods SCR and GCR yield average norms larger than 1.3. Although the estimation of variance structure is generally more difficult than that of estimating mean structure, the KDR method nonetheless is effective at finding the central subspace in this case. Table 2 Comparison of KDR and other methods for regression (B) KDR σ 0.1 0.2 0.3 SIR SAVE pHd NORM SD NORM SD NORM SD NORM SD 0.05 0.11 0.13 0.02 0.06 0.07 0.24 0.32 0.41 0.10 0.15 0.19 0.23 0.29 0.41 0.13 0.16 0.21 0.43 0.51 0.63 0.19 0.23 0.29 17 KERNEL DIMENSION REDUCTION Table 3 Comparison of KDR and other methods for regression (C) KDR a 0.0 0.5 1.0 SIR SAVE pHd NORM SD NORM SD NORM SD NORM SD 0.17 0.17 0.18 0.05 0.04 0.05 1.83 0.58 0.30 0.22 0.19 0.08 0.30 0.35 0.57 0.07 0.08 0.20 1.48 1.52 1.58 0.27 0.28 0.28 3.2. Applications. We apply the KDR method to two data sets; one is a binary classification problem and the other is a regression with a continuous response variable. These data sets have been used previously in studies of dimension reduction methods. The first data set that we studied is Swiss bank notes which has been previously studied in the dimension reduction context by Cook and Lee [7], with the data taken from [11]. The problem is that of classifying counterfeit and genuine Swiss bank notes. The data is a sample of 100 counterfeit and 100 genuine notes. There are six continuous explanatory variables that represent aspects of the size of a note: length, height on the left, height on the right, distance of inner frame to the lower border, distance of inner frame to the upper border and length of the diagonal. We standardize each of explanatory variables so that their standard deviation is 5.0. As we have discussed in the Introduction, many dimension reduction methods (including SIR) are not generally suitable for binary classification problems. Because among inverse regression methods the estimated subspace given by SAVE is necessarily larger than that given by pHd and SIR [7], we compared the KDR method only with SAVE for this data set. Figure 1 shows two-dimensional plots of the data projected onto the subspaces estimated by the KDR method and by SAVE. The figure shows that the results for KDR appear to be robust with respect to the values of the scale parameter a in the Gaussian RBF kernel. (Note that if a goes to infinity, the result approaches that obtained by a linear kernel, since the linear term in the Taylor expansion of the exponential function is dominant.) In the KDR case, using a Gaussian RBF with scale parameter a = 10 and 100 we obtain clear separation of genuine and counterfeit notes. Slightly less separation is obtained for the Gaussian RBF kernel with a = 10,000, for the linear kernel and for SAVE; in these cases there is an isolated genuine data point that lies close to the class boundary, which is similar to the results using linear discriminant analysis and specification analysis [11]. We see that KDR finds a more effective subspace to separate the two classes than SAVE and the existing analysis. Finally, note that there are two clusters of counterfeit notes in the result of SAVE, while KDR does not show multiple clusters 18 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN Fig. 1. Two-dimensional plots of Swiss bank notes. The crosses and circles show genuine and counterfeit notes, respectively. For the KDR methods, the Gaussian RBF kernel exp(−kz1 − z2 k2 /a) is used with a = 10, 100 and 10,000. For comparison, the plots given by KDR with a linear kernel and SAVE are shown. KERNEL DIMENSION REDUCTION 19 in either class. Although clusters have also been reported in other analyses [11], Section 12, the KDR results suggest that the cluster structure may not be relevant to the classification. We also analyzed the Evaporation data set, available in the Arc package (http://www.stat.umn.edu/arc/software.html). The data set is concerned with the effect on soil evaporation of various air and soil conditions. The number of explanatory variables is 10: maximum daily soil temperature (Maxst), minimum daily soil temperature (Minst), area under the daily soil temperature curve (Avst), maximum daily air temperature (Maxat), minimum daily air temperature (Minat), average daily air temperature (Avat), maximum daily humidity (Maxh), minimum daily humidity (Minh), area under the daily humidity curve (Avh) and total wind speed in miles/hour (Wind). The response variable is daily soil evaporation (Evap). The data were collected daily during 46 days; thus, the number of data points is 46. This data set was studied in the context of contour regression methods for dimension reduction in [22]. We standardize each variable so that the sample variance is equal to 5.0, and use the Gaussian RBF kernel exp(−kz1 − z2 k2 /10). Our analysis yielded an estimated two-dimensional subspace which is spanned by the vectors: KDR 1 : −0.25 MAXST + 0.32 MINST + 0.00 AVST + (−0.28) MAXAT + (−0.23) MINAT + (−0.44) AVAT + 0.39 MAXH + 0.25 MINH + (−0.07) AVH + (−0.54) WIND . KDR 2 : 0.09 MAXST + (−0.02) MINST + 0.00 AVST + 0.10 MAXAT + (−0.45) MINAT + 0.23 AVAT + 0.21 MAXH + (−0.41) MINH + (−0.71) AVH + (−0.05) WIND . In the first direction, Wind and Avat have a large factor with the same sign, while both have weak contributions on the second direction. In the second direction, Avh is dominant. Figure 2 presents the scatter plots representing the response Y plotted with respect to each of the first two directions given by the KDR method. Both of these directions show a clear relation with Y . Figure 3 presents the scatter plot of Y versus the two-dimensional subspace found by KDR. The obtained two-dimensional subspace is different from the one given by the existing analysis in [22]; the contour regression method gives a subspace in which the first direction shows a clear monotonic trend, but the second direction suggests a U -shaped pattern. In the result of KDR, we do not see a clear folded pattern. Although without further analysis it is difficult to say which result expresses more clearly the statistical dependence, the plots suggest that the KDR method successfully captured the effective directions for regression. 20 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN Fig. 2. Two-dimensional representation of Evaporation data for each of the first two directions. 4. Consistency of kernel dimension reduction. In this section we prove that the KDR estimator is consistent. Our proof of consistency requires tools from empirical process theory, suitably elaborated to handle the RKHS setting. We establish convergence of the empirical contrast function to the population contrast function under a condition on the regularization coefficient b (n) . εn , and from this result infer the consistency of B 4.1. Main result. We assume hereafter that Y is a topological space. The Stiefel manifold Sm d (R) is assumed to be equipped with a distance D which is compatible with the topology of Sm d (R). It is known that geodesics define such a distance (see, e.g., [19], Chapter IV). Fig. 3. Three-dimensional representation of Evaporation data. KERNEL DIMENSION REDUCTION 21 The following technical assumptions are needed to guarantee the consistency of kernel dimension reduction: (A-1) For any bounded continuous function g on Y, the function B 7→ EX [EY |B T X [g(Y )|B T X]2 ] is continuous on Sm d (R). (A-2) For B ∈ Sm d (R), let PB be the probability distribution of the random T B + R is dense in L2 (P ) for any variable BB X on X . The Hilbert space HX B m B ∈ Sd (R). (A-3) There exists a measurable function φ : X → R such that E|φ(X)|2 < ∞ and the Lipschitz condition kkd (B T x, ·) − kd (B̃ T x, ·)kHd ≤ φ(x)D(B, B̃) holds for all B, B̃ ∈ Sm d (R) and x ∈ X . Theorem 6. Suppose kd in (9) is continuous and bounded, and suppose the regularization parameter εn in (15) satisfies (18) εn → 0, n1/2 εn → ∞ (n → ∞). Define the set of the optimum parameters Bm d by B Bm d = arg min Tr[ΣY Y |X ]. B∈Sm (R) d Under the assumptions (A-1), (A-2) and (A-3), the set Bm d is nonempty, m ⊂ U we have and for an arbitrary open set U in Sm (R) with B d d b (n) ∈ U ) = 1. lim Pr(B n→∞ Note that Theorem 6 holds independently of any requirement that the population contrast function characterizes conditional independence. If the additional conditions of Theorem 4 are satisfied, then the estimator converges in probability to the set of sufficient dimension-reduction subspaces. The assumptions (A-1) and (A-2) are used to establish the continuity of Tr[ΣB Y Y |X ] in Lemma 13, and (A-3) is needed to derive the order of uniform B(n) b convergence of Σ Y Y |X in Lemma 9. The assumption (A-1) is satisfied in various cases. Let f (x) = EY |X [g(Y )|X = x], and assume f (x) is continuous. This assumption holds, for example, if the conditional probability density pY |X (y|x) is bounded and continuous with respect to x. Let C be an element of Sm m−d (R) such that the subspaces spanned by the column vectors of B and C are orthogonal; that is, the m × m matrix (B, C) is an orthogonal matrix. Define random variables U and V by U = B T X and V = C T X. If X has the probability density function pX (x), the 22 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN probability density function of (U, V ) is given by pU,V (u, v) = pX (Bu + Cv). Consider the situation in which u is given by u = B T x̃ for B ∈ Sm d (R) and x̃ ∈ X , and let VB,x̃ = {v ∈ Rm−d |BB T x̃ + Cv ∈ X }. We have T T E[g(Y )|B X = B x̃] = R VB,x̃ f (BB T x̃ + Cv)pX (BB T x̃ + Cv) dv R VB,x̃ pX (BB T x̃ + Cv) dv . If there exists an integrable function r(v) such that χVB,x̃ (v)pX (BB T x̃ + Cv) ≤ r(v) for all B ∈ Sm d (R) and x̃ ∈ X , the dominated convergence theorem ensures (A-1). Thus, it is easy to see that a sufficient condition for (A-1) is that X is bounded, pX (x) is bounded, and pY |X (y|x) is bounded and continuous on x, which is satisfied by a wide class of distributions. The assumption (A-2) holds if X is compact and kd + 1 is a universal kernel on Z. The assumption (A-3) is satisfied by many useful kernels; for example, kernels with the property ∂2 ∂z ∂z kd (z1 , z2 ) ≤ Lkz1 − z2 k a (a, b = 1, 2) b for some L > 0. In particular Gaussian RBF kernels satisfy this property. 4.2. Proof of the consistency theorem. If the following proposition is shown, Theorem 6 follows straightforwardly by standard arguments establishing the consistency of M -estimators (see, e.g., [31], Section 5.2). Proposition 7. Under the same assumptions as Theorem 6, the funcm b B(n) ] and Tr[ΣB tions Tr[Σ Y Y |X ] are continuous on Sd (R), and Y Y |X B(n) B b sup |Tr[Σ Y Y |X ] − Tr[ΣY Y |X ]| → 0 B∈Sm (R) d in probability. (n → ∞) The proof of Proposition 7 is divided into several lemmas. We decompose B b B(n) supB | Tr[ΣB Y Y |X ] − Tr[ΣY Y |X ]| into two parts: supB | Tr[ΣY Y |X ] − Tr[ΣY Y − B −1 B B B −1 B ΣB Y X (ΣXX + εn I) ΣXY ]| and supB | Tr[ΣY Y − ΣY X (ΣXX + εn I) ΣXY ] − B(n) b Tr[Σ Y Y |X ]|. Lemmas 8, 9 and 10 establish the convergence of the second part. The convergence of the first part is shown by Lemmas 11–14; in particular, Lemmas 12 and 13 establish the key result that the trace of the population conditional covariance operator is a continuous function of B. The following lemmas make use of the trace norm and the Hilbert– Schmidt norm of operators. For a discussion of these norms, see [26], Section KERNEL DIMENSION REDUCTION 23 VI and [20], Section 30. Recall that the trace of a positive operator A on a Hilbert space H is defined by Tr[A] = ∞ X i=1 hϕi , Aϕi iH , where {ϕi }∞ i=1 is a complete orthonormal system (CONS) of H. A bounded operator T on a Hilbert space H is called trace class if Tr[(T ∗ T )1/2 ] is finite. The set of all trace class operators on a Hilbert space is a Banach space with the trace norm kT ktr = Tr[(T ∗ T )1/2 ]. For a trace class operator T on H, the P∞ series i=1 hϕi , T ϕi i converges absolutely for any CONS {ϕi }∞ i=1 , and the limit does not depend on the choice of CONS. The limit is called the trace of T , and denoted by Tr[T ]. It is known that | Tr[T ]| ≤ kT ktr . A bounded operator T : H1 → H2 , where H1 and H2 are Hilbert spaces, is P 2 called Hilbert–Schmidt if Tr[T ∗ T ] < ∞, or equivalently, ∞ i=1 kT ϕi kH2 < ∞ ∞ for a CONS {ϕi }i=1 of H1 . The set of all Hilbert–Schmidt operators from H1 to H2 is a Hilbert space with Hilbert–Schmidt inner product hT1 , T2 iHS = ∞ X i=1 hT1 ϕi , T2 ϕi iH2 , a CONS of H1 . Thus, the Hilbert–Schmidt norm kT kHS where {ϕi }∞ i=1 is P 2 satisfies kT k2HS = ∞ i=1 kT ϕi kH2 . Obviously, kT k ≤ kT kHS ≤ kT ktr holds, if T is trace class or Hilbert– Schmidt. Recall also that if A is trace class (Hilbert–Schmidt) and B is bounded, AB and BA are trace class (Hilbert–Schmidt, resp.), for which kBAktr ≤ kBkkAktr and kABktr ≤ kBkkAktr (kABkHS ≤ kAkkBkHS and kBAkHS ≤ kAkkBkHS ). If A : H1 → H2 and B : H2 → H1 are Hilbert–Schmidt, the product AB is trace-class with kABktr ≤ kAkHS kBkHS . It is known that cross-covariance operators and covariance operators are Hilbert–Schmidt and trace class, respectively, under the assumption (2) [13, 16]. The Hilbert–Schmidt norm of ΣY X is given by (19) kΣY X k2HS = kEY X [(kX (·, X) − mX )(kY (·, Y ) − mY )]k2HX ⊗HY , where HX ⊗ HY is the direct product of HX and HY , and the trace norm of ΣXX is Tr[ΣXX ] = EX [kkX (·, X) − mX k2HX ]. (20) Lemma 8. (n) −1 b |Tr[Σ Y Y |X ] − Tr[ΣY Y − ΣY X (ΣXX + εn I) ΣXY ]| ≤ 1 b (n) k + kΣY X kHS )kΣ b (n) − ΣY X k {(kΣ HS Y X HS YX εn 24 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN (n) b + kΣY Y ktr kΣ XX − ΣXX k} (n) b + |Tr[Σ Y Y − ΣY Y ]|. Proof. Noting that the self-adjoint operator ΣY X (ΣXX + εn I)−1 ΣXY is trace class from ΣY X (ΣXX + εn I)−1 ΣXY ≤ ΣY Y , the left-hand side of the assertion is bounded from above by (n) (n) (n) (n) −1 b −1 b b b |Tr[Σ Y Y − ΣY Y ]| + |Tr[ΣY X (ΣXX + εn I) ΣXY − ΣY X (ΣXX + εn I) ΣXY ]|. The second term is upper-bounded by (n) (n) (n) −1 b b b |Tr[(Σ Y X − ΣY X )(ΣXX + εn I) ΣXY ]| (n) (n) −1 b b + |Tr[ΣY X (Σ XX + εn I) (ΣXY − ΣXY )]| (n) −1 b + |Tr[ΣY X {(Σ − (ΣXX + εn I)−1 }ΣXY ]| XX + εn I) (n) (n) (n) −1 b b b ≤ k(Σ Y X − ΣY X )(ΣXX + εn I) ΣXY ktr (n) (n) −1 b b + kΣY X (Σ XX + εn I) (ΣXY − ΣXY )ktr + |Tr[ΣY X (ΣXX + εn I)−1/2 (n) −1 1/2 b × {(ΣXX + εn I)1/2 (Σ − I} XX + εn I) (ΣXX + εn I) × (ΣXX + εn I)−1/2 ΣXY ]| ≤ 1 b (n) b (n) k + 1 kΣY X kHS kΣ b (n) − ΣXY k kΣY X − ΣY X kHS kΣ HS XY HS XY εn εn (n) −1 1/2 b + k(ΣXX + εn I)1/2 (Σ − Ik XX + εn I) (ΣXX + εn I) × k(ΣXX + εn I)−1/2 ΣXY ΣY X (ΣXX + εn I)−1/2 ktr . In the last line, we use | Tr[ABA∗ ]| ≤ kBkkA∗ Aktr for a Hilbert–Schmidt operator A and a bounded operator B. This is confirmed easily by the singular decomposition of A. Since the spectrum of A∗ A and AA∗ are identical, we have (n) −1 1/2 b − Ik k(ΣXX + εn I)1/2 (Σ XX + εn I) (ΣXX + εn I) (n) (n) −1/2 −1/2 b b = k(Σ (ΣXX + εn I)(Σ − Ik XX + εn I) XX + εn I) (n) (n) (n) −1/2 −1/2 b b b ≤ k(Σ (ΣXX − Σ k XX + εn I) XX )(ΣXX + εn I) ≤ 1 b (n) − ΣXX k. kΣ εn XX 1/2 The bound k(ΣXX + εn I)−1/2 ΣXX VXY k ≤ 1 yields k(ΣXX + εn I)−1/2 ΣXY ΣY X (ΣXX + εn I)−1/2 ktr ≤ kΣY Y ktr , 25 KERNEL DIMENSION REDUCTION which concludes the proof. Lemma 9. Under the assumption (A-3), B(n) B b sup kΣ XX − ΣXX kHS , B∈Sm (R) d and B(n) B b sup kΣ XY − ΣXY kHS B∈Sm (R) d B(n) B b sup |Tr[Σ Y Y − ΣY Y ]| B∈Sm (R) d √ are of order Op (1/ n) as n → ∞. The proof of Lemma 9 is deferred to the Appendix. From Lemmas 8 and 9, the following lemma is obvious. Lemma 10. If the regularization parameter (εn )∞ n=1 satisfies (18), under the assumption (A-3) we have B(n) B B −1 B −1 −1/2 b sup |Tr[Σ ) Y Y |X ] − Tr[ΣY Y − ΣY X (ΣXX + εn I) ΣXY ]| = Op (εn n B∈Sm (R) d as n → ∞. In the next four lemmas, we establish the uniform convergence of Lε to L0 (ε ↓ 0), where Lε (B) is a function on Sm d (R) defined by B −1 B Lε (B) = Tr[ΣB Y X (ΣXX + εI) ΣXY ] 1/2 1/2 B Σ for ε > 0 and L0 (B) = Tr[ΣY Y VYBX VXY Y Y ]. We begin by establishing pointwise convergence. Lemma 11. For arbitrary kernels with (2), 1/2 1/2 Tr[ΣY X (ΣXX + εI)−1 ΣXY ] → Tr[ΣY Y VY X VXY ΣY Y ] (ε ↓ 0). Proof. With a CONS {ψi }∞ i=1 for HY , the difference of the right-hand side and the left-hand side can be written as ∞ X i=1 1/2 1/2 1/2 1/2 hψi , ΣY Y VY X {I − ΣXX (ΣXX + εI)−1 ΣXX }VXY ΣY Y ψi iHY . 1/2 1/2 Since each summand is positive and upper bounded by hψi , ΣY Y VY X VXY ΣY Y × ψi iHY , and the sum over i is finite, by the dominated convergence theorem it suffices to show 1/2 1/2 1/2 1/2 limhψ, ΣY Y VY X {I − ΣXX (ΣXX + εI)−1 ΣXX }VXY ΣY Y ψiHY = 0 ε↓0 26 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN for each ψ ∈ HY . Fix arbitrary ψ ∈ HY and δ > 0. From the fact R(VXY ) ⊂ R(ΣXX ), there 1/2 exists h ∈ HX such that kVXY ΣY Y ψ − ΣXX hkHX < δ. Using the fact I − 1/2 1/2 ΣXX (ΣXX + εn I)−1 ΣXX = εn (ΣXX + εn I)−1 , we have 1/2 1/2 1/2 k{I − ΣXX (ΣXX + εI)−1 ΣXX }VXY ΣY Y ψkHX = kε(ΣXX + εI)−1 ΣXX hkHX 1/2 + kε(ΣXX + εI)−1 (VXY ΣY Y ψ − ΣXX h)kHX ≤ εkhkHX + δ, which is arbitrarily small if ε is sufficiently small. This completes the proof. Lemma 12. Suppose kd is continuous and bounded. Then, for any ε > 0, the function Lε (B) is continuous on Sm d (R). Proof. By an argument similar to that in the proof of Lemma 11, it B −1 B suffices to show the continuity of B 7→ hψ, ΣB Y X (ΣXX + εI) ΣXY ψiHY for each ψ ∈ HY . B : HB → L2 (P ) and J : H → L2 (P ) be inclusions. As seen in Let JX X Y Y Y X B Proposition 1, the operators ΣB Y X and ΣXX can be extended to the integral B B B operators SYBX and SXX on L2 (PX ), respectively, so that JY ΣB Y X = SY X JX B B B B B B −1 and JX ΣXX = SXX JX . It is not difficult to see also JX (ΣXX + εI) = B + εI)−1 J B for ε > 0. These relations yield (SXX X B −1 B hψ, ΣB Y X (ΣXX + εI) ΣXY ψiHY B B = EXY [ψ(Y )((SXX + εI)−1 SXY ψ)(X)] B B − EY [ψ(Y )]EX [((SXX + εI)−1 SXY ψ)(X)], where JY ψ is identified with ψ. The assertion is obtained if we prove that B and (S B + εI)−1 are continuous with respect to B in opthe operators SXY XX erator norm. To see this, let X̃ be identically and independently distributed with X. We have B0 B k(SXY − SXY )ψk2L2 (PX ) B0 B (X, X̃) − kX = EX̃ [CovY X [kX (X, X̃), ψ(Y )]2 ] ≤ EX̃ [VarX [kd (B T X, B T X̃) − kd (B0T X, B0T X̃)]VarY [ψ(Y )]] ≤ EX̃ EX [(kd (B T X, B T X̃) − kd (B0T X, B0T X̃))2 ]kψk2L2 (PY ) , B from which the continuity of B 7→ SXY is obtained by the continuity and B + εI)−1 is shown by k(S B + boundedness of kd . The continuity of (SXX XX 27 KERNEL DIMENSION REDUCTION B0 B +εI)−1 (S B0 −S B )(S B0 +εI)−1 k ≤ εI)−1 −(SXX +εI)−1 k = k(SXX XX XX XX B k. SXX B0 1 ε2 kSXX − −1 B ΣB To establish the continuity of L0 (B) = Tr[ΣB XY ], the argument Y X ΣXX −1 B in the proof of Lemma 12 cannot be applied, because ΣXX is not bounded in general. The assumptions (A-1) and (A-2) are used for the proof. Lemma 13. Suppose kd is continuous and bounded. Under the assumptions (A-1) and (A-2), the function L0 (B) is continuous on Sm d (R). Proof. By the same argument as in the proof of Lemma 11, it suffices to establish the continuity of B 7→ hψ, ΣB Y Y |X ψi for ψ ∈ HY . From Proposition 2, the proof is completed if the continuity of the map B 7→ inf VarXY [g(Y ) − f (X)] f ∈HB X is proved for any continuous and bounded function g. B , under the assumption Since f (x) depends only on B T x for any f ∈ HX (A-2), we use the same argument as in the proof of Proposition 3 to obtain inf VarXY [g(Y ) − f (X)] f ∈HB X = inf VarX [EY |BB T X [g(Y )|BB T X] − f (X)] f ∈HB X + EX [VarY |BB T X [g(Y )|BB T X]] = EY [g(Y )2 ] − EX [EY |B T X [g(Y )|B T X]2 ], which is a continuous function of B ∈ Sm d (R) from assumption (A-1). Lemma 14. Suppose that kd is continuous and bounded, and that εn converges to zero as n goes to infinity. Under the assumptions (A-1) and (A-2), we have sup B∈Sm (R) d B −1 B B Tr[ΣB Y Y |X − {ΣY Y − ΣY X (ΣXX + εn I) ΣXY }] → 0 (n → ∞). Proof. From Lemmas 11, 12 and 13, the continuous function Tr[ΣY Y − −1 B B ΣY X (ΣB XX + εn I) ΣXY ] converges to the continuous function Tr[ΣY Y |X ] m m for every B ∈ Sd (R). Because this convergence is monotone and Sd (R) is compact, it is necessarily uniform. The proof of Proposition 7 is now easily obtained. 28 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN Proof of Proposition 7. Lemmas 12 and 13 show the continuity of b B(n) ] and Tr[ΣB Tr[Σ Y Y |X ]. Lemmas 10 and 14 prove the uniform converY Y |X gence. 5. Conclusions. This paper has presented KDR, a new method for sufficient dimension reduction in regression. The method is based on a characterization of conditional independence using covariance operators on reproducing Hilbert spaces. This characterization is not restricted to first or second-order conditional moments, but exploits high-order moments in the estimation of the central subspace. The KDR method is widely applicable; in distinction to most of the existing literature on SDR it does not impose strong assumptions on the probability distribution of the covariate vector X. It is also applicable to problems in which the response Y is discrete. We have developed some asymptotic theory for the estimator, resulting in a proof of consistency of the estimator under weak conditions. The proof of consistency reposes on a result establishing the uniform convergence of the empirical process in a Hilbert space. In particular, we have established the rate Op (n−1/2 ) for uniform convergence, paralleling the results for ordinary real-valued empirical processes. We have not yet developed distribution theory for the KDR method, and have left open the important problem of inferring the dimensionality of the central subspace. Our proof techniques do not straightforwardly extend to yield the asymptotic distribution of the KDR estimator, and new techniques may be required. It should be noted, however, that inference of the dimensionality of the central subspace is not necessary for many of the applications of SDR. In particular, SDR is often used in the context of graphical exploration of data, where a data analyst may wish to explore views of varying dimensionality. Also, in high-dimensional prediction problems of the kind studied in statistical machine learning, dimension reduction may be carried out in the context of predictive modeling, in which case cross-validation and related techniques may be used to choose the dimensionality. Finally, while we have focused our discussion on the central subspace as the object of inference, it is also worth noting that KDR applies even to situations in which a central subspace does not exist. As we have shown, the KDR estimate converges to the subset of projection matrices that satisfy (1); this result holds regardless of the existence of a central subspace. That is, if the intersection of dimension-reduction subspaces is not a dimensionreduction subspace, but if the dimensionality chosen for KDR is chosen to be large enough such that subspaces satisfying (1) exist, then KDR will converge to one of those subspaces. 29 KERNEL DIMENSION REDUCTION APPENDIX: UNIFORM CONVERGENCE OF CROSS-COVARIANCE OPERATORS In this Appendix we present a proof of Lemma 9. The proof involves the use of random elements in a Hilbert space [3, 30]. Let H be a Hilbert space equipped with a Borel σ-field. A random element in the Hilbert space H is a measurable map F : Ω → H from a measurable space (Ω, S). If H is an RKHS on a measurable set X with a measurable positive definite kernel k, a random variable X in X defines a random element in H by k(·, X). A random element F in a Hilbert space H is said to have strong order p (0 < p < ∞) if EkF kp is finite. For a random element F of strong order one, the expectation of F , which is defined as the element mF ∈ H such that hmF , giH = E[hF, giH ] for all g ∈ H, is denoted by E[F ]. With this notation, the interchange of the expectation and the inner product is justified: hE[F ], giH = E[hF, giH ]. Note also that for independent random elements F and G of strong order two, the relation E[hF, GiH ] = hE[F ], E[G]iH holds. Let (X, Y ) be a random vector on X × Y with law PXY , and let HX and HY be the RKHS with positive definite kernels kX and kY , respectively, which satisfy (2). The random element kX (·, X) has strong order two, and E[k(·, X)] equals mX , where mX is given by (4). The random element kX (·, X)kY (·, Y ) in the direct product HX ⊗ HY has strong order one. Define the zero mean random elements F = kX (·, X) − E[kX (·, X)] and G = kY (·, Y ) − E[kY (·, Y )]. For an i.i.d. sample (X1 , Y1 ), . . . , (Xn , Yn ) on X × Y with law PXY , define random elements Fi = kX (·, Xi )−E[kX (·, X)] and Gi = kY (·, Yi )−E[kY (·, Y )]. Then, F, F1 , . . . , Fn and G, G1 , . . . , Gn are zero mean i.i.d. random elements in HX and HY , respectively. In the following, the notation F = HX ⊗ HY is used for simplicity. As shown in the proof of Lemma 4 in [13], we have b (n) kΣ YX ! ! n n n 1 X 1X 1X − ΣY X kHS = Gi − Fi − Fj Gj − E[F G] , n n n i=1 j=1 j=1 F which provides a bound sup (21) B∈Sm (R) d b B(n) kΣ YX − ΣB Y X kHS n 1 X B ≤ sup (Fi Gi − E[F G]) m n B∈S (R) i=1 d n 1 X FjB + sup m B∈S (R) n d j=1 FB n 1 X Gj B n HX j=1 HY , 30 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN B ⊗ H . Also, (20) where FiB are defined with the kernel kB , and F B = HX Y implies b (n) Tr[Σ XX sup (22) X 2 n n 1 X 1X kFi k2HX − EkF k2HX − Fi , = n i=1 n i=1 H from which we have B∈Sm (R) d 2 n n 1X 1X − ΣXX ] = F − − EkF k2HX Fj i n i=1 n j=1 H b B(n) |Tr[Σ XX − ΣB XX ]| ≤ X n 1 X B 2 B 2 sup kFi kHB − EkF kHB X X B∈Sm (R) n i=1 d 2 n 1 X B Fi . + sup B B∈Sm (R) n d i=1 HX It follows that Lemma 9 is proved if all√ the four terms on the right-hand side of (21) and (22) are of order Op (1/ n). Hereafter, the kernel kd is assumed to be bounded. We begin by considering the first term on the right-hand side of (22). This is the supremum of a process which consists of real-valued random variables kFiB k2HB . Let U B X be a random element in Hd defined by U B = kd (·, B T X) − E[kd (·, B T X)] and let C > 0 be a constant such that |kd (z, z)| ≤ C 2 for all z ∈ Z. From kU B kHd ≤ 2C, we have for B, B̃ ∈ Sm d (R) |kF B k2HB − kF B̃ k2HB̃ | = |hU B − U B̃ , U B + U B̃ iHd | X X ≤ kU B − U B̃ kHd kU B + U B̃ kHd ≤ 4CkU B − U B̃ kHd . The above inequality, combined with the bound kU B − U B̃ kHd ≤ 2φ(x)D(B, B̃) (23) obtained from assumption (A-3), provides a Lipschitz condition |kF B k2HB − X kF B̃ k2 B̃ | ≤ 8Cφ(x)D(B, B̃), which works as a sufficient condition for the HX uniform central limit theorem [31], Example 19.7. This yields n 1 X √ B 2 B 2 kFi kHB − EkF kHB = Op (1/ n). sup X X B∈Sm (R) n d i=1 31 KERNEL DIMENSION REDUCTION Our approach to the other three terms is based on a treatment of emB T pirical processes in a Hilbert space. For B ∈ Sm d (R), let Ui = kd (·, B Xi ) − E[kd (·, B T X)] be a random element in Hd . Then the relation hkB (·, x), kB (·, x̃)iHB = kd (B T x, B T x̃) = hkd (·, B T x), kd (·, B T x̃)iHd implies X n 1 X B Fj n (24) j=1 (25) n 1 X B Fj G − E[F G] n j=1 HB X HB ⊗HY X n 1 X B Uj = n j=1 , Hd n 1 X B B Uj G − E[U G] = n j=1 . Hd ⊗HY Note also that the assumption (A-3) gives q kU B G − U B̃ GkHd ⊗HY ≤ 2 kY (y, y)φ(x)D(B, B̃). (26) From (23)–(26), the proof of Lemma 9 is completed from the following proposition: Proposition 15. Let (X , BX ) be a measurable space, let Θ be a compact metric space with distance D, and let H be a Hilbert space. Suppose that X, X1 , . . . , Xn are i.i.d. random variables on X , and suppose F : X × Θ → H is a Borel measurable map. If supθ∈Θ kF (x; θ)kH < ∞ for all x ∈ X and there exists a measurable function φ : X → R such that E[φ(X)2 ] < ∞ and (27) kF (x; θ1 ) − F (x; θ2 )kH ≤ φ(x)D(θ1 , θ2 ) (∀θ1 , θ2 ∈ Θ), then we have n 1 X (F (Xi ; θ) − E[F (X; θ)]) = Op (1) sup √ n θ∈Θ i=1 H (n → ∞). The proof of Proposition 15 is similar to that for a real-valued random process, and is divided into several lemmas. I.i.d. random variables σ1 , . . . , σn taking values in {+1, −1} with equal probability are called Rademacher variables. The following concentration inequality is known for a Rademacher average in a Banach space: Proposition 16. Let a1 , . . . , an be elements in a Banach space, and let σ1 , . . . , σn be Rademacher variables. Then, for every t > 0 n ! X t2 Pr σi ai > t ≤ 2 exp − Pn . 32 i=1 kai k2 i=1 32 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN Proof. See [21], Theorem 4.7 and the remark thereafter. With Proposition 16, the following exponential inequality is obtained with a slight modification of the standard symmetrization argument for empirical processes. Lemma 17. Let X, X1 , . . . , Xn and H be as in Proposition 15, and denote (X1 , . . . , Xn ) by Xn . Let F : X → H be a Borel measurable map with EkF (X)k2H < ∞. For a positive number M such that EkF (X)k2H < M , deP fine an event An by n1 ni=1 kF (Xi )k2 ≤ M . Then, for every t > 0 and sufficiently large n, Pr ( n 1 X (F (Xi ) − E[F (X)]) Xn n i=1 H ) ! > t ∩ An ≤ 8 exp − nt2 . 1024M Proof. First, note that for any sufficiently large n we have Pr(An ) ≥ 34 P and Pr(k n1 ni=1 (F (Xi ) − E[F (X)])k ≤ 2t ) ≥ 43 . We consider only such n in the following. Let X̃n be an independent copy of Xn , and let P Ãn = {X̃n | n1 ni=1 kF (X̃i )k2 ≤ M }. The obvious inequality ! ) n 1 X (F (Xi ) − E[F (X)]) > t ∩ An Pr Xn n i=1 H ( ! ) n 1 X t (F (X̃i ) − E[F (X)]) ≤ × Pr X̃n ∩ Ãn n 2 i=1 H ! ( ) n 1 X t ≤ Pr (Xn , X̃n ) (F (Xi ) − F (X̃i )) > ∩ An ∩ Ãn n 2 ( i=1 Pn H 1 2 2 and the fact that Bn := {(Xn , X̃n )| 2n i=1 (kF (Xi )k + kF (X̃i )k ) ≤ M } includes An ∩ Ãn gives a symmetrized bound Pr ! ) n 1 X (F (Xi ) − E[F (X)]) > t ∩ An Xn n i=1 H ( ! ) n 1 X t ≤ 2 Pr (Xn , X̃n ) (F (Xi ) − F (X̃i )) > ∩ Bn . n 2 ( i=1 H Introducing Rademacher variables σ1 , . . . , σn , the right-hand side is equal to 2 Pr ( ! ) n 1 X t σi (F (Xi ) − F (X̃i )) > (Xn , X̃n , {σi }) ∩ Bn , n 2 i=1 H 33 KERNEL DIMENSION REDUCTION which is upper-bounded by ! n n 1 X t 1 X 2 σi F (Xi ) > and kF (Xi )kH ≤ M 4 Pr n 4 2n i=1 i=1 H " ! # n 1 X t = 4EXn Pr σi F (Xi ) > |Xn 1{Xn ∈Cn } , n 4 i=1 where Cn = {Xn | n1 Pn 2 i=1 kF (Xi )kH line is upper-bounded by 4 exp(− 32 H ≤ 2M }. From Proposition 16, the last 2 Pn(nt/4) kF (Xi )k2 i=1 2 nt ) ≤ 4 exp(− 1024M ). Let Θ be a set with semimetric d. For any δ > 0, the covering number N (δ, d, Θ) is the smallest m ∈ N for which there exist m points θ1 , . . . , θm in Θ such that min1≤i≤m d(θ, θi ) ≤ δ holds for any θ ∈ Θ. We write N (δ) for N (δ, d, Θ) if there is no confusion. For δ > 0, the covering integral J(δ) for Θ is defined by J(δ) = Z δ (8 log(N (u)2 /u))1/2 du. 0 The chaining lemma [25], which plays a crucial role in the uniform central limit theorem, is readily extendable to a random process in a Banach space. Lemma 18 (Chaining lemma). Let Θ be a set with semimetric d, and let {Z(θ)|θ ∈ Θ} be a family of random elements in a Banach space. Suppose Θ has a finite covering integral J(δ) for 0 < δ < 1 and suppose there exists a positive constant R > 0 such that for all θ, η ∈ Θ and t > 0 the inequality 1 Pr(kZ(θ) − Z(η)k > td(θ, η)) ≤ 8 exp − t2 2R holds. Then, there exists a countable subset Θ∗ of Θ such that for any 0 < ε<1 Pr sup θ,η∈Θ∗ ,d(θ,η)≤ε kZ(θ) − Z(η)k > 26RJ(d(θ, η)) ≤ 2ε holds. If Z(θ) has continuous sample paths, then Θ∗ can be replaced by Θ. Proof. By noting that the proof of the chaining lemma for a real-valued random process does not use any special properties of real numbers but the property of the norm (absolute value) for Z(θ), the proof applies directly to a process in a Banach space. See [25], Section VII.2. 34 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN Proof of Proposition 15. Note that (27) means 2 n n 1 X 1X (F (Xi ; θ1 ) − F (Xi ; θ2 )) ≤ D(θ1 , θ2 )2 φ(Xi )2 . n n i=1 i=1 H Let M > 0 be a constant such that E[φ(X)2 ] < M , and let An = {Xn |k n1 × 2 2 i=1 (F (Xi ; θ1 ) − F (Xi ; θ2 ))kH ≤ M D(θ1 , θ2 ) }. Since the probability of An converges to zero as n → ∞, it suffices to show that there exists δ > 0 such that the probability Pn Pn = Pr Xn |An ∩ ( n 1 X (F (Xi ; θ) − E[F (X; θ)]) sup √ θ∈Θ n i=1 >δ H )! satisfies lim supn→∞ Pn = 0. With the notation F̃θ (x) = F (x; θ) − E[F (X; θ)], from Lemma 17 we can derive Pr An ∩ ( )! n 1 X Xn √ (F̃θ1 (Xi ) − F̃θ2 (Xi )) > t n i=1 ≤ 8 exp − t2 512 · 2M D(θ1 , θ2 )2 H for any t > 0 and sufficiently large n. Because the covering integral J(δ) with respect to D is finite by the compactness of Θ, and the sample path P Θ ∋ θ 7→ √1n ni=1 F̃θ (Xi ) ∈ H is continuous, the chaining lemma implies that for any 0 < ε < 1 Pr An ∩ ( Xn n 1 X sup (F̃θ1 (Xi ) − F̃θ2 (Xi )) √ n i=1 θ1 ,θ2 ∈Θ,D(θ1 ,θ2 )≤ε H )! > 26 · 512M · J(ε) ≤ 2ε. Take an arbitrary ε ∈ (0, 1). We can find a finite number of partitions Sν(ε) Θ = a=1 Θa (ν(ε) ∈ N) so that any two points in each Θa are within the distance ε. Let θa be an arbitrary point in Θa . Then the probability Pn is bounded by n 1 X F̃θa (Xi ) Pn ≤ Pr max √ 1≤a≤ν(ε) n i=1 (28) + Pr An ∩ ( Xn δ > 2 H ! )! n 1 X δ (F̃θ (Xi ) − F̃η (Xi )) > . sup √ 2 θ,η∈Θ,D(θ,η)≤ε n i=1 H 35 KERNEL DIMENSION REDUCTION From Chebyshev’s inequality the first term is upper-bounded by n 1 X F̃θa (Xi ) ν(ε) Pr √ n i=1 δ > 2 H ! ≤ 4ν(ε)EkF̃θa (X)k2H . δ2 4ν(ε)EkF̃ (X)k2 θa H If we take sufficiently large δ so that 512M J(ε) < δ/2 and < ε 2 δ , the right-hand side of (28) is bounded by 3ε, which completes the proof. Acknowledgments. The authors thank the Editor and anonymous referees for their helpful comments. The authors also thank Dr. Yoichi Nishiyama for his helpful comments on the uniform convergence of empirical processes. REFERENCES [1] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337–404. MR0051437 [2] Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. J. Mach. Learn. Res. 3 1–48. MR1966051 [3] Baker, C. R. (1973). Joint measures and cross-covariance operators. Trans. Amer. Math. Soc. 186 273–289. MR0336795 [4] Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation. J. Amer. Statist. Assoc. 80 580–598. MR0803258 [5] Chiaromonte, F. and Cook, R. D. (2002). Sufficient dimension reduction and graphics in regression. Ann. Inst. Statist. Math. 54 768–795. MR1954046 [6] Cook, R. D. (1998). Regression Graphics. Wiley, New York. MR1645673 [7] Cook, R. D. and Lee, H. (1999). Dimension reduction in regression with a binary response. J. Amer. Statist. Assoc. 94 1187–1200. MR1731482 [8] Cook, R. D. and Li, B. (2002). Dimension reduction for conditional mean in regression. Ann. Statist. 30 455–474. MR1902895 [9] Cook, R. D. and Weisberg, S. (1991). Discussion of Li. J. Amer. Statist. Assoc. 86 328–332. [10] Cook, R. D. and Yin, X. (2001). Dimension reduction and visualization in discriminant analysis (with discussion). Aust. N. Z. J. Stat. 43 147–199. MR1839361 [11] Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A Practical Approach. Chapman and Hall, London. [12] Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76 817–823. MR0650892 [13] Fukumizu, K., Bach, F. R. and Gretton, A. (2007). Statistical consistency of kernel canonical correlation analysis. J. Mach. Learn. Res. 8 361–383. MR2320675 [14] Fukumizu, K., Bach, F. R. and Jordan, M. I. (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 5 73–99. MR2247974 [15] Fukumizu, K., Gretton, A., Sun, X. and Schölkopf, B. (2008). Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20 (J. Platt, D. Koller, Y. Singer and S. Roweis, eds.) 489–496. MIT Press, Cambridge, MA. 36 K. FUKUMIZU, F. R. BACH AND M. I. JORDAN [16] Gretton, A., Bousquet, O., Smola, A. J. and Schölkopf, B. (2005). Measuring statistical dependence with Hilbert–Schmidt norms. In 16th International Conference on Algorithmic Learning Theory (S. Jain, H. U. Simon and E. Tomita, eds.) 63–77. Springer, Berlin. MR2255909 [17] Groetsch, C. W. (1984). The Theory of Tikhonov Regularization for Fredholm Equations of the First Kind. Pitman, Boston, MA. [18] Hristache, M., Juditsky, A., Polzehl, J. and Spokoiny, V. (2001). Structure adaptive approach for dimension reduction. Ann. Statist. 29 1537–1566. MR1891738 [19] Kobayashi, S. and Nomizu, K. (1963). Foundations of Differential Geometry, Vol. 1. Wiley, New York. [20] Lax, P. D. (2002). Functional Analysis. Wiley, New York. MR1892228 [21] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, Berlin. MR1102015 [22] Li, B., Zha, H. and Chiaromonte, F. (2005). Contour regression: A general approach to dimension reduction. Ann. Statist. 33 1580–1616. MR2166556 [23] Li, K.-C. (1991). Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86 316–342. MR1137117 [24] Li, K.-C. (1992). On principal Hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. J. Amer. Statist. Assoc. 87 1025–1039. MR1209564 [25] Pollard, D. (1984). Convergence of Stochastic Processes. Springer, New York. MR0762984 [26] Reed, M. and Simon, B. (1980). Functional Analysis. Academic Press, New York. MR0751959 [27] Samarov, A. M. (1993). Exploring regression structure using nonparametric functional estimation. J. Amer. Statist. Assoc. 88 836–847. MR1242934 [28] Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, G. and Schölkopf, B. (2008). Injective Hilbert space embeddings of probability measures. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008) (R. A. Servedio and T. Zhang, eds.) 111–122. Omnipress, Madison, WI. [29] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. MR1379242 [30] Vakhania, N. N., Tarieladze, V. I. and Chobanyan, S. A. (1987). Probability Distributions on Banach Spaces. Reidel, Dordrecht. MR1435288 [31] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Univ. Press, Cambridge. MR1652247 [32] Wahba, G. (1990). Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics 59. SIAM, Philadelphia, PA. MR1045442 [33] Xia, Y., Tong, H., Li, W. and Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 363–410. MR1924297 [34] Yin, X. and Bura, E. (2006). Moment-based dimension reduction for multivariate response regression. J. Statist. Plann. Inference 136 3675–3688. MR2256281 [35] Yin, X. and Cook, R. D. (2005). Direction estimation in single-index regressions. Biometrika 92 371–384. MR2201365 [36] Zhu, Y. and Zeng, P. (2006). Fourier methods for estimating the central subspace and the central mean subspace in regression. J. Amer. Statist. Assoc. 101 1638– 1651. MR2279485 KERNEL DIMENSION REDUCTION K. Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu Minato-ku, Tokyo 106-8569 Japan E-mail: fukumizu@ism.ac.jp 37 F. R. Bach INRIA—WILLOW Project-Team Laboratoire d’Informatique de l’Ecole Normale Supérieure CNRS/ENS/INRIA UMR 8548 45, rue d’Ulm 75230 Paris France E-mail: francis.bach@mines.org M. I. Jordan Department of Statistics Department of Computer Science and Electrical Engineering University of California Berkeley, California 94720 USA E-mail: jordan@stat.berkeley.edu