Streaming Kernel Principal Component Analysis Mina Ghashami School of Computing University of Utah Salt Lake City, UT ghashami@cs.utah.edu Daniel Perry School of Computing University of Utah Salt Lake City, UT dperry@cs.utah.edu Jeff M. Phillips School of Computing University of Utah Salt Lake City, UT jeffp@cs.utah.edu September 14, 2015 Abstract Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture non-linear structures within large data sets, and is a central tool in data analysis and learning. To allow for non-linear relations, typically a full n × n kernel matrix is constructed over n data points, but this requires too much space and time for large values of n. Techniques such as the Nyström method and random feature maps can help towards this goal, but either they do not explicitly maintain the basis vectors in a stream or still use space linear in n. We propose a new approach for streaming KPCA which maintains a small set of basis elements in a stream, requiring space only logarithmic in n, and also improves the dependence on the error parameter. Our technique combines together random feature maps with recent advances in matrix sketching, it has guaranteed spectral norm error bounds with respect to the original kernel matrix, and it compares favorably in practice to state-of-the-art approaches. 1 Introduction Principal component analysis (PCA) is a well-known technique for dimensionality reduction, and has many applications including visualization, pattern recognition, and data compression [9]. Given a set of centered d-dimensional (training) data points A = [a1 ; . . . ; an ] ∈ Rn×d , PCA diagonalizes the covariance matrix C = n1 AT A by solving the eigenvalue equation Cv = λv. However, when the data points lie on a highly nonlinear space, PCA fails to concisely capture the structure of data. To overcome this, several nonlinear extension of PCA have been proposed, in particular Kernel Principal Component Analysis (KPCA) [19]. The basic idea of KPCA is to implicitly map the data into a nonlinear feature space of high (or often infinite) dimension and perform PCA in that space [19]. The nonlinear map is often denoted as φ : Rd → H where H is a Reproducing Kernel Hilbert Space (RKHS). While direct computation of PCA in RKHS is infeasible, we can invoke the so called kernel trick which exploits the fact that PCA interacts with data through only pair-wise inner products. That is hφ(x), φ(y)iH = K(x, y), for all x, y ∈ Rd for a kernel function K; we represent this as the n × n gram matrix G. However, KPCA suffers from high space and computational complexity in storing the entire kernel (gram) matrix G ∈ Rn×n and in computing the decomposition of it in the training phase. Then in the testing phase it spends O(nd) time to evaluate the kernel function for any arbitrary test vector with respect to all training examples. Although one can use low rank decomposition approaches [4, 6, 13, 18] to reduce the computational cost to some extent, KPCA still needs to compute and store the kernel matrix. There have been two main approaches towards resolving this space issue. First, the Nyström approach [22] uses a sample of the data points to construct a much smaller gram matrix. Second, feature maps [17] provide an approximate but explicit embedding of the RKHS into Euclidean space. As we will describe, 1 both approaches can be made to operate in a stream, approximating the KPCA result in less than n2 time and space. Once these approximations are formed they reveal a D n dimensional space, and typically a kdimensional subspace found through linear PCA in RD , which captures most of the data (e.g., a low rank-k approximation). There are two main purposes of these D- and k-dimensional subspaces; they start with mapping a data point x ∈ Rd into the D-dimensional space, and then often onto the k-dimensional subspace. If x is one of the training data points, then the k-dimensional representation can be used as a concise “loadings” vector. It can be used in various down-stream training and learning tasks wherein this k-dimensional space, can assume linear relations (e.g., linear separators, clustering under Euclidean distance) since the non-linearity will have already been represented through the mapping to this space. If x is not in the training set, and the training set represents some underlying distribution, then we can assess the “fit” of x to this distribution by considering the residual of its representation in the D-dimensional space when projected to the k-dimensional space. We refer to this procedure of mapping a single point x to the D-dimensional space as the T EST TIME. The value of k needed to get a good fit depends on the choice of kernel and its fit to the data; but D depends on the technique. For instance in (regular) KPCA D = n, in Nyström D = O(1/ε2 ), using random feature maps with [17]) D = O (1/ε2 ) log n , where ε ∈ (0, 1) is the error parameter. We propose a new streaming approach, named as SKPCA, that will only require D = O(1/ε). When we evaluate our approach and alternatives, we will focus also on S PACE (the cost of storing this map, and space needed to construct it), and T RAIN TIME (the time needed to construct the map to the D-dimensional space). We indicate matrix A is n × d dimensional as A ∈ Rn×d . Matrices A and Z will be indexed by their row vectors A = [a1 ; a2 ; . . . , an ] while other matrices V, U, W, . . . will be indexed n×d as by column vectors V = [v1 , v2 , . . . , vd ]. We use In for the n-dimensional identity matrixpand P 0 2 the full zero matrix of dimension n × d. The Frobenius norm of a matrix A is kAkF = i=1 kai k T and the spectral norm is kAk2 = supx∈Rd kAxk kxk . We denote transpose of a matrix as A . The singular Background and Notation. value decomposition of matrix A ∈ Rn×d is denoted by [U, S, V ] = svd(A). If n ≥ d it guarantees that A = U SV T , U T U = In , V T V = Id , U ∈ Rn×n , V ∈ Rd×d , and S = diag(s1 , s2 , . . . , sd ) ∈ Rn×d is a diagonal matrix with s1 ≥ s2 ≥ . . . ≥ sd ≥ 0. Let Uk and Vk be matrices containing the first k columns of U and V , respectively, and Sk = diag(s1 , s2 , . . . , sk ) ∈ Rk×k . The matrix Ak = Uk Sk VkT is the best rank k approximation of A in the sense that Ak = arg minC:rank(C)≤k kA − Ck2,F . We denote by πB (A) the projection of rows of A on the span of the rows of B. In other words, πB (A) = AB † B where (·)† indicates taking the Moore-Penrose psuedoinverse. Finally, expected value of a matrix is defined as the matrix of expected values. 1.1 Related Work In this section, we review the most related works to streaming principal component analysis and kernel PCA. Among many recent advancements in matrix sketching [15,23], we focus on those that compress a n×d matrix A into an `×d matrix B. There are several classes of algorithms based on sampling rows (or columns) [2, 4] (very related to Nyström approaches [5]), randomly projecting [18] or hashing [3] which require that ` ≈ c/ε2 to achieve ε error. The constant c depends on the algorithm, specific type of approximation, and whether it is a “for each” or “for all” approximation. A recent and different approach, Frequent Directions (FD) [13], uses only ` = 2/ε, achieves the error bound kAT A − B T Bk2 ≤ εkAk2F , and runs in time O(nd/ε). The algorithm basically processes one row of A at a time, and maintains in B an approximate singular value decomposition of part of A. Each new row Matrix Sketching. 2 of A is appended to B, the singular value decomposition is updated, but the size reduced back to only ` − 1 rows in B. This rank reduction of B is achieved by shrinking all squared singular values by the smallest squared singular value. To obtain the stated runtime, this needs to be run in batch, shrinking to rank `/2 (subtracting the (`/2 + 1)th squared singular value), once every `/2 rows of A. We use a modified version of this algorithm in our proposed approach. We first mention some techniques that do not achieve space-error trade-off guarantees. This group of methods update/augment the eigenspace of KPCA without storing all training data. [11] proposed an algorithm based on incremental PCA [7], in which they maintain a set I NDP of linearly independent training data points in memory and compute top d eigenvectors such that they preserve a θ-fraction (for a threshold θ ∈ (0, 1)) of the total energy of the eigenspace. Two major drawbacks of this method are: 1) The set I NDP can grow large and unpredictably, perhaps exceeding the capacity of the memory. 2) The update of the eigenspace is done for every single datapoint, which can lead to high computation cost. [20] and [21] partially address the second issue by allowing some batch processing. Incremental Kernel PCA. Another group of methods [5, 22], known as Nyström-based methods, approximate the kernel (gram) matrix G with a low-rank matrix Ḡ, by sampling columns of G. The original version [22] samples c columns with replacement C and estimates Ḡ = CW −1 C T , where W is the intersection of the sampled columns and rows; it took O(nc2 ) time and was not streaming. This was modified [5] to use sampling with replacement and P approximating G as Ḡ = CWk† C T . They proved 2 constant probability they can that if sampling probabilities are of the form pi = Gii / ni=1 G2ii P, nthen with 2 obtain a Frobenius error bound kG − Ḡk kF P ≤ kG − Gk kF + ε i=1 Gii with c = Ω(k/ε4 ), and spectral error bound kG − Ḡk k2 ≤ kG − Gk k2 + ε ni=1 G2ii with c = Ω(1/ε2 ). Note that by setting k = n (so Gk = G) in the second bound in the setting where Gii = 1 (a scenario we will focus on), then we obtain the bound kG − Ḡk2 ≤ εn. We observe that one can construct W in a stream over the data matrix using c independent reservoir samplers to select c data points using O(nd + nc + c2 ) = O(nd + n/ε2 + 1/ε4 ) time and O(cd + c2 ) = O(d/ε2 + 1/ε4 ) space. The testing step on a point x ∈ Rd is done by evaluating K(x, y) on each data point y sampled, taking O(cd) time, then projecting onto the c-dimensional basis in O(c2 ) time; requiring O(cd + c2 ) = O(d/ε2 + 1/ε4 ) time total. Nyström-Based Methods for Kernel PCA. In this line of work, the kernel matrix is approximated via randomized feature maps. The seminal work of [17] showed how to construct randomized feature maps Z : Rd → Rm such that for x, y ∈ Rd , E[hZ(x), Z(y)i] = K(x, y) for a shift-invariant kernel K(x, y) = K(x−y). Using this mapping, instead of implicitly lifting data points to H by the kernel trick, they explicitly embed the data to a low-dimensional Euclidean inner product space. We explicitly describe their work in Section 2. Subsequent works generalized to other kernel functions such as group invariant kernels [12], min/intersection kernels [16], dot-product kernels [10], and polynomial kernels [1, 8]. Employing this body of methods, Lopez et al. [14] proposed randomized nonlinear PCA (RNCA), which is basically a linear PCA on the n × m matrix Z. This reduces the time computational complexity of KPCA to O(nm2 ) and space complexity to O(nm). To bound errors we can consider (without constructing) Ĝ = ZZ T . They showed using [17], that the approximation error is bounded as E[kĜ − Gk2 ] ≤ Θ(n log n/m). Along the way to our main results, we will show below that using m = O((1/ε2 ) log n), we obtain kG − Ĝk2 ≤ εn with constant probability. This takes O(ndm + nm2 ) = O(n((d/ε2 ) log n + (1/ε4 ) log2 n) time and O(md + nm) = O((1/ε2 )n log n) space, assuming d < n. Testing can be done on a new point x ∈ Rd using the m random features in O(md) = O((d/ε2 ) log n) time. Random Fourier Features for Kernel PCA. 3 KPCA N YSTR ÖM RNCA SKPCA T RAIN TIME O(n2 (n + d)) O(nd + n/ε2 + 1/ε4 ) O(n((d/ε2 ) log n + (1/ε4 ) log2 n)) O(n log n(d/ε2 + 1/ε3 )) T EST TIME O(n(d + n2 )) O(d/ε2 + 1/ε4 ) O((1/ε2 )(d + n) log n) O((d + 1/ε)/ε2 log n) S PACE O(n2 + nd) O(d/ε2 + 1/ε4 ) O((d/ε2 )n log n) O(((d + 1/ε)/ε2 ) log n) Table 1: Asymptotic T RAIN TIME, T EST TIME and S PACE for four algorithms SKPCA (our newly proposed algorithm), KPCA [19] , RNCA [14], and Nyström [5]. For each, the resulting summary could be used to recover an approximate gram matrix G0 such that kG0 − Gk2 ≤ εn (KCPA is exact). 1.2 Our Results In this paper, we present a streaming variant of kernel PCA (SKPCA) that can produce a (implicit) representation of two matrices W ∈ Rm×` and Z ∈ Rn×m , so for G̃ = ZW W T Z T with constant probability kG − G̃k2 ≤ εn. It requires O(n log n(d/ε2 + 1/ε3 )) time, O(((d + 1/ε)/ε2 ) log n) space, and permits O(((d + 1/ε)/ε2 ) log n) test time. It is the first method to use space that is both sublinear (logarithmic) in n and sub-quartic in 1/ε, have training time sub-quartic in 1/ε, and with test time both sublinear (logarithmic) in n and less than 1/ε4 dependence on ε. 2 Streaming Kernel Principal Component Analysis In this section, we describe our algorithm Streaming Kernel Principal Component Analysis (SKPCA) for approximating the eigenspace of a streaming dataset existing on a nonlinear manifold. SKPCA consists of two implicit phases. In the first phase, a set of m data oblivious random feature functions (f1 , · · · , fm ) are computed to map data points to a low dimensional Euclidean inner product space. There are several methods that compute these feature maps for different type of kernels, e.g. shift-invariant kernels [17], polynomial kernels [8], dot-product kernels [10]. These are used to map each data point ai ∈ Rd to zi ∈ Rm . In the second phase, each approximate feature vector zi is fed into the small space streaming algorithm F REQUENT D IRECTIONS [13] for computing an approximate set of singular vectors, as a matrix W ∈ R`×m . However, in the actual algorithm these phases are not separated. The feature mapping functions are precomputed (oblivious to the data), so the approximate feature vectors are immediately fed into the matrix sketching algorithm, so we never need to fully materialize and store the full n × m matrix Z. Also, perhaps unintuitively, we do not sketch the m-dimensional column-space of Z, rather its n-dimensional row-space. Yet, since the resulting `-dimensional row-space of W (with ` m) encodes a lower dimensional subspace within Rm , it serves to represent as our kernel principal components. Pseudocode is provided in Algorithm 2.1. To make the algorithm concrete, we consider the approximate feature maps described in the general framework of Rahimi and Recht [17]; label this instantiation of the F EATURE M APS function as R ANDOM -F OURIER F EAURE M APS in Algorithm 2.2. This works for positive definite shiftinvariant kernels K(x, y) = K(x − y) (e.g. Gaussian kernel K(x, y) = (1/2π)d/2 exp(−kx − yk2 /2)). It computes a randomized feature map z : Rd → Rm so that E[z(x)T z(y)] = K(x, y) for any x, y ∈ Rd . To construct the mapping z, they define m functions of the form fi (x) = cos(riT x + γi ), where ri ∈ Rd is a sample drawn uniformly at random from the Fourier transform of the kernel function, and γi ∼ Unif(0, 2π], uniformly at random fromp the interval (0, 2π]. Applying each fi on a datapoint x, gives the ith coordinate of z(x) in Rm as z(x)i = 2/mfi (x). This implies each coordinate has squared value of (z(x)i )2 ≤ 2/m. The algorithm is described in Algorithm 2.2. We consider Algorithm 2.1, using Algorithm 2.2, with m = O((1/ε2 ) log n) and ` = O(1/ε). Approximate feature maps. 4 Algorithm 2.1 SKPCA Input: A ∈ Rn×d as data points, a shift-invariant kernel function K, and `, m ∈ Z+ Output: Feature maps [f1 , · · · , fm ] and their approximate best `-dim subspace W [f1 , · · · , fm ] = F EATURE M APS(K, m) (data oblivious features) `×m B←0 for i ∈ [n] q do 2 zi = m [f1 (ai ), · · · , fm (ai )] and insert zi as a row into B if B has no zero valued rows then [Y, Σ,q W ] ← svd(B) B ← max{0, Σ2 − Σ2`/2,`/2 I` } · W T (last `/2 rows of B become 0) Return [f1 , · · · , fm ] and W . S PACE : Storing m functions fi , for i = 1, . . . , m, this algorithm is able to process the input A = [a1 ; . . . ; an ] in a streaming fashion. Since for each function we need to store a d-dimensional vector ri , it takes O(dm) space to store all functions. Having all functions we can compute feature map z(x) and get a m-dimensional row vector z(ai ) for each datapoint ai ∈ A, which then is used to update the sketch B ∈ R`×m in F REQUENT D IRECTIONS. Since we need an additional O(`m) for storing B and W , the total space usage of Algorithm 2.1 is O(dm + `m) = O(((d + 1/ε)/ε2 ) log n). T RAIN T IME : Applying the feature map on each datapoint takes O(dm) and computing the F REQUENTD IRECTIONS sketch takes O(n`m) time, so the training time is O(n · dm + n`m) = O(n log n(d/ε2 + 1/ε3 )). T EST T IME : Although W approximates the eigenspace of A, to have the mapping of the data points to lower dimension `, we can store transformed data points in a matrix Z ∈ Rn×m and project it onto the `-dimensional subspace spanned by W to get Z̃ = ZW ∈ Rn×` . Note that Z̃ needs an additional O(n`) to store, and another pass over A taking O(n(dm + m`)) time. However, for a test point xtest ∈ Rd , we can project it onto Rm by z(xtest ) in O(dm) time and then use W to project to z(xtest )W → R` in O(`m) time, so in total it takes O(dm + `m) = O((d + 1/ε)/ε2 · log n) time. 2.1 Spectral Error Analysis In this section we prove our main result. Let G = ΦΦT be the exact kernel matrix in RKHS. Let Ĝ = ZZ T be an approximate kernel matrix using Z ∈ Rn×m consists of mapping the n points Rm using m RandomFourier Feature Maps. Then we consider G̃ = ZW W T Z T , as the kernel matrix which could be constructed from outputs of Algorithm 2.2 and Algorithm 2.1. We show that if we set m = O((1/ε2 ) log(n/δ)) and ` = 4/ε then kG − G̃k2 ≤ εn with probability at least 1 − δ. In our proof we use the Bernstein inequality on sum of zero-mean random matrices. Algorithm 2.2 R ANDOM -F OURIER F EATURE M APS(K, m) Input: A positive definite shift-invariant kernel K(x, y) = g(x − y) with K(x, x) = 1, and m as the output dimensionality. Output: A feature map z(x) : Rd → Rm so that z(x)T z(y) ' K(x, y) ∀x, y ∈ Rd R −jrT δ √ 1 Let p be the Fourier transform of kernel g: p(r) = 2π e g(δ) dδ (j = −1) d Draw m iid samples {ri }m i=1 ∈ R from p and m iid samples {γi } ∈ R from Unif(0, 2π) Define functions fi (x) = cos(riT x + γi ) for i = 1, . . . , m Return [f1 , . . . , fm ] 5 Matrix Bernstein Inequality: Let X1 , · · · , Xd ∈ Rn×n be independent random matrices such that for all 1 ≤ i ≤ d, E[Xi ] = 0 and kXi k2 ≤ R for a fixed constant R. Define as parameter i h P variance Pd Pd d 2 T T σ = max{k i=1 E[Xi Xi ]k, k i=1 E[Xi Xi ]k}. Then for all t ≥ 0, Pr i=1 Xi ≥ t ≤ 2n · 2 2 exp 3σ2−t . Using this inequality, [14] bounded E [kG− Ĝk ]. Here we employ similar ideas to improve 2 +2Rt this to a bound on kG − Ĝk2 with high probability. Lemma 2.1. For n points, let G = ΦΦT ∈ Rn×n be the exact gram matrix, and let Ĝ = ZZ T ∈ Rn×n be the approximate kernel matrix using m = O((1/ε2 ) log(n/δ)) random features from Algorithm 2.2. Then kG − Ĝk ≤ εn with probability at least 1 − δ. 1 Proof. Consider m independent random variables Ei = m G − zi ziT . Note that E[Ei ] = 0n×n [17]. Next we can rewrite 1 1 T T T kEi k2 = G − zi zi = E[ZZ ] − zi zi m m 2 2 1 mG − E[zi ziT ] = and thus bound kEi k2 ≤ 2n 2n 4n 1 1 kE[ZZ T ]k2 + kzi ziT k2 ≤ E[kZk22 ] + kzi k2 ≤ + = . m m m m m The first inequality is correct because of triangle inequality, and second inequality is achieved using Jensen’s inequality on expected values, which states kE[X]k ≤ E[kXk] for any random variable X. Last inequality 2 ≤ kZk2 ≤ 2n. uses the bound on the norm of zi as kzi k2 ≤ 2n 2 F m , and therefore kZk Pm 2 2 To bound σ , due to symmetry of matrices Ei , simply σ = k i=1 E[Ei2 ]k2 . Expanding " 2 # 2 1 1 G T 2 T T T 2 E[Ei ] = E G − zi zi + kz k z z − (z z G + Gz z ) =E i i i i i i i m m2 m it follows that E[Ei2 ] ≤ G2 2n 1 1 1 + E[zi ziT ] − (E[zi ziT ]G + G E[zi ziT ]) = 2 (G2 + 2nG − 2G2 ) = 2 (2nG − G2 ) 2 m m m m m 1 The first inequality holds by kzi k2 ≤ 2n/m, and second inequality is due to E[zi ziT ] = m G. Therefore m X 1 2n 1 2n2 1 3n2 2 2 2 σ = E[Ei ] ≤ (2n G − G ) ≤ kGk2 + kG2 k2 ≤ + kGk22 ≤ , m m m m m m 2 i=1 2 the second triangle inequality, and the last inequality by kGk2 ≤ Tr(G) = n. Setting P inequality Pm is by 1 T ) = G − Ĝ and using Bernstein inequality with t = εn we obtain G − z:,i z:,i M= m E = ( i=1 i i=1 m ! 2 h i −(εn)2 −ε m Pr kG − Ĝk2 ≥ εn ≤ 2n exp = 2n exp ≤δ 3n2 4n 9 + 8ε 3( ) + 2( )εn m Solving for m we get m ≥ kG − Ĝk2 ≤ εn. 9+8ε ε2 m log(2n/δ), so with probability at least 1 − δ for m = O( ε12 log(n/δ)), then Next, we show that applying the Frequent Directions step to Z does not asymptotically increase the error. To do so, we first show that spectrum of Z along directions that FrequentDirections fails to capture is small. We prove this for any n × m matrix A that is approximated as B ∈ R`×m by FrequentDirections. 6 Lemma 2.2. Consider an A ∈ Rn×m matrix with m ≤ n, and let B be an ` × m matrix resulting from running Frequent Directions on A with ` rows. For any unit vector y ∈ Rn with ky T AB † Bk = 0, it holds that ky T Ak2 ≤ kA − Ak k2F /(` − k), for all k ≤ `, including k = 0 where A − Ak = A. Proof. Let [U, S, V ] = svd(A) be the svdof A. Consider any unit vector y ∈ Rn that lies in the column space of A and the null space of B, that is ky T AB † Bk = 0 and ky T AA† k = kyk = 1. Since U = [u1 , u2 , . . . , un ] provides an orthonormal basis for Rn , we can write y= n X αi ui such that αi = hy, ui i, i=1 n X αi2 = 1 i=1 Pn Pm 2 2 T † T T Since 1 = P i > m. Moreover i=1 αi = kyk = kyPAA k = ky Um Um k = i=1 αi , therefore αi = 0 for P m m 2 hy, u i2 = 2 α2 . This implies there exists a unit vector x = m s s ky T Ak2 = m i i=1 i i=1 i i i=1 αi vi ∈ R with αi = hx, vi i = hy, ui i for i = 1, · · · , m such that ky T Ak = kAxk and importantly kBxk = 0, which we will prove shortly. Then, due to the Frequent Directions bound [6], for any unit vector x̄ ∈ Rm , kAx̄k2 − kB x̄k2 ≤ kA − Ak k2F /(` − k), and for our particular choice of x with kBxk = 0, we obtain ky T Ak = kAxk2 ≤ kA − Ak k2F /(` − k), as desired. Now to see that kBxk = 0, we will assume that kBxk > 0 and prove a contradiction. Since kBxk > 0, then x is not in the null space of B, and kπB (x)k > 0 for any unit vector x. Let Σ = diag(σ1 , . . . , σ` ), assuming σ1 ≥ σ2 ≥ . . . ≥ σ` > 0, are the singular values of B, and W = [w1 , . . . , w` ] ∈ Rmx` are its right singular vectors. Then kBxk = kΣW xk and if kπB (x)k > 0, then setting Σ̄ = diag(1, 1, . . . , 1) and B̄ = Σ̄W = W` to remove the scaling from B, we have kπB̄ (x)k > 0. Similarly, if ky T U SV T B † Bk = ky T πB (A)k = 0, then setting S̄ = diag(1, . . . , 1) and Ā = U S̄V T to remove scale from A, we have ky T πB (Ā)k = 0. Hence † 0 ≤ kπB̄ (x)k = kxB Bk = kxW` W`T k =k ` X hx, wj ik = k j=1 and 0 = ky T πB (Ā)k = k ` X m X αi hvi , wj ik, j=1 i=1 m m ` X m X X X hy, ui iviT W` W`T k = k αi viT W` k = k αi hvi , wj ik. i=1 i=1 j=1 i=1 Since last terms of each line match, we have a contradiction, and hence kBxk = 0. Lemma 2.3. Let Z̃ = ZW , and G̃ = Z̃ Z̃ T = ZW W T Z T be the corresponding gram matrix from Z ∈ Rn×m and W ∈ Rm×` constructed via Algorithm 2.1 with ` = 2/ε. Comparing to Ĝ = ZZ T , then kG̃ − Ĝk2 ≤ εn. Proof. Consider any unit vector y ∈ Rn , and note that y T Z = [y T Z]W + [y T Z]⊥W where [y T Z]W = y T ZW W T lies on the column space spanned by W , and [y T Z]⊥W = y T Z(I − W W T ) is in the null space of W . Then first off ky T Zk2 = k[y T Z]W k2 + k[y T Z]⊥W k2 since two components are perpendicular to each other. Second [y T Z]W W = y T ZW W T W = y T ZW and [y T Z]⊥W W = y T Z(I − W W T )W = y T Z(W − W ) = 0. Knowing these two we can say y T (ZZ T − Z̃ Z̃ T )y = (y T Z)(y T Z)T − (y T Z)W W T (y T Z)T T = ky T Zk2 − [y T Z]W + [y T Z]⊥W W W T [y T Z]W + [y T Z]⊥W = ky T Zk2 − (y T ZW )(y T ZW )T = k[y T Z]W k2 + k[y T Z]⊥W k2 − ky T ZW k2 = k[y T Z]⊥W k2 . 7 The last inequality holds because ky T ZW k = ky T ZW W T k = k[y T Z]W k as W is an orthonormal matrix. To show k[y T Z]⊥W k ≤ εkZk2F , consider vector v = y T Z(I − W W T )Z † and let y ∗ = v/kvk. Clearly ∗ y satisfies requirement of Lemma 2.2 as it is a unit vector in Rn and ky ∗ ZW W T k = 0 as ky ∗ ZW W T k = ky T Z(I − W W T )Z † ZW W T k/kvk = ky T Z(I − W W T )W W T k/kvk = 0. Therefore it satisfies that ky ∗ Zk ≤ kZ − Zk k2F /(` − k). Since kZk2F ≤ 2n for k = 0 and ` = 2/ε, we obtain k[y T Z]⊥W k2 = ky T Z(I−W W T )k = ky T Z(I−W W T )Z † Zk2 = ky ∗ Zk2 kvk2 ≤ kZk2F kvk2 /` ≤ εnkvk2 . It is left to show that kvk ≤ 1. For that, note πZW πZ (y T ) = πZW (y T ZZ † ) = y T ZZ † (ZW )(ZW )† = y T ZW (ZW )† = πZW (y T ). The finally we obtain kvk2 = ky T Z(I − W W T )Z † k2 = ky T ZZ † − y T ZW W T Z † k2 = kπZ (y T ) − πZW (y T )k2 = kπZ (y T ) − πZW (πZ (y T ))k2 ≤ kπZ (y T )k2 ≤ kyk2 = 1. Theorem 2.1. Let G = ΦΦT be the exact kernel matrix over n points. Let G̃ = ZW T W Z T be the result of Z from m = O((1/ε2 ) log(n/δ)) random features from Algorithm 2.2, W from running Algorithm 2.1 with ` = 4/ε. Then with probability at least 1 − δ, we have kG − G̃k2 ≤ εn. Proof. Combining Lemmas 2.1 and 2.3 and using triangle inequality, we get the bound. Next, we extend the error analysis above to show that the spectral error bound has useful implications towards other bounds. First, we prove a Frobenius error bound for any G0 = Y Y T , including when Y = ZW . Second, we prove a smaller “for each” bound. Theorem 2.1 shows a stronger “for all” bound, specifically that for all unit vectors x that |kΦT xk2 − kW T Z T xk2 | ≤ εn. The FD part was deterministic so implicitly “for all,” and the RandomFeature part required m = O((1/ε2 ) log(n/δ)) to hold with probability at least 1 − δ. We can remove the dependence on n and obtain a “for each” bound. This is useful to bound the error on any one test point xtest instead of all possible test points. 2.2 Frobenius Error Analysis Let the true gram matrix be G = ΦΦT , and consider G0 = Y Y T , for any Y including when Y = ZW . First we write the bound in terms of Φ and Y . Lemma 2.4. kG − G0 k2 = max |kΦT xk2 − kY T xk2 |. kxk=1 Proof. Recall we can rewrite spectral norm as kG − G0 k2 = max |xT Gx − xT G0 x| = max |xT ΦΦT x − kxk=1 kxk=1 xT Y Y T x| = max |kΦT xk2 − kY T xk2 |. First line follows by definition of top eigenvalue of a symmetric kxk=1 matrix, and last line is true because kyk2 = y T y for any vector y. Thus if kG − G0 k2 ≤ εn where G0 = Y Y T could be reconstructed by any of the algorithms we consider, then it implies maxkxk=1 |kΦT xk2 − kY T xk2 | ≤ εn. We can now generalize the spectral norm bound to Frobenius norm. Let G−G0 = U ΛU T be the eigen decomposition of G−G0 . Recall that one can write each n P eigenvalue as Λi,i = uTi (G − G0 )ui , and the definition of the Frobenius norm implies kG − G0 k2F = Λ2i,i i=1 Hence kG − G0 k2F = n X i=1 (uTi (G n n X X T 2 T 2 2 − G )ui ) = (kΦ ui k − kY ui k ) ≤ (εn)2 ≤ ε2 n3 0 2 i=1 i=1 8 Therefore kG − G0 kF ≤ εn1.5 . We can also show a more interesting bound by considering Gk and G0k , the best rank k approximations of G and G0 respectively. √ Lemma 2.5. Given that kG − G0 k2 ≤ εn we can bound kG − G0k kF ≤ kG − Gk kF + ε kn. Proof. Let [u1 , . . . , un ] and [v1 , . . . , vn ] be eigenvectors of G and G − G0 , respectively. Then kG − G0k k2F k n X X T 0 2 = (vi (G − Gk )vi ) + (viT (G − G0k )vi )2 i=1 ≤ ≤ = i=k+1 k X n X i=1 i=k+1 (viT (G − G0k )vi )2 + (viT Gvi )2 k X n X i=1 i=k+1 (viT (ΦΦT − Y Y T )vi )2 + (uTi Gui )2 k X (kΦT vi k2 − kY T vi k2 )2 + kG − Gk k2F i=1 ≤ k(εn)2 + kG − Gk k2F . The second transition is true because G0 is positive semidefinite, therefore viT (G − G0k )vi ≤ viT Gvi , and third transition holds because if ui is ith eigenvector of G then, uTi Gui ≥ viT Gvi where vi is ith eigenvector of G − G0 . Taking square root yields q √ 0 2 kG − Gk kF ≤ kG − G0k k2F + (εn)2 k ≤ kG − G0k kF + εn k. Thus we can get error bounds for the best rank-k approximation of the data in√RKHS that depends on “tail” kG − Gk kF√which is typically small. We can also make the second term εn k equal to ε0 n by using a value of ε = ε0 / k in the previously described algorithms. 2.3 For Each Spectral Error In this section, we bound kΦT xk2 − kZ T xk2 , where Φ and Z are mappings of data to RKHS and RFF, respectively and x is a fixed unit vector in Rn . Note that Lemma 2.1 essentially already gave a stronger proof, where using m = O((1/ε2 ) log(n/δ)) the bound kG − Ĝk2 ≤ εn holds along all directions (which makes progress towards addressing an open question of constructing oblivious subspace embeddings for Gaussian kernel features spaces, in [1]). The advantage of this proof is that the bound on m will be independent of n. Unfortunately, in this proof, going from the “for each” bound to the stronger “for all” bound would seem to require a net of size 2O(n) and a union bound resulting in a worse “for all” bound with m = O(n/ε2 ). On the other hand, main objective of T EST TIME procedure, which is mapping a single data point to the D-dimensional or k-dimensional kernel space is already interesting for what the error is expected to be for a single vector x. This scenario corresponds to the “for each” setting that we will prove in this section. In our proof, we use a variant of Chernoff-Hoeffding inequality, stated Prnext. Consider a set of r independent random variables {X1 , · · ·, Xr } where 0 ≤ Xi ≤ ∆. Let M = i=1 Xi , then for any α ∈ (0, 1/2), 2 Pr [|M − E[M ]| > α] ≤ 2 exp −2α . r∆2 For this proof we are more careful with notation about rows and column vectors. Now matrix Z ∈ Rn×m can be written as a set rows [z1,: ; z2,: ; . . . , zn,: ] where each zi,: is a vector of length m or a set of columns [z:,1 , z:,2 , . . . , z:,d ], where each z:,j is a vector of length n. We denote the (i, j)-th entry of this matrix as zi,j . 9 Theorem 2.2. For n points in any arbitrary dimension and a shift-invariant kernel, let G = ΦΦT ∈ Rn×n be the exact gram matrix, and Ĝ = ZZ T ∈ Rn×n be the approximate kernel matrix using m = 2 O((1/ε random features from Algorithm 2.2. Then for any fixed unit vector x ∈ Rn , it holds T) log(1/δ)) that kΦ xk2 − kZ T xk2 ≤ εn with probability at least 1 − δ. Proof. Note Rn is not the dimension of data. Consider any unit vector x ∈ Rn . Define m independent random variables {Xi = hz:,i , xi2 }m each Xi as 0 ≤ Xi ≤ kz:,i k2 ≤ 2n/m therefore i=1 P. mWe can bound T ∆ = 2n/m for all Xi s. Setting M = i=1 Xi = kZ xk2 , we observe m m n m n n X n X X X X X X E[M ] = E hz:,i , xi2 = E ( zji xj )2 = E (zji xj )2 + 2 zji zki xj xk i=1 = n X x2j E i=1 "m X j=1 = n X # 2 zji +2 i=1 = x2j hφj,: , φj,: i + 2 n X n X i=1 " xj xk E m X n X x2j φ2ji j=1 +2 n X n X j=1 k>j j=1 # zji zki = i=1 n X j=1 xj xk φji φki = n X j=1 k>j x2j E [hzj,: , zj,: i] + 2 j=1 xj xk hφj,: , φk,: i = j=1 k>j i=1 n X n X j=1 k>j j=1 D X j=1 x2j D X i=1 n X n X xj xk E [hzj,: , zk,: i] j=1 k>j φ2ji + 2 n X n X j=1 k>j xj xk D X φji φki i=1 D X hφ:,i , xi2 = kΦT xk2 . i=1 Since x is a fixed unit vector, it is pulled out of all expectations. Using theChernoff-Hoeffding bound and −2(εn)2 = 2 exp −2ε2 m ≤ δ. Then setting α = εn yields Pr |kΦT xk2 − kZ T xk2 | > εn ≤ 2 exp m(2n/m) 2 we solve for m = (1/(2ε2 )) ln(2/δ) in the last inequality. 3 Experiments We measure the S PACE, T RAIN TIME, and T EST TIME of our SKPCA algorithms with ` taking values {2, 5, 10, 20, 30, 50}. We use spectral and Forbenious-based error measures and compare against the Nyström sampling approach and the RNCA approach using random Fourier features (RFF). All methods are implemented in Julia, run on an OpenSUSE 13.1 machine with 80 Intel(R) Xeon(R) 2.40GHz CPU and 750 GB of RAM. We run experiments on several real (CPU, F OREST, A DULT) and synthetic (R ANDOM N OISY) data sets. Each data set is an n × d matrix A (CPU is 7373 × 21, F OREST is 523910 × 54, A DULT is 33561 × 123, and R ANDOM N OISY is 20000 × 1000) with n datapoints and d attributes. For each a random subset is removed as the test set of size 1000, except CPU where the test set size is 800. We generate the R ANDOM N OISY synthetic dataset using the approach by [13]. We create A = SDU + F/ζ, where SDU is an s-dimensional signal matrix (for s < d) and F/ζ is (full) d-dimensional noise with ζ controlling the signal to noise ratio. Each entry Fi,j of F is generated i.i.d. from a normal distribution N (0, 1), and we use ζ = 10. For the signal matrix, S ∈ Rn×s again we generate each Si,j ∼ N (0, 1) i.i.d; D is diagonal with entries Di,i = 1 − (i − 1)/d linearly decreasing; and U ∈ Rs×d is just a random rotation. We set s = 50. Data sets. We consider two error measures comparing the true gram matrix G and an approximated gram matrix (constructed in various ways). Kernel Spectral Error = kG − G0 k2 /n represents the worst case error. Kernel Frobenius Error = kG − G0 kF /n2 represents the global error. We normalized the error measures by 1/n and 1/n2 , respectively, so they are comparable across data sets. These measures require another pass on the data to compute, but give a more holistic view of how accurate our approaches are. Error measures. 10 We measure the S PACE requirements of each algorithm as follows. SKPCA sketch has size md + m`, Nyström has space c2 + cd, and RNCA has space m2 + md, where m is the dimension of RFF, and c is the number of data points Nyström samples. In our experiments, we set m and c similarly, calling these parameters S AMPLE S IZE. Note that S AMPLE S IZE and S PACE usage are different: both RNCA and Nyström have S PACE quadratic in S AMPLE S IZE, while for SKPCA it is linear. Figures 1, 2,3 and 4 show log-log plots of results for R ANDOM N OISY dataset, CPU dataset, F OREST dataset and A DULT dataset. In each figure, row 1 from left to right shows Kernel Frobenius Error, Kernel Spectral Error and Train time with respect to S AMPLE S IZE, respectively. Row 2 of each figure shows Kernel Frobenius Error and Kernel Spectral Error with respect to S PACE, and Test Time with respect to S AMPLE S IZE. For small S AMPLE S IZE we observe that Nyström performs quite well under all error measures, corroborating results reported by Lopez et al. [14]. However, we note that the error range is quite small, typically less than 0.01. Also, for Kernel Frobenius Error we typically observe a cross over point where RNCA and often most versions of SKPCA have better error for that size. Under Kernel Spectral Error we often see a cross-over point for SKPCA, but not for RNCA. We suspect that this is related to how FD only maintains the most dominate directions while ignoring other (potentially spurious) directions introduced by the RFF coordinate space. In general, SKPCA has as good or better error than RNCA for the same size, with smaller size being required with smaller ` values. This difference is more pronounced in S PACE than S AMPLE S IZE, where our theoretical results expect a polynomial advantage. Considering the timing experiments, we see that as a function of S AMPLE S IZE RNCA is slowest for the T RAIN TIME and Nyström is slowest for T EST TIME, by at least an order of magnitude. In both cases all versions of SKPCA are among the fastest algorithms. For the T RAIN TIME results, RNCA’s slow time is dominated by summing n outer products, of dimensions m × m. This is avoided in SKPCA by only keeping the top ` dimensions, and only requiring similar computation on the order of ` × m, where typically ` m. The Nyström approach only needs to compute a c × c gram matrix once. For the T EST TIME results, Nyström requires constructing and projecting a c-dimensional vector onto a c-dimensional subspace, using O(dc + c2 ) time that is likely the bottleneck. Both RNCA and SKPCA avoid this by directly computing an m dimensional representation of a test data point in O(dm) time (plus an m-dimensional to `-dimensional project for SKPCA). Overall, our proposed method SKPCA has superior timing and error results to RNCA, which it achieves by sketching in the kernel feature space. Its error is typically a bit worse than a Nyström approach, but the difference is small, and SKPCA is superior to Nyström in run time, especially in T EST TIME. Results. References [1] Haim Avron, Huy L. Nguyen, and David P. Woodruff. Subspace embeddings for the polynomial kernel. In NIPS, 2014. [2] Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of 20th ACM-SIAM Symposium on Discrete Algorithms, 2009. [3] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th Annual ACM symposium on Theory of computing, 2013. [4] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices ii: Computing a low-rank approximation to a matrix. SIAM Journal on Computing, 36(1):158–183, 2006. 11 10-5 0 10 101 102 sample size Kernel Frobenius Error 10-3 10 10 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) -4 -5 100 101 102 103 104 space 105 106 10-2 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 103 102 101 100 101 102 sample size 10 10 -3 200 400 600 800 1000 1200 1400 1600 sample size 103 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 10-1 -2 10-1 0 104 103 random-noisy 100 107 104 Train time (sec) 10-1 10-3 0 10 104 103 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 102 Test time (sec) -4 random-noisy 100 Kernel Spectral Error 10 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) Kernel Spectral Error Kernel Frobenius Error 10-3 101 100 10-1 10-2 100 101 102 104 103 space 105 106 10-3 0 107 200 400 600 800 1000 1200 1400 1600 sample size Figure 1: Results for R ANDOM N OISY dataset. row 1: Kernel Frobenius Error (left), Kernel Spectral Error (middle) and Train Time (right) vs. Sample size. row 2: Kernel Frobenius Error (left), Kernel Spectral Error (middle) vs. Space, and Test Time vs. Sample size (right) 10-5 0 10 Kernel Frobenius Error 10 101 102 sample size -3 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 10-4 10-5 0 10 101 102 103 104 space 105 106 107 10-2 103 Train time (sec) 10-1 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 102 101 100 10-1 10-3 0 10 104 103 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 10 101 102 sample size cpu 0 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 10-1 10 -2 10-3 0 10 10-2 0 104 103 101 102 103 104 space 105 106 107 Test time (sec) -4 cpu 100 Kernel Spectral Error 10 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) Kernel Spectral Error Kernel Frobenius Error 10-3 10 1 10 0 200 400 600 800 1000 1200 1400 1600 sample size RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 10-1 10-2 10-3 0 200 400 600 800 1000 1200 1400 1600 sample size Figure 2: Results for CPU dataset. row 1: Kernel Frobenius Error (left), Kernel Spectral Error (middle) and Train Time (right) vs. Sample size. row 2: Kernel Frobenius Error (left), Kernel Spectral Error (middle) vs. Space, and Test Time vs. Sample size (right) [5] Petros Drineas and Michael W Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. The Journal of Machine Learning Research, 6:2153–2175, 2005. [6] Mina Ghashami and Jeff M. Phillips. Relative errors for deterministic low-rank matrix approximations. In SODA, pages 707–717, 2014. 12 10-5 0 10 101 102 sample size Kernel Frobenius Error 10-3 10 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) -4 10-5 0 10 101 102 103 104 space 105 106 10-2 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 104 103 102 101 100 101 102 sample size 10-1 0 104 103 forest 100 200 400 600 800 1000 1200 1400 1600 sample size 102 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 10-1 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 101 10-2 100 10-1 10-2 10-3 0 10 107 105 Train time (sec) 10-1 10-3 0 10 104 103 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) Test time (sec) -4 forest 100 Kernel Spectral Error 10 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) Kernel Spectral Error Kernel Frobenius Error 10-3 101 102 104 103 space 105 106 10-3 0 107 200 400 600 800 1000 1200 1400 1600 sample size Figure 3: Results for F OREST dataset.row 1: Kernel Frobenius Error (left), Kernel Spectral Error (middle) and Train Time (right) vs. Sample size. row 2: Kernel Frobenius Error (left), Kernel Spectral Error (middle) vs. Space, and Test Time vs. Sample size (right) Kernel Frobenius Error 10 -5 101 100 102 sample size -3 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 10-4 10-5 0 10 101 102 103 104 space 105 106 107 10-2 104 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 103 Train time (sec) 10-1 102 101 100 10-1 10 104 103 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 10 -3 101 100 102 sample size 10-2 0 104 103 adult 0 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 10-1 10-2 10 200 400 600 800 1000 1200 1400 1600 sample size 2 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) 101 Test time (sec) 10 -4 adult 100 Kernel Spectral Error 10 RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) Kernel Spectral Error Kernel Frobenius Error 10-3 100 10-1 10-2 10-3 0 10 101 102 103 104 space 105 106 107 10-3 0 200 400 600 800 1000 1200 1400 1600 sample size Figure 4: Results for A DULT dataset.row 1: Kernel Frobenius Error (left), Kernel Spectral Error (middle) and Train Time (right) vs. Sample size. row 2: Kernel Frobenius Error (left), Kernel Spectral Error (middle) vs. Space, and Test Time vs. Sample size (right) [7] Peter M Hall, A David Marshall, and Ralph R Martin. Incremental eigenanalysis for classification. In BMVC, volume 98, pages 286–295, 1998. [8] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature maps. arXiv preprint arXiv:1312.4626, 2013. 13 [9] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2005. [10] Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels. arXiv preprint arXiv:1201.6530, 2012. [11] Shosuke Kimura, Seiichi Ozawa, and Shigeo Abe. Incremental kernel pca for online learning of feature space. In Computational Intelligence for Modelling, Control and Automation, 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference on, volume 1, pages 595–600. IEEE, 2005. [12] Fuxin Li, Catalin Ionescu, and Cristian Sminchisescu. Random fourier approximations for skewed multiplicative histogram kernels. In Pattern Recognition, pages 262–271. Springer, 2010. [13] Edo Liberty. Simple and deterministic matrix sketching. In KDD, pages 581–588, 2013. [14] David Lopez-Paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Schölkopf. Randomized nonlinear component analysis. arXiv preprint arXiv:1402.0119, 2014. [15] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3, 2011. [16] Subhransu Maji and Alexander C Berg. Max-margin additive classifiers for detection. In Computer Vision, 2009 IEEE 12th International Conference on, pages 40–47. IEEE, 2009. [17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2007. [18] Tamás Sarlós. Improved approximation algorithms for large matrices via random projections. In FOCS, pages 143–152, 2006. [19] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Kernel principal component analysis. In Artificial Neural Networks—ICANN’97, pages 583–588. Springer, 1997. [20] Yohei Takeuchi, Seiichi Ozawa, and Shigeo Abe. An efficient incremental kernel principal component analysis for online feature selection. In Neural Networks, 2007. IJCNN 2007. International Joint Conference on, pages 2346–2351. IEEE, 2007. [21] Takaomi Tokumoto and Seiichi Ozawa. A fast incremental kernel principal component analysis for learning stream of data chunks. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 2881–2888. IEEE, 2011. [22] Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Proceedings of the 14th Annual Conference on Neural Information Processing Systems, number EPFL-CONF-161322, pages 682–688, 2001. [23] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10:1–157, 2014. 14