Streaming Kernel Principal Component Analysis

advertisement
Streaming Kernel Principal Component Analysis
Mina Ghashami
School of Computing
University of Utah
Salt Lake City, UT
ghashami@cs.utah.edu
Daniel Perry
School of Computing
University of Utah
Salt Lake City, UT
dperry@cs.utah.edu
Jeff M. Phillips
School of Computing
University of Utah
Salt Lake City, UT
jeffp@cs.utah.edu
September 14, 2015
Abstract
Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture
non-linear structures within large data sets, and is a central tool in data analysis and learning. To allow
for non-linear relations, typically a full n × n kernel matrix is constructed over n data points, but this
requires too much space and time for large values of n. Techniques such as the Nyström method and
random feature maps can help towards this goal, but either they do not explicitly maintain the basis
vectors in a stream or still use space linear in n. We propose a new approach for streaming KPCA which
maintains a small set of basis elements in a stream, requiring space only logarithmic in n, and also
improves the dependence on the error parameter. Our technique combines together random feature maps
with recent advances in matrix sketching, it has guaranteed spectral norm error bounds with respect to
the original kernel matrix, and it compares favorably in practice to state-of-the-art approaches.
1
Introduction
Principal component analysis (PCA) is a well-known technique for dimensionality reduction, and has many
applications including visualization, pattern recognition, and data compression [9]. Given a set of centered
d-dimensional (training) data points A = [a1 ; . . . ; an ] ∈ Rn×d , PCA diagonalizes the covariance matrix
C = n1 AT A by solving the eigenvalue equation Cv = λv. However, when the data points lie on a highly
nonlinear space, PCA fails to concisely capture the structure of data. To overcome this, several nonlinear
extension of PCA have been proposed, in particular Kernel Principal Component Analysis (KPCA) [19].
The basic idea of KPCA is to implicitly map the data into a nonlinear feature space of high (or often infinite)
dimension and perform PCA in that space [19]. The nonlinear map is often denoted as φ : Rd → H where
H is a Reproducing Kernel Hilbert Space (RKHS). While direct computation of PCA in RKHS is infeasible,
we can invoke the so called kernel trick which exploits the fact that PCA interacts with data through only
pair-wise inner products. That is hφ(x), φ(y)iH = K(x, y), for all x, y ∈ Rd for a kernel function K; we
represent this as the n × n gram matrix G. However, KPCA suffers from high space and computational
complexity in storing the entire kernel (gram) matrix G ∈ Rn×n and in computing the decomposition of it
in the training phase. Then in the testing phase it spends O(nd) time to evaluate the kernel function for any
arbitrary test vector with respect to all training examples. Although one can use low rank decomposition
approaches [4, 6, 13, 18] to reduce the computational cost to some extent, KPCA still needs to compute and
store the kernel matrix.
There have been two main approaches towards resolving this space issue. First, the Nyström approach
[22] uses a sample of the data points to construct a much smaller gram matrix. Second, feature maps [17]
provide an approximate but explicit embedding of the RKHS into Euclidean space. As we will describe,
1
both approaches can be made to operate in a stream, approximating the KPCA result in less than n2 time
and space.
Once these approximations are formed they reveal a D n dimensional space, and typically a kdimensional subspace found through linear PCA in RD , which captures most of the data (e.g., a low rank-k
approximation). There are two main purposes of these D- and k-dimensional subspaces; they start with mapping a data point x ∈ Rd into the D-dimensional space, and then often onto the k-dimensional subspace. If
x is one of the training data points, then the k-dimensional representation can be used as a concise “loadings” vector. It can be used in various down-stream training and learning tasks wherein this k-dimensional
space, can assume linear relations (e.g., linear separators, clustering under Euclidean distance) since the
non-linearity will have already been represented through the mapping to this space. If x is not in the training
set, and the training set represents some underlying distribution, then we can assess the “fit” of x to this
distribution by considering the residual of its representation in the D-dimensional space when projected to
the k-dimensional space.
We refer to this procedure of mapping a single point x to the D-dimensional space as the T EST TIME.
The value of k needed to get a good fit depends on the choice of kernel and its fit to the data; but D depends
on the technique. For instance in (regular)
KPCA D = n, in Nyström D = O(1/ε2 ), using random feature
maps with [17]) D = O (1/ε2 ) log n , where ε ∈ (0, 1) is the error parameter. We propose a new streaming
approach, named as SKPCA, that will only require D = O(1/ε).
When we evaluate our approach and alternatives, we will focus also on S PACE (the cost of storing this
map, and space needed to construct it), and T RAIN TIME (the time needed to construct the map to the
D-dimensional space).
We indicate matrix A is n × d dimensional as A ∈ Rn×d . Matrices A and Z
will be indexed by their row vectors A = [a1 ; a2 ; . . . , an ] while other matrices V, U, W, . . . will be indexed
n×d as
by column vectors V = [v1 , v2 , . . . , vd ]. We use In for the n-dimensional identity matrixpand
P 0
2
the full zero matrix of dimension n × d. The Frobenius norm of a matrix A is kAkF =
i=1 kai k
T
and the spectral norm is kAk2 = supx∈Rd kAxk
kxk . We denote transpose of a matrix as A . The singular
Background and Notation.
value decomposition of matrix A ∈ Rn×d is denoted by [U, S, V ] = svd(A). If n ≥ d it guarantees that
A = U SV T , U T U = In , V T V = Id , U ∈ Rn×n , V ∈ Rd×d , and S = diag(s1 , s2 , . . . , sd ) ∈ Rn×d is a
diagonal matrix with s1 ≥ s2 ≥ . . . ≥ sd ≥ 0. Let Uk and Vk be matrices containing the first k columns of
U and V , respectively, and Sk = diag(s1 , s2 , . . . , sk ) ∈ Rk×k . The matrix Ak = Uk Sk VkT is the best rank
k approximation of A in the sense that Ak = arg minC:rank(C)≤k kA − Ck2,F . We denote by πB (A) the
projection of rows of A on the span of the rows of B. In other words, πB (A) = AB † B where (·)† indicates
taking the Moore-Penrose psuedoinverse. Finally, expected value of a matrix is defined as the matrix of
expected values.
1.1
Related Work
In this section, we review the most related works to streaming principal component analysis and kernel PCA.
Among many recent advancements in matrix sketching [15,23], we focus on those that
compress a n×d matrix A into an `×d matrix B. There are several classes of algorithms based on sampling
rows (or columns) [2, 4] (very related to Nyström approaches [5]), randomly projecting [18] or hashing [3]
which require that ` ≈ c/ε2 to achieve ε error. The constant c depends on the algorithm, specific type of
approximation, and whether it is a “for each” or “for all” approximation.
A recent and different approach, Frequent Directions (FD) [13], uses only ` = 2/ε, achieves the error
bound kAT A − B T Bk2 ≤ εkAk2F , and runs in time O(nd/ε). The algorithm basically processes one row
of A at a time, and maintains in B an approximate singular value decomposition of part of A. Each new row
Matrix Sketching.
2
of A is appended to B, the singular value decomposition is updated, but the size reduced back to only ` − 1
rows in B. This rank reduction of B is achieved by shrinking all squared singular values by the smallest
squared singular value. To obtain the stated runtime, this needs to be run in batch, shrinking to rank `/2
(subtracting the (`/2 + 1)th squared singular value), once every `/2 rows of A. We use a modified version
of this algorithm in our proposed approach.
We first mention some techniques that do not achieve space-error trade-off
guarantees. This group of methods update/augment the eigenspace of KPCA without storing all training
data. [11] proposed an algorithm based on incremental PCA [7], in which they maintain a set I NDP of
linearly independent training data points in memory and compute top d eigenvectors such that they preserve
a θ-fraction (for a threshold θ ∈ (0, 1)) of the total energy of the eigenspace. Two major drawbacks of
this method are: 1) The set I NDP can grow large and unpredictably, perhaps exceeding the capacity of
the memory. 2) The update of the eigenspace is done for every single datapoint, which can lead to high
computation cost. [20] and [21] partially address the second issue by allowing some batch processing.
Incremental Kernel PCA.
Another group of methods [5, 22], known as Nyström-based
methods, approximate the kernel (gram) matrix G with a low-rank matrix Ḡ, by sampling columns of G.
The original version [22] samples c columns with replacement C and estimates Ḡ = CW −1 C T , where
W is the intersection of the sampled columns and rows; it took O(nc2 ) time and was not streaming. This
was modified [5] to use sampling with replacement and P
approximating G as Ḡ = CWk† C T . They proved
2
constant probability they can
that if sampling probabilities are of the form pi = Gii / ni=1 G2ii
P, nthen with
2
obtain a Frobenius error bound kG − Ḡk kF P
≤ kG − Gk kF + ε i=1 Gii with c = Ω(k/ε4 ), and spectral
error bound kG − Ḡk k2 ≤ kG − Gk k2 + ε ni=1 G2ii with c = Ω(1/ε2 ). Note that by setting k = n (so
Gk = G) in the second bound in the setting where Gii = 1 (a scenario we will focus on), then we obtain
the bound kG − Ḡk2 ≤ εn. We observe that one can construct W in a stream over the data matrix using
c independent reservoir samplers to select c data points using O(nd + nc + c2 ) = O(nd + n/ε2 + 1/ε4 )
time and O(cd + c2 ) = O(d/ε2 + 1/ε4 ) space. The testing step on a point x ∈ Rd is done by evaluating
K(x, y) on each data point y sampled, taking O(cd) time, then projecting onto the c-dimensional basis in
O(c2 ) time; requiring O(cd + c2 ) = O(d/ε2 + 1/ε4 ) time total.
Nyström-Based Methods for Kernel PCA.
In this line of work, the kernel matrix is approximated via
randomized feature maps. The seminal work of [17] showed how to construct randomized feature maps
Z : Rd → Rm such that for x, y ∈ Rd , E[hZ(x), Z(y)i] = K(x, y) for a shift-invariant kernel K(x, y) =
K(x−y). Using this mapping, instead of implicitly lifting data points to H by the kernel trick, they explicitly
embed the data to a low-dimensional Euclidean inner product space. We explicitly describe their work in
Section 2. Subsequent works generalized to other kernel functions such as group invariant kernels [12],
min/intersection kernels [16], dot-product kernels [10], and polynomial kernels [1, 8].
Employing this body of methods, Lopez et al. [14] proposed randomized nonlinear PCA (RNCA), which
is basically a linear PCA on the n × m matrix Z. This reduces the time computational complexity of KPCA
to O(nm2 ) and space complexity to O(nm). To bound errors we can consider (without constructing) Ĝ =
ZZ T . They showed using [17], that the approximation error is bounded as E[kĜ − Gk2 ] ≤ Θ(n log n/m).
Along the way to our main results, we will show below that using m = O((1/ε2 ) log n), we obtain
kG − Ĝk2 ≤ εn with constant probability. This takes O(ndm + nm2 ) = O(n((d/ε2 ) log n + (1/ε4 ) log2 n)
time and O(md + nm) = O((1/ε2 )n log n) space, assuming d < n. Testing can be done on a new point
x ∈ Rd using the m random features in O(md) = O((d/ε2 ) log n) time.
Random Fourier Features for Kernel PCA.
3
KPCA
N YSTR ÖM
RNCA
SKPCA
T RAIN TIME
O(n2 (n + d))
O(nd + n/ε2 + 1/ε4 )
O(n((d/ε2 ) log n + (1/ε4 ) log2 n))
O(n log n(d/ε2 + 1/ε3 ))
T EST TIME
O(n(d + n2 ))
O(d/ε2 + 1/ε4 )
O((1/ε2 )(d + n) log n)
O((d + 1/ε)/ε2 log n)
S PACE
O(n2 + nd)
O(d/ε2 + 1/ε4 )
O((d/ε2 )n log n)
O(((d + 1/ε)/ε2 ) log n)
Table 1: Asymptotic T RAIN TIME, T EST TIME and S PACE for four algorithms SKPCA (our newly proposed
algorithm), KPCA [19] , RNCA [14], and Nyström [5]. For each, the resulting summary could be used to
recover an approximate gram matrix G0 such that kG0 − Gk2 ≤ εn (KCPA is exact).
1.2
Our Results
In this paper, we present a streaming variant of kernel PCA (SKPCA) that can produce a (implicit) representation of two matrices W ∈ Rm×` and Z ∈ Rn×m , so for G̃ = ZW W T Z T with constant probability
kG − G̃k2 ≤ εn. It requires O(n log n(d/ε2 + 1/ε3 )) time, O(((d + 1/ε)/ε2 ) log n) space, and permits
O(((d + 1/ε)/ε2 ) log n) test time. It is the first method to use space that is both sublinear (logarithmic) in n
and sub-quartic in 1/ε, have training time sub-quartic in 1/ε, and with test time both sublinear (logarithmic)
in n and less than 1/ε4 dependence on ε.
2
Streaming Kernel Principal Component Analysis
In this section, we describe our algorithm Streaming Kernel Principal Component Analysis (SKPCA) for
approximating the eigenspace of a streaming dataset existing on a nonlinear manifold. SKPCA consists of
two implicit phases. In the first phase, a set of m data oblivious random feature functions (f1 , · · · , fm )
are computed to map data points to a low dimensional Euclidean inner product space. There are several
methods that compute these feature maps for different type of kernels, e.g. shift-invariant kernels [17],
polynomial kernels [8], dot-product kernels [10]. These are used to map each data point ai ∈ Rd to zi ∈ Rm .
In the second phase, each approximate feature vector zi is fed into the small space streaming algorithm
F REQUENT D IRECTIONS [13] for computing an approximate set of singular vectors, as a matrix W ∈ R`×m .
However, in the actual algorithm these phases are not separated. The feature mapping functions are
precomputed (oblivious to the data), so the approximate feature vectors are immediately fed into the matrix
sketching algorithm, so we never need to fully materialize and store the full n × m matrix Z. Also, perhaps
unintuitively, we do not sketch the m-dimensional column-space of Z, rather its n-dimensional row-space.
Yet, since the resulting `-dimensional row-space of W (with ` m) encodes a lower dimensional subspace
within Rm , it serves to represent as our kernel principal components. Pseudocode is provided in Algorithm
2.1.
To make the algorithm concrete, we consider the approximate feature maps
described in the general framework of Rahimi and Recht [17]; label this instantiation of the F EATURE M APS
function as R ANDOM -F OURIER F EAURE M APS in Algorithm 2.2. This works for positive definite shiftinvariant kernels K(x, y) = K(x − y) (e.g. Gaussian kernel K(x, y) = (1/2π)d/2 exp(−kx − yk2 /2)). It
computes a randomized feature map z : Rd → Rm so that E[z(x)T z(y)] = K(x, y) for any x, y ∈ Rd . To
construct the mapping z, they define m functions of the form fi (x) = cos(riT x + γi ), where ri ∈ Rd is a
sample drawn uniformly at random from the Fourier transform of the kernel function, and γi ∼ Unif(0, 2π],
uniformly at random fromp
the interval (0, 2π]. Applying each fi on a datapoint x, gives the ith coordinate
of z(x) in Rm as z(x)i = 2/mfi (x). This implies each coordinate has squared value of (z(x)i )2 ≤ 2/m.
The algorithm is described in Algorithm 2.2.
We consider Algorithm 2.1, using Algorithm 2.2, with m = O((1/ε2 ) log n) and ` = O(1/ε).
Approximate feature maps.
4
Algorithm 2.1 SKPCA
Input: A ∈ Rn×d as data points, a shift-invariant kernel function K, and `, m ∈ Z+
Output: Feature maps [f1 , · · · , fm ] and their approximate best `-dim subspace W
[f1 , · · · , fm ] = F EATURE M APS(K, m)
(data oblivious features)
`×m
B←0
for i ∈ [n]
q do
2
zi = m
[f1 (ai ), · · · , fm (ai )] and insert zi as a row into B
if B has no zero valued rows then
[Y, Σ,q
W ] ← svd(B)
B ← max{0, Σ2 − Σ2`/2,`/2 I` } · W T
(last `/2 rows of B become 0)
Return [f1 , · · · , fm ] and W .
S PACE : Storing m functions fi , for i = 1, . . . , m, this algorithm is able to process the input A =
[a1 ; . . . ; an ] in a streaming fashion. Since for each function we need to store a d-dimensional vector ri ,
it takes O(dm) space to store all functions. Having all functions we can compute feature map z(x) and
get a m-dimensional row vector z(ai ) for each datapoint ai ∈ A, which then is used to update the sketch
B ∈ R`×m in F REQUENT D IRECTIONS. Since we need an additional O(`m) for storing B and W , the total
space usage of Algorithm 2.1 is O(dm + `m) = O(((d + 1/ε)/ε2 ) log n).
T RAIN T IME : Applying the feature map on each datapoint takes O(dm) and computing the F REQUENTD IRECTIONS sketch takes O(n`m) time, so the training time is O(n · dm + n`m) = O(n log n(d/ε2 +
1/ε3 )).
T EST T IME : Although W approximates the eigenspace of A, to have the mapping of the data points
to lower dimension `, we can store transformed data points in a matrix Z ∈ Rn×m and project it onto the
`-dimensional subspace spanned by W to get Z̃ = ZW ∈ Rn×` . Note that Z̃ needs an additional O(n`) to
store, and another pass over A taking O(n(dm + m`)) time. However, for a test point xtest ∈ Rd , we can
project it onto Rm by z(xtest ) in O(dm) time and then use W to project to z(xtest )W → R` in O(`m) time,
so in total it takes O(dm + `m) = O((d + 1/ε)/ε2 · log n) time.
2.1
Spectral Error Analysis
In this section we prove our main result. Let G = ΦΦT be the exact kernel matrix in RKHS. Let Ĝ = ZZ T
be an approximate kernel matrix using Z ∈ Rn×m consists of mapping the n points Rm using m RandomFourier Feature Maps. Then we consider G̃ = ZW W T Z T , as the kernel matrix which could be constructed
from outputs of Algorithm 2.2 and Algorithm 2.1. We show that if we set m = O((1/ε2 ) log(n/δ)) and
` = 4/ε then kG − G̃k2 ≤ εn with probability at least 1 − δ. In our proof we use the Bernstein inequality
on sum of zero-mean random matrices.
Algorithm 2.2 R ANDOM -F OURIER F EATURE M APS(K, m)
Input: A positive definite shift-invariant kernel K(x, y) = g(x − y) with K(x, x) = 1, and m as the
output dimensionality.
Output: A feature map z(x) : Rd → Rm so that z(x)T z(y) ' K(x, y) ∀x, y ∈ Rd
R −jrT δ
√
1
Let p be the Fourier transform of kernel g: p(r) = 2π
e
g(δ) dδ
(j = −1)
d
Draw m iid samples {ri }m
i=1 ∈ R from p and m iid samples {γi } ∈ R from Unif(0, 2π)
Define functions fi (x) = cos(riT x + γi ) for i = 1, . . . , m
Return [f1 , . . . , fm ]
5
Matrix Bernstein Inequality: Let X1 , · · · , Xd ∈ Rn×n be independent random matrices such that
for all 1 ≤ i ≤ d, E[Xi ] = 0 and kXi k2 ≤ R for a fixed constant R. Define
as
parameter
i
h P variance
Pd
Pd
d
2
T
T
σ = max{k i=1 E[Xi Xi ]k, k i=1 E[Xi Xi ]k}. Then for all t ≥ 0, Pr i=1 Xi ≥ t ≤ 2n ·
2
2
exp 3σ2−t
.
Using
this
inequality,
[14]
bounded
E
[kG−
Ĝk
].
Here
we
employ
similar
ideas
to improve
2
+2Rt
this to a bound on kG − Ĝk2 with high probability.
Lemma 2.1. For n points, let G = ΦΦT ∈ Rn×n be the exact gram matrix, and let Ĝ = ZZ T ∈ Rn×n be
the approximate kernel matrix using m = O((1/ε2 ) log(n/δ)) random features from Algorithm 2.2. Then
kG − Ĝk ≤ εn with probability at least 1 − δ.
1
Proof. Consider m independent random variables Ei = m
G − zi ziT . Note that E[Ei ] =
0n×n [17]. Next we can rewrite
1
1
T
T
T
kEi k2 = G − zi zi = E[ZZ ] − zi zi m
m
2
2
1
mG
− E[zi ziT ] =
and thus bound
kEi k2 ≤
2n 2n
4n
1
1
kE[ZZ T ]k2 + kzi ziT k2 ≤ E[kZk22 ] + kzi k2 ≤
+
=
.
m
m
m
m
m
The first inequality is correct because of triangle inequality, and second inequality is achieved using Jensen’s
inequality on expected values, which states kE[X]k ≤ E[kXk] for any random variable X. Last inequality
2 ≤ kZk2 ≤ 2n.
uses the bound on the norm of zi as kzi k2 ≤ 2n
2
F
m , and therefore kZk
Pm
2
2
To bound σ , due to symmetry of matrices Ei , simply σ = k i=1 E[Ei2 ]k2 . Expanding
"
2 #
2
1
1
G
T
2
T
T
T
2
E[Ei ] = E
G − zi zi
+
kz
k
z
z
−
(z
z
G
+
Gz
z
)
=E
i
i i
i i
i i
m
m2
m
it follows that
E[Ei2 ] ≤
G2
2n
1
1
1
+
E[zi ziT ] − (E[zi ziT ]G + G E[zi ziT ]) = 2 (G2 + 2nG − 2G2 ) = 2 (2nG − G2 )
2
m
m
m
m
m
1
The first inequality holds by kzi k2 ≤ 2n/m, and second inequality is due to E[zi ziT ] = m
G. Therefore
m
X
1
2n
1
2n2
1
3n2
2
2 2 σ =
E[Ei ] ≤ (2n G − G ) ≤
kGk2 + kG2 k2 ≤
+ kGk22 ≤
,
m
m
m
m
m
m
2
i=1
2
the second
triangle inequality, and the last inequality by kGk2 ≤ Tr(G) = n. Setting
P inequality
Pm is by
1
T ) = G − Ĝ and using Bernstein inequality with t = εn we obtain
G
− z:,i z:,i
M= m
E
=
(
i=1 i
i=1 m
!
2 h
i
−(εn)2
−ε m
Pr kG − Ĝk2 ≥ εn ≤ 2n exp
= 2n exp
≤δ
3n2
4n
9 + 8ε
3( ) + 2( )εn
m
Solving for m we get m ≥
kG − Ĝk2 ≤ εn.
9+8ε
ε2
m
log(2n/δ), so with probability at least 1 − δ for m = O( ε12 log(n/δ)), then
Next, we show that applying the Frequent Directions step to Z does not asymptotically increase the error.
To do so, we first show that spectrum of Z along directions that FrequentDirections fails to capture is small.
We prove this for any n × m matrix A that is approximated as B ∈ R`×m by FrequentDirections.
6
Lemma 2.2. Consider an A ∈ Rn×m matrix with m ≤ n, and let B be an ` × m matrix resulting from
running Frequent Directions on A with ` rows. For any unit vector y ∈ Rn with ky T AB † Bk = 0, it holds
that ky T Ak2 ≤ kA − Ak k2F /(` − k), for all k ≤ `, including k = 0 where A − Ak = A.
Proof. Let [U, S, V ] = svd(A) be the svdof A. Consider any unit vector y ∈ Rn that lies in the column space
of A and the null space of B, that is ky T AB † Bk = 0 and ky T AA† k = kyk = 1. Since U = [u1 , u2 , . . . , un ]
provides an orthonormal basis for Rn , we can write
y=
n
X
αi ui
such that
αi = hy, ui i,
i=1
n
X
αi2 = 1
i=1
Pn
Pm 2
2
T
†
T
T
Since 1 = P
i > m. Moreover
i=1 αi = kyk = kyPAA k = ky Um Um k =
i=1 αi , therefore αi = 0 for P
m
m
2 hy, u i2 =
2 α2 . This implies there exists a unit vector x =
m
s
s
ky T Ak2 = m
i
i=1 i
i=1 i i
i=1 αi vi ∈ R
with αi = hx, vi i = hy, ui i for i = 1, · · · , m such that ky T Ak = kAxk and importantly kBxk = 0, which
we will prove shortly.
Then, due to the Frequent Directions bound [6], for any unit vector x̄ ∈ Rm , kAx̄k2 − kB x̄k2 ≤ kA −
Ak k2F /(` − k), and for our particular choice of x with kBxk = 0, we obtain ky T Ak = kAxk2 ≤ kA −
Ak k2F /(` − k), as desired.
Now to see that kBxk = 0, we will assume that kBxk > 0 and prove a contradiction. Since kBxk > 0,
then x is not in the null space of B, and kπB (x)k > 0 for any unit vector x. Let Σ = diag(σ1 , . . . , σ` ),
assuming σ1 ≥ σ2 ≥ . . . ≥ σ` > 0, are the singular values of B, and W = [w1 , . . . , w` ] ∈ Rmx` are its
right singular vectors. Then kBxk = kΣW xk and if kπB (x)k > 0, then setting Σ̄ = diag(1, 1, . . . , 1) and
B̄ = Σ̄W = W` to remove the scaling from B, we have kπB̄ (x)k > 0. Similarly, if ky T U SV T B † Bk =
ky T πB (A)k = 0, then setting S̄ = diag(1, . . . , 1) and Ā = U S̄V T to remove scale from A, we have
ky T πB (Ā)k = 0. Hence
†
0 ≤ kπB̄ (x)k = kxB Bk =
kxW` W`T k
=k
`
X
hx, wj ik = k
j=1
and
0 = ky T πB (Ā)k = k
` X
m
X
αi hvi , wj ik,
j=1 i=1
m
m
` X
m
X
X
X
hy, ui iviT W` W`T k = k
αi viT W` k = k
αi hvi , wj ik.
i=1
i=1
j=1 i=1
Since last terms of each line match, we have a contradiction, and hence kBxk = 0.
Lemma 2.3. Let Z̃ = ZW , and G̃ = Z̃ Z̃ T = ZW W T Z T be the corresponding gram matrix from
Z ∈ Rn×m and W ∈ Rm×` constructed via Algorithm 2.1 with ` = 2/ε. Comparing to Ĝ = ZZ T , then
kG̃ − Ĝk2 ≤ εn.
Proof. Consider any unit vector y ∈ Rn , and note that y T Z = [y T Z]W + [y T Z]⊥W where [y T Z]W =
y T ZW W T lies on the column space spanned by W , and [y T Z]⊥W = y T Z(I − W W T ) is in the null space
of W . Then first off ky T Zk2 = k[y T Z]W k2 + k[y T Z]⊥W k2 since two components are perpendicular to
each other. Second [y T Z]W W = y T ZW W T W = y T ZW and [y T Z]⊥W W = y T Z(I − W W T )W =
y T Z(W − W ) = 0. Knowing these two we can say
y T (ZZ T − Z̃ Z̃ T )y = (y T Z)(y T Z)T − (y T Z)W W T (y T Z)T
T
= ky T Zk2 − [y T Z]W + [y T Z]⊥W W W T [y T Z]W + [y T Z]⊥W
= ky T Zk2 − (y T ZW )(y T ZW )T
= k[y T Z]W k2 + k[y T Z]⊥W k2 − ky T ZW k2
= k[y T Z]⊥W k2 .
7
The last inequality holds because ky T ZW k = ky T ZW W T k = k[y T Z]W k as W is an orthonormal matrix.
To show k[y T Z]⊥W k ≤ εkZk2F , consider vector v = y T Z(I − W W T )Z † and let y ∗ = v/kvk. Clearly
∗
y satisfies requirement of Lemma 2.2 as it is a unit vector in Rn and ky ∗ ZW W T k = 0 as
ky ∗ ZW W T k = ky T Z(I − W W T )Z † ZW W T k/kvk = ky T Z(I − W W T )W W T k/kvk = 0.
Therefore it satisfies that ky ∗ Zk ≤ kZ − Zk k2F /(` − k). Since kZk2F ≤ 2n for k = 0 and ` = 2/ε, we
obtain
k[y T Z]⊥W k2 = ky T Z(I−W W T )k = ky T Z(I−W W T )Z † Zk2 = ky ∗ Zk2 kvk2 ≤ kZk2F kvk2 /` ≤ εnkvk2 .
It is left to show that kvk ≤ 1. For that, note πZW πZ (y T ) = πZW (y T ZZ † ) = y T ZZ † (ZW )(ZW )† =
y T ZW (ZW )† = πZW (y T ). The finally we obtain
kvk2 = ky T Z(I − W W T )Z † k2 = ky T ZZ † − y T ZW W T Z † k2 = kπZ (y T ) − πZW (y T )k2
= kπZ (y T ) − πZW (πZ (y T ))k2 ≤ kπZ (y T )k2 ≤ kyk2 = 1.
Theorem 2.1. Let G = ΦΦT be the exact kernel matrix over n points. Let G̃ = ZW T W Z T be the result of
Z from m = O((1/ε2 ) log(n/δ)) random features from Algorithm 2.2, W from running Algorithm 2.1 with
` = 4/ε. Then with probability at least 1 − δ, we have kG − G̃k2 ≤ εn.
Proof. Combining Lemmas 2.1 and 2.3 and using triangle inequality, we get the bound.
Next, we extend the error analysis above to show that the spectral error bound has useful implications towards other bounds. First, we prove a Frobenius error bound for any G0 = Y Y T , including when Y = ZW .
Second, we prove a smaller “for each” bound. Theorem 2.1 shows a stronger “for all” bound, specifically
that for all unit vectors x that |kΦT xk2 − kW T Z T xk2 | ≤ εn. The FD part was deterministic so implicitly
“for all,” and the RandomFeature part required m = O((1/ε2 ) log(n/δ)) to hold with probability at least
1 − δ. We can remove the dependence on n and obtain a “for each” bound. This is useful to bound the error
on any one test point xtest instead of all possible test points.
2.2
Frobenius Error Analysis
Let the true gram matrix be G = ΦΦT , and consider G0 = Y Y T , for any Y including when Y = ZW . First
we write the bound in terms of Φ and Y .
Lemma 2.4. kG − G0 k2 = max |kΦT xk2 − kY T xk2 |.
kxk=1
Proof. Recall we can rewrite spectral norm as kG − G0 k2 = max |xT Gx − xT G0 x| = max |xT ΦΦT x −
kxk=1
kxk=1
xT Y Y T x| = max |kΦT xk2 − kY T xk2 |. First line follows by definition of top eigenvalue of a symmetric
kxk=1
matrix, and last line is true because kyk2 = y T y for any vector y.
Thus if kG − G0 k2 ≤ εn where G0 = Y Y T could be reconstructed by any of the algorithms we consider,
then it implies maxkxk=1 |kΦT xk2 − kY T xk2 | ≤ εn. We can now generalize the spectral norm bound to
Frobenius norm. Let G−G0 = U ΛU T be the eigen decomposition of G−G0 . Recall that one can write each
n
P
eigenvalue as Λi,i = uTi (G − G0 )ui , and the definition of the Frobenius norm implies kG − G0 k2F =
Λ2i,i
i=1
Hence
kG −
G0 k2F
=
n
X
i=1
(uTi (G
n
n
X
X
T
2
T
2 2
− G )ui ) =
(kΦ ui k − kY ui k ) ≤
(εn)2 ≤ ε2 n3
0
2
i=1
i=1
8
Therefore kG − G0 kF ≤ εn1.5 . We can also show a more interesting bound by considering Gk and G0k , the
best rank k approximations of G and G0 respectively.
√
Lemma 2.5. Given that kG − G0 k2 ≤ εn we can bound kG − G0k kF ≤ kG − Gk kF + ε kn.
Proof. Let [u1 , . . . , un ] and [v1 , . . . , vn ] be eigenvectors of G and G − G0 , respectively. Then
kG −
G0k k2F
k
n
X
X
T
0
2
=
(vi (G − Gk )vi ) +
(viT (G − G0k )vi )2
i=1
≤
≤
=
i=k+1
k
X
n
X
i=1
i=k+1
(viT (G − G0k )vi )2 +
(viT Gvi )2
k
X
n
X
i=1
i=k+1
(viT (ΦΦT − Y Y T )vi )2 +
(uTi Gui )2
k
X
(kΦT vi k2 − kY T vi k2 )2 + kG − Gk k2F
i=1
≤ k(εn)2 + kG − Gk k2F .
The second transition is true because G0 is positive semidefinite, therefore viT (G − G0k )vi ≤ viT Gvi , and
third transition holds because if ui is ith eigenvector of G then, uTi Gui ≥ viT Gvi where vi is ith eigenvector
of G − G0 . Taking square root yields
q
√
0 2
kG − Gk kF ≤ kG − G0k k2F + (εn)2 k ≤ kG − G0k kF + εn k.
Thus we can get error bounds for the best rank-k approximation of the data in√RKHS that depends on
“tail” kG − Gk kF√which is typically small. We can also make the second term εn k equal to ε0 n by using
a value of ε = ε0 / k in the previously described algorithms.
2.3
For Each Spectral Error
In this section, we bound kΦT xk2 − kZ T xk2 , where Φ and Z are mappings of data to RKHS and RFF,
respectively and x is a fixed unit vector in Rn .
Note that Lemma 2.1 essentially already gave a stronger proof, where using m = O((1/ε2 ) log(n/δ))
the bound kG − Ĝk2 ≤ εn holds along all directions (which makes progress towards addressing an open
question of constructing oblivious subspace embeddings for Gaussian kernel features spaces, in [1]). The
advantage of this proof is that the bound on m will be independent of n. Unfortunately, in this proof, going
from the “for each” bound to the stronger “for all” bound would seem to require a net of size 2O(n) and a
union bound resulting in a worse “for all” bound with m = O(n/ε2 ).
On the other hand, main objective of T EST TIME procedure, which is mapping a single data point to the
D-dimensional or k-dimensional kernel space is already interesting for what the error is expected to be for
a single vector x. This scenario corresponds to the “for each” setting that we will prove in this section.
In our proof, we use a variant of Chernoff-Hoeffding inequality, stated
Prnext. Consider a set of r independent random variables {X1 , · · ·, Xr } where 0 ≤ Xi ≤ ∆. Let M = i=1 Xi , then for any α ∈ (0, 1/2),
2
Pr [|M − E[M ]| > α] ≤ 2 exp −2α
.
r∆2
For this proof we are more careful with notation about rows and column vectors. Now matrix Z ∈ Rn×m
can be written as a set rows [z1,: ; z2,: ; . . . , zn,: ] where each zi,: is a vector of length m or a set of columns
[z:,1 , z:,2 , . . . , z:,d ], where each z:,j is a vector of length n. We denote the (i, j)-th entry of this matrix as zi,j .
9
Theorem 2.2. For n points in any arbitrary dimension and a shift-invariant kernel, let G = ΦΦT ∈
Rn×n be the exact gram matrix, and Ĝ = ZZ T ∈ Rn×n be the approximate kernel matrix using m =
2
O((1/ε
random
features from Algorithm 2.2. Then for any fixed unit vector x ∈ Rn , it holds
T) log(1/δ))
that kΦ xk2 − kZ T xk2 ≤ εn with probability at least 1 − δ.
Proof. Note Rn is not the dimension of data. Consider any unit vector x ∈ Rn . Define m independent
random variables {Xi = hz:,i , xi2 }m
each Xi as 0 ≤ Xi ≤ kz:,i k2 ≤ 2n/m therefore
i=1
P. mWe can bound
T
∆ = 2n/m for all Xi s. Setting M = i=1 Xi = kZ xk2 , we observe




m
m
n
m
n
n X
n
X
X
X
X
X
X
E[M ] =
E hz:,i , xi2 =
E (
zji xj )2  =
E
(zji xj )2 + 2
zji zki xj xk 
i=1
=
n
X
x2j E
i=1
"m
X
j=1
=
n
X
#
2
zji
+2
i=1
=
x2j hφj,: , φj,: i + 2
n X
n
X
i=1
"
xj xk E
m
X

n
X
x2j
φ2ji
j=1
+2
n X
n
X
j=1 k>j
j=1
#
zji zki =
i=1
n
X
j=1

xj xk φji φki  =
n
X
j=1 k>j
x2j E [hzj,: , zj,: i] + 2
j=1
xj xk hφj,: , φk,: i =
j=1 k>j

i=1
n X
n
X
j=1 k>j
j=1
D
X
j=1
x2j
D
X
i=1
n X
n
X
xj xk E [hzj,: , zk,: i]
j=1 k>j
φ2ji + 2
n X
n
X
j=1 k>j
xj xk
D
X
φji φki
i=1
D
X
hφ:,i , xi2 = kΦT xk2 .
i=1
Since x is a fixed unit vector, it is pulled out of all expectations. Using theChernoff-Hoeffding bound and
−2(εn)2
= 2 exp −2ε2 m ≤ δ. Then
setting α = εn yields Pr |kΦT xk2 − kZ T xk2 | > εn ≤ 2 exp m(2n/m)
2
we solve for m = (1/(2ε2 )) ln(2/δ) in the last inequality.
3
Experiments
We measure the S PACE, T RAIN TIME, and T EST TIME of our SKPCA algorithms with ` taking values {2, 5, 10, 20, 30, 50}. We use spectral and Forbenious-based error measures and compare against the
Nyström sampling approach and the RNCA approach using random Fourier features (RFF). All methods
are implemented in Julia, run on an OpenSUSE 13.1 machine with 80 Intel(R) Xeon(R) 2.40GHz CPU and
750 GB of RAM.
We run experiments on several real (CPU, F OREST, A DULT) and synthetic (R ANDOM N OISY)
data sets. Each data set is an n × d matrix A (CPU is 7373 × 21, F OREST is 523910 × 54, A DULT is
33561 × 123, and R ANDOM N OISY is 20000 × 1000) with n datapoints and d attributes. For each a random
subset is removed as the test set of size 1000, except CPU where the test set size is 800. We generate the
R ANDOM N OISY synthetic dataset using the approach by [13]. We create A = SDU + F/ζ, where SDU
is an s-dimensional signal matrix (for s < d) and F/ζ is (full) d-dimensional noise with ζ controlling the
signal to noise ratio. Each entry Fi,j of F is generated i.i.d. from a normal distribution N (0, 1), and we use
ζ = 10. For the signal matrix, S ∈ Rn×s again we generate each Si,j ∼ N (0, 1) i.i.d; D is diagonal with
entries Di,i = 1 − (i − 1)/d linearly decreasing; and U ∈ Rs×d is just a random rotation. We set s = 50.
Data sets.
We consider two error measures comparing the true gram matrix G and an approximated
gram matrix (constructed in various ways). Kernel Spectral Error = kG − G0 k2 /n represents the worst case
error. Kernel Frobenius Error = kG − G0 kF /n2 represents the global error. We normalized the error
measures by 1/n and 1/n2 , respectively, so they are comparable across data sets. These measures require
another pass on the data to compute, but give a more holistic view of how accurate our approaches are.
Error measures.
10
We measure the S PACE requirements of each algorithm as follows. SKPCA sketch has size md + m`,
Nyström has space c2 + cd, and RNCA has space m2 + md, where m is the dimension of RFF, and c is
the number of data points Nyström samples. In our experiments, we set m and c similarly, calling these parameters S AMPLE S IZE. Note that S AMPLE S IZE and S PACE usage are different: both RNCA and Nyström
have S PACE quadratic in S AMPLE S IZE, while for SKPCA it is linear.
Figures 1, 2,3 and 4 show log-log plots of results for R ANDOM N OISY dataset, CPU dataset,
F OREST dataset and A DULT dataset.
In each figure, row 1 from left to right shows Kernel Frobenius Error, Kernel Spectral Error and Train
time with respect to S AMPLE S IZE, respectively. Row 2 of each figure shows Kernel Frobenius Error and
Kernel Spectral Error with respect to S PACE, and Test Time with respect to S AMPLE S IZE.
For small S AMPLE S IZE we observe that Nyström performs quite well under all error measures, corroborating results reported by Lopez et al. [14]. However, we note that the error range is quite small, typically
less than 0.01. Also, for Kernel Frobenius Error we typically observe a cross over point where RNCA and
often most versions of SKPCA have better error for that size. Under Kernel Spectral Error we often see a
cross-over point for SKPCA, but not for RNCA. We suspect that this is related to how FD only maintains
the most dominate directions while ignoring other (potentially spurious) directions introduced by the RFF
coordinate space.
In general, SKPCA has as good or better error than RNCA for the same size, with smaller size being
required with smaller ` values. This difference is more pronounced in S PACE than S AMPLE S IZE, where
our theoretical results expect a polynomial advantage.
Considering the timing experiments, we see that as a function of S AMPLE S IZE RNCA is slowest for the
T RAIN TIME and Nyström is slowest for T EST TIME, by at least an order of magnitude. In both cases all
versions of SKPCA are among the fastest algorithms. For the T RAIN TIME results, RNCA’s slow time is
dominated by summing n outer products, of dimensions m × m. This is avoided in SKPCA by only keeping
the top ` dimensions, and only requiring similar computation on the order of ` × m, where typically ` m.
The Nyström approach only needs to compute a c × c gram matrix once.
For the T EST TIME results, Nyström requires constructing and projecting a c-dimensional vector onto a
c-dimensional subspace, using O(dc + c2 ) time that is likely the bottleneck. Both RNCA and SKPCA avoid
this by directly computing an m dimensional representation of a test data point in O(dm) time (plus an
m-dimensional to `-dimensional project for SKPCA).
Overall, our proposed method SKPCA has superior timing and error results to RNCA, which it achieves
by sketching in the kernel feature space. Its error is typically a bit worse than a Nyström approach, but the
difference is small, and SKPCA is superior to Nyström in run time, especially in T EST TIME.
Results.
References
[1] Haim Avron, Huy L. Nguyen, and David P. Woodruff. Subspace embeddings for the polynomial kernel.
In NIPS, 2014.
[2] Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. An improved approximation algorithm
for the column subset selection problem. In Proceedings of 20th ACM-SIAM Symposium on Discrete
Algorithms, 2009.
[3] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input sparsity
time. In Proceedings of the 45th Annual ACM symposium on Theory of computing, 2013.
[4] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices ii:
Computing a low-rank approximation to a matrix. SIAM Journal on Computing, 36(1):158–183, 2006.
11
10-5 0
10
101
102
sample size
Kernel Frobenius Error
10-3
10
10
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
-4
-5
100
101
102
103
104
space
105
106
10-2
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
103
102
101
100
101
102
sample size
10
10
-3
200 400 600 800 1000 1200 1400 1600
sample size
103
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
10-1
-2
10-1 0
104
103
random-noisy
100
107
104
Train time (sec)
10-1
10-3 0
10
104
103
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
102
Test time (sec)
-4
random-noisy
100
Kernel Spectral Error
10
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
Kernel Spectral Error
Kernel Frobenius Error
10-3
101
100
10-1
10-2
100
101
102
104
103
space
105
106
10-3 0
107
200 400 600 800 1000 1200 1400 1600
sample size
Figure 1: Results for R ANDOM N OISY dataset. row 1: Kernel Frobenius Error (left), Kernel Spectral Error
(middle) and Train Time (right) vs. Sample size. row 2: Kernel Frobenius Error (left), Kernel Spectral Error
(middle) vs. Space, and Test Time vs. Sample size (right)
10-5 0
10
Kernel Frobenius Error
10
101
102
sample size
-3
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
10-4
10-5 0
10
101
102
103
104
space
105
106
107
10-2
103
Train time (sec)
10-1
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
102
101
100
10-1
10-3 0
10
104
103
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
10
101
102
sample size
cpu
0
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
10-1
10
-2
10-3 0
10
10-2 0
104
103
101
102
103
104
space
105
106
107
Test time (sec)
-4
cpu
100
Kernel Spectral Error
10
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
Kernel Spectral Error
Kernel Frobenius Error
10-3
10
1
10
0
200 400 600 800 1000 1200 1400 1600
sample size
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
10-1
10-2
10-3 0
200 400 600 800 1000 1200 1400 1600
sample size
Figure 2: Results for CPU dataset. row 1: Kernel Frobenius Error (left), Kernel Spectral Error (middle) and
Train Time (right) vs. Sample size. row 2: Kernel Frobenius Error (left), Kernel Spectral Error (middle) vs.
Space, and Test Time vs. Sample size (right)
[5] Petros Drineas and Michael W Mahoney. On the nyström method for approximating a gram matrix for
improved kernel-based learning. The Journal of Machine Learning Research, 6:2153–2175, 2005.
[6] Mina Ghashami and Jeff M. Phillips. Relative errors for deterministic low-rank matrix approximations.
In SODA, pages 707–717, 2014.
12
10-5 0
10
101
102
sample size
Kernel Frobenius Error
10-3
10
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
-4
10-5 0
10
101
102
103
104
space
105
106
10-2
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
104
103
102
101
100
101
102
sample size
10-1 0
104
103
forest
100
200 400 600 800 1000 1200 1400 1600
sample size
102
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
10-1
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
101
10-2
100
10-1
10-2
10-3 0
10
107
105
Train time (sec)
10-1
10-3 0
10
104
103
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
Test time (sec)
-4
forest
100
Kernel Spectral Error
10
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
Kernel Spectral Error
Kernel Frobenius Error
10-3
101
102
104
103
space
105
106
10-3 0
107
200 400 600 800 1000 1200 1400 1600
sample size
Figure 3: Results for F OREST dataset.row 1: Kernel Frobenius Error (left), Kernel Spectral Error (middle)
and Train Time (right) vs. Sample size. row 2: Kernel Frobenius Error (left), Kernel Spectral Error (middle)
vs. Space, and Test Time vs. Sample size (right)
Kernel Frobenius Error
10
-5
101
100
102
sample size
-3
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
10-4
10-5 0
10
101
102
103
104
space
105
106
107
10-2
104
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
103
Train time (sec)
10-1
102
101
100
10-1
10
104
103
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
10
-3
101
100
102
sample size
10-2 0
104
103
adult
0
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
10-1
10-2
10
200 400 600 800 1000 1200 1400 1600
sample size
2
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
101
Test time (sec)
10
-4
adult
100
Kernel Spectral Error
10
RNCA
Nystrom
SKPCA (2)
SKPCA (5)
SKPCA (10)
SKPCA (20)
SKPCA (30)
SKPCA (50)
Kernel Spectral Error
Kernel Frobenius Error
10-3
100
10-1
10-2
10-3 0
10
101
102
103
104
space
105
106
107
10-3 0
200 400 600 800 1000 1200 1400 1600
sample size
Figure 4: Results for A DULT dataset.row 1: Kernel Frobenius Error (left), Kernel Spectral Error (middle)
and Train Time (right) vs. Sample size. row 2: Kernel Frobenius Error (left), Kernel Spectral Error (middle)
vs. Space, and Test Time vs. Sample size (right)
[7] Peter M Hall, A David Marshall, and Ralph R Martin. Incremental eigenanalysis for classification. In
BMVC, volume 98, pages 286–295, 1998.
[8] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature maps. arXiv
preprint arXiv:1312.4626, 2013.
13
[9] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2005.
[10] Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels. arXiv preprint
arXiv:1201.6530, 2012.
[11] Shosuke Kimura, Seiichi Ozawa, and Shigeo Abe. Incremental kernel pca for online learning of feature
space. In Computational Intelligence for Modelling, Control and Automation, 2005 and International
Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference
on, volume 1, pages 595–600. IEEE, 2005.
[12] Fuxin Li, Catalin Ionescu, and Cristian Sminchisescu. Random fourier approximations for skewed
multiplicative histogram kernels. In Pattern Recognition, pages 262–271. Springer, 2010.
[13] Edo Liberty. Simple and deterministic matrix sketching. In KDD, pages 581–588, 2013.
[14] David Lopez-Paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Schölkopf. Randomized
nonlinear component analysis. arXiv preprint arXiv:1402.0119, 2014.
[15] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in
Machine Learning, 3, 2011.
[16] Subhransu Maji and Alexander C Berg. Max-margin additive classifiers for detection. In Computer
Vision, 2009 IEEE 12th International Conference on, pages 40–47. IEEE, 2009.
[17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in
neural information processing systems, pages 1177–1184, 2007.
[18] Tamás Sarlós. Improved approximation algorithms for large matrices via random projections. In
FOCS, pages 143–152, 2006.
[19] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Kernel principal component analysis. In Artificial Neural Networks—ICANN’97, pages 583–588. Springer, 1997.
[20] Yohei Takeuchi, Seiichi Ozawa, and Shigeo Abe. An efficient incremental kernel principal component
analysis for online feature selection. In Neural Networks, 2007. IJCNN 2007. International Joint
Conference on, pages 2346–2351. IEEE, 2007.
[21] Takaomi Tokumoto and Seiichi Ozawa. A fast incremental kernel principal component analysis for
learning stream of data chunks. In Neural Networks (IJCNN), The 2011 International Joint Conference
on, pages 2881–2888. IEEE, 2011.
[22] Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines.
In Proceedings of the 14th Annual Conference on Neural Information Processing Systems, number
EPFL-CONF-161322, pages 682–688, 2001.
[23] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in
Theoretical Computer Science, 10:1–157, 2014.
14
Download