Invariance of Principal Components under Low

advertisement
INVARIANCE OF PRINCIPAL COMPONENTS UNDER LOW-DIMENSIONAL RANDOM
PROJECTION OF THE DATA
Hanchao Qi, Shannon M. Hughes
Department of Electrical, Computer, and Energy Engineering
University of Colorado at Boulder
ABSTRACT
Algorithms that can efficiently recover principal components of
high-dimensional data from compressive sensing measurements
(e.g. low-dimensional random projections) of it have been an important topic of recent interest in the literature. In this paper, we show
that, under certain conditions, normal principal component analysis
(PCA) on such low-dimensional random projections of data actually
returns the same result as PCA on the original data set would. In
particular, as the number of data samples increases, the center of the
randomly projected data converges to the true center of the original
data (up to a known scaling factor) and the principal components
converge to the true principal components of the original data as
well, even if the dimension of each random subspace used is very
low. Indeed, experimental results verify that this approach does
estimate the original center and principal components very well for
both synthetic and real-world datasets, including hyperspectral data.
Its performance is even superior to that of other algorithms recently
developed in the literature for this purpose.
Index Terms— Compressive sensing, Principal component
analysis, Random projections, Low-rank matrix recovery, Hyperspectral data
1. INTRODUCTION
Principal component analysis (PCA) [1] selects the best lowdimensional linear projection of a set of data points to minimize
error between the original and projected data. It can also be thought
of as finding the linear subspace that maximally preserves the variance of, or in some cases, the information in, the data. PCA is
frequently used for dimensionality reduction, or as a summary of
interesting features of the data. It is also often used as a precursor to
signal classification.
To obtain the principal components (PCs) of data, one typically
centers the data and then computes the eigenvectors of the data’s
covariance matrix, using full knowledge of all data. However, in this
paper, we will show that when the PCA algorithm is instead applied
to low-dimensional random projections of each data point, as are
acquired in many compressive sensing (CS) measurement schemes
[2], it will often return the same center (up to a known scaling factor)
and principal components as it would for the original dataset.
More precisely, we show that the center of the low-dimensional
random projections of the data converges to the true center of the
original data (up to a known scaling factor) almost surely as the
number of data samples increases. We then show that under certain
conditions the top d eigenvectors of the randomly projected data’s
covariance matrix converge to the true d principal components of
the original data as the number of data samples increases. Moreover, both of the above conclusions are true regardless of how few
dimensions we use in our random projections (i.e. how few CS measurements we take of each sample).
This material is based upon work supported by the National Science
Foundation under Grant CCF-1117775.
Furthermore, experimentally, we find that on both synthetic and
real-world examples, including hyperspectral data, normal PCA on
low-dimensional random projections of the data recovers the center and the principal components of the original data very well. In
fact, the principal components recovered using normal PCA on the
randomly projected data are significantly more accurate than those
returned by other algorithms previously designed for this task, such
as Compressive-Projection Principal Component Analysis (CPPCA)
[3]. Normal PCA on the randomly-projected data thus provides both
a simpler and a more effective approach for recovering principal
components from CS measurements.
We further note that the principal component recovery problem
considered here is connected to the problem of low-rank matrix recovery from CS measurements (see e.g. [4]). Our work here thus
provides a simple and very computationally efficient way of solving
certain types of CS low-rank matrix recovery problems. Moreover,
due to the extreme simplicity of our approach (in contrast with other
strategies for PC recovery), it has the advantage of being readily extended to other related tasks. One such task is “compressed kernel
PCA”, the computation of kernel PCA on the original data using only
CS measurements of it, which we will present in an upcoming work.
Section 2 presents the notation and assumptions that we will use
throughout the paper. Section 3 presents two theorems (with proofs
deferred to Section 5) verifying that computing the center and PCs of
the original data using PCA on low-dimensional random projections
is possible. In Section 4, we show experimental results that verify our theoretical conclusions, while benchmarking this approach
against other approaches in the literature.
2. NOTATION AND ASSUMPTIONS
We assume that our original data are centered at x̄ ∈ Rp with principal components v1 , ..., vd P
∈ Rp . Each data sample can then be
(i)
represented as x = x̄ + dj=1 wij σj vj + zi , where {wi }n
i=1
are drawn i.i.d from N (0, Id ) and {zi }n
i=1 are drawn i.i.d. from
N (0, 2 Ip ) with small ≥ 0. The vectors v1 , ..., vd are assumed
orthonormal and span a d-dimensional subspace in Rp . Suppose we
have n such data samples x(1) , ..., x(n) ∈ Rp .
We then draw m i.i.d. random vectors eij , j = 1, ..., m from
N (0, Ip ) for each x(i) . We will project each x(i) onto the subspace
spanned by these associated eij . Here, it is important to assume different random vectors for different i. If we used the same random
subspace for projecting each x(i) , then we could only recover principal components within this subspace. We will assume that mn ≥ p
so that the eij s span Rp almost surely. Denoting by E i the mai
p×m
trix with columns {eij }m
. The proj=1 , we see that E ∈ R
jection matrix onto the subspace spanned by the columns of E i is
T
T
then Pi = E i (E i E i )−1 E i ∈ Rp×p . The random projection of
(i)
(i)
each x is then Pi x . We note that if we have typical CS meaT
surements of each x(i) of the form mi = E i x(i) , then for each i,
(i)
i
i T i −1
Pi x = E (E E ) mi . Hence, each random projection Pi x(i)
can be recovered directly from the measurements mi and measure-
ment matrix E i , without knowledge of the original data x(i) .
3. RECOVERY OF CENTER AND PRINCIPAL
COMPONENTS VIA PCA ON RANDOM PROJECTIONS
In this section, we will argue that using normal PCA to find the
center and principal components of {Pi x(i) }n
i=1 recovers the center
(up to scaling factor) and principal components of the original data
{x(i) }n
i=1 . We start by presenting the following theorem, whose
proof is deferred to Section 5.
(i) n
Theorem 1. Suppose {Pi }n
i=1 , {x }i=1 , etc. are defined as in Section 2 with fixed 1 ≤ m < p. Then as the number of data samples
n → ∞, the center of the randomly projected data converges to m
p
times the true center of the original data x̄ almost surely:
lim
n→∞
n
1X
m
Pi x(i) = x̄
n i=1
p
(1)
This result allows recovery of the original data’s center. We note
that it does not depend on the measurements per sample m.
ˆ , we can
Once we have found the center of the original data x̄
adjust the random projections Pi x(i) to reflect centering the original
ˆ ) or, if
data. This involves either replacing Pi x(i) with Pi (x(i) − x̄
we are working with CS measurements, replacing mi with
mcentered
=E
i
iT
ˆ ) = mi − E
(x(i) − x̄
iT
ˆ
x̄
(2)
In the rest of this section, we will assume that the original data samples are centered around the origin, i.e. that we are working with
ˆ instead of x(i) and/or mcentered
x(i) − x̄
instead of mi .
i
In typical PCA, the principal components of centered data are
found as the eigenvectors of the empirical covariance matrix of the
P
(i) (i) T
data Cemp = n1 n
which often comes very close to
i=1 x x
P
the true underlying covariance C = di=1 σi2 vi viT .
We propose that the covariance matrix CP of the projected data
{Pi x(i) }n
i=1 defined as
(a) p = 3
(b) p = 10
(c) p = 50
Fig. 1. Randomly projecting the data preserves the principal component. In
each of the three figures, there are n = 3000 points uniformly distributed
on a line in R3 , R10 and R50 respectively. We randomly project each point
onto a two-dimensional random subspace, and view two dimensions of the
result (the original principal component’s and one other). Blue stars are the
original points and red circles are the projected points. We observe that the
original principal component remains intact even for a very small ratio m
.
p
where Σ is a diagonal matrix,
Pd
Σ
=
diag(σ12 k1
Pd
j=1
σj2 k2
p−1
where k1 =
m2
p2
+
+
, ...,
j=1,j6=1
σj2 k2
p−1
Pd
2
j=1 σj k2
2m(p−m)
p3 +2p2
p−1
and k2 =
Pd
, ..., σd2 k1
)
m
p
+
j=1,j6=d
σj2 k2
p−1
(5)
− k1 .
Corollary 1. As n → ∞, taking the top ` ≤ d eigenvectors of
CP will recover the true ` principal components v1 , . . . , v` of the
k2
original data as long as k1 − p−1
> 0.
Proof. Clearly, the v1 , . . . , v` are recovered from CP as n → ∞
as long as Σrr > Σss for all r ≤ `, s > `. We note that

k2

if s > d
σr2 k1 − p−1
Σrr − Σss =
 (σr2 − σs2 ) k1 − k2
if s ≤ d
p−1
will also be similar to C allowing us to obtain the same principal
components v1 , . . . , vd from CP .
To gain some intuition for this, consider a simple example in
which we take two-dimensional random projections of data points
in Rp generated from one principal component for varying p (see
Fig. 1). We can see that in all cases the projected points have the
same principal component as the original points, and that the projected points are nicely symmetrically distributed around the original principal component. Although random projection scatters the
energy of a principal component into other directions, the direction
with the most energy remains the original direction, even for small
ratio m/p. This observation is verified in the following theorem,
whose proof is also deferred to Section 5.
The conclusion then immediately follows.
2
k2
In fact, we note that k1 ≈ m
and p−1
≈ pm2 , so the condition
p2
of the corollary will generally be satisfied.
Although the center and principal components of the original
data can be recovered directly from PCA on the randomly projected
data, we found that the results were improved slightly in practice
by iterative procedures. It is these procedures that we will use in
our experimental results in Section 3. For the center estimator, we
found that iteratively re-estimating the center and re-centering the
data using Eq. 2 more than once improved the accuracy of the result.
For the principal components, a small improvement is obtained by
estimating the principal components v1 , . . . , v` one by one instead
of all at once. To do this, we obtain v̂1 as the first eigenvector from
Eq. 3 as expected, but then use least squares as in [3, 5] to estimate
(i)
\
the coefficient β̂i1 = w
i1 σ1 of v̂1 for each original data sample x .
(i)
We then subtract β̂i1 Pi v̂1 from each Pi x (or equivalently subtract
T
β̂i1 E i v̂1 from each mi ) to get the randomly projected data after
removal of the first principal component from the original data set.
We then repeat the process ` − 1 times, estimating the first principal
component and subtracting it each time, to recover v̂2 , . . . , v̂` .
Theorem 2. Suppose data samples {x(i) }n
i=1 centered at x̄ = 0
and v1 , ..., vd ∈ Rp , the orthonormal principal components, are as
defined in Section 2. Let us select vd+1 , ..., vp so that v1 , ..., vp is
an orthonormal basis for Rp . Let V ∈ Rp×p be the matrix with ith
column vi , then
m
(4)
lim CP = V ΣV T + 2 I
n→∞
p
4. EXPERIMENTAL VERIFICATION
Now that we have showed that regular PCA on random projections
permits recovery of the original data’s center and principal components, we present experimental verification of this. For our first ex100
periment, we synthetically generate data samples {x(i) }n
i=1 ∈ R
with 5 significant underlying principal components. Using the notation of Section 2, the five significant principal components have
CP
=
n
1X
Pi x(i) (Pi x(i) )T
n i=1
=
n
T
T
1 X i i T i −1
E (E E ) mi mTi (E i E i )−1 E i
n i=1
(3)
,
(σ1 , σ2 , σ3 , σ4 , σ5 ) = (20, 15, 10, 8, 6) and = 1. Each component of the center is drawn from a uniform distribution on [0, 10).
Fig. 2 (a) shows how the center estimated by normal PCA on the
randomly projected data (with iterative enhancement as noted above
p
and scaled by m
) compares to the true center for varying measurem
ment ratio p . The error is measured as the distance between the
estimated and true centers normalized by the true center’s magnitude. We verify that this error becomes very small for large n, even
for small numbers of measurements m.
To evaluate error for the principal components, we use the magnitude of the normalized inner product between each principal component obtained from random projections and the corresponding true
principal component. Hence, a value of one indicates the best possible estimate. Figs. 2 (b,c,d) show the results for the first 5 significant principal components for varying measurement ratio m
, varyp
ing number of samples n, and varying noise level /σ1 respectively.
We see that the principal components are indeed recovered from
PCA on the randomly projected data if n is sufficiently large.
(a)
(a)
(b)
Fig. 3. Results for Lankershim Boulevard Video. The normalized inner
product magnitude between the first 5 estimated principal components and
the true first 5 principal components for (a) Normal PCA on the Randomly
Projected Data vs. (b) Compressive-Projection Principal Components Analysis (CPPCA) [3].
ponents, followed by least squares estimation of the coefficients of
each piece of data within the principal components basis to estimate
the original data point (see [3, 5]). We compare the average SNR of
the resulting reconstructed hyperspectral images (see [3] for details)
with that obtained reconstructing using CPPCA [3] and MTBCS
[6]. The comparison of average SNR of the reconstructed hyperspectral images as a function of different measurement ratios m/p
are shown in Fig. 4. We can see that the SNR using our approach is
much higher than that using the other two methods.
(b)
(a) Cuprite Dataset
(b) Jasper Ridge Dataset
Fig. 4. Plots of average SNR of reconstructed hyperspectral images for var(c)
(d)
Fig. 2. Results for synthetic data. (a) Plot of normalized distance between
. (b,c,d) Plots
the estimated center and the true center for varying n and m
p
of normalized inner product magnitude between estimated PCs and the corresponding true PCs for (b) varying measurement ratios m
for n = 2000,
p
=
0.2,
and
(d)
varying
noise
ratio
/σ
(c) varying n when m
1.
p
Second, we examine the Lankershim Boulevard Data, a set of
real-world videos of traffic passing through a busy intersection. Considering individual frames of video as data points, we expect these
data to have a low-rank-plus-sparse structure, represented by a collection of strong principal components plus sparse anomalies. Here
we compare the first 5 principal components as estimated using normal PCA vs. CPPCA [3], another method for estimating principal
components from CS measurements. Due to the large size of each
frame of the raw video, we have resized it to 24 × 32 pixels. We
have also centered the data before proceeding since CPPCA can not
estimate the center. The results, in the form of the normalized inner
product between the estimated and true principal components, are
shown in Fig. 3. We see that using normal PCA to estimate the principal components for this dataset results in more accurate estimated
principal components than CPPCA [3].
Finally, we use the estimated principal components to reconstruct hyperspectral images. We use the “Cuprite” and “Jasper
Ridge” image datasets with p = 224 spectral bands and n = 10000
samples for each. In each case, we use normal PCA on random
projections of the data to estimate the original data’s principal com-
ious measurement ratios. We see that using normal PCA on random projections to estimate the principal components of hyperspectral data outperforms
both Compressive-Projection Principal Components Analysis and Multi-Task
Bayesian Compressive Sensing for hyperspectral data reconstruction.
5. THEORETICAL VERIFICATION
In this section, we present proofs of Theorem 1 and 2 from Sec. 3.
We start by introducing two lemmas used in the proof of Theorem 1.
Lemma 1 ((From [7]) Symmetry of the distribution of Pi x under
reflection across x). Suppose x is a fixed point in Rp and let Pi x be
a random vector with Pi defined as in Section 2. Define the reflection
operator Rx as
Rx (y) = y + 2(hy, x̂ix̂ − y)
(6)
x
where x̂ = kxk
. Then the distribution of Pi x is the same as the
distribution of Rx (Pi x).
Lemma 2 (Theorem 1.1 of [8]). Let e1 , ..., em ∈ Rp be m points
i.i.d drawn from N (0, Ip ). If m < p then the vectors {ei }m
i=1 span a
m-dimensional linear subspace of Rp almost surely. This subspace
is then called a random m-space in Rp . Let H be a random mspace in Rp , L be a fixed 1-space in Rp and θ be the principal angle
between H and L.
The random variable cos2 θ has the beta distribution β m
, p−m
.
2
2
Proof of Theorem 1. Because {Pi x(i) }n
i=1 are i.i.d., we will try to
evaluate E(Pi x(i) ) so that we may use the law of large numbers to
show (1). Because wij , Pi , and zi are independent with E(wij ) =
0, E(zi ) = 0,
d
X
E(Pi x(i) ) = E(Pi x̄) +
σj E(wij )E(Pi vj ) + E(Pi )E(zi )
j=1
=
E(Pi x̄)
(7)
From Lemma 1, the distribution of Pi x̄ is the same as the distribution of Rx̄ (Pi x̄). Thus, E(Pi x̄) = E (Rx̄ (Pi x̄)). Hence,
1
1
hP
x̄,
x̄i
x̄ (8)
E(Pi x̄) = E (Pi x̄ + Rx̄ (Pi x̄)) = E
i
2
kx̄k2
Now, Pi x̄ is the projection of x̄ onto a random m-space. Suppose the principal angle between this space and the span of x̄ is θ.
Then
hPi x̄, x̄i = kPi x̄kkx̄kcosθ = kx̄k2 cos2 θ
(9)
Thus, from Lemma 2 and (9), we have
m
1
hP
x̄,
x̄i
= E(cos2 θ) =
(10)
E
i
kx̄k2
p
Combining (7), (8) and (10), we have that E Pi x(i) = m
x̄,
p
and Theorem 1 follows from the law of large numbers.
To prove Theorem 2, we first introduce the following lemma.
Lemma 3 ((From [7]) Symmetry of the distribution of Pi x under
rotation about x). Suppose x is a fixed point in Rp and let Pi be
defined as in Section 2. Let V ∈ Rp×p be an orthogonal matrix with
first column x/||x|| and let
0
1
01×(p−1)
Qx = V
VT
(11)
0(p−1)×1
Q
where Q is in the special orthogonal group SOp−1 , so that Q0x represents an arbitrary rotation of Rp about x. Then the distribution of
Pi x is the same as the distribution of Q0x (Pi x).
Proof of P
Theorem 2. Since the data is assumed centered with x̄ = 0,
d
x(i) =
j=1 wij σj vj + zi . Thus, since all the wij and zi are
independent and zero-mean, we can show that
E Pi x(i) x
(i) T
PiT
=
d
X
σj2 E Pi vj vjT PiT +E Pi zi zTi PiT .
j=1
(12)
Let’s first analyze a single term C1 = E Pi v1 v1T PiT .
If we consider Pi v1 as a random variable, then we can define
Pv1 (Pi v1 )
and Pv⊥ (Pi v1 )
=
=
hv1 , Pi v1 i ∈ R1
(13)
T
p−1
(hv2 , Pi v1 i, ..., hvp , Pi v1 i) ∈ R (14)
We will abbreviate Pv1 (Pi v1 ) by Pv , and Pv⊥ (Pi v1 ) by Pv⊥ when
no confusion will arise. Then,
Pv
Pi v 1 = V
(15)
Pv⊥
kPv k2
Pv PvT⊥
C1 = V E
VT
(16)
T
T
Pv⊥ Pv Pv⊥ Pv⊥
We now evaluate the four terms in the block matrix in (16). For
the first term E(kPv k2 ), our earlier analysis in (9) gives kPv k =
cos2 θ, where θ is the principal angle between v1 and the random
m-space Pi projects onto. Thus, from Lemma 2,
2m(p − m)
m2
(17)
E kPv k2 = E (cos2 θ)2 = 2 +
p
p3 + 2p
We use k1 to represent this quantity: k1 =
m2
p2
+
2m(p−m)
.
p3 +2p
To compute the remaining three terms in (16), we take advantage
of Lemma 3. From Lemma 3, the distribution of Pi v1 is rotationally symmetric about v1 . This implies that E(Pv PvT⊥ ) = 01×p and
E(Pv⊥ PvT ) = 0p×1 . Furthermore, since the distribution of Pi v1 is
rotationally symmetric about v1 , the distribution of Pv⊥ , the projection of Pi v1 onto the orthogonal complement of v1 , is rotationally
symmetric about 0. This implies that E(Pv⊥ PvT⊥ ) is a multiple of
the identity. Now,
= E(trace(C1 )) − E(kPv k2 )
trace E Pv⊥ PvT⊥
= E(kPi v1 k2 ) − k1
=
m
− k1
p
since the norm squared of a random m-dimensional projection of a
unit vector in Rp is well-known to be m
. We let k2 = m
− k1 .
p
p
T
Then, since E(Pv⊥ Pv⊥ ) is a multiple of the identity, we must have
k2
E(Pv⊥ PvT⊥ ) =
I(p−1)
p−1
Then, from (16),
k2
k2
, ...,
VT
C1 = V diag k1 ,
p−1
p−1
(18)
We may perform a similar analysis for each of the other terms
E(Pi vj vjT PiT ) resulting in the same answer, except with k1 occupying the j th entry of the diagonal instead.
For the term E Pi zi zTi PiT in (12), because both zi and
{eij }m
j=1 are random with completely isotropic distributions, we
know that the distribution of Pi zi will also be isotropic. Thus,
E Pi zi zTi PiT is also a multiple of identity. Because
trace E(Pi zi zTi PiT ) = E(kPi zi k2 ) = m
E(||zi ||2 ) = m
p2 =
p
p
2
m , we have
m
(19)
E Pi zi zTi PiT = 2 I
p
From (12), (18), and (19), we have
d
X
T
m
σj2 E Pi vj vjT PiT +C = V ΣV T + 2 I
E Pi x(i) x(i) PiT =
p
j=1
(20)
where Σ is defined in (5).
T
Since the terms {Pi x(i) x(i) PiT }n
i=1 are i.i.d., Theorem 2 then
follows from the law of large numbers.
6. CONCLUSIONS
We have demonstrated, both through theoretical analysis and experimentally, that PCA performed on low-dimensional random projections of the data recovers both the center and the principal components of the original data quite well, indeed better than previous approaches in the literature for recovering principal components from
compressive sensing measurements.
7. REFERENCES
[1] I. Jolliffe, Principal Component Analysis, Springer-Verlag, NY, 2002.
[2] R. Baraniuk, “Compressive sensing,” IEEE Signal Processing Magazine, vol. 24,
no. 4, pp. 118 –121, Jul. 2007.
[3] J. Fowler, “Compressive-projection principal component analysis,” IEEE Trans.
on Image Processing, vol. 18, no. 10, pp. 2230 –2242, Oct. 2009.
[4] M. Fazel et al., “Compressed sensing and robust recovery of low rank matrices,” in
ACSSC, Oct. 2008, pp. 1043 –1047.
[5] H. Qi et al., “Using the kernel trick in compressive sensing: Accurate signal recovery from fewer measurements,” in ICASSP, May 2011, pp. 3940 –3943.
[6] S. Ji et al., “Multitask compressive sensing,” IEEE Trans. on Signal Processing,
vol. 57, no. 1, pp. 92 –106, Jan. 2009.
[7] H. Qi and S. Hughes, “Technical report: Two observations on probability distribution symmetries for randomly projected data,” arXiv, 2012.
[8] P. Frankl et al., “Some geometric applications of the beta distribution,” Annals of
the Institute of Stat. Math., vol. 42, no. 3, pp. 463–474, Sep. 1990.
Download