INVARIANCE OF PRINCIPAL COMPONENTS UNDER LOW-DIMENSIONAL RANDOM PROJECTION OF THE DATA Hanchao Qi, Shannon M. Hughes Department of Electrical, Computer, and Energy Engineering University of Colorado at Boulder ABSTRACT Algorithms that can efficiently recover principal components of high-dimensional data from compressive sensing measurements (e.g. low-dimensional random projections) of it have been an important topic of recent interest in the literature. In this paper, we show that, under certain conditions, normal principal component analysis (PCA) on such low-dimensional random projections of data actually returns the same result as PCA on the original data set would. In particular, as the number of data samples increases, the center of the randomly projected data converges to the true center of the original data (up to a known scaling factor) and the principal components converge to the true principal components of the original data as well, even if the dimension of each random subspace used is very low. Indeed, experimental results verify that this approach does estimate the original center and principal components very well for both synthetic and real-world datasets, including hyperspectral data. Its performance is even superior to that of other algorithms recently developed in the literature for this purpose. Index Terms— Compressive sensing, Principal component analysis, Random projections, Low-rank matrix recovery, Hyperspectral data 1. INTRODUCTION Principal component analysis (PCA) [1] selects the best lowdimensional linear projection of a set of data points to minimize error between the original and projected data. It can also be thought of as finding the linear subspace that maximally preserves the variance of, or in some cases, the information in, the data. PCA is frequently used for dimensionality reduction, or as a summary of interesting features of the data. It is also often used as a precursor to signal classification. To obtain the principal components (PCs) of data, one typically centers the data and then computes the eigenvectors of the data’s covariance matrix, using full knowledge of all data. However, in this paper, we will show that when the PCA algorithm is instead applied to low-dimensional random projections of each data point, as are acquired in many compressive sensing (CS) measurement schemes [2], it will often return the same center (up to a known scaling factor) and principal components as it would for the original dataset. More precisely, we show that the center of the low-dimensional random projections of the data converges to the true center of the original data (up to a known scaling factor) almost surely as the number of data samples increases. We then show that under certain conditions the top d eigenvectors of the randomly projected data’s covariance matrix converge to the true d principal components of the original data as the number of data samples increases. Moreover, both of the above conclusions are true regardless of how few dimensions we use in our random projections (i.e. how few CS measurements we take of each sample). This material is based upon work supported by the National Science Foundation under Grant CCF-1117775. Furthermore, experimentally, we find that on both synthetic and real-world examples, including hyperspectral data, normal PCA on low-dimensional random projections of the data recovers the center and the principal components of the original data very well. In fact, the principal components recovered using normal PCA on the randomly projected data are significantly more accurate than those returned by other algorithms previously designed for this task, such as Compressive-Projection Principal Component Analysis (CPPCA) [3]. Normal PCA on the randomly-projected data thus provides both a simpler and a more effective approach for recovering principal components from CS measurements. We further note that the principal component recovery problem considered here is connected to the problem of low-rank matrix recovery from CS measurements (see e.g. [4]). Our work here thus provides a simple and very computationally efficient way of solving certain types of CS low-rank matrix recovery problems. Moreover, due to the extreme simplicity of our approach (in contrast with other strategies for PC recovery), it has the advantage of being readily extended to other related tasks. One such task is “compressed kernel PCA”, the computation of kernel PCA on the original data using only CS measurements of it, which we will present in an upcoming work. Section 2 presents the notation and assumptions that we will use throughout the paper. Section 3 presents two theorems (with proofs deferred to Section 5) verifying that computing the center and PCs of the original data using PCA on low-dimensional random projections is possible. In Section 4, we show experimental results that verify our theoretical conclusions, while benchmarking this approach against other approaches in the literature. 2. NOTATION AND ASSUMPTIONS We assume that our original data are centered at x̄ ∈ Rp with principal components v1 , ..., vd P ∈ Rp . Each data sample can then be (i) represented as x = x̄ + dj=1 wij σj vj + zi , where {wi }n i=1 are drawn i.i.d from N (0, Id ) and {zi }n i=1 are drawn i.i.d. from N (0, 2 Ip ) with small ≥ 0. The vectors v1 , ..., vd are assumed orthonormal and span a d-dimensional subspace in Rp . Suppose we have n such data samples x(1) , ..., x(n) ∈ Rp . We then draw m i.i.d. random vectors eij , j = 1, ..., m from N (0, Ip ) for each x(i) . We will project each x(i) onto the subspace spanned by these associated eij . Here, it is important to assume different random vectors for different i. If we used the same random subspace for projecting each x(i) , then we could only recover principal components within this subspace. We will assume that mn ≥ p so that the eij s span Rp almost surely. Denoting by E i the mai p×m trix with columns {eij }m . The proj=1 , we see that E ∈ R jection matrix onto the subspace spanned by the columns of E i is T T then Pi = E i (E i E i )−1 E i ∈ Rp×p . The random projection of (i) (i) each x is then Pi x . We note that if we have typical CS meaT surements of each x(i) of the form mi = E i x(i) , then for each i, (i) i i T i −1 Pi x = E (E E ) mi . Hence, each random projection Pi x(i) can be recovered directly from the measurements mi and measure- ment matrix E i , without knowledge of the original data x(i) . 3. RECOVERY OF CENTER AND PRINCIPAL COMPONENTS VIA PCA ON RANDOM PROJECTIONS In this section, we will argue that using normal PCA to find the center and principal components of {Pi x(i) }n i=1 recovers the center (up to scaling factor) and principal components of the original data {x(i) }n i=1 . We start by presenting the following theorem, whose proof is deferred to Section 5. (i) n Theorem 1. Suppose {Pi }n i=1 , {x }i=1 , etc. are defined as in Section 2 with fixed 1 ≤ m < p. Then as the number of data samples n → ∞, the center of the randomly projected data converges to m p times the true center of the original data x̄ almost surely: lim n→∞ n 1X m Pi x(i) = x̄ n i=1 p (1) This result allows recovery of the original data’s center. We note that it does not depend on the measurements per sample m. ˆ , we can Once we have found the center of the original data x̄ adjust the random projections Pi x(i) to reflect centering the original ˆ ) or, if data. This involves either replacing Pi x(i) with Pi (x(i) − x̄ we are working with CS measurements, replacing mi with mcentered =E i iT ˆ ) = mi − E (x(i) − x̄ iT ˆ x̄ (2) In the rest of this section, we will assume that the original data samples are centered around the origin, i.e. that we are working with ˆ instead of x(i) and/or mcentered x(i) − x̄ instead of mi . i In typical PCA, the principal components of centered data are found as the eigenvectors of the empirical covariance matrix of the P (i) (i) T data Cemp = n1 n which often comes very close to i=1 x x P the true underlying covariance C = di=1 σi2 vi viT . We propose that the covariance matrix CP of the projected data {Pi x(i) }n i=1 defined as (a) p = 3 (b) p = 10 (c) p = 50 Fig. 1. Randomly projecting the data preserves the principal component. In each of the three figures, there are n = 3000 points uniformly distributed on a line in R3 , R10 and R50 respectively. We randomly project each point onto a two-dimensional random subspace, and view two dimensions of the result (the original principal component’s and one other). Blue stars are the original points and red circles are the projected points. We observe that the original principal component remains intact even for a very small ratio m . p where Σ is a diagonal matrix, Pd Σ = diag(σ12 k1 Pd j=1 σj2 k2 p−1 where k1 = m2 p2 + + , ..., j=1,j6=1 σj2 k2 p−1 Pd 2 j=1 σj k2 2m(p−m) p3 +2p2 p−1 and k2 = Pd , ..., σd2 k1 ) m p + j=1,j6=d σj2 k2 p−1 (5) − k1 . Corollary 1. As n → ∞, taking the top ` ≤ d eigenvectors of CP will recover the true ` principal components v1 , . . . , v` of the k2 original data as long as k1 − p−1 > 0. Proof. Clearly, the v1 , . . . , v` are recovered from CP as n → ∞ as long as Σrr > Σss for all r ≤ `, s > `. We note that k2 if s > d σr2 k1 − p−1 Σrr − Σss = (σr2 − σs2 ) k1 − k2 if s ≤ d p−1 will also be similar to C allowing us to obtain the same principal components v1 , . . . , vd from CP . To gain some intuition for this, consider a simple example in which we take two-dimensional random projections of data points in Rp generated from one principal component for varying p (see Fig. 1). We can see that in all cases the projected points have the same principal component as the original points, and that the projected points are nicely symmetrically distributed around the original principal component. Although random projection scatters the energy of a principal component into other directions, the direction with the most energy remains the original direction, even for small ratio m/p. This observation is verified in the following theorem, whose proof is also deferred to Section 5. The conclusion then immediately follows. 2 k2 In fact, we note that k1 ≈ m and p−1 ≈ pm2 , so the condition p2 of the corollary will generally be satisfied. Although the center and principal components of the original data can be recovered directly from PCA on the randomly projected data, we found that the results were improved slightly in practice by iterative procedures. It is these procedures that we will use in our experimental results in Section 3. For the center estimator, we found that iteratively re-estimating the center and re-centering the data using Eq. 2 more than once improved the accuracy of the result. For the principal components, a small improvement is obtained by estimating the principal components v1 , . . . , v` one by one instead of all at once. To do this, we obtain v̂1 as the first eigenvector from Eq. 3 as expected, but then use least squares as in [3, 5] to estimate (i) \ the coefficient β̂i1 = w i1 σ1 of v̂1 for each original data sample x . (i) We then subtract β̂i1 Pi v̂1 from each Pi x (or equivalently subtract T β̂i1 E i v̂1 from each mi ) to get the randomly projected data after removal of the first principal component from the original data set. We then repeat the process ` − 1 times, estimating the first principal component and subtracting it each time, to recover v̂2 , . . . , v̂` . Theorem 2. Suppose data samples {x(i) }n i=1 centered at x̄ = 0 and v1 , ..., vd ∈ Rp , the orthonormal principal components, are as defined in Section 2. Let us select vd+1 , ..., vp so that v1 , ..., vp is an orthonormal basis for Rp . Let V ∈ Rp×p be the matrix with ith column vi , then m (4) lim CP = V ΣV T + 2 I n→∞ p 4. EXPERIMENTAL VERIFICATION Now that we have showed that regular PCA on random projections permits recovery of the original data’s center and principal components, we present experimental verification of this. For our first ex100 periment, we synthetically generate data samples {x(i) }n i=1 ∈ R with 5 significant underlying principal components. Using the notation of Section 2, the five significant principal components have CP = n 1X Pi x(i) (Pi x(i) )T n i=1 = n T T 1 X i i T i −1 E (E E ) mi mTi (E i E i )−1 E i n i=1 (3) , (σ1 , σ2 , σ3 , σ4 , σ5 ) = (20, 15, 10, 8, 6) and = 1. Each component of the center is drawn from a uniform distribution on [0, 10). Fig. 2 (a) shows how the center estimated by normal PCA on the randomly projected data (with iterative enhancement as noted above p and scaled by m ) compares to the true center for varying measurem ment ratio p . The error is measured as the distance between the estimated and true centers normalized by the true center’s magnitude. We verify that this error becomes very small for large n, even for small numbers of measurements m. To evaluate error for the principal components, we use the magnitude of the normalized inner product between each principal component obtained from random projections and the corresponding true principal component. Hence, a value of one indicates the best possible estimate. Figs. 2 (b,c,d) show the results for the first 5 significant principal components for varying measurement ratio m , varyp ing number of samples n, and varying noise level /σ1 respectively. We see that the principal components are indeed recovered from PCA on the randomly projected data if n is sufficiently large. (a) (a) (b) Fig. 3. Results for Lankershim Boulevard Video. The normalized inner product magnitude between the first 5 estimated principal components and the true first 5 principal components for (a) Normal PCA on the Randomly Projected Data vs. (b) Compressive-Projection Principal Components Analysis (CPPCA) [3]. ponents, followed by least squares estimation of the coefficients of each piece of data within the principal components basis to estimate the original data point (see [3, 5]). We compare the average SNR of the resulting reconstructed hyperspectral images (see [3] for details) with that obtained reconstructing using CPPCA [3] and MTBCS [6]. The comparison of average SNR of the reconstructed hyperspectral images as a function of different measurement ratios m/p are shown in Fig. 4. We can see that the SNR using our approach is much higher than that using the other two methods. (b) (a) Cuprite Dataset (b) Jasper Ridge Dataset Fig. 4. Plots of average SNR of reconstructed hyperspectral images for var(c) (d) Fig. 2. Results for synthetic data. (a) Plot of normalized distance between . (b,c,d) Plots the estimated center and the true center for varying n and m p of normalized inner product magnitude between estimated PCs and the corresponding true PCs for (b) varying measurement ratios m for n = 2000, p = 0.2, and (d) varying noise ratio /σ (c) varying n when m 1. p Second, we examine the Lankershim Boulevard Data, a set of real-world videos of traffic passing through a busy intersection. Considering individual frames of video as data points, we expect these data to have a low-rank-plus-sparse structure, represented by a collection of strong principal components plus sparse anomalies. Here we compare the first 5 principal components as estimated using normal PCA vs. CPPCA [3], another method for estimating principal components from CS measurements. Due to the large size of each frame of the raw video, we have resized it to 24 × 32 pixels. We have also centered the data before proceeding since CPPCA can not estimate the center. The results, in the form of the normalized inner product between the estimated and true principal components, are shown in Fig. 3. We see that using normal PCA to estimate the principal components for this dataset results in more accurate estimated principal components than CPPCA [3]. Finally, we use the estimated principal components to reconstruct hyperspectral images. We use the “Cuprite” and “Jasper Ridge” image datasets with p = 224 spectral bands and n = 10000 samples for each. In each case, we use normal PCA on random projections of the data to estimate the original data’s principal com- ious measurement ratios. We see that using normal PCA on random projections to estimate the principal components of hyperspectral data outperforms both Compressive-Projection Principal Components Analysis and Multi-Task Bayesian Compressive Sensing for hyperspectral data reconstruction. 5. THEORETICAL VERIFICATION In this section, we present proofs of Theorem 1 and 2 from Sec. 3. We start by introducing two lemmas used in the proof of Theorem 1. Lemma 1 ((From [7]) Symmetry of the distribution of Pi x under reflection across x). Suppose x is a fixed point in Rp and let Pi x be a random vector with Pi defined as in Section 2. Define the reflection operator Rx as Rx (y) = y + 2(hy, x̂ix̂ − y) (6) x where x̂ = kxk . Then the distribution of Pi x is the same as the distribution of Rx (Pi x). Lemma 2 (Theorem 1.1 of [8]). Let e1 , ..., em ∈ Rp be m points i.i.d drawn from N (0, Ip ). If m < p then the vectors {ei }m i=1 span a m-dimensional linear subspace of Rp almost surely. This subspace is then called a random m-space in Rp . Let H be a random mspace in Rp , L be a fixed 1-space in Rp and θ be the principal angle between H and L. The random variable cos2 θ has the beta distribution β m , p−m . 2 2 Proof of Theorem 1. Because {Pi x(i) }n i=1 are i.i.d., we will try to evaluate E(Pi x(i) ) so that we may use the law of large numbers to show (1). Because wij , Pi , and zi are independent with E(wij ) = 0, E(zi ) = 0, d X E(Pi x(i) ) = E(Pi x̄) + σj E(wij )E(Pi vj ) + E(Pi )E(zi ) j=1 = E(Pi x̄) (7) From Lemma 1, the distribution of Pi x̄ is the same as the distribution of Rx̄ (Pi x̄). Thus, E(Pi x̄) = E (Rx̄ (Pi x̄)). Hence, 1 1 hP x̄, x̄i x̄ (8) E(Pi x̄) = E (Pi x̄ + Rx̄ (Pi x̄)) = E i 2 kx̄k2 Now, Pi x̄ is the projection of x̄ onto a random m-space. Suppose the principal angle between this space and the span of x̄ is θ. Then hPi x̄, x̄i = kPi x̄kkx̄kcosθ = kx̄k2 cos2 θ (9) Thus, from Lemma 2 and (9), we have m 1 hP x̄, x̄i = E(cos2 θ) = (10) E i kx̄k2 p Combining (7), (8) and (10), we have that E Pi x(i) = m x̄, p and Theorem 1 follows from the law of large numbers. To prove Theorem 2, we first introduce the following lemma. Lemma 3 ((From [7]) Symmetry of the distribution of Pi x under rotation about x). Suppose x is a fixed point in Rp and let Pi be defined as in Section 2. Let V ∈ Rp×p be an orthogonal matrix with first column x/||x|| and let 0 1 01×(p−1) Qx = V VT (11) 0(p−1)×1 Q where Q is in the special orthogonal group SOp−1 , so that Q0x represents an arbitrary rotation of Rp about x. Then the distribution of Pi x is the same as the distribution of Q0x (Pi x). Proof of P Theorem 2. Since the data is assumed centered with x̄ = 0, d x(i) = j=1 wij σj vj + zi . Thus, since all the wij and zi are independent and zero-mean, we can show that E Pi x(i) x (i) T PiT = d X σj2 E Pi vj vjT PiT +E Pi zi zTi PiT . j=1 (12) Let’s first analyze a single term C1 = E Pi v1 v1T PiT . If we consider Pi v1 as a random variable, then we can define Pv1 (Pi v1 ) and Pv⊥ (Pi v1 ) = = hv1 , Pi v1 i ∈ R1 (13) T p−1 (hv2 , Pi v1 i, ..., hvp , Pi v1 i) ∈ R (14) We will abbreviate Pv1 (Pi v1 ) by Pv , and Pv⊥ (Pi v1 ) by Pv⊥ when no confusion will arise. Then, Pv Pi v 1 = V (15) Pv⊥ kPv k2 Pv PvT⊥ C1 = V E VT (16) T T Pv⊥ Pv Pv⊥ Pv⊥ We now evaluate the four terms in the block matrix in (16). For the first term E(kPv k2 ), our earlier analysis in (9) gives kPv k = cos2 θ, where θ is the principal angle between v1 and the random m-space Pi projects onto. Thus, from Lemma 2, 2m(p − m) m2 (17) E kPv k2 = E (cos2 θ)2 = 2 + p p3 + 2p We use k1 to represent this quantity: k1 = m2 p2 + 2m(p−m) . p3 +2p To compute the remaining three terms in (16), we take advantage of Lemma 3. From Lemma 3, the distribution of Pi v1 is rotationally symmetric about v1 . This implies that E(Pv PvT⊥ ) = 01×p and E(Pv⊥ PvT ) = 0p×1 . Furthermore, since the distribution of Pi v1 is rotationally symmetric about v1 , the distribution of Pv⊥ , the projection of Pi v1 onto the orthogonal complement of v1 , is rotationally symmetric about 0. This implies that E(Pv⊥ PvT⊥ ) is a multiple of the identity. Now, = E(trace(C1 )) − E(kPv k2 ) trace E Pv⊥ PvT⊥ = E(kPi v1 k2 ) − k1 = m − k1 p since the norm squared of a random m-dimensional projection of a unit vector in Rp is well-known to be m . We let k2 = m − k1 . p p T Then, since E(Pv⊥ Pv⊥ ) is a multiple of the identity, we must have k2 E(Pv⊥ PvT⊥ ) = I(p−1) p−1 Then, from (16), k2 k2 , ..., VT C1 = V diag k1 , p−1 p−1 (18) We may perform a similar analysis for each of the other terms E(Pi vj vjT PiT ) resulting in the same answer, except with k1 occupying the j th entry of the diagonal instead. For the term E Pi zi zTi PiT in (12), because both zi and {eij }m j=1 are random with completely isotropic distributions, we know that the distribution of Pi zi will also be isotropic. Thus, E Pi zi zTi PiT is also a multiple of identity. Because trace E(Pi zi zTi PiT ) = E(kPi zi k2 ) = m E(||zi ||2 ) = m p2 = p p 2 m , we have m (19) E Pi zi zTi PiT = 2 I p From (12), (18), and (19), we have d X T m σj2 E Pi vj vjT PiT +C = V ΣV T + 2 I E Pi x(i) x(i) PiT = p j=1 (20) where Σ is defined in (5). T Since the terms {Pi x(i) x(i) PiT }n i=1 are i.i.d., Theorem 2 then follows from the law of large numbers. 6. CONCLUSIONS We have demonstrated, both through theoretical analysis and experimentally, that PCA performed on low-dimensional random projections of the data recovers both the center and the principal components of the original data quite well, indeed better than previous approaches in the literature for recovering principal components from compressive sensing measurements. 7. REFERENCES [1] I. Jolliffe, Principal Component Analysis, Springer-Verlag, NY, 2002. [2] R. Baraniuk, “Compressive sensing,” IEEE Signal Processing Magazine, vol. 24, no. 4, pp. 118 –121, Jul. 2007. [3] J. Fowler, “Compressive-projection principal component analysis,” IEEE Trans. on Image Processing, vol. 18, no. 10, pp. 2230 –2242, Oct. 2009. [4] M. Fazel et al., “Compressed sensing and robust recovery of low rank matrices,” in ACSSC, Oct. 2008, pp. 1043 –1047. [5] H. Qi et al., “Using the kernel trick in compressive sensing: Accurate signal recovery from fewer measurements,” in ICASSP, May 2011, pp. 3940 –3943. [6] S. Ji et al., “Multitask compressive sensing,” IEEE Trans. on Signal Processing, vol. 57, no. 1, pp. 92 –106, Jan. 2009. [7] H. Qi and S. Hughes, “Technical report: Two observations on probability distribution symmetries for randomly projected data,” arXiv, 2012. [8] P. Frankl et al., “Some geometric applications of the beta distribution,” Annals of the Institute of Stat. Math., vol. 42, no. 3, pp. 463–474, Sep. 1990.