Recursive Projected Modified Compressed Sensing for Undersampled Measurements (Long Version)

1 Recursive Projected Modified Compressed Sensing for Undersampled Measurements (Long Version) Brian Lois, Namrata Vaswani, and Chenlu Qiu Iowa State University, Ames IA Abstract We study the problem of recursively reconstructing a time sequence of sparse vectors St from measurements of the form Mt = ASt + BLt where A and B are known measurement matrices, and Lt lies in a slowly changing low dimensional subspace. We assume that the signal of interest (St ) is sparse, and has support which is correlated over time. We introduce a solution which we call Recursive Projected Modified Compressed Sensing (ReProMoCS), which exploits the correlated support change of St . We show that, under weaker assumptions than previous work, with high probability, ReProMoCS will exactly recover the support set of St and the reconstruction error of St is upper bounded by a small time-invariant value. A motivating application where the above problem occurs is in functional MRI imaging of the brain to detect regions that are “activated” in response to stimuli. In this case both measurement matrices are the same (i.e. A = B). The active region image constitutes the sparse vector St and this region changes slowly over time. The background brain image changes are global but the amount of change is very little and hence it can be well modeled as lying in a slowly changing low dimensional subspace, i.e. this constitutes Lt . I. I NTRODUCTION We study the problem of recursively reconstructing a time sequence of sparse vectors St from undersampled measurements corrupted by (possibly) large magnitude but low-dimensional noise. By lowdimensional we mean that the Lt ’s all lie in a low-dimensional subspace, which is allowed to change slowly over time. Specifically we observe Mt := ASt + BLt where A and B are known matrices, and Lt is (potentially) large but low-dimensional and non-sparse noise. This problem has applications in fMRI imaging where the “active” region of the brain is the sparse signal of interest, and the background brain image is slowly changing low dimensional background. In this case, B = A is the partial Fourier matrix. Another application is compressed sensing in the presence of large but low-dimensional sensor noise. In this case B = I and A is the usual CS matrix. Related Work. This work is related to [1], [2] which study the problem of decomposing an observed matrix M = [M1 · · · Mt ] into the sum of a sparse matrix S = [S1 · · · St ] and a low rank matrix L = [L1 · · · Lt ]. These papers can all be viewed as batch solutions to the problem studied here. A limitation of these works is that they either require independence over time of the support of S [1] or stronger restrictions on the sparsity pattern of S [2], [3]. A more detailed discussion is provided in Section IV-E. Both of these are not practical assumptions for image analysis because foreground objects often move in 2 a correlated fashion over time. This can result in some rows of S having many non-zero entries, while other rows have very few or no non-zero entries. These papers also do not consider the case where the sparse vectors are undersampled. Other related works, some of which consider the under-sampled case, include [3], [4], [5], [6], [7], [8],[9], [10], [11], [12], [13], [14]. Contribution. The problem of recursively reconstructing sparse vectors in large but low-dimensional noise was studied in [15], where the solution called Recursive Projected CS was presented and analyzed. This work extends these results in two ways. First, we study the case where the sparse and low rank vectors may be under-sampled (multiplied by the matrices A and B). Second, we modify the algorithm in [15] by using modified-CS [16] in place of the simple `1 minimization (CS) step in ReProCS. This allows us to prove the same performance guarantees under weaker assumptions as long as the support of St changes slowly over time. II. N OTATION AND BACKGROUND A. Notation Let [1 : n] = {1, 2, . . . , n}. For a set T ⊆ [1 : n], |T | denotes the number of elements in T . For a vector v, vi denotes the ith entry of v and vT denotes the vector formed from the entries of v indexed by T . We use kvkp for the `p norm of v. For a matrix M, M 0 denotes its transpose and M † denotes is Moore-Penrose pseudo-inverse. kM k2 refers to the induced 2-norm of M kM k2 = maxkxk2 =1 kM xk2 . For an Hermitain matrix H, we use EVD H = U ΛU 0 to mean the eigenvalue decomposition where U is a unitary matrix, and Λ is diagonal with entries arranged in decreasing order. For Hermitian matrices H1 and H2 , the notation H1 H2 means that H2 − H1 is positive semi-definite. I is an identity matrix of appropriate size. For an indexing set T and matrix M, MT is the matrix formed by retaining the columns of M indexed by T . Notice that MT = M IT . For matrices M1 and M2 with the same number of rows, [M1 M2 ] is the matrix formed by horizontally concatenating the matrices M1 and M2 . We use range(M ) to refer to the subspace spanned by the columns of M . Definition 2.1. We refer to a tall matrix P as a basis matrix if it satisfies P 0 P = I. Definition 2.2. We use the notation Q = basis(C) to mean: Q is a matrix such that the columns of Q form an orthonormal basis for range(C) i.e. Q0 Q = I and range(Q) = range(C). Definition 2.3. The s-restricted isometry constant (RIC) [17], δs , for an n × m matrix Ψ is the smallest real number satisfying (1 − δs )kxk22 ≤ kΨxk22 ≤ (1 + δs )kxk22 for all s-sparse vectors x. It follows easily that max|T |≤s k(Ψ0T ΨT )−1 k2 ≤ 1 1−δs (Ψ) [17]. Definition 2.4. We give some notation for random variables in this definition. 3 1) We let E[Z] denote the expectation of a random variable (r.v.) Z and E[Z|X] denote its conditional expectation given another r.v. X. 2) Let B be a set of values that a r.v. Z can take. We use B e to denote the event Z ∈ B, i.e. B e := {Z ∈ B}. 3) The probability of any event B e can be expressed as [18], P(B e ) := E[IB (Z)]. where ( IB (Z) := 1 if Z ∈ B 0 otherwise is the indicator function on the set B. 4) For two events B e , B˜e , P(B e |B˜e ) refers to the conditional probability of B e given B˜e , i.e. P(B e |B˜e ) := P(B e , B˜e )/P(B˜e ). 5) For a r.v. X, and a set B of values that the r.v. Z can take, the notation P(B e |X) is defined as P(B e |X) := E[IB (Z)|X]. Notice that P(B e |X) is a r.v. (it is a function of the r.v. X) that always lies between zero and one. B. High Probability Tail Bounds for Sums of Independent Random Matrices Lemma 2.5. Suppose that B is the set of values that the r.v.s X, Y can take. Suppose that D is a set of values that the r.v. X can take. For a 0 ≤ p ≤ 1, if P(B e |X) ≥ p for all X ∈ D, then P(B e |De ≥ p as long as P(De ) > 0. The proof is in the Appendix. The following lemma is an easy consequence of the chain rule of probability applied to a contracting sequence of events. e e Lemma 2.6. For a sequence of events E0e , E1e , . . . Em that satisfy E0e ⊇ E1e ⊇ E2e · · · ⊇ Em , the following holds e P(Em |E0e ) = m Y e P(Eke |Ek−1 ). k=1 e e e Proof. P(Em |E0e ) = P(Em , Em−1 , . . . E0e |E0e ) = Qm k=1 e e P(Eke |Ek−1 , Ek−2 , . . . E0e ) = Qm k=1 e P(Eke |Ek−1 ). Next, we state the matrix Hoeffding inequality [19, Theorem 1.3] which gives tail bounds for sums of independent random matrices. Theorem 2.7 (Matrix Hoeffding for a zero mean Hermitian matrix [19]). Consider a finite sequence {Zt } of independent, random, Hermitian matrices of size n × n, and let {At } be a sequence of fixed Hermitian 4 matrices. Assume that each random matrix satisfies (i) P(Zt2 A2t ) = 1 and (ii) E(Zt ) = 0. Then, for all > 0, ! P λmax X Zt ! ≤ ≥ 1 − n exp t −2 8σ 2 X At 2 , where σ 2 = t 2 Corollary 2.8 (Matrix Hoeffding conditioned on another random variable for a nonzero mean Hermitian matrix). Given an α-length sequence {Zt } of random Hermitian matrices of size n × n, a r.v. X, and a set D of values that X can take. Assume that, for all X ∈ C, (i) Zt ’s are conditionally independent given P X; (ii) P(b1 I Zt b2 I|X) = 1 and (iii) b3 I α1 t E(Zt |X) b4 I. Then for all > 0, ! ! 1X −α2 P λmax Zt ≤ b4 + X ≥ 1 − n exp for all X ∈ C α t 8(b2 − b1 )2 ! ! 1X −α2 P λmin Zt ≥ b3 − X ≥ 1 − n exp for all X ∈ C α t 8(b2 − b1 )2 The proof is in the Appendix. Corollary 2.9 (Matrix Hoeffding conditioned on another random variable for an arbitrary nonzero mean matrix). Given an α-length sequence {Zt } of random Hermitian matrices of size n × n, a r.v. X, and a set C of values that X can take. Assume that, for all X ∈ C, (i) Zt ’s are conditionally independent given P X; (ii) P(kZt k2 ≤ b1 |X) = 1 and (iii) k α1 t E(Zt |X)k2 ≤ b2 . Then, for all > 0, ! 1 X −α2 P Zt ≤ b2 + X ≥ 1 − (n1 + n2 ) exp for all X ∈ C α t 32b21 2 The proof is in the Appendix. C. Results from Linear Algebra Davis and Kahan’s sin θ theorem [20] studies the rotation of eigenvectors by perturbation. Theorem 2.10 (sin θ theorem [20]). Given two Hermitian matrices C and H satisfying " #" # " #" # h i C 0 h i H B0 E0 E0 C = E E⊥ , H = E E⊥ 0 C⊥ E⊥ 0 B H⊥ E⊥ 0 (1) where [E E⊥ ] is an orthogonal matrix. The two ways of representing C + H are " #" # " #" # h i C +H h i Λ 0 B0 E0 F0 C + H = E E⊥ = F F⊥ B C⊥ + H⊥ E⊥ 0 0 Λ⊥ F⊥ 0 where [F F⊥ ] is another orthogonal matrix. Let R := (C + H)E − CE = HE. If λmin (C) > λmax (Λ⊥ ), then k(I − F F 0 )Ek2 ≤ kRk2 λmin (C) − λmax (Λ⊥ ) The above result bounds the amount by which the two subspaces range(E) and range(F ) differ as a function of the norm of the perturbation kRk2 and of the gap between the minimum eigenvalue of C and the maximum eigenvalue of Λ⊥ . 5 Next, we state Weyl’s theorem which bounds the eigenvalues of a perturbed Hermitian matrix, followed by Ostrowski’s theorem. Theorem 2.11 (Weyl [21]). Let C and H be two n × n Hermitian matrices. For each i = 1, 2, . . . , n we have λi (C) + λmin (H) ≤ λi (C + H) ≤ λi (C) + λmax (H) Theorem 2.12 (Ostrowski [21]). Let H and W be n × n matrices, with H Hermitian and W nonsingular. For each i = 1, 2 . . . n, there exists a positive real number θi such that λmin (W W 0 ) ≤ θi ≤ λmax (W W 0 ) and λi (W HW 0 ) = θi λi (H). Therefore, λmin (W HW 0 ) ≥ λmin (W W 0 )λmin (H) The following lemma is a consequence of Weyl’s theorem (Theorem 2.11) and the sin θ theorem (Theorem 2.10) Lemma 2.13. Suppose that two Hermitian matrices C and H can be decomposed as in (1) where [E E⊥ ] is an orthogonal matrix, and C is a c × c matrix. Also suppose that the EVD of C + H is " #" # h i Λ 0 F0 EVD C + H = F F⊥ 0 Λ⊥ F⊥ 0 If λmin (Ck ) > kCk,⊥ k2 − kHk k2 , then k(I − F F 0 )Ek2 ≤ kHk2 λmin (Ck ) − kCk,⊥ k2 − kHk k2 Proof. By definition of EVD, [F F⊥ ] is an orthogonal matrix. By the sin θ theorem, if λmin (C) > λmax (Λ⊥ ), then k(I − F F 0 )Ek2 ≤ kRk2 λmin (Ck )−λmax (Λ⊥ ) where R := HE. Clearly kRk ≤ kHk. Since λmin (C) > λmax (C⊥ ) and C is a c × c matrix, λc+1 (C) = λmax (C⊥ ). By definition of EVD (eigenvalues arranged in decreasing order) and since Λ is a c×c matrix, λc+1 (C + H) = λmax (Λ⊥ ). By Weyl’s theorem, λmax (Λ⊥ ) = λc+1 (C + H) ≤ λc+1 (C) + λmax (H). Since λmax (H) ≤ kHk2 , the result follows. Lemma 2.14. Suppose that P , P̂ and Q are three basis matrices. Also, P and P̂ are of the same size, Q0 P = 0 and k(I − P̂ P̂ 0 )P k2 = ζ∗ . Then, 1) k(I − P̂ P̂ 0 )P P 0 k2 = k(I − P P 0 )P̂ P̂ 0 k2 = k(I − P P 0 )P̂ k2 = k(I − P̂ P̂ 0 )P k2 = ζ∗ 2) kP P 0 − P̂ P̂ 0 k2 ≤ 2k(I − P̂ P̂ 0 )P k2 = 2ζ∗ 3) kP̂ 0 Qk2 ≤ ζ∗ p 1 − ζ∗2 ≤ σi ((I − P̂ P̂ 0 )Q) ≤ 1 4) The proof is in the Appendix. Lemma 2.15. Let B be any matrix, and Q(B) be a matrix such that the columns of Q form an orthonormal basis for range(B). Then, kA0T Bk2 ≤ kA0T Q(B)k2 kBk2 . The proof is in the Appendix. 6 D. Compressive Sensing Result Lemma 2.16 ([22] Lemma 4). Consider a sequence of vectors xt with support Nt ⊆ [1 : n] such that |Nt | ≤ s and yt = Axt + βt where kβt k2 ≤ . Suppose that at each time t at most sa elements enter the support of xt and at most sa elements leave the support of xt . Let T̂t be an estimate of the support of xt such that |T̂t \ Nt | ≤ s∆e and |Nt \ T̂t | ≤ s∆ . For t ≥ 2 let x̂t be the solution to the optimization problem min k(x̃)T̂ c k1 s.t. kyt − Ax̃t k ≤ x̃ and set T̂t+1 = N̂t = {i ∈ [1 : n] : |(x̂t )i | > ω} Here N̂t is the final estimate of the support of xt . This value also serves as the initial support estimate for time t + 1. Let % be the smallest real number such that % kxt − x̂t k∞ ≤ √ kxt − x̂t k2 . sa Notice that % ≤ √ sa because the infinity norm is always less than or equal to the two norm. Then, 1) All elements of the set {i ∈ Nt : |(xt )i | ≥ b1 } will get detected (be included in N̂t ) if • δs+s∆e +2s∆ < 0.207, and b1 > ω + % √ 7.50. sa 2) There will be no false additions, and all the true removals from the support (T̂t \ Nt ) will get deleted if • δs+s∆e +2s∆ < 0.207, and ω ≥ % √ 7.50. sa III. P ROBLEM F ORMULATION The measurement vector at time t, Mt , is an m-dimensional vector which is constituted as Mt = ASt + BLt . A and B are known m × n matrices where n can be greater than m. St is a sparse vector in Rn with support size at most s and whose non-zero entries have magnitude at least Smin . Furthermore, the number of elements that enter the support and the number of elements that exit the support at a given time are both bounded by sa . Lt is a dense vector which lies in a slowly changing low-dimensional subspace of Rn . Since all of the Lt ’s lie in a low dimensional subspace so do the vectors BLt . Therefore let L̃t = BLt , then the observed vector can be expressed as Mt = ASt + L̃t . We assume that we have access to an accurate estimate of the subspace in which the first ttrain vectors L̃t lie. That is we have a basis matrix Q̂0 such that k(I − Q00 Q0 )Q̂0 k2 is small. Here Q0 = basis([L̃1 , L̃2 , . . . , L̃ttrain ]). Our goal is to estimate St at every time t > ttrain . Notation for St . Let Nt := {i ∈ [1 : n] | (St )i 6= 0} be the support of St . At each time t > ttrain Nt changes as: Nt = (Nt−1 ∪ Nadd,t ) \ Ndel,t 7 where Nadd,t are the elements added to the support at time t, and Ndel,t are the elements deleted from the support at time t. Define s := max |Nt | and sa := max max {|Nadd,t |, |Ndel,t |} t t Let Smin := min min |(St )i | be a lower bound on the size of any non-zero entry of St for any time t. Let t i∈Nt T̂t be an estimate of the support Nt . Denote by ∆e,t the elements of T̂ that are not actually in Nt i.e. ∆e,t = T̂t \ Nt . Similarly ∆t denotes the elements of T that are not in T̂ i.e. ∆t = Nt \ T̂t . Model on Lt . 1) The Lt ’s lie in a low-dimensional subspace that changes every so often. Let P(t) be a basis matrix for this subspace at time t. Then Lt = P(t) at where P(t) changes every so often. Denote the subspace change times by tj , j = 0, 1, 2, · · · , J. (There are a maximum of J subspace change times.) For tj ≤ t ≤ tj+1 , P(t) = Pj where Pj is an m × rj basis matrix with rj m and rj (tj+1 − tj ). 2) At the change times, tj , Pj changes as Pj = [Pj−1 Pj,new ] where Pj,new is an m × cj,new basis matrix with " columns # orthogonal to the columns of Pj−1 . Therefore rj = rj−1 + cj,new . We can then write at,∗ at = conformal with the partition of Pj . at,new 3) There exists a constant cmx such that 0 ≤ cj,new ≤ cmx . 0 Lt is a zero mean bounded random variable. That is E(at ) = 0 and 4) The vector of coefficients at = P(t) there is a constant γ∗ such that kat k∞ ≤ γ∗ for all t. The covariance matrix Λt := Cov[at ] = E(at a0t ) is diagonal with λ− := mint λmin (Λt ) > 0 and λ+ := maxt λmax (Λt ) < ∞. Thus the condition number of any Λt is bounded by f := λ+ . λ− 5) Finally, the at ’s are mutually independent over time. Definition 3.1. Define Qj = basis(BPj ), and Qj,new = basis (I − Qj−1 Qj−1 0 )BPj . Notice that Q0j−1 Qj,new = 0. A. Slow Subspace Change By slow subspace change we mean all of the following: 1) The delay between consecutive subspace change times (tj+1 − tj ) is large enough. 2) The vector of coefficients for the new directions, at,new is initially small. That is max tj ≤t<tj +α kat,new k∞ ≤ γnew with γnew γ∗ and γnew Smin , but can increase gradually. This is modeled as follows. Split the interval [tj , tj+1 − 1] into α length periods. We assume that max j max t∈[tj +(k−1)α,tj +kα−1] kat,new k∞ ≤ min(v k−1 γnew , γ∗ ) for a v > 1 but not too large. 3) The number of newly added directions is small i.e. cj,new ≤ cmx r0 . 8 B. Matrix Coherence Parameter and its Relation with RIC Definition 3.2. For an m × n matrix C and an m × p matrix A, define the incoherence coefficient κs,A (C) by κs,A (C) := max kAT 0 basis(C)k2 |T |≤s and let κs (C) := κs,I (C). κs,A (C) can be thought of as a measure of the coherence between range(AT ) and range(C) for any set T of size less than or equal to s. When A = I, κs (C) measures the denseness of columns of basis(C) and hence in [15] it was referred to as the denseness coefficient. We will use the notation κ2s,A (C) to mean (κs,A (C))2 . Lemma 3.3. For any set T such that |T | ≤ s and matrices A and C, kAT 0 Ck2 ≤ κs,A (C)kCk2 . QR Proof. Let C = QR. Then kAT 0 Ck = kAT 0 QRk ≤ kAT 0 QkkRk ≤ κs,A (C)kRk = κs,A (C)kCk. Lemma 3.4. For a basis matrix P (P 0 P = I) and matrix A. δs ((I − P P 0 )A) ≤ δs (A) + κs,A (P )2 . The proof is in the appendix. IV. R ECURSIVE P ROJECTED M ODIFIED CS A LGORITHM AND ITS P ERFORMANCE G UARANTEES The Recursive Projected Modified CS algorithm is summarized in Algorithm 2. First we need the following definition Definition 4.1. Define the time interval Ij,k := [tj + (k − 1)α, tj + kα − 1] for k = 1, . . . K and Ij,K+1 := [tj + Kα, tj+1 − 1] where K is the algorithm parameter in Algorithm 2. A. The Projection-PCA algorithm Given a data matrix D, a basis matrix P and an integer r, projection-PCA (proj-PCA) applies PCA on Dproj := (I − P P 0 )D, i.e., it computes the top r eigenvectors (the eigenvectors with the r largest eigenvalues) of 1 D D 0. αD proj proj Here αD is the number of column vectors in D. This is summarized in Algorithm 1. If P = [.], then projection-PCA reduces to standard PCA, i.e. it computes the top r eigenvectors of 1 DD0 . αD We should mention that the idea of projecting perpendicular to a partly estimated subspace has been used in different contexts in past work [23], [4]. 9 Algorithm 1 projection-PCA: R ← proj-PCA(D, P, r) 1) Projection: compute Dproj ← (I − P P 0 )D " #" # h i Λ 0 0 R EV D 2) PCA: compute α1D Dproj Dproj 0 = R R⊥ where R is an n × r basis matrix and 0 Λ⊥ R⊥ 0 αD is the number of columns in D. Algorithm 2 Recursive Projected Modified CS (ReProMoCS) Parameters: algorithm parameters: ξ, ω, α, K, model parameters: tj , r0 , cj,new (set as in Theorem 4.2 or as in [15, Sec X-B] when the model is not known) b̃ , Q̂ Input: Mt , Output: Ŝt , L t (t) Initialization: Compute Q̂0 ← proj-PCA [L̃1 , L̃2 , · · · , L̃ttrain ], [.], r0 , and set Q̂(t) ← Q̂0 . Let T̂ttrain +1 = ∅, and let j ← 1, k ← 1. For t > ttrain , do the following: 1) Estimate Nt and St via Projected Modified CS: a) Nullify most of L̃t : compute Φ(t) ← I − Q̂(t−1) Q̂0(t−1) , compute yt ← Φ(t) Mt b) Sparse Recovery: compute Ŝt,modcs as the solution of minx kxT̂tc k1 s.t. kyt − Φ(t) Axk2 ≤ ξ c) Support Estimate: compute N̂t = {i : |(Ŝt,modcs )i | > ω}, and set T̂t+1 = N̂t d) LS Estimate of St : compute (Ŝt )N̂t = ((Φ(t) A)N̂t )† yt , (Ŝt )N̂tc = 0 b̃ = M − AŜ . 2) Estimate L̃t : L t t t 3) Update Q̂(t) by K Projection PCA steps. a) If t = tj + kα − 1, i h b̃ b̃ , Q̂ , c L i) Q̂j,new,k ← proj-PCA L , . . . , j−1 j,new , tj +kα−1 tj +(k−1)α ii) set Q̂(t) ← [Q̂j−1 Q̂j,new,k ]; increment k ← k + 1. Else i) set Q̂(t) ← Q̂(t−1) . b) If t = tj + Kα − 1, then set Q̂j ← [Q̂j−1 Q̂j,new,K ]. Increment j ← j + 1. Reset k ← 1. 4) Increment t ← t + 1 and go to step 1. B. Recursive Projected Modified CS (ReProMoCS) The idea behind ReProMoCS is as follows. Assume that the current basis matrix, Q̂(t−1) is an accurate estimate of the current subspace where the L̃t ’s lie. We project the measurement Mt perpendicular to range Q̂(t−1) in order to nullify most of L̃t and obtain yt := Φ(t) Mt where Φ(t) = I − Q̂(t−1) Q̂0(t−1) . Since the projection matrix only has rank n−r∗ where r∗ = rank(Q̂(t−1) ), there are only n−r∗ “effective” linear measurements. Recovering the n dimensional sparse vector St now becomes a sparse recovery problem in small noise. Using the support estimate from the previous time we use the modified-CS algorithm to recover Ŝt,modcs and estimate its support by thresholding on the recovered vector. By computing a least squares (LS) estimate of St on the estimated support and setting it to zero everywhere else (step 1d), we b̃ = M − AŜ (step 2). The can get a more accurate final estimate, Ŝt . This Ŝt is used to estimate L̃t as L t t t 10 b̃ = M − AŜ , e also satisfies Ae = L b̃ − BL . Thus, sparse recovery error is then et := St − Ŝt . Since L t t t t t t t b̃ a small e means that L̃ is also recovered accurately. The estimated L ’s are used to obtain new estimates t t t of Qj,new every α frames for a total of Kα frames via a modification of the standard PCA procedure, which we call projection PCA (step 3). C. Performance Guarantees Theorem 4.2. Consider Algorithm 2. Let c := cmx and r := r0 + (J − 1)c. Assume that Lt obeys the model given in Sec. III and there are a total of J change times. Assume also that the initial subspace estimate is accurate enough, i.e. k(I − Q̂0 Q̂00 )Q0 k ≤ r0 ζ, for a ζ that satisfies −4 λ+ 1.5 × 10−4 1 10 , where f := , ζ ≤ min r2 r2 f r3 γ∗2 λ− If the following conditions hold: 1) the algorithm parameters are set as ξ = ξ0 (ζ), % √ 7.50ξ sa ≤ ω < Smin − √%sa 7.50ξ, K = K(ζ), α ≥ αadd,A = 1.05αadd (ζ), where ξ0 (ζ), %, K(ζ), αadd (ζ) are defined in Definition 5.2. 2) A, Pj−1 , Pj,new , Dj,new,k := (I − Q̂j−1 Q̂0j−1 − Q̂j,new,k Q̂0j,new,k )Qj,new , and Gj,new,k := (I − Pj,new Pj,new 0 )P̂j,new,k satisfy δ2s (A) ≤ 0.05 κ2s (Q1 ) ≤ 0.3 κ2s (Q1,new ) ≤ 0.15 κs (D1,new,1 ) ≤ 0.15 κ2s (Q1,new,1 ) ≤ 0.15 and κs+3sa ,A (QJ−1 ) ≤ 0.22, max κs+3sa ,A (Qj,new ) ≤ 0.14, j max max κs+3sa ,A (Dj,new,k ) ≤ 0.15, j 0≤k≤K max max κs+3sa ,A (Gj,new,k ) ≤ 0.14 j 0≤k≤K max max κs+3sa ,A (Dj,∗,k ) ≤ 1 j 0≤k≤K with P̂j,new,0 = [.] (empty matrix). 3) for a given value of Smin , the subspace change is slow enough, i.e. min(tj+1 − tj ) > Kα, j max j max tj +(k−1)α≤t<tj +kα 14ρξ0 (ζ) ≤ Smin , kat,new k∞ ≤ min(1.2k−1 γnew , γ∗ ), 11 4) the condition number of the covariance matrix of at,new is bounded, i.e. λmax [Cov(at,new )] √ ≤ 2 gt := λmin [Cov(at,new )] then, with probability at least (1 − n−10 ), 1) at all times, t, T̂t = Tt , et satisfies et = INt [(Φk−1 A)0Nt (Φk−1 A)Nt ]−1 A0Nt Φk−1 Lt √ √ √ √ and ket k2 = kŜt − St k2 ≤ 0.18 cγnew + 1.2 ζ( r + 0.06 c). (2) 2) the subspace error SE(t) := k(I − Q̂(t) Q̂0(t) )Q(t) k2 satisfies the bounds given in [15, Theorem 18]. m l log(0.6cζ) Definition 4.3. 1) Define K(ζ) := log 0.6 √ 2) Define g + := 2 to be the upper bound on gt . √ √ √ √ 3) Define ξ0 (ζ) := cγnew + ζ( r + c) 4) Define % to be the smallest real number such that % kSt − Ŝt,ModCS k∞ ≤ √ kSt − Ŝt,ModCS k2 sa √ for all t. Notice that % ≤ sa because the infinity norm is always less than or equal to the two norm. 5) Let K = K(ζ). Define 4608(log 6KJ + 11 log n) 4K 4 4 16 2 2 max min(1.2 γnew , γ∗ ), 2 , 4(0.186γnew + 0.0034γnew + 2.3) = . ζ 2 (λ− )2 c αadd In words, αadd is the smallest value of the number of data points, α, needed for one projection PCA step to ensure that Theorem 4.2 holds w.p. at least (1 − n−10 ). This result says that if (a) the algorithm parameters are set appropriately; (b) The RIC of A is small enough; (c) the basis matrices for the previous subspace, the newly added subspace, and the unestimated part of the newly added subspace are sufficiently incoherent with A; (d) the low-dimensional subspace changes slowly enough; (e) the condition number of the covariance matrix of at,new is small enough, then with high probability, the algorithm exactly recovers the support of St at all times t. D. Discussion This work extends the results of [15] to the case where St and Lt are multiplied by the matrices A and B which could be fat. This obviously makes the problem harder, so we need an additional assumption, namely that δs (A) < 0.05. Also, because St is replaced by ASt and Lt is replaced by BLt , the denseness assumption on the subspaces in which Lt ’s lie or on the changed part of these subspaces is replaced by a similar assumption on the incoherence between the range of any set of s columns of A and the subspaces or subspace changes of BLt (quantified by bounds on κs,A (Qj ) and κs,A (Qj,new )). In addition to considering the undersampled case, we also substitute modified-CS for basis pursuit in the sparse recovery step of the algorithm. To study the effect of this change and compare it with the 12 result of [15], consider the case where A = B = I. In this case, κs (P ) is the denseness coefficient of range(P ). Since modified-CS utilizes the previous support estimate to aid the current sparse recovery, we are able to replace the bounds on κ2s with similar bounds on κs+3sa . This is a weaker condition if 3sa ≤ s. So if the support of St changes slowly (which is a practical assumption in certain applications) we can get exact support recovery under weaker conditions than [15]. However, we still need stronger conditions for the very first step or some initial support estimate. At the same time, we note that denseness of Dnew,k requires the support of St to not be constant in an α-duration period. As observed in simulation experiments described in [15], even a support change by one index at each t is sufficient to ensure that Dnew,k is dense enough. A simple practical example where both of the above requirements are satisfied is a fixed size object moving by one or a few pixels in each frame. The relation between the denseness requirement on Dnew,k and support change of St needs to be studied further and this is being done in our ongoing work. To compare with [1], [2], consider the case where A = B = I. A direct comparison is not possible because the solution methods and proof techniques are very different. Specifically both [1] and [2] solve a batch version of the problem, whereas we provide a recursive solution. ReProMoCS requires knowledge of some subspace change model parameters and appropriate setting of a few algorithm parameters (of course, we also have a practical version that works without any model knowledge) while this is not needed for the approach studied in [1], [2]. Also, ReProMoCS needs slow subspace change (this, however, is a practically valid assumption). The advantage of ReProMoCS is that it requires weaker denseness assumptions on Lt , and most importantly, it needs weaker assumptions on the uncorrelated-ness of the support change of St over time. [1] needs the support sets to be mutually independent over time. [2] needs an assumption that is satisfied if the minimum number of non-zeros in any row of S is small enough. However, if subspace changes slowly enough, ReProMoCS will work even if many rows of S are dense. E. Comparison with Rank-Sparsity Incoherence To compare with [2] consider the case where A = B = I. A direct comparison is not possible because the solution methods and proof techniques are very different. Specifically Rank-Sparsity Incoherence solves a batch version of the problem, whereas we provide a recursive solution. The authors of [2] provide sufficient conditions under which the the minimization problem min γkS̃k1 + kL̃k∗ s.t. S̃ + L̃ = M S̃,L̃ will exactly recover S and L. (Here M = S + L.) In order to state the conditions needed they define two quantities: ξ(L) := max kN k∞ N ∈T (L) kN k=1 where T (L) is the span of all matrices with column space contained in the column space of L or row space contained in the row space of L and µ(S) := max N supp(N )⊆supp(S) kN k∞ =1 kN k 13 If µ(S)ξ(L) < 16 , then the optimization problem will exactly recover both S and L. This differs from our work in that we only guarantee exact recovery of the support of S. The requirement that ξ(L) be small is related to our assumption that κs+3sa ,A (QJ−1 ) be small. Matrices with their row/column space incoherent with the coordinate axes will have small ξ(L). Since we are studying the undersampled case, we require that QJ−1 be incoherent with the columns of A. Our assumptions differ in regard to the sparse matrix S. Here we assume that supp(St ) changes slowly over time. This will result in rows of S = [S1 · · · St ] with many non-zero entries. Rank-Sparsity Incoherence [2] uses the number of non-zero entries per row or column to bound µ(S). If degmax (S) is the maximum number of non-zero entries in any row or column of S, and degmin (S) is the minimum number of non-zero entries in any row or column, then degmin (S) ≤ µ(S) ≤ degmax (S). They also show that if the support of S is chosen uniformly at random, then µ(S) will be small with high probability. For our work, denseness of Dnew,k requires some support change of the St ’s as seen in simulations. The number of support changes required seems to be proportional to the number of newly added directions, for example if c = 1 new direction is added, then even one support change in an α length duration is sufficient to ensure that κs (Dnew,k ) is less than one and as a result the recovery error eventually stabilizes (see Section IV-C of [15]). V. D EFINITIONS AND P ROOF O UTLINE FOR T HEOREM 4.2 Remark 5.1. We will prove Theorem 4.2 for the case when B = I. In this case Qj = Pj and Qnew = Pnew . A. Definitions A few quantities are already defined in the model (Sec III), Definition 4.1, Algorithm 2, and Theorem 4.2. Here we define more quantities needed for the proofs. Definition 5.2. We define here the parameters used in Theorem 4.2. m l 1) Define K(ζ) := log(0.6cζ) √ log 0.6 √ √ √ 2) Define ξ0 (ζ) := cγnew + ζ( r + c) 3) Define ρ := maxt {κ1 (Ŝt,cs − St )}. Notice that ρ ≤ 1. 4) Define 8 · 242 4 16 2 4K 4 2 := (log 6KJ + 11 log n) 2 − 2 max min(1.2 γnew , γ∗ ), 2 , 4(0.186γnew + 0.0034γnew + 2.3) ζ (λ ) c αadd In words, αadd is the smallest value of the number of data points, α, needed for one projection PCA step to ensure that Theorem 4.2 holds w.p. at least (1 − n−10 ). Definition 5.3. We define the noise seen by the sparse recovery step at time t as 0 βt := (I − P̂(t−1) P̂(t−1) )Lt . Also define the reconstruction error of St as et := Ŝt − St . 14 Here Ŝt is the final estimate of St after the LS step in Algorithm 2. Notice that et also satisfies Aet = Lt − L̂t . Definition 5.4. We define the subspace estimation errors as follows. Recall that P̂j,new,0 = [.] (empty matrix). 0 SE(t) := k(I − P̂(t) P̂(t) )P(t) k2 , 0 )Pj−1 k2 ζj,∗ := k(I − P̂j−1 P̂j−1 0 0 ζj,k := k(I − P̂j−1 P̂j−1 − P̂j,new,k P̂j,new,k )Pj,new k2 Remark 5.5. Recall from the model given in Sec III and from Algorithm 2 that 0 1) P̂j,new,k is orthogonal to P̂j−1 , i.e. P̂j,new,k P̂j−1 = 0 2) P̂j−1 := [P̂0 , P̂1,new,K , . . . P̂j−1,new,K ] and Pj−1 := [P0 , P1,new , . . . Pj−1,new ] 3) for t ∈ Ij,k+1 , P̂(t) = [P̂j−1 , P̂j,new,k ] and P(t) = Pj = [Pj−1 , Pj,new ]. 0 4) Φ(t) := I − P̂(t−1) P̂(t−1) From Definition 5.4 and the above, it is easy to see that P 1) ζj,∗ ≤ ζ1,∗ + jj−1 0 =1 ζj 0 ,K P for t ∈ Ij,k+1 . 2) SE(t) ≤ ζj,∗ + ζj,k ≤ ζ1,∗ + jj−1 0 =1 ζj 0 ,K + ζj,k Definition 5.6. Define the following 1) Φj,k , Φj,0 and φk 0 0 a) Φj,k := I − P̂j−1 P̂j−1 − P̂j,new,k P̂j,new,k is part of the CS matrix (the CS matrix will be Φj,k A) for t ∈ Ij,k+1 , i.e. Φ(t) = Φj,k for this duration. 0 b) Φj,0 := I − P̂j−1 P̂j−1 is part of the CS matrix for t ∈ Ij,1 , i.e. Φ(t) = Φj,0 for this duration. Φj,0 is also the matrix used in all of the projection PCA steps for t ∈ [tj , tj+1 − 1]. c) φk := maxj maxT :|T |≤s k(((Φj,k A)T )0 (Φj,k A)T )−1 k2 . It is easy to see that φk ≤ 1 1−maxj δs (Φj,k A) [17]. 2) Dj,new,k , Dj,new and Dj,∗ a) Dj,new,k := Φj,k Pj,new . span(Dj,new,k ) is the unestimated part of the newly added subspace for any t ∈ Ij,k+1 . b) Dj,new := Dj,new,0 = Φj,0 Pj,new . span(Dj,new ) is interpreted similarly for any t ∈ Ij,1 . c) Dj,∗,k := Φj,k Pj−1 . span(Dj,∗,k ) is the unestimated part of the existing subspace for any t ∈ Ij,k d) Dj,∗ := Dj,∗,0 = Φj,0 Pj−1 . span(Dj,∗,k ) is interpreted similarly for any t ∈ Ij,1 e) Notice that ζj,0 = kDj,new k2 , ζj,k = kDj,new,k k2 , ζj,∗ = kDj,∗ k2 . Also, clearly, kDj,∗,k k2 ≤ ζj,∗ . Definition 5.7. QR 1) Let Dj,new = Ej,new Rj,new denote its QR decomposition. Here Ej,new is a basis matrix and Rj,new is upper triangular. 2) Let Ej,new,⊥ be a basis matrix for the orthogonal complement of span(Ej,new ) = span(Dj,new ). To be 0 precise, Ej,new,⊥ is a n × (n − cj,new ) basis matrix that satisfies Ej,new,⊥ Ej,new = 0. 15 3) Using Ej,new and Ej,new,⊥ , define Cj,k , Cj,k,⊥ , Hj,k , Hj,k,⊥ and Bj,k as 1 X Ej,new 0 Φj,0 Lt Lt 0 Φj,0 Ej,new Cj,k := α t∈I j,k X 1 Cj,k,⊥ := Ej,new,⊥ 0 Φj,0 Lt Lt 0 Φj,0 Ej,new,⊥ α t∈I j,k X 1 Ej,new 0 Φj,0 ((Aet )(Aet )0 − Lt (Aet )0 − Aet Lt 0 )Φj,0 Ej,new Hj,k := α t∈I j,k 1 X Ej,new,⊥ 0 Φj,0 ((Aet )(Aet )0 − Lt (Aet )0 − Aet Lt 0 )Φj,0 Ej,new,⊥ Hj,k,⊥ := α t∈I j,k 1 X 1 X Bj,k := Ej,new,⊥ 0 Φj,0 L̂t L̂0t Φj,0 Ej,new = Ej,new,⊥ 0 Φj,0 (Lt − Aet )(Lt 0 − (Aet )0 )Φj,0 Ej,new α t∈I α t∈I j,k j,k 4) Define h Aj,k := Ej,new Ej,new,⊥ " i C j,k 0 0 #" Cj,k,⊥ " 0 h i H j,k Bj,k Hj,k := Ej,new Ej,new,⊥ Bj,k Hj,k,⊥ Ej,new 0 # Ej,new,⊥ 0 #" # Ej,new 0 Ej,new,⊥ 0 5) From the above, it is easy to see that Aj,k + Hj,k = 1 X Φj,0 L̂t L̂0t Φj,0 . α t∈I j,k EV D 6) Recall from Algorithm 2 that Aj,k + Hj,k = h P̂j,new,k P̂j,new,k,⊥ " i Λ k 0 0 Λk,⊥ #" 0 P̂j,new,k 0 P̂j,new,k,⊥ # is the EVD of Aj,k + Hj,k . Here P̂j,new,k is an n × cj,new basis matrix. 7) Using the above, Aj,k + Hj,k can be decomposed in two ways as follows: " #" # h i Λ 0 0 P̂j,new,k k Aj,k + Hj,k = P̂j,new,k P̂j,new,k,⊥ 0 0 Λk,⊥ P̂j,new,k,⊥ " #" # h i A +H 0 Bj,k Ej,new 0 j,k j,k = Ej,new Ej,new,⊥ Bj,k Aj,k,⊥ + Hj,k,⊥ Ej,new,⊥ 0 P Remark 5.8. Thus, from the above definition, Hj,k = α1 [Φ0 t (−Lt (Aet )0 − Aet L0t + (Aet )(Aet )0 )Φ0 + P 0 0 0 F + F 0 ] where F := Enew,⊥ Enew,⊥ Φ0 t Lt L0t Φ0 Enew Enew = Enew,⊥ Enew,⊥ (D∗,k−1 at,∗ )(D∗,k−1 at,∗ + 0 Dnew,k−1 at,new )0 Enew Enew . Since E[at,∗ a0t,new ] = 0, k α1 F k2 . r2 ζ 2 λ+ w.h.p. Recall . means (in an informal sense) that the RHS contains the dominant terms in the bound. Definition 5.9. In the sequel, we let 1) r := r0 + (J − 1)cmx and c := cmx = maxj cj,new , 2) κs,∗ := maxj κs (Pj−1 ), κs,new := maxj κs (Pj,new ), κs,k := maxj κs (Dj,new,k ), κ̃s,k := maxj κs ((I − Pj,new Pj,new 0 )P̂j,new,k ), gk := maxj gj,k , 16 + + + + + 3) δs+3s , κ+ := s+3sa ,∗ := 0.3, κs+3sa ,new := 0.15, κs := 0.15, κ̃s+3sa := 0.15 and g a √ 2 are the upper bounds assumed in Theorem 4.2 on δs+3sa (A), maxj κs+3sa (Pj ), maxj κs+3sa (Pj,new ), maxj maxk κs (Dj,new,k ), maxj κs+3sa (Qj,new,k ) and maxt gt respectively. 4) φ+ := 1.1735 We see later that this is an upperbound on φk under the assumptions of Theorem 4.2. 5) γnew,k := min(1.2k−1 γnew , γ∗ ) (recall that the theorem assumes maxj maxt∈Ij,k kat,new k∞ ≤ γnew,k ) 6) Pj,∗ := Pj−1 and P̂j,∗ := P̂j−1 (see Remark 5.10). Remark 5.10. Notice that the subscript j always appears as the first subscript, while k is the last one. At many places in this paper, we remove the subscript j for simplicity. Whenever there is only one subscript, it refers to the value of k, e.g., Φ0 refers to Φj,0 , P̂new,k refers to P̂j,new,k . Also, P∗ := Pj−1 and P̂∗ := P̂j−1 . Definition 5.11. Define the following: 1) ζ∗+ := rζ (We note that ζ∗+ = (r0 + (j − 1)c)ζ will also work.) 2) Define the sequence {ζk + }k=0,1,2,···K recursively as follows ζ0+ := 1 ζk+ := 1− (ζ∗+ )2 b + 0.125cζ for k ≥ 1, − (ζ∗+ )2 f − 0.125cζ − b (3) where + + + 2 + + 2 0 + 2 b := Cκ+ s g ζk−1 + C̃(κs ) g (ζk−1 ) + C f (ζ∗ ) 2κ+ φ+ C := p s + + φ+ , 1 − (ζ∗ )2 + + 2 κ+ κ+ 2φ+ sφ s (φ ) + p p + 1 + φ + + , C 0 := (φ+ )2 + p 1 − (ζ∗+ )2 1 − (ζ∗+ )2 1 − (ζ∗+ )2 κ+ (φ+ )2 C̃ := (φ+ )2 + p s . 1 − (ζ∗+ )2 As we will see, ζ∗+ and ζk+ are the high probability upper bounds on ζj,∗ and ζj,k (defined in Definition 5.4) under the assumptions of Theorem 4.2. Definition 5.12. Define the random variable Xj,k := {a1 , a2 , · · · , atj +kα−1 }. Recall that the at ’s are mutually independent over t. Definition 5.13. Define the set Γ̌j,k as follows: Γ̌j,k := {Xj,k : ζj,k ≤ ζk+ and T̂t = Tt and et satisfies (2) for all t ∈ Ij,k } Γ̌j,K+1 := {Xj+1,0 : T̂t = Tt and et satisfies (2) for all t ∈ Ij,K+1 } 17 Definition 5.14. Recursively define the sets Γj,k as follows: Γ1,0 := {X1,0 : ζ1,∗ ≤ rζ and T̂t = Tt and et satisfies (2) for all t ∈ [ttrain+1 : t1 − 1]} Γj,k := Γj,k−1 ∩ Γ̌j,k k = 1, 2, . . . , K, j = 1, 2, . . . , J Γj+1,0 := Γj,K ∩ Γ̌j,K+1 j = 1, 2, . . . , J B. Proof Outline for Theorem 4.2 The proof of Theorem 4.2 essentially follows from two main lemmas, 6.1 and 6.2. Lemma 6.1 gives an exponentially decaying upper bound on ζk+ defined in Definition 5.11. ζk+ will be shown to be a high probability upper bound for ζk under the assumptions of the Theorem. Lemma 6.2 says that conditioned on Xj,k−1 ∈ Γj,k−1 , Xj,k will be in Γj,k w.h.p.. In words this says that if, during the time interval Ij,k−1 , the algorithm has worked well (recovered the support of St exactly and recovered the background subspace + with subspace recovery error below ζk−1 + ζ∗+ ), then it will also work well in Ij,k w.h.p.. The proof of Lemma 6.2 requires two lemmas: one for the projected CS step and one for the projection PCA step of the algorithm. These are lemmas 8.1 and 8.2. The proof Lemma 8.1 follows using Lemmas 6.1, 3.4, 2.14, the ModCS error bound (Theorem 2.16), and some straightforward steps. The proof of Lemma 8.2 is longer and uses Lemma 2.13 to get a pound on ζk . From here we use the matrix Hoeffding inequalities (Corollaries 2.8 and 2.9) to bound each of the terms in the bound on ζk to finally show that, conditioned on Γej,k−1 ζk ≤ ζk+ w.h.p.. These are Lemmas 8.6 and 8.7. VI. M AIN L EMMAS AND P ROOF OF T HEOREM 4.2 Recall that when there is only one subscript, it refers to the value of k (i.e. ζk = ζj,k ). Lemma 6.1 (Exponential decay of ζk+ ). Assume that the bounds on ζ from Theorem 4.2 hold. Define the sequence ζk+ as in Definition 5.11. Then 1) ζ0+ = 1 and ζk+ ≤ 0.6k + 0.4cζ for all k = 1, 2, . . . , K, 2) the denominator of ζk+ is positive for all k = 1, 2, . . . , K. We will prove this lemma in Section VII. Lemma 6.2. Assume that all the conditions of Theorem 4.2 hold. Also assume that P(Γej,k−1 ) > 0. Then P(Γej,k |Γej,k−1 ) ≥ pk (α, ζ) ≥ pK (α, ζ) for all k = 1, 2, . . . , K, where pk (α, ζ) is defined in equation (5). Remark 6.3. Under the assumptions of Theorem 4.2, it is easy to see that the following holds. For any k = 1, 2 . . . K, Γej,k implies that ζj,∗ ≤ ζ∗+ + From the definition of Γej,k , ζj 0 ,K ≤ ζK for all j 0 ≤ j − 1. By Lemma 6.1 and the definition of K in P + Definition 5.2, ζK ≤ 0.6K + 0.4cζ ≤ cζ for all j 0 ≤ j − 1. Using Remark 5.5, ζj,∗ ≤ ζ1∗ + j−1 j 0 =1 ζj 0 ,K ≤ r0 ζ + (j − 1)cζ ≤ ζ∗+ . 18 Proof of Theorem 4.2. The theorem is a direct consequence of Lemmas 6.1, 6.2, and Lemma 2.6. Notice that Γej,0 ⊇ Γej,1 ⊇ · · · ⊇ Γej,K,0 ⊇ Γej+1,0 . Thus, by Lemma 2.6, P(Γej+1,0 |Γej,0 ) = QJ Q e e e e P(Γej+1,0 |Γej,K ) K j=1 P(Γj+1,0 |Γj,0 ). k=1 P(Γj,k |Γj,k−1 ) and P(ΓJ+1,0 |Γ1,0 ) = Using Lemmas 6.2, and the fact that pk (α, ζ) ≥ pK (α, ζ) (see their respective definitions in Lemma 8.7 and equation (5)), we get P(ΓeJ+1,0 |Γ1,0,0 ) ≥ pK (α, ζ)KJ . Also, P(Γe1,0 ) = 1. This follows by the assumption on P̂0 and Lemma 8.1. Thus, P(ΓeJ+1,0 ) ≥ pK (α, ζ)KJ . Using the definition of αadd we get that P(ΓeJ+1,0 ) ≥ pK (α, ζ)KJ ≥ 1 − n−10 whenever α ≥ αadd . The event ΓeJ+1,0 implies that T̂t = Tt and et satisfies (2) for all t < tJ+1 . Using Remarks 5.5 and √ 6.3, ΓeJ+1,0 implies that all the bounds on the subspace error hold. Using these, kat,new k2 ≤ cγnew,k , and √ kat k2 ≤ rγ∗ , ΓeJ+1,0 implies that all the bounds on ket k2 hold (the bounds are obtained in Lemma 8.1). Thus, all conclusions of the the result hold w.p. at least 1 − n−10 . VII. P ROOF OF L EMMA 6.1 Proof. First recall the definition of ζk+ (Definition 5.11). Recall from Definition 5.9 that κ+ s := 0.15 , √ + + + φ := 1.1735, and g := 2. So we can make these substitutions directly. Notice that ζk is an increasing function of ζ∗+ , ζ, c, and f . Therefore we can use upper bounds on each of these quantities to get an upper bound on ζk+ . From the definition of ζ in Theorem 4.2 and ζ∗+ := rζ we get • ζ∗+ ≤ 10−4 • ζ∗+ f ≤ 1.5 × 10−4 • cζ ≤ 10−4 ζ∗+ rζ r = = ≤ r (We assume that c ≥ 1. Recall that c is the upper bound on the number of new cζ cζ c directions added) • ζ∗+ f r = r2 f ζ ≤ 1.5 × 10−4 • + First we prove by induction that ζk+ ≤ ζk−1 ≤ 0.6 for all k ≥ 1. Notice that ζ0+ = 1 by definition. • Base case (k = 1): Using the above bounds we get that ζ1+ < 0.5985 < 1 = ζ0+ . • + + + For the induction step, assume that ζk−1 ≤ ζk−2 . Then because ζk+ is increasing in ζk−1 we get that + + + ζk+ = finc (ζk−1 ) ≤ finc (ζk−2 ) = ζk−1 . 1) To prove the first claim, first rewrite ζk+ as + ζk+ = + ζk−1 + + 2 + + C(ζ∗+ f ) (ζcζ∗ ) + .125 Cκ+ s g + C̃(κs ) g (ζk−1 ) + cζ 1 − (ζ∗+ )2 − (ζ∗+ )2 f − 0.125cζ − b 1 − (ζ∗+ )2 − (ζ∗+ )2 f − 0.125cζ − b + ≤ .6 we get Where C, C̃, and b are as in Definition 5.11. Using the above bounds including ζk−1 that ζk+ ≤ + ζk−1 (0.6)+cζ(0.16) = k−1 X ζ0+ (0.6)k + i=0 k (0.6) (0.16)cζ ≤ ∞ X (0.6)k (0.16)cζ ≤ 0.6k +0.4cζ ζ0+ (0.6)k + i=0 2) To see that the denominator is positive, observe that the denominator is decreasing in all of its arguments: ζ∗+ , ζ∗+ f, cζ, and b. Using the same upper bounds as before, we get that the denominator is greater than or equal to 0.78 > 0. 19 VIII. P ROOF OF L EMMA 6.2 The proof of Lemma 6.2 follows from two lemmas. The first is the final conclusion for the projected CS step for t ∈ Ij,k . The second is the final conclusion for one projection PCA (i.e.) for t ∈ Ij,k . We will state the two lemmas first and then proceed to prove them in order. Lemma 8.1 (Projected Modified Compressed Sensing Lemma). Assume that all conditions of Theorem 4.2 hold. 1) For all t ∈ Ij,k , for any k = 1, 2, . . . K, if Xj,k−1 ∈ Γj,k−1 , √ √ √ + √ cγnew,k + ζ∗+ rγ∗ ≤ c0.72k−1 γnew + 1.06 ζ ≤ a) the projection noise βt satisfies kβt k2 ≤ ζk−1 ξ0 . b) N̂t = Nt √ √ + √ + c) et satisfies (2) and ket k2 ≤ φ+ [κ+ rγ∗ ] ≤ 0.18 · 0.72k−1 cγnew + 1.17 · s ζk−1 cγnew,k + ζ∗ √ 1.06 ζ. Recall that (2) is ITt (Φ(t) )Tt † βt = ITt [(Φ(t) )0Tt (Φ(t) )Tt ]−1 ITt 0 Φ(t) Lt 2) For all k = 1, 2, . . . K, P(T̂t = Tt and et satisfies (2) for all t ∈ Ij,k |Γej,k−1 ) = 1. Lemma 8.2 (Subspace Recovery Lemma). Assume that all the conditions of Theorem 4.2 hold. Let ζ∗+ = rζ. Then, for all k = 1, 2, . . . K, P(ζk ≤ ζk+ |Γej,k−1 ) ≥ pk (α, ζ) where ζk+ is defined in Definition 5.11 and pk (α, ζ) is defined in (5). Proof of Lemma 6.2. Observe that P(Γj,k |Γj,k−1 ) = P(Γ̌j,k |Γj,k−1 ). The lemma then follows by combining Lemma 8.2 and item 2 of Lemma 8.1. A. Proof of Lemma 8.1 In order to prove Lemma 8.1 we first need a bound on the RIC of the compressed sensing matrix Φk A. Lemma 8.3 (Bounding the RIC). Recall that ζ∗ := k(I − P̂∗ P̂∗0 )P∗ k. The following hold. 1) κs,A (P̂∗ )2 ≤ 2ζ∗ (1 + δs (A)) + κs,A (P∗ )2 2) κs,A (P̂new,k ) ≤ κs,A,new + κ̃s,A,k (ζk + ζ∗ ) 3) δs ((I − P̂∗ P̂∗0 )A) ≤ δs (A) + κs,A (P̂∗ )2 ≤ δs (A) + 2ζ∗ (1 + δs (A)) + κs,A (P∗ )2 0 4) δs ((I − P̂∗ P̂∗0 − P̂new,k P̂new,k )A) ≤ δs (A) + κs,A (P̂∗ )2 + κs,A (P̂new,k )2 ≤ δs (A) + (2ζ∗ (1 + δs (A)) + κs,A (P∗ )2 ) + (κs,A,new + κ̃s,A,k (ζk + ζ∗ ))2 20 Proof. 1) For any set T such that |T | ≤ s, kA0T P̂∗ k2 = kA0T P̂∗ P̂∗0 AT k = kA0T (P̂∗ P̂∗0 − P∗ P∗0 + P∗ P∗0 )AT k ≤ kA0T (P̂∗ P̂∗0 − P∗ P∗0 )AT k + kA0T P∗ P∗0 AT k ≤ kA0T kkP̂∗ P̂∗0 − P∗ P∗0 kkAT k + kA0T P∗ P∗0 AT k ≤ 2ζ∗ kAT k2 + κs,A (P∗ )2 ≤ 2ζ∗ (1 + δs ) + κs,A (P∗ )2 2) For any set T such that |T | ≤ s, 0 0 kA0T P̂new,k k ≤ kA0T (I − Pnew Pnew )P̂new,k k + kA0T Pnew Pnew P̂new,k k 0 ≤ κ̃s,A,k k(I − Pnew Pnew )P̂new,k k + kA0T Pnew k 0 = κ̃s,A,k k(I − P̂new,k P̂new,k )Pnew k + κs,A,new 0 = κ̃s,A,k k(I − P̂∗ P̂∗0 − P̂new,k P̂new,k + P̂∗ P̂∗0 )Pnew k + κs,A,new 0 0 0 ≤ κ̃s,A,k k(I − P̂∗ P̂∗ − P̂new,k P̂new,k )Pnew k + kP̂∗ P̂∗ Pnew k + κs,A,new ≤ κ̃s,A,k ζk + κ̃s,A,k ζ∗ + κs,A,new ≤ κ̃s,A,k ζk + ζ∗ + κs,A,new Lines 3 and 7 use Lemma 2.14. Line 2 uses lemma 2.15. 3) Follows from (1) and Lemma 3.4. 4) Follows from (2) and Lemma 3.4. Corollary 8.4. If the conditions of Theorem 4.2 are satisfied, and Xj,k−1 ∈ Γj,k−1 , then 2 + + 1) δs (Φ0 A) ≤ δs+3sa (Φ0 A) ≤ δs+3s + 2ζ∗+ (1 + δs+3s ) + κ+ s+3sa ,∗ < 0.1 < 0.207 a a 2 + + + + + 2) δs (Φk−1 A) ≤ δs+3sa (Φk−1 A) ≤ δs+3s + 2ζ∗+ (1 + δs+3s ) + κ+ s+3sa ,∗ + (κs+3sa ,new + κ̃s+3sa ,k−1 (ζk−1 + a a ζ∗+ ))2 < 0.207 3) φk−1 ≤ 1 1−δs (Φk−1 A) < φ+ + Proof. This follows using Lemma 8.3, the definition of Γj,k−1 , and the bound on ζk−1 from Lemma 6.1. The following are striaghtforward bounds that will be useful for the proof of Lemma 8.1 and later. Fact 8.5. Under the assumptions of Theorem 4.2: √ √ ζ 1) ζγ∗ ≤ (r0 +(J−1)c) ζ 3/2 ≤ 10−4 ≤ 10−4 (r0 +(J−1)c) 1 ζ∗+ γ∗2 ≤ (r0 +(J−1)c) 2 ≤ 1 √ √ ζ ζ∗+ γ∗ ≤ √ ≤ ζ r0 +(J−1)c 1.5×10−4 + ζ∗ f ≤ r0 +(J−1)c ≤ 1.5 × 10−4 2) ζ∗+ ≤ 3) 4) 5) 21 + 6) ζk−1 ≤ 0.6k−1 + 0.4cζ (from Lemma 6.1) + 7) ζk−1 γnew,k ≤ (0.6 · 1.2)k−1 γnew + 0.4cζγ∗ ≤ 0.72k−1 γnew + √ + 2 2 2 + + 0.4cζγ∗2 ≤ 0.864k−1 γnew 8) ζk−1 ≤ (0.6 · 1.22 )k−1 γnew γnew,k √ 0.4 ζ ≤ r0 +(J−1)c 0.4 (r0 +(J−1)c)2 √ 0.72k−1 γnew + 0.4 ζ 2 + 0.4 ≤ 0.864k−1 γnew Proof of Lemma 8.1. For t = ttrain+1 we have no prior support estimate (this is why we need the first set of bounds in assumption 2 of Theorem 4.2), so step 1b of algorithm 2 is just basis pursuit and we refer to the proof in [15]. + For t > ttrain + 1 recall that Xj,k−1 ∈ Γj,k−1 implies that ζ∗ ≤ ζ∗+ and ζk−1 ≤ ζk−1 and T̂t = Nt−1 (at time t − 1 the support was recovered exactly). 1) 0 a) For t ∈ Ij,k , βt := (I − P̂(t−1) P̂(t−1) )Lt = D∗,k−1 at,∗ + Dnew,k−1 at,new . Thus, using Fact 8.5 √ √ kβt k2 ≤ ζ∗ rγ∗ + ζk−1 cγnew,k p √ p √ ≤ ζ r + (0.72k−1 γnew + .4 ζ) c p √ √ √ = c0.72k−1 γnew + ζ( r + 0.4 c) ≤ ξ0 . b) By Corollary 8.4, δs+3sa (Φk−1 ) < 0.207. Since T̂t = Nt−1 we can use Lemma 2.16 with s∆ = s∆e = sa . (i.e. the ‘misses’ and ‘extras’ in the support estimate are the newly added and deleted indices). We also have that |Nt | ≤ s, kβt k2 ≤ ξ0 = ξ. So by part 1 of lemma 2.16 if |(St )i | > ω + √%s∆ 7.50ξ, then i ∈ N̂t . That is to say (St )i is large enough to be detected. Also according to part 2 of lemma 2.16 if ω ≥ % √ 7.50ξ, s∆ then there will be no false additions, and all true removals will be eliminated from the support estimate N̂t . This along with the assumption that % √ 7.50ξ sa ≤ ω < Smin − % √ 7.50ξ sa gives exact support recovery i.e. N̂t = Nt . c) We use the fact that we have exact support recovery for the least squares step of the algorithm to write an expression for et := Ŝt − St . To begin, observe that (Ŝt )Nt = [(Φk−1 A)Nt ]† yt = [(Φk−1 A)Nt ]† (Φk−1 ASt + Φk−1 Lt ) = [(Φk−1 AINt )0 (Φk−1 AINt )]−1 (Φk−1 AINt )0 (Φk−1 ASt + Φk−1 Lt ) = [(Φk−1 AINt )0 (Φk−1 AINt )]−1 (Φk−1 AINt )0 (Φk−1 AINt IN0 t St + Φk−1 Lt ) = (St )Nt + [(Φk−1 AINt )0 (Φk−1 AINt )]−1 (Φk−1 AINt )0 Φk−1 Lt = (St )Nt + [(Φk−1 ANt )0 (Φk−1 ANt )]−1 A0Nt Φk−1 Lt The 4th equality is because supp(St ) = Nt . Note that since δ|Nt | (Φk−1 A) < 1 the columns of Φk−1 ANt are linearly independent, so (Φk−1 ANt )0 (Φk−1 ANt ) is invertible. So (et )Nt = [(Φk−1 ANt )0 (Φk−1 ANt )]−1 A0Nt Φk−1 Lt , and (et )Ttc = 0. Therefore et = INt [(Φk−1 ANt )0 (Φk−1 ANt )]−1 A0Nt Φk−1 Lt = INt [(Φk−1 ANt )0 (Φk−1 ANt )]−1 A0Nt (Φk−1 P∗ at,∗ + Dnew,k−1 at,new ). 22 So p √ √ + 1 + δs (A) ζ∗+ rγ∗ + κs,k−1 ζk−1 cγnew,k √ p p √ √ k−1 r ζ + c0.15(0.72) γnew + c0.06 ζ ≤ 1.25 p √ √ √ k−1 ≤ 0.19 c0.72 γnew + 1.25 ζ r + 0.06 c ket k2 ≤ φ+ 2) The second claim is just a restatement of the first. B. Proof of Lemma 8.2 The proof of Lemma 8.2 will use the next two lemmas (8.6, and 8.7). Lemma 8.6. If λmin (Ak ) − kAk,⊥ k2 − kHk k2 > 0, then ζk ≤ kHk k2 kRk k2 ≤ λmin (Ak ) − kAk,⊥ k2 − kHk k2 λmin (Ak ) − kAk,⊥ k2 − kHk k2 (4) where Rk := Hk Enew and Ak , Ak,⊥ , Hk are defined in Definition 5.7. 0 0 0 Proof. Since ζk = k(I − P̂new,k P̂new,k )Dnew k2 = k(I − P̂new,k P̂new,k )Enew Rnew k2 ≤ k(I − P̂new,k P̂new,k )Enew k2 , the result follows by Lemma 2.13. Lemma 8.7 (High probability bounds for each of the terms in the ζk bound (4)). Assume the conditions of Theorem 4.2 hold. Also assume that P(Γej,k−1 ) > 0 for all 1 ≤ k ≤ K + 1. Then, for all 1 ≤ k ≤ K cζ e + 2 Γj,k−1 > 1 − pa,k (α, ζ) where 1) P λmin (Ak ) ≥ λ− new,k 1 − (ζ∗ ) − 12 −αζ 2 (λ− )2 −αc2 ζ 2 (λ− )2 pa,k (α, ζ) := c exp + c exp 4 , γ4) 8 · 242 · min(1.24k γnew 8 · 242 · 42 ∗ cζ e + 2 Γj,k−1 > 1 − pb (α, ζ) where 2) P λmax (Ak,⊥ ) ≤ λ− new,k (ζ∗ ) f + 24 −αc2 ζ(λ− )2 pb (α, ζ) := (n − c) exp 8 · 242 e Γ (b + 0.125cζ) 3) P kHk k2 ≤ λ− j,k−1 ≥ 1 − pc (α, ζ) where b is as defined in Definition 5.11 new,k and −αζ 2 (λ− )2 + pc (α, ζ) :=n exp 2 2 + 0.0072γ 8 · 242 (0.0324γnew new + 0.0004) −αζ 2 (λ− )2 n exp + 2 + 0.0006γ 2 32 · 242 (0.06γnew new + 0.4) −αζ 2 (λ− )2 2 n exp . 2 + 0.00034γ 2 32 · 242 (0.186γnew new + 2.3) Proof of Lemma 8.2. Lemma 8.2 now follows by combining Lemmas 8.6 and 8.7 and defining pk (α, ζ) := 1 − pa,k (α, ζ) − pb (α, ζ) − pc (α, ζ). (5) 23 As above, we will start with some simple facts that will be used to prove Lemma 8.7. P P For convenience, we will use α1 t to denote α1 t∈Ij,k Fact 8.8. Under the assumptions of Theorem 4.2 the following are true. 1) The matrices Dnew , Rnew , Enew , D∗ , Dnew,k−1 , Φk−1 are functions of the r.v. Xj,k−1 . All terms P that we bound for the first two claims of the lemma are of the form α1 t∈Ij,k Zt where Zt = f1 (Xj,k−1 )Yt f2 (Xj,k−1 ), Yt is a sub-matrix of at a0t and f1 (.) and f2 (.) are functions of Xj,k−1 . 2) Xj,k−1 is independent of any at for t ∈ Ij,k , and hence the same is true for the matrices Dnew , Rnew , Enew , D∗ , Dnew,k−1 , Φk−1 . Also, at ’s for different t ∈ Ij,k are mutually independent. Thus, conditioned on Xj,k−1 , the Zt ’s defined above are mutually independent. 3) All the terms that we bound for the third claim contain et . Using the second claim of Lemma 8.1, conditioned on Xj,k−1 , et satisfies (2) w.p. one whenever Xj,k−1 ∈ Γj,k−1 . Conditioned on Xj,k−1 , all P these terms are also of the form α1 t∈Ij,k Zt with Zt as defined above, whenever Xj,k−1 ∈ Γj,k−1 . Thus, conditioned on Xj,k−1 , the Zt ’s for these terms are mutually independent, whenever Xj,k−1 ∈ Γj,k−1 . 4) It is easy to see that kΦk−1 P∗ k2 ≤ ζ∗ , ζ0 = kDnew k2 ≤ 1, Φ0 Dnew = Φ00 Dnew = Dnew , kRnew k ≤ 1, p 0 0 k(Rnew )−1 k ≤ 1/ 1 − ζ∗2 , Enew,⊥ 0 Dnew = 0, and kEnew 0 Φ0 et k = k(Rnew )−1 Dnew Φ0 Aet k = + κs,A −1 0 0 −1 0 k(Rnew ) Dnew Aet k ≤ k(Rnew ) Dnew ANt kket k ≤ √ 2 ket k. The bounds on kRnew k and 1−ζ∗ k(Rnew )−1 k follow using Lemma 2.14 and the fact that σi (Rnew ) = σi (Dnew ). 5) Xj,k−1 ∈ Γj,k−1 implies that a) ζ∗ ≤ ζ∗+ (see Remark 6.3) + b) ζk−1 ≤ ζk−1 ≤ 0.6k−1 + 0.4cζ (This follows by the definition of Γj,k−1 and Lemma 6.1.) 6) Item 5 implies that a) λmin (Rnew Rnew 0 ) ≥ 1 − (ζ∗+ )2 . This follows from Lemma 2.14 and the fact that σmin (Rnew ) = σmin (Dnew ). + b) kINt 0 Φk−1 P∗ k2 ≤ kΦk−1 P∗ k2 ≤ ζ∗ ≤ ζ∗+ , kINt 0 Dnew,k−1 k2 ≤ κs,k−1 ζk−1 ≤ κ+ s ζk−1 . P P 7) By Weyl’s theorem (Theorem 2.11), for a sequence of matrices Bt , λmin ( t Bt ) ≥ t λmin (Bt ) and P P λmax ( t Bt ) ≤ t λmax (Bt ). p 8) kAet k2 ≤ kANt k2 ket k2 ≤ 1 + δs (A)ket k2 . P Proof of Lemma 8.7. Consider Ak := α1 t Enew 0 Φ0 Lt Lt 0 Φ0 Enew . Notice that Enew 0 Φ0 Lt = Rnew at,new + Enew 0 D∗ at,∗ . Let Zt = Rnew at,new at,new 0 Rnew 0 and let Yt = Rnew at,new at,∗ 0 D∗ 0 Enew 0 + Enew 0 D∗ at,∗ at,new 0 Rnew 0 , then Ak Consider P t Zt = P t 1X 1X Zt + Yt α t α t (6) 0 Rnew at,new at,new 0 Rnew . 1) Using item 2 of Fact 8.8, the Zt ’s are conditionally independent given Xj,k−1 . 2) Using item 2, Ostrowoski’s theorem (Theorem 2.12), and item 6, for P P = λmin Rnew α1 t E(at,new at,new 0 )Rnew 0 Xj,k−1 ∈ Γj,k−1 , λmin E( α1 t Zt |Xj,k−1 ) P λmin (Rnew Rnew 0 ) λmin α1 t E(at,new at,new 0 ) ≥ (1 − (ζ∗+ )2 )λ− new,k . all ≥ 24 3) Finally, using items 4 and the bound on kat k∞ from the model, conditioned on Xj,k−1 , 0 Zt 2 2 , γ∗2 I holds w.p. one for all Xj,k−1 ∈ Γj,k−1 . I ≤ c max (1.2)2k γnew cγnew,k − Thus, applying Corollary 2.8 with = cζλ , we get 24 ! ! − 1X cζλ −αζ 2 (λ− )2 + 2 − P λmin Zt ≥ (1 − (ζ∗ ) )λnew,k − Xj,k−1 ≥ 1 − c exp 4 , γ4) α t 24 8 · 242 · min(1.24k γnew ∗ (7) for all Xj,k−1 ∈ Γj,k−1 . Consider Yt = Rnew at,new at,∗ 0 D∗ 0 Enew + Enew 0 D∗ at,∗ at,new 0 Rnew 0 . 1) Using item 2, the Yt ’s are conditionally independent given Xj,k−1 . 2) Using item 2 and the fact that at,new and at,∗ are mutually uncorrelated, E 1 α P t Yt |Xj,k−1 = 0 for all Xj,k−1 ∈ Γj,k−1 . 3) Using the bound on kat k∞ , items 4, 6, and Fact 8.5, conditioned on Xj,k−1 , kYt k ≤ √ √ 2 crζ∗+ γ∗ γnew,k ≤ 2 crζ∗+ γ∗2 ≤ 2 holds w.p. one for all Xj,k−1 ∈ Γj,k−1 . Thus, under the same conditioning, −bI Yt bI with b = 2 w.p. one. − , we get Thus, applying Corollary 2.8 with = cζλ 24 ! ! −cζλ− −αc2 ζ 2 (λ− )2 1X Yt ≥ P λmin for all Xj,k−1 ∈ Γj,k−1 Xj,k−1 ≥ 1 − c exp α t 24 8 · 242 · (2b)2 (8) − cζλ + 2 Combining (6), (7) and (8) and using the union bound, P(λmin (Ak ) ≥ λ− new,k (1−(ζ∗ ) )− 12 |Xj,k−1 ) ≥ − 1 − pa (α, ζ) for all Xj,k−1 ∈ Γj,k−1 . The first claim of the lemma follows by using λ− new,k ≥ λ and then applying Lemma 2.5 with X ≡ Xj,k−1 and C ≡ Γj,k−1 . P Now consider Ak,⊥ := α1 t Enew,⊥ 0 Φ0 Lt Lt 0 Φ0 Enew,⊥ . Using item 4, Enew,⊥ 0 Φ0 Lt = Enew,⊥ 0 D∗ at,∗ . Thus, P Ak,⊥ = α1 t Zt with Zt = Enew,⊥ 0 D∗ at,∗ at,∗ 0 D∗ 0 Enew,⊥ which is of size (n − c) × (n − c). Using the same P ideas as above we can show that 0 Zt r(ζ∗+ )2 γ∗2 I ζI and E α1 t Zt |Xj,k−1 (ζ∗+ )2 λ+ I. Thus by Corollary 2.8 with = cζλ− 24 and Lemma 2.5 the second claim follows. Using the expression for Hk given in Definition 5.7, it is easy to see that 1 X 0 Aet (Aet ) + max(kT 2k2 , kT 4k2 ) + kBk k2 (9) kHk k2 ≤ max{kHk k2 , kHk,⊥ k2 } + kBk k2 ≤ α t 2 P P where T 2 := α1 t Enew 0 Φ0 (Lt (Aet )0 + Aet Lt 0 )Φ0 Enew and T 4 := α1 t Enew,⊥ 0 Φ0 (Lt (Aet )0 + (Aet )0 Lt )Φ0 Enew,⊥ . The second inequality follows by using the facts that (i) Hk = T 1 − T 2 where T 1 := P P 0 0 1 1 0 0 t Enew Φ0 Aet (Aet ) Φ0 Enew , (ii) Hk,⊥ = T 3 − T 4 where T 3 := α t Enew,⊥ Φ0 Aet (Aet ) Φ0 Enew,⊥ , and α P (iii) max(kT 1k2 , kT 3k2 ) ≤ k α1 t Aet (Aet )0 k2 . Next, we obtain high probability bounds on each of the terms on the RHS of (9) using the Hoeffding corollaries. P Consider k α1 t Aet (Aet )0 k2 . Let Zt = Aet (Aet )0 . 1) Using item 2, conditioned on Xj,k−1 , the various Zt ’s in the summation are independent, for all Xj,k−1 ∈ Γj,k−1 . 2) Using item 6, and the bound on kat k∞ , conditioned on Xj,k−1 , 0 Zt b1 I w.p. one for all √ √ + + Xj,k−1 ∈ Γj,k−1 . Here b1 := (1 + δs+ )(κ+ cγnew,k + ζ∗+ φ+ rγ∗ )2 . s,A ζk−1 φ 25 3) Also using item 6, and the bound on kat k∞ , 0 1 α P t E(Zt |Xj,k−1 ) b2 I, with b2 := (1 + 2 + 2 + 2 + + 2 + + 2 δs+ )(κ+ s,A ) (ζk−1 ) (φ ) λnew,k + (ζ∗ ) (φ ) λ for all Xj,k−1 ∈ Γj,k−1 . Thus, applying Corollary 2.8 with = cζλ− , 24 1 X cζλ− P Aet (Aet )0 ≤ b2 + Xj,k−1 α t 24 2 0 ! ≥ 1 − n exp −αc2 ζ 2 (λ− )2 8 · 242 b21 for all Xj,k−1 ∈ Γj,k−1 (10) 0 Consider T 2. Let Zt := Enew Φ0 (Lt (Aet )0 + Aet Lt )Φ0 Enew which is of size c × c. Then T 2 = 1 α P t Zt . 1) Using item 2, conditioned on Xj,k−1 , the various Zt ’s used in the summation are mutually independent, for all Xj,k−1 ∈ Γj,k−1 . Using item 4, Enew 0 Φ0 Lt = Rnew at,new + Enew 0 D∗ at,∗ and Enew 0 Φ0 et = (Rnew 0 )−1 Dnew 0 et . 2) Thus, using items 4, 6, and the bound on kat k∞ , it follows that conditioned on Xj,k−1 , κ+ + √ kZt k2 ≤ 2b̃3 ≤ 2b3 w.p. one for all Xj,k−1 ∈ Γj,k−1 . Here, b̃3 := √ s,A+ 2 φ+ (κ+ s,A ζk−1 cγnew,k + 1−(ζ∗ ) √ + √ √ √ + 2 + + 2 + 2 + rζ∗ γ∗ )( cγnew,k + rζ∗+ γ∗ ) and b3 := √ 1 + 2 (φ+ cκ+ ζ rcκs,A ζk−1 ζ∗ γnew,k γ∗ + γ +φ s,A k−1 new,k 1−(ζ∗ ) √ + +2 2 + φ+ rcκ+ s ζ∗ γ∗ γnew,k + φ rζ∗ γ∗ ). P κ+ + + 3) Also, k α1 t E(Zt |Xj,k−1 )k2 ≤ 2b̃4 ≤ 2b4 where b̃4 := √ s,A+ φ+ κ+ s ζk−1 λnew,k + √ κ+ s,A 1−(ζ∗+ )2 + φ κ+ s,A + + and b4 := √ φ+ κ+ s,A ζk−1 λnew,k 1−(ζ∗+ )2 − , 2.9 with = cζλ 24 (ζ∗+ )2 λ+ 1−(ζ∗ )2 +√ 1 1−(ζ∗+ )2 + φ (ζ∗+ )2 λ+ . Thus, applying Corollary cζλ− −αc2 ζ 2 (λ− )2 P kT 2k2 ≤ 2b4 + for all Xj,k−1 ∈ Γj,k−1 Xj,k−1 ≥ 1 − c exp 24 32 · 242 · 4b23 Consider T 4. Let Zt := Enew,⊥ 0 Φ0 (Lt (Aet )0 + Aet Lt 0 )Φ0 Enew,⊥ which is of size (n − c) × (n − c). Then P T 4 = α1 t Zt . 1) Using item 2, conditioned on Xj,k−1 , the various Zt ’s used in the summation are mutually independent, for all Xj,k−1 ∈ Γj,k−1 . Using item 4, Enew,⊥ 0 Φ0 Lt = Enew,⊥ 0 D∗ at,∗ . 2) Thus, conditioned on Xj,k−1 , kZt k2 ≤ 2b5 w.p. one for all Xj,k−1 ∈ Γj,k−1 . Here b5 := √ √ + + ( 1 + δs +)(φ+ r(ζ∗+ )2 γ∗2 + φ+ rcκ+ s ζ∗ ζk−1 γ∗ γnew,k ) This follows using items 6 and the bound on kat k∞ . p P 3) Also, k α1 t E(Zt |Xj,k−1 )k2 ≤ 2b6 , b6 := 1 + δs+ (φ+ (ζ∗+ )2 λ+ ). − Applying Corollary 2.9 with = cζλ , 24 −αc2 ζ 2 (λ− )2 cζλ− P kT 4k2 ≤ 2b6 + for all Xj,k−1 ∈ Γj,k−1 Xj,k−1 ≥ 1 − (n − c) exp 24 32 · 242 · 4b25 + Consider max(kT 2k2 , kT 4k2 ). Sinceb3 > b5 (follows because ζk−1 ≤ 21)2 and b4 > b6 , so 2b6 + −αc2 ζ 2 (λ− )2 −αc ζ (λ− )2 cζλ− cζλ− < 2b4 + 24 and 1 − (n − c) exp > 1 − (n − c) exp . Therefore, for all 2 2 ·4b2 24 8·242 ·4b5 2 28·24 3 − 2 − ζ (λ ) Xj,k−1 ∈ Γj,k−1 , P kT 4k2 ≤ 2b4 + cζλ X ≥ 1 − (n − c) exp −αc . 24 j,k−1 32·242 ·4b2 3 By the union bound, for all Xj,k−1 ∈ Γj,k−1 , cζλ− −αc2 ζ 2 (λ− )2 P max(kT 2k2 , kT 4k2 ) ≤ 2b4 + Xj,k−1 ≥ 1 − n exp 24 32 · 242 · 4b23 (11) 26 Consider kBk k2 . Let Zt := Enew,⊥ 0 Φ0 (Lt − Aet )(Lt 0 − (Aet )0 )Φ0 Enew which is of size (n − c) × c. P Then Bk = α1 t Zt . Using item 4, Enew,⊥ 0 Φ0 (Lt − Aet ) = Enew,⊥ 0 (D∗ at,∗ − Φ0 Aet ), Enew 0 Φ0 (Lt − Aet ) = 0 0 Rnew at,new + Enew 0 D∗ at,∗ + (Rnew )−1 Dnew Aet . Also, kZt k2 ≤ b7 w.p. one for all Xj,k−1 ∈ Γj,k−1 and P k α1 t E(Zt |Xj,k−1 )k2 ≤ b8 for all Xj,k−1 ∈ Γj,k−1 . Here √ √ + + b7 :=(1 + δs+ )( rζ∗+ (1 + φ+ )γ∗ + (κ+ cγnew,k )· s,A )ζk−1 φ ! ! √ √ + √ 1 1 + + 2 + + + cγnew,k + rζ∗ 1 + p κs,A φ κs,A ζk−1 φ cγnew,k γ∗ + p 1 − (ζ∗+ )2 1 − (ζ∗+ )2 and b8 :=(1 + δs+ ) + + κ+ s,A ζk−1 φ +p ! 1 1− 3 + 2 + 2 (κ+ s,A ) (ζk−1 ) (φ ) + 2 (ζ∗ ) 1 1 λ+ new,k + ! (1 + δs+ )(ζ∗+ )2 1 + φ+ + p κ+ φ+ + p κ+ (φ+ )2 + 2 s,A + 2 s,A 1 − (ζ∗ ) 1 − (ζ∗ ) λ+ − Thus, applying Corollary 2.9 with = cζλ , 24 cζλ− −αc2 ζ 2 (λ− )2 P kBk k2 ≤ b8 + for all Xj,k−1 ∈ Γj,k−1 Xj,k−1 ≥ 1 − n exp 24 32 · 242 b27 (12) Using (9), (10), (11) and (12) and the union bound, for any Xj,k−1 ∈ Γj,k−1 , cζλ− P kHk k2 ≤ b9 + Xj,k−1 ≥ 8 −αc2 ζ 2 (λ− )2 −αc2 ζ 2 (λ− )2 −αc2 ζ 2 (λ− )2 2 1 − n exp − n exp − n exp 8 · 242 b21 32 · 242 · 4b23 32 · 242 b27 where b9 := b2 + 2b4 + b8 . − + − Using λ− new,k ≥ λ and f := λ /λ , b9 + cζλ− 8 ≤ λ− new,k (b + 0.125cζ) (recall b is defined in Definition + + 5.11). Using Fact 8.5 and substituting κ+ s = 0.15, φ = 1.2, δs = 0.05, one can upper bound b1 , b3 and b7 and show that the above probability is lower bounded by 1 − pc (α, ζ). Finally, applying Lemma 2.5, the third claim of the lemma follows. IX. A PPENDIX A. Proof of Lemma 2.5 Proof. It is easy to see that P(B e , De ) = E[IB (X, Y )ID (X)]. If E[IB (X, Y )|X] ≥ p for all X ∈ D, this means that E[IB (X, Y )|X]ID (X) ≥ pIC (X). This, in turn, implies that P(B e , De ) = E[IB (X, Y )ID (X)] = E[E[IB (X, Y )|X]ID (X)] ≥ pE[ID (X)]. Recall from Definition 2.4 that P(B e |X) = E[IB (X, Y )|X] and P(De ) = E[ID (X)]. Thus, we conclude that if P(B e |X) ≥ p for all X ∈ D, then P(B e , De ) ≥ pP(De ). Using the definition of P(B e |De ), the claim follows. 27 B. Proof of Corollary 2.8 Proof. 1) Since, for any X ∈ D, conditioned on X, the Zt ’s are independent, the same is also true for Zt − g(X) for any function of X. Let Yt := Zt − E(Zt |X). Thus, for any X ∈ D, conditioned on X, the Yt ’s are independent. Also, clearly E(Yt |X) = 0. Since for all X ∈ D, P(b1 I Zt b2 I|X) = 1 and since λmax (.) is a convex function, and λmin (.) is a concave function, of a Hermitian matrix, thus b1 I E(Zt |X) b2 I w.p. one for all X ∈ D. Therefore, P(Yt2 (b2 − b1 )2 I|X) = 1 for all P X ∈ D. Thus, for Theorem 2.7, σ 2 = k t (b2 − b1 )2 Ik2 = α(b2 − b1 )2 . For any X ∈ D, applying Theorem 2.7 for {Yt }’s conditioned on X, we get that, for any > 0, ! ! −α2 1X Yt ≤ X > 1 − n exp for all X ∈ D P λmax α t 8(b2 − b1 )2 P P P By Weyl’s theorem, λmax ( α1 t Yt ) = λmax ( α1 t (Zt − E(Zt |X)) ≥ λmax ( α1 t Zt ) + P P P λmin ( α1 t −E(Zt |X)). Since λmin ( α1 t −E(Zt |X)) = −λmax ( α1 t E(Zt |X)) ≥ −b4 , thus P P λmax ( α1 t Yt ) ≥ λmax ( α1 t Zt ) − b4 . Therefore, ! ! 2 −α 1X P λmax Zt ≤ b4 + X > 1 − n exp for all X ∈ D α t 8(b2 − b1 )2 2) Let Yt = E(Zt |X) − Zt . As before, E(Yt |X) = 0 and conditioned on any X ∈ D, the Yt ’s are independent and P(Yt2 (b2 − b1 )2 I|X) = 1. As before, applying Theorem 2.7, we get that for any > 0, ! ! −α2 1X P λmax Yt ≤ X > 1 − n exp for all X ∈ D α t 8(b2 − b1 )2 P P P By Weyl’s theorem, λmax ( α1 t Yt ) = λmax ( α1 t (E(Zt |X) − Zt )) ≥ λmin ( α1 t E(Zt |X)) + P P P P λmax ( α1 t −Zt ) = λmin ( α1 t E(Zt |X)) − λmin ( α1 t Zt ) ≥ b3 − λmin ( α1 t Zt ) Therefore, for any > 0, P λmin 1X Zt α t ! ! ≥ b3 − X ≥ 1 − n exp −α2 8(b2 − b1 )2 for all X ∈ D C. Proof of Corollary 2.9 " Proof. Define the dilation of an n1 × n2 matrix M as dilation(M ) := 0 M0 M 0 (n1 + n2 ) × (n1 + n2 ) Hermitian matrix [19]. As shown in [19, equation 2.12], λmax (dilation(M )) = kdilation(M )k2 = kM k2 # . Notice that this is an (13) Thus, the corollary assumptions imply that P(kdilation(Zt )k2 ≤ b1 |X) = 1 for all X ∈ D. Thus, P(−b1 I dilation(Zt ) b1 I|X) = 1 for all X ∈ D. 28 Using (13), the corollary assumptions P dilation( α1 t E(Zt |X)) b2 I for all X ∈ D. also imply that 1 α P t E(dilation(Zt )|X) = Finally, Zt ’s conditionally independent given X, for any X ∈ D, implies that the same thing also holds for dilation(Zt )’s. Thus, applying Corollary 2.8 for the sequence {dilation(Zt )}, we get that, ! ! 2 1X −α for all X ∈ D P λmax dilation(Zt ) ≤ b2 + X ≥ 1 − (n1 + n2 ) exp α t 32b21 P P P Using (13), λmax ( α1 t dilation(Zt )) = λmax (dilation( α1 t Zt )) = k α1 t Zt k2 and this gives the final result. D. Proof of Lemma 2.14 Proof. Because P , Q and P̂ are basis matrix, P 0 P = I, Q0 Q = I and P̂ 0 P̂ = I. 1) Using P 0 P = I and kM k22 = kM M 0 k2 , k(I − P̂ P̂ 0 )P P 0 k2 = k(I − P̂ P̂ 0 )P k2 . Similarly, k(I − P P 0 )P̂ P̂ 0 k2 = k(I − P P 0 )P̂ k2 . Let D1 = (I − P̂ P̂ 0 )P P 0 and let D2 = (I − P P 0 )P̂ P̂ 0 . Notice that p p p p kD1 k2 = λmax (D10 D1 ) = kD10 D1 k2 and kD2 k2 = λmax (D20 D2 ) = kD20 D2 k2 . So, in order SV D to show kD1 k2 = kD2 k2 , it suffices to show that kD10 D1 k2 = kD20 D2 k2 . Let P 0 P̂ = U ΣV 0 . Then, D10 D1 = P (I −P 0 P̂ P̂ 0 P )P 0 = P U (I −Σ2 )U 0 P 0 and D20 D2 = P̂ (I − P̂ 0 P P 0 P̂ )P̂ 0 = P̂ V (I −Σ2 )V 0 P̂ 0 are the compact SVD’s of D10 D1 and D20 D2 respectively. Therefore, kD10 D1 k = kD20 D2 k2 = kI − Σ2 k2 and hence k(I − P̂ P̂ 0 )P P 0 k2 = k(I − P P 0 )P̂ P̂ 0 k2 . 2) kP P 0 − P̂ P̂ 0 k2 = kP P − P̂ P̂ 0 P P 0 + P̂ P̂ 0 P P 0 − P̂ P̂ 0 k2 ≤ k(I − P̂ P̂ 0 )P P 0 k2 +k(I −P P 0 )P̂ P̂ 0 k2 = 2ζ∗ . 3) Since Q0 P = 0, then kQ0 P̂ k2 = kQ0 (I − P P 0 )P̂ k2 ≤ k(I − P P 0 )P̂ k2 = ζ∗ . q 4) Let M = (I−P̂ P̂ 0 )Q). Then M 0 M = Q0 (I−P̂ P̂ 0 )Q and so σi ((I−P̂ P̂ 0 )Q) = λi (Q0 (I − P̂ P̂ 0 )Q). Clearly, λmax (Q0 (I−P̂ P̂ 0 )Q) ≤ 1. By Weyl’s Theorem, λmin (Q0 (I−P̂ P̂ 0 )Q) ≥ 1−λmax (Q0 P̂ P̂ 0 Q) = p 1 − kQ0 P̂ k22 ≥ 1 − ζ∗2 . Therefore, 1 − ζ∗2 ≤ σi ((I − P̂ P̂ 0 )Q) ≤ 1. E. Proof of Lemma 3.4 Proof. Let B = (I − P P 0 )A. By the Rayleigh-Ritz-Bits theorem, (1 − δs (B)) ≤ λmin (BT0 BT ) ≤ λmax (BT0 BT ) ≤ (1 + δs (B)). Which implies, 1 − λmin (BT0 BT ) ≤ δs (B) and λmax (BT0 BT ) − 1 ≤ δs (B) with one of the two holding with equality. Now BT0 BT = A0T AT − (A0T P )(A0T P )0 . By Weyl’s Theorem, λmin (A0T AT ) + λmin (−(A0T P )(A0T P )0 ) ≤ λmin (BT0 BT ) (14) λmax (BT0 BT ) ≤ λmax (A0T AT ) + λmax (−(A0T P )(A0T P )0 ). (15) and 29 From (14) using the fact that λmin (−H) = −λmax (H) we get 1 − δs (A) − λmax ((A0T P )(A0T P )0 ) ≤ λmin (BT0 BT ) which implies, 1 − λmin (BT0 BT ) ≤ δs (A) + λmax ((A0T P )(A0T P )0 ) ≤ δs (A) + κ2s,A (P ). From (15) we get λmax (BT0 BT ) ≤ 1 + δs (A) − λmin ((A0T P )(A0T P )0 ) which implies, λmax (BT0 BT ) − 1 ≤ δs (A) − λmin ((A0T P )(A0T P )0 ) ≤ δs (A). Therefore, δs (B) = δs ((I − P P 0 )A) ≤ δs (A) + κ2s,A (P ) R EFERENCES [1] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of ACM, vol. 58, no. 3, 2011. [2] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, 2011. [3] H. Xu, C. Caramanis, and S. Sanghavi, “Robust pca via outlier pursuit,” IEEE Tran. on Information Theorey, vol. 58, no. 5, 2012. [4] M. McCoy and J. Tropp, “Two proposals for robust pca using semidefinite programming,” arXiv:1012.1086v3, 2010. [5] M. B. McCoy and J. A. Tropp, “Sharp recovery bounds for convex deconvolution, with applications,” arXiv:1205.1580. [6] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Foundations of Computational Mathematics, no. 6, 2012. [7] Y. Hu, S. Goud, and M. Jacob, “A fast majorize-minimize algorithm for the recovery of sparse and low-rank matrices,” IEEE Transactions on Image Processing, vol. 21, no. 2, p. 742=753, Feb 2012. [8] A. E. Waters, A. C. Sankaranarayanan, and R. G. Baraniuk, “Sparcs: Recovering low-rank and sparse matrices from compressive measurements,” in Proc. of Neural Information Processing Systems(NIPS), 2011. [9] E. Richard, P.-A. Savalle, and N. Vayatis, “Estimation of simultaneously sparse and low rank matrices,” arXiv:1206.6474, appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012). [10] D. Hsu, S. M. Kakade, and T. Zhang, “Robust matrix decomposition with outliers,” arXiv:1011.1518. [11] M. Mardani, G. Mateos, and G. B. Giannakis, “Recovery of low-rank plus compressed sparse matrices with application to unveiling traffic anomalies,” arXiv:1204.6537. [12] J. Wright, A. Ganesh, K. Min, and Y. Ma, “Compressive principal component pursuit,” arXiv:1202.4596. [13] A. Ganesh, K. Min, J. Wright, and Y. Ma, “Principal component pursuit with reduced linear measurements,” arXiv:1202.6445. [14] M. Tao and X. Yuan, “Recovering low-rank and sparse components of matrices from incomplete and noisy observations,” SIAM Journal on Optimization, vol. 21, no. 1, pp. 57–81, 2011. [15] C. Qiu, N. Vaswani, and L. Hogben, “Recursive robust pca or recursive sparse recovery in large but structured noise,” in IEEE Intl. Conf. Acoustics, Speech, Sig. Proc. (ICASSP), 2013, longer version in arXiv: 1211.3754 [cs.IT]. [16] N. Vaswani and W. Lu, “Modified-cs: Modifying compressive sensing for problems with partially known support,” IEEE Trans. Signal Processing, September 2010. [17] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans. Info. Th., vol. 51(12), pp. 4203 – 4215, Dec. 2005. [18] G. Grimmett and D. Stirzaker, Probability and Random Processes. Oxford University Press, 2001. [19] J. A. Tropp, “User-friendly tail bounds for sums of random matrices,” Foundations of Computational Mathematics, vol. 12, no. 4, 2012. [20] C. Davis and W. M. Kahan, “The rotation of eigenvectors by a perturbation. iii,” SIAM Journal on Numerical Analysis, Mar. 1970. [21] R. Horn and C. Johnson, Matrix Analysis. Cambridge University Press, 1985. [22] J. Zhan and N. Vaswani, “Time invariant error bounds for modified-cs based sparse signal sequence recovery,” arXiv:1305.0842. [23] G. Li and Z. Chen., “Projection-pursuit approach to robust dispersion matrices and principal components: Primary theory and monte carlo,” Journal of the American Statistical Association, vol. 80, no. 391, pp. 759–766, 1985.

Recursive Projected Modified Compressed Sensing for Undersampled Measurements (Long Version)

Related documents

Products

Support

Recursive Projected Modified Compressed Sensing for Undersampled Measurements (Long Version)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib