Recursive Projected Modified Compressed Sensing for Undersampled Measurements (Long Version)

advertisement
1
Recursive Projected Modified Compressed Sensing
for Undersampled Measurements (Long Version)
Brian Lois, Namrata Vaswani, and Chenlu Qiu Iowa State University, Ames IA
Abstract
We study the problem of recursively reconstructing a time sequence of sparse vectors St from measurements of
the form Mt = ASt + BLt where A and B are known measurement matrices, and Lt lies in a slowly changing low
dimensional subspace. We assume that the signal of interest (St ) is sparse, and has support which is correlated over
time. We introduce a solution which we call Recursive Projected Modified Compressed Sensing (ReProMoCS),
which exploits the correlated support change of St . We show that, under weaker assumptions than previous work,
with high probability, ReProMoCS will exactly recover the support set of St and the reconstruction error of St
is upper bounded by a small time-invariant value. A motivating application where the above problem occurs is in
functional MRI imaging of the brain to detect regions that are “activated” in response to stimuli. In this case both
measurement matrices are the same (i.e. A = B). The active region image constitutes the sparse vector St and
this region changes slowly over time. The background brain image changes are global but the amount of change
is very little and hence it can be well modeled as lying in a slowly changing low dimensional subspace, i.e. this
constitutes Lt .
I. I NTRODUCTION
We study the problem of recursively reconstructing a time sequence of sparse vectors St from undersampled measurements corrupted by (possibly) large magnitude but low-dimensional noise. By lowdimensional we mean that the Lt ’s all lie in a low-dimensional subspace, which is allowed to change
slowly over time. Specifically we observe
Mt := ASt + BLt
where A and B are known matrices, and Lt is (potentially) large but low-dimensional and non-sparse
noise. This problem has applications in fMRI imaging where the “active” region of the brain is the sparse
signal of interest, and the background brain image is slowly changing low dimensional background. In
this case, B = A is the partial Fourier matrix. Another application is compressed sensing in the presence
of large but low-dimensional sensor noise. In this case B = I and A is the usual CS matrix.
Related Work. This work is related to [1], [2] which study the problem of decomposing an observed
matrix M = [M1 · · · Mt ] into the sum of a sparse matrix S = [S1 · · · St ] and a low rank matrix L =
[L1 · · · Lt ]. These papers can all be viewed as batch solutions to the problem studied here. A limitation
of these works is that they either require independence over time of the support of S [1] or stronger
restrictions on the sparsity pattern of S [2], [3]. A more detailed discussion is provided in Section IV-E.
Both of these are not practical assumptions for image analysis because foreground objects often move in
2
a correlated fashion over time. This can result in some rows of S having many non-zero entries, while
other rows have very few or no non-zero entries. These papers also do not consider the case where the
sparse vectors are undersampled. Other related works, some of which consider the under-sampled case,
include [3], [4], [5], [6], [7], [8],[9], [10], [11], [12], [13], [14].
Contribution. The problem of recursively reconstructing sparse vectors in large but low-dimensional
noise was studied in [15], where the solution called Recursive Projected CS was presented and analyzed.
This work extends these results in two ways. First, we study the case where the sparse and low rank
vectors may be under-sampled (multiplied by the matrices A and B). Second, we modify the algorithm
in [15] by using modified-CS [16] in place of the simple `1 minimization (CS) step in ReProCS. This
allows us to prove the same performance guarantees under weaker assumptions as long as the support of
St changes slowly over time.
II. N OTATION AND BACKGROUND
A. Notation
Let [1 : n] = {1, 2, . . . , n}. For a set T ⊆ [1 : n], |T | denotes the number of elements in T . For a vector
v, vi denotes the ith entry of v and vT denotes the vector formed from the entries of v indexed by T . We
use kvkp for the `p norm of v.
For a matrix M, M 0 denotes its transpose and M † denotes is Moore-Penrose pseudo-inverse. kM k2
refers to the induced 2-norm of M kM k2 = maxkxk2 =1 kM xk2 . For an Hermitain matrix H, we use
EVD
H = U ΛU 0 to mean the eigenvalue decomposition where U is a unitary matrix, and Λ is diagonal with
entries arranged in decreasing order. For Hermitian matrices H1 and H2 , the notation H1 H2 means
that H2 − H1 is positive semi-definite. I is an identity matrix of appropriate size. For an indexing set
T and matrix M, MT is the matrix formed by retaining the columns of M indexed by T . Notice that
MT = M IT . For matrices M1 and M2 with the same number of rows, [M1 M2 ] is the matrix formed by
horizontally concatenating the matrices M1 and M2 . We use range(M ) to refer to the subspace spanned
by the columns of M .
Definition 2.1. We refer to a tall matrix P as a basis matrix if it satisfies P 0 P = I.
Definition 2.2. We use the notation Q = basis(C) to mean: Q is a matrix such that the columns of Q
form an orthonormal basis for range(C) i.e. Q0 Q = I and range(Q) = range(C).
Definition 2.3. The s-restricted isometry constant (RIC) [17], δs , for an n × m matrix Ψ is the smallest
real number satisfying
(1 − δs )kxk22 ≤ kΨxk22 ≤ (1 + δs )kxk22
for all s-sparse vectors x.
It follows easily that max|T |≤s k(Ψ0T ΨT )−1 k2 ≤
1
1−δs (Ψ)
[17].
Definition 2.4. We give some notation for random variables in this definition.
3
1) We let E[Z] denote the expectation of a random variable (r.v.) Z and E[Z|X] denote its conditional
expectation given another r.v. X.
2) Let B be a set of values that a r.v. Z can take. We use B e to denote the event Z ∈ B, i.e.
B e := {Z ∈ B}.
3) The probability of any event B e can be expressed as [18],
P(B e ) := E[IB (Z)].
where
(
IB (Z) :=
1
if Z ∈ B
0 otherwise
is the indicator function on the set B.
4) For two events B e , B˜e , P(B e |B˜e ) refers to the conditional probability of B e given B˜e , i.e. P(B e |B˜e ) :=
P(B e , B˜e )/P(B˜e ).
5) For a r.v. X, and a set B of values that the r.v. Z can take, the notation P(B e |X) is defined as
P(B e |X) := E[IB (Z)|X].
Notice that P(B e |X) is a r.v. (it is a function of the r.v. X) that always lies between zero and one.
B. High Probability Tail Bounds for Sums of Independent Random Matrices
Lemma 2.5. Suppose that B is the set of values that the r.v.s X, Y can take. Suppose that D is a set of
values that the r.v. X can take. For a 0 ≤ p ≤ 1, if P(B e |X) ≥ p for all X ∈ D, then P(B e |De ≥ p as
long as P(De ) > 0.
The proof is in the Appendix.
The following lemma is an easy consequence of the chain rule of probability applied to a contracting
sequence of events.
e
e
Lemma 2.6. For a sequence of events E0e , E1e , . . . Em
that satisfy E0e ⊇ E1e ⊇ E2e · · · ⊇ Em
, the following
holds
e
P(Em
|E0e )
=
m
Y
e
P(Eke |Ek−1
).
k=1
e
e
e
Proof. P(Em
|E0e ) = P(Em
, Em−1
, . . . E0e |E0e ) =
Qm
k=1
e
e
P(Eke |Ek−1
, Ek−2
, . . . E0e ) =
Qm
k=1
e
P(Eke |Ek−1
).
Next, we state the matrix Hoeffding inequality [19, Theorem 1.3] which gives tail bounds for sums of
independent random matrices.
Theorem 2.7 (Matrix Hoeffding for a zero mean Hermitian matrix [19]). Consider a finite sequence {Zt }
of independent, random, Hermitian matrices of size n × n, and let {At } be a sequence of fixed Hermitian
4
matrices. Assume that each random matrix satisfies (i) P(Zt2 A2t ) = 1 and (ii) E(Zt ) = 0. Then, for
all > 0,
!
P λmax
X
Zt
!
≤
≥ 1 − n exp
t
−2
8σ 2
X
At 2 , where σ 2 = t
2
Corollary 2.8 (Matrix Hoeffding conditioned on another random variable for a nonzero mean Hermitian
matrix). Given an α-length sequence {Zt } of random Hermitian matrices of size n × n, a r.v. X, and a
set D of values that X can take. Assume that, for all X ∈ C, (i) Zt ’s are conditionally independent given
P
X; (ii) P(b1 I Zt b2 I|X) = 1 and (iii) b3 I α1 t E(Zt |X) b4 I. Then for all > 0,
!
!
1X
−α2
P λmax
Zt ≤ b4 + X ≥ 1 − n exp
for all X ∈ C
α t
8(b2 − b1 )2
!
!
1X
−α2
P λmin
Zt ≥ b3 − X ≥ 1 − n exp
for all X ∈ C
α t
8(b2 − b1 )2
The proof is in the Appendix.
Corollary 2.9 (Matrix Hoeffding conditioned on another random variable for an arbitrary nonzero mean
matrix). Given an α-length sequence {Zt } of random Hermitian matrices of size n × n, a r.v. X, and a
set C of values that X can take. Assume that, for all X ∈ C, (i) Zt ’s are conditionally independent given
P
X; (ii) P(kZt k2 ≤ b1 |X) = 1 and (iii) k α1 t E(Zt |X)k2 ≤ b2 . Then, for all > 0,
!
1 X −α2
P Zt ≤ b2 + X ≥ 1 − (n1 + n2 ) exp
for all X ∈ C
α t
32b21
2
The proof is in the Appendix.
C. Results from Linear Algebra
Davis and Kahan’s sin θ theorem [20] studies the rotation of eigenvectors by perturbation.
Theorem 2.10 (sin θ theorem [20]). Given two Hermitian matrices C and H satisfying
"
#"
#
"
#"
#
h
i C 0
h
i H B0
E0
E0
C = E E⊥
, H = E E⊥
0 C⊥
E⊥ 0
B H⊥
E⊥ 0
(1)
where [E E⊥ ] is an orthogonal matrix. The two ways of representing C + H are
"
#"
#
"
#"
#
h
i C +H
h
i Λ 0
B0
E0
F0
C + H = E E⊥
= F F⊥
B
C⊥ + H⊥
E⊥ 0
0 Λ⊥
F⊥ 0
where [F F⊥ ] is another orthogonal matrix. Let R := (C + H)E − CE = HE. If λmin (C) > λmax (Λ⊥ ),
then
k(I − F F 0 )Ek2 ≤
kRk2
λmin (C) − λmax (Λ⊥ )
The above result bounds the amount by which the two subspaces range(E) and range(F ) differ as a
function of the norm of the perturbation kRk2 and of the gap between the minimum eigenvalue of C and
the maximum eigenvalue of Λ⊥ .
5
Next, we state Weyl’s theorem which bounds the eigenvalues of a perturbed Hermitian matrix, followed
by Ostrowski’s theorem.
Theorem 2.11 (Weyl [21]). Let C and H be two n × n Hermitian matrices. For each i = 1, 2, . . . , n we
have
λi (C) + λmin (H) ≤ λi (C + H) ≤ λi (C) + λmax (H)
Theorem 2.12 (Ostrowski [21]). Let H and W be n × n matrices, with H Hermitian and W nonsingular.
For each i = 1, 2 . . . n, there exists a positive real number θi such that λmin (W W 0 ) ≤ θi ≤ λmax (W W 0 )
and λi (W HW 0 ) = θi λi (H). Therefore,
λmin (W HW 0 ) ≥ λmin (W W 0 )λmin (H)
The following lemma is a consequence of Weyl’s theorem (Theorem 2.11) and the sin θ theorem
(Theorem 2.10)
Lemma 2.13. Suppose that two Hermitian matrices C and H can be decomposed as in (1) where [E E⊥ ]
is an orthogonal matrix, and C is a c × c matrix. Also suppose that the EVD of C + H is
"
#"
#
h
i Λ 0
F0
EVD
C + H = F F⊥
0 Λ⊥
F⊥ 0
If λmin (Ck ) > kCk,⊥ k2 − kHk k2 , then
k(I − F F 0 )Ek2 ≤
kHk2
λmin (Ck ) − kCk,⊥ k2 − kHk k2
Proof. By definition of EVD, [F F⊥ ] is an orthogonal matrix. By the sin θ theorem, if λmin (C) >
λmax (Λ⊥ ), then k(I − F F 0 )Ek2 ≤
kRk2
λmin (Ck )−λmax (Λ⊥ )
where R := HE. Clearly kRk ≤ kHk. Since
λmin (C) > λmax (C⊥ ) and C is a c × c matrix, λc+1 (C) = λmax (C⊥ ).
By definition of EVD (eigenvalues arranged in decreasing order) and since Λ is a c×c matrix, λc+1 (C +
H) = λmax (Λ⊥ ). By Weyl’s theorem, λmax (Λ⊥ ) = λc+1 (C + H) ≤ λc+1 (C) + λmax (H). Since λmax (H) ≤
kHk2 , the result follows.
Lemma 2.14. Suppose that P , P̂ and Q are three basis matrices. Also, P and P̂ are of the same size,
Q0 P = 0 and k(I − P̂ P̂ 0 )P k2 = ζ∗ . Then,
1) k(I − P̂ P̂ 0 )P P 0 k2 = k(I − P P 0 )P̂ P̂ 0 k2 = k(I − P P 0 )P̂ k2 = k(I − P̂ P̂ 0 )P k2 = ζ∗
2) kP P 0 − P̂ P̂ 0 k2 ≤ 2k(I − P̂ P̂ 0 )P k2 = 2ζ∗
3) kP̂ 0 Qk2 ≤ ζ∗
p
1 − ζ∗2 ≤ σi ((I − P̂ P̂ 0 )Q) ≤ 1
4)
The proof is in the Appendix.
Lemma 2.15. Let B be any matrix, and Q(B) be a matrix such that the columns of Q form an orthonormal
basis for range(B). Then,
kA0T Bk2 ≤ kA0T Q(B)k2 kBk2 .
The proof is in the Appendix.
6
D. Compressive Sensing Result
Lemma 2.16 ([22] Lemma 4). Consider a sequence of vectors xt with support Nt ⊆ [1 : n] such that
|Nt | ≤ s and yt = Axt + βt where kβt k2 ≤ . Suppose that at each time t at most sa elements enter the
support of xt and at most sa elements leave the support of xt . Let T̂t be an estimate of the support of xt
such that |T̂t \ Nt | ≤ s∆e and |Nt \ T̂t | ≤ s∆ . For t ≥ 2 let x̂t be the solution to the optimization problem
min k(x̃)T̂ c k1 s.t. kyt − Ax̃t k ≤ x̃
and set
T̂t+1 = N̂t = {i ∈ [1 : n] : |(x̂t )i | > ω}
Here N̂t is the final estimate of the support of xt . This value also serves as the initial support estimate
for time t + 1. Let % be the smallest real number such that
%
kxt − x̂t k∞ ≤ √ kxt − x̂t k2 .
sa
Notice that % ≤
√
sa because the infinity norm is always less than or equal to the two norm. Then,
1) All elements of the set {i ∈ Nt : |(xt )i | ≥ b1 } will get detected (be included in N̂t ) if
•
δs+s∆e +2s∆ < 0.207, and b1 > ω +
%
√ 7.50.
sa
2) There will be no false additions, and all the true removals from the support (T̂t \ Nt ) will get deleted
if
•
δs+s∆e +2s∆ < 0.207, and ω ≥
%
√ 7.50.
sa
III. P ROBLEM F ORMULATION
The measurement vector at time t, Mt , is an m-dimensional vector which is constituted as
Mt = ASt + BLt .
A and B are known m × n matrices where n can be greater than m. St is a sparse vector in Rn with
support size at most s and whose non-zero entries have magnitude at least Smin . Furthermore, the number
of elements that enter the support and the number of elements that exit the support at a given time are
both bounded by sa . Lt is a dense vector which lies in a slowly changing low-dimensional subspace of
Rn . Since all of the Lt ’s lie in a low dimensional subspace so do the vectors BLt . Therefore let L̃t = BLt ,
then the observed vector can be expressed as Mt = ASt + L̃t .
We assume that we have access to an accurate estimate of the subspace in which the first ttrain
vectors L̃t lie. That is we have a basis matrix Q̂0 such that k(I − Q00 Q0 )Q̂0 k2 is small. Here
Q0 = basis([L̃1 , L̃2 , . . . , L̃ttrain ]). Our goal is to estimate St at every time t > ttrain .
Notation for St . Let Nt := {i ∈ [1 : n] | (St )i 6= 0} be the support of St . At each time t > ttrain Nt
changes as:
Nt = (Nt−1 ∪ Nadd,t ) \ Ndel,t
7
where Nadd,t are the elements added to the support at time t, and Ndel,t are the elements deleted from the
support at time t. Define
s := max |Nt | and sa := max max {|Nadd,t |, |Ndel,t |}
t
t
Let Smin := min min |(St )i | be a lower bound on the size of any non-zero entry of St for any time t. Let
t
i∈Nt
T̂t be an estimate of the support Nt . Denote by ∆e,t the elements of T̂ that are not actually in Nt i.e.
∆e,t = T̂t \ Nt . Similarly ∆t denotes the elements of T that are not in T̂ i.e. ∆t = Nt \ T̂t .
Model on Lt .
1) The Lt ’s lie in a low-dimensional subspace that changes every so often. Let P(t) be a basis matrix
for this subspace at time t. Then Lt = P(t) at where P(t) changes every so often. Denote the subspace
change times by tj , j = 0, 1, 2, · · · , J. (There are a maximum of J subspace change times.) For
tj ≤ t ≤ tj+1 , P(t) = Pj where Pj is an m × rj basis matrix with rj m and rj (tj+1 − tj ).
2) At the change times, tj , Pj changes as Pj = [Pj−1 Pj,new ] where Pj,new is an m × cj,new basis matrix
with "
columns
# orthogonal to the columns of Pj−1 . Therefore rj = rj−1 + cj,new . We can then write
at,∗
at =
conformal with the partition of Pj .
at,new
3) There exists a constant cmx such that 0 ≤ cj,new ≤ cmx .
0
Lt is a zero mean bounded random variable. That is E(at ) = 0 and
4) The vector of coefficients at = P(t)
there is a constant γ∗ such that kat k∞ ≤ γ∗ for all t. The covariance matrix Λt := Cov[at ] = E(at a0t )
is diagonal with λ− := mint λmin (Λt ) > 0 and λ+ := maxt λmax (Λt ) < ∞. Thus the condition
number of any Λt is bounded by f :=
λ+
.
λ−
5) Finally, the at ’s are mutually independent over time.
Definition 3.1. Define Qj = basis(BPj ), and Qj,new = basis (I − Qj−1 Qj−1 0 )BPj .
Notice that Q0j−1 Qj,new = 0.
A. Slow Subspace Change
By slow subspace change we mean all of the following:
1) The delay between consecutive subspace change times (tj+1 − tj ) is large enough.
2) The vector of coefficients for the new directions, at,new is initially small. That is
max
tj ≤t<tj +α
kat,new k∞ ≤
γnew with γnew γ∗ and γnew Smin , but can increase gradually. This is modeled as follows. Split
the interval [tj , tj+1 − 1] into α length periods. We assume that
max
j
max
t∈[tj +(k−1)α,tj +kα−1]
kat,new k∞ ≤ min(v k−1 γnew , γ∗ )
for a v > 1 but not too large.
3) The number of newly added directions is small i.e. cj,new ≤ cmx r0 .
8
B. Matrix Coherence Parameter and its Relation with RIC
Definition 3.2. For an m × n matrix C and an m × p matrix A, define the incoherence coefficient κs,A (C)
by
κs,A (C) := max kAT 0 basis(C)k2
|T |≤s
and let κs (C) := κs,I (C).
κs,A (C) can be thought of as a measure of the coherence between range(AT ) and range(C) for any
set T of size less than or equal to s. When A = I, κs (C) measures the denseness of columns of basis(C)
and hence in [15] it was referred to as the denseness coefficient.
We will use the notation κ2s,A (C) to mean (κs,A (C))2 .
Lemma 3.3. For any set T such that |T | ≤ s and matrices A and C,
kAT 0 Ck2 ≤ κs,A (C)kCk2 .
QR
Proof. Let C = QR. Then kAT 0 Ck = kAT 0 QRk ≤ kAT 0 QkkRk ≤ κs,A (C)kRk = κs,A (C)kCk.
Lemma 3.4. For a basis matrix P (P 0 P = I) and matrix A.
δs ((I − P P 0 )A) ≤ δs (A) + κs,A (P )2 .
The proof is in the appendix.
IV. R ECURSIVE P ROJECTED M ODIFIED CS A LGORITHM AND ITS P ERFORMANCE G UARANTEES
The Recursive Projected Modified CS algorithm is summarized in Algorithm 2. First we need the
following definition
Definition 4.1. Define the time interval Ij,k := [tj + (k − 1)α, tj + kα − 1] for k = 1, . . . K and
Ij,K+1 := [tj + Kα, tj+1 − 1] where K is the algorithm parameter in Algorithm 2.
A. The Projection-PCA algorithm
Given a data matrix D, a basis matrix P and an integer r, projection-PCA (proj-PCA) applies PCA
on Dproj := (I − P P 0 )D, i.e., it computes the top r eigenvectors (the eigenvectors with the r largest
eigenvalues) of
1
D D 0.
αD proj proj
Here αD is the number of column vectors in D. This is summarized in
Algorithm 1.
If P = [.], then projection-PCA reduces to standard PCA, i.e. it computes the top r eigenvectors of
1
DD0 .
αD
We should mention that the idea of projecting perpendicular to a partly estimated subspace has been
used in different contexts in past work [23], [4].
9
Algorithm 1 projection-PCA: R ← proj-PCA(D, P, r)
1) Projection: compute Dproj ← (I − P P 0 )D "
#"
#
h
i Λ 0
0
R
EV
D
2) PCA: compute α1D Dproj Dproj 0 = R R⊥
where R is an n × r basis matrix and
0 Λ⊥
R⊥ 0
αD is the number of columns in D.
Algorithm 2 Recursive Projected Modified CS (ReProMoCS)
Parameters: algorithm parameters: ξ, ω, α, K, model parameters: tj , r0 , cj,new (set as in Theorem 4.2 or
as in [15, Sec X-B] when the model is not known)
b̃ , Q̂
Input: Mt , Output: Ŝt , L
t
(t)
Initialization: Compute Q̂0 ← proj-PCA [L̃1 , L̃2 , · · · , L̃ttrain ], [.], r0 , and set Q̂(t) ← Q̂0 .
Let T̂ttrain +1 = ∅, and let j ← 1, k ← 1.
For t > ttrain , do the following:
1) Estimate Nt and St via Projected Modified CS:
a) Nullify most of L̃t : compute Φ(t) ← I − Q̂(t−1) Q̂0(t−1) , compute yt ← Φ(t) Mt
b) Sparse Recovery: compute Ŝt,modcs as the solution of minx kxT̂tc k1 s.t. kyt − Φ(t) Axk2 ≤ ξ
c) Support Estimate: compute N̂t = {i : |(Ŝt,modcs )i | > ω}, and set T̂t+1 = N̂t
d) LS Estimate of St : compute (Ŝt )N̂t = ((Φ(t) A)N̂t )† yt , (Ŝt )N̂tc = 0
b̃ = M − AŜ .
2) Estimate L̃t : L
t
t
t
3) Update Q̂(t) by K Projection PCA steps.
a) If t = tj + kα − 1,
i
h
b̃
b̃
,
Q̂
,
c
L
i) Q̂j,new,k ← proj-PCA L
,
.
.
.
,
j−1 j,new ,
tj +kα−1
tj +(k−1)α
ii) set Q̂(t) ← [Q̂j−1 Q̂j,new,k ]; increment k ← k + 1.
Else
i) set Q̂(t) ← Q̂(t−1) .
b) If t = tj + Kα − 1, then set Q̂j ← [Q̂j−1 Q̂j,new,K ]. Increment j ← j + 1. Reset k ← 1.
4) Increment t ← t + 1 and go to step 1.
B. Recursive Projected Modified CS (ReProMoCS)
The idea behind ReProMoCS is as follows. Assume that the current basis matrix, Q̂(t−1) is an accurate
estimate of the current subspace where the L̃t ’s lie. We project the measurement Mt perpendicular to
range Q̂(t−1) in order to nullify most of L̃t and obtain yt := Φ(t) Mt where Φ(t) = I − Q̂(t−1) Q̂0(t−1) . Since
the projection matrix only has rank n−r∗ where r∗ = rank(Q̂(t−1) ), there are only n−r∗ “effective” linear
measurements. Recovering the n dimensional sparse vector St now becomes a sparse recovery problem
in small noise. Using the support estimate from the previous time we use the modified-CS algorithm to
recover Ŝt,modcs and estimate its support by thresholding on the recovered vector. By computing a least
squares (LS) estimate of St on the estimated support and setting it to zero everywhere else (step 1d), we
b̃ = M − AŜ (step 2). The
can get a more accurate final estimate, Ŝt . This Ŝt is used to estimate L̃t as L
t
t
t
10
b̃ = M − AŜ , e also satisfies Ae = L
b̃ − BL . Thus,
sparse recovery error is then et := St − Ŝt . Since L
t
t
t
t
t
t
t
b̃
a small e means that L̃ is also recovered accurately. The estimated L ’s are used to obtain new estimates
t
t
t
of Qj,new every α frames for a total of Kα frames via a modification of the standard PCA procedure,
which we call projection PCA (step 3).
C. Performance Guarantees
Theorem 4.2. Consider Algorithm 2. Let c := cmx and r := r0 + (J − 1)c. Assume that Lt obeys the
model given in Sec. III and there are a total of J change times. Assume also that the initial subspace
estimate is accurate enough, i.e. k(I − Q̂0 Q̂00 )Q0 k ≤ r0 ζ, for a ζ that satisfies
−4
λ+
1.5 × 10−4 1
10
,
where
f
:=
,
ζ ≤ min
r2
r2 f
r3 γ∗2
λ−
If the following conditions hold:
1) the algorithm parameters are set as ξ = ξ0 (ζ),
%
√ 7.50ξ
sa
≤ ω < Smin − √%sa 7.50ξ, K = K(ζ), α ≥
αadd,A = 1.05αadd (ζ), where ξ0 (ζ), %, K(ζ), αadd (ζ) are defined in Definition 5.2.
2) A, Pj−1 , Pj,new ,
Dj,new,k := (I − Q̂j−1 Q̂0j−1 − Q̂j,new,k Q̂0j,new,k )Qj,new , and
Gj,new,k := (I − Pj,new Pj,new 0 )P̂j,new,k
satisfy
δ2s (A) ≤ 0.05
κ2s (Q1 ) ≤ 0.3
κ2s (Q1,new ) ≤ 0.15
κs (D1,new,1 ) ≤ 0.15
κ2s (Q1,new,1 ) ≤ 0.15
and
κs+3sa ,A (QJ−1 ) ≤ 0.22, max κs+3sa ,A (Qj,new ) ≤ 0.14,
j
max max κs+3sa ,A (Dj,new,k ) ≤ 0.15,
j
0≤k≤K
max max κs+3sa ,A (Gj,new,k ) ≤ 0.14
j
0≤k≤K
max max κs+3sa ,A (Dj,∗,k ) ≤ 1
j
0≤k≤K
with P̂j,new,0 = [.] (empty matrix).
3) for a given value of Smin , the subspace change is slow enough, i.e.
min(tj+1 − tj ) > Kα,
j
max
j
max
tj +(k−1)α≤t<tj +kα
14ρξ0 (ζ) ≤ Smin ,
kat,new k∞ ≤ min(1.2k−1 γnew , γ∗ ),
11
4) the condition number of the covariance matrix of at,new is bounded, i.e.
λmax [Cov(at,new )] √
≤ 2
gt :=
λmin [Cov(at,new )]
then, with probability at least (1 − n−10 ),
1) at all times, t, T̂t = Tt , et satisfies
et = INt [(Φk−1 A)0Nt (Φk−1 A)Nt ]−1 A0Nt Φk−1 Lt
√ √
√
√
and ket k2 = kŜt − St k2 ≤ 0.18 cγnew + 1.2 ζ( r + 0.06 c).
(2)
2) the subspace error SE(t) := k(I − Q̂(t) Q̂0(t) )Q(t) k2 satisfies the bounds given in [15, Theorem 18].
m
l
log(0.6cζ)
Definition 4.3. 1) Define K(ζ) := log 0.6
√
2) Define g + := 2 to be the upper bound on gt .
√ √
√
√
3) Define ξ0 (ζ) := cγnew + ζ( r + c)
4) Define % to be the smallest real number such that
%
kSt − Ŝt,ModCS k∞ ≤ √ kSt − Ŝt,ModCS k2
sa
√
for all t. Notice that % ≤ sa because the infinity norm is always less than or equal to the two
norm.
5) Let K = K(ζ). Define
4608(log 6KJ + 11 log n)
4K 4
4 16
2
2
max min(1.2 γnew , γ∗ ), 2 , 4(0.186γnew + 0.0034γnew + 2.3)
=
.
ζ 2 (λ− )2
c
αadd
In words, αadd is the smallest value of the number of data points, α, needed for one projection PCA
step to ensure that Theorem 4.2 holds w.p. at least (1 − n−10 ).
This result says that if (a) the algorithm parameters are set appropriately; (b) The RIC of A is small
enough; (c) the basis matrices for the previous subspace, the newly added subspace, and the unestimated
part of the newly added subspace are sufficiently incoherent with A; (d) the low-dimensional subspace
changes slowly enough; (e) the condition number of the covariance matrix of at,new is small enough, then
with high probability, the algorithm exactly recovers the support of St at all times t.
D. Discussion
This work extends the results of [15] to the case where St and Lt are multiplied by the matrices A and
B which could be fat. This obviously makes the problem harder, so we need an additional assumption,
namely that δs (A) < 0.05. Also, because St is replaced by ASt and Lt is replaced by BLt , the denseness
assumption on the subspaces in which Lt ’s lie or on the changed part of these subspaces is replaced by a
similar assumption on the incoherence between the range of any set of s columns of A and the subspaces
or subspace changes of BLt (quantified by bounds on κs,A (Qj ) and κs,A (Qj,new )).
In addition to considering the undersampled case, we also substitute modified-CS for basis pursuit in
the sparse recovery step of the algorithm. To study the effect of this change and compare it with the
12
result of [15], consider the case where A = B = I. In this case, κs (P ) is the denseness coefficient of
range(P ). Since modified-CS utilizes the previous support estimate to aid the current sparse recovery,
we are able to replace the bounds on κ2s with similar bounds on κs+3sa . This is a weaker condition if
3sa ≤ s. So if the support of St changes slowly (which is a practical assumption in certain applications)
we can get exact support recovery under weaker conditions than [15]. However, we still need stronger
conditions for the very first step or some initial support estimate. At the same time, we note that denseness
of Dnew,k requires the support of St to not be constant in an α-duration period. As observed in simulation
experiments described in [15], even a support change by one index at each t is sufficient to ensure that
Dnew,k is dense enough. A simple practical example where both of the above requirements are satisfied
is a fixed size object moving by one or a few pixels in each frame. The relation between the denseness
requirement on Dnew,k and support change of St needs to be studied further and this is being done in our
ongoing work.
To compare with [1], [2], consider the case where A = B = I. A direct comparison is not possible
because the solution methods and proof techniques are very different. Specifically both [1] and [2] solve a
batch version of the problem, whereas we provide a recursive solution. ReProMoCS requires knowledge of
some subspace change model parameters and appropriate setting of a few algorithm parameters (of course,
we also have a practical version that works without any model knowledge) while this is not needed for the
approach studied in [1], [2]. Also, ReProMoCS needs slow subspace change (this, however, is a practically
valid assumption). The advantage of ReProMoCS is that it requires weaker denseness assumptions on Lt ,
and most importantly, it needs weaker assumptions on the uncorrelated-ness of the support change of St
over time. [1] needs the support sets to be mutually independent over time. [2] needs an assumption that
is satisfied if the minimum number of non-zeros in any row of S is small enough. However, if subspace
changes slowly enough, ReProMoCS will work even if many rows of S are dense.
E. Comparison with Rank-Sparsity Incoherence
To compare with [2] consider the case where A = B = I. A direct comparison is not possible because
the solution methods and proof techniques are very different. Specifically Rank-Sparsity Incoherence
solves a batch version of the problem, whereas we provide a recursive solution. The authors of [2]
provide sufficient conditions under which the the minimization problem
min γkS̃k1 + kL̃k∗ s.t. S̃ + L̃ = M
S̃,L̃
will exactly recover S and L. (Here M = S + L.) In order to state the conditions needed they define two
quantities:
ξ(L) := max kN k∞
N ∈T (L)
kN k=1
where T (L) is the span of all matrices with column space contained in the column space of L or row
space contained in the row space of L and
µ(S) :=
max
N
supp(N )⊆supp(S)
kN k∞ =1
kN k
13
If µ(S)ξ(L) < 16 , then the optimization problem will exactly recover both S and L. This differs from
our work in that we only guarantee exact recovery of the support of S. The requirement that ξ(L) be
small is related to our assumption that κs+3sa ,A (QJ−1 ) be small. Matrices with their row/column space
incoherent with the coordinate axes will have small ξ(L). Since we are studying the undersampled case,
we require that QJ−1 be incoherent with the columns of A. Our assumptions differ in regard to the
sparse matrix S. Here we assume that supp(St ) changes slowly over time. This will result in rows of
S = [S1 · · · St ] with many non-zero entries. Rank-Sparsity Incoherence [2] uses the number of non-zero
entries per row or column to bound µ(S). If degmax (S) is the maximum number of non-zero entries in any
row or column of S, and degmin (S) is the minimum number of non-zero entries in any row or column,
then degmin (S) ≤ µ(S) ≤ degmax (S). They also show that if the support of S is chosen uniformly at
random, then µ(S) will be small with high probability. For our work, denseness of Dnew,k requires some
support change of the St ’s as seen in simulations. The number of support changes required seems to be
proportional to the number of newly added directions, for example if c = 1 new direction is added, then
even one support change in an α length duration is sufficient to ensure that κs (Dnew,k ) is less than one
and as a result the recovery error eventually stabilizes (see Section IV-C of [15]).
V. D EFINITIONS AND P ROOF O UTLINE FOR T HEOREM 4.2
Remark 5.1. We will prove Theorem 4.2 for the case when B = I. In this case Qj = Pj and Qnew = Pnew .
A. Definitions
A few quantities are already defined in the model (Sec III), Definition 4.1, Algorithm 2, and Theorem
4.2. Here we define more quantities needed for the proofs.
Definition 5.2. We define here the parameters used in Theorem 4.2.
m
l
1) Define K(ζ) := log(0.6cζ)
√ log 0.6 √ √
√
2) Define ξ0 (ζ) := cγnew + ζ( r + c)
3) Define ρ := maxt {κ1 (Ŝt,cs − St )}. Notice that ρ ≤ 1.
4) Define
8 · 242
4 16
2
4K 4
2
:= (log 6KJ + 11 log n) 2 − 2 max min(1.2 γnew , γ∗ ), 2 , 4(0.186γnew + 0.0034γnew + 2.3)
ζ (λ )
c
αadd
In words, αadd is the smallest value of the number of data points, α, needed for one projection PCA
step to ensure that Theorem 4.2 holds w.p. at least (1 − n−10 ).
Definition 5.3. We define the noise seen by the sparse recovery step at time t as
0
βt := (I − P̂(t−1) P̂(t−1)
)Lt .
Also define the reconstruction error of St as
et := Ŝt − St .
14
Here Ŝt is the final estimate of St after the LS step in Algorithm 2. Notice that et also satisfies Aet =
Lt − L̂t .
Definition 5.4. We define the subspace estimation errors as follows. Recall that P̂j,new,0 = [.] (empty
matrix).
0
SE(t) := k(I − P̂(t) P̂(t)
)P(t) k2 ,
0
)Pj−1 k2
ζj,∗ := k(I − P̂j−1 P̂j−1
0
0
ζj,k := k(I − P̂j−1 P̂j−1
− P̂j,new,k P̂j,new,k
)Pj,new k2
Remark 5.5. Recall from the model given in Sec III and from Algorithm 2 that
0
1) P̂j,new,k is orthogonal to P̂j−1 , i.e. P̂j,new,k
P̂j−1 = 0
2) P̂j−1 := [P̂0 , P̂1,new,K , . . . P̂j−1,new,K ] and Pj−1 := [P0 , P1,new , . . . Pj−1,new ]
3) for t ∈ Ij,k+1 , P̂(t) = [P̂j−1 , P̂j,new,k ] and P(t) = Pj = [Pj−1 , Pj,new ].
0
4) Φ(t) := I − P̂(t−1) P̂(t−1)
From Definition 5.4 and the above, it is easy to see that
P
1) ζj,∗ ≤ ζ1,∗ + jj−1
0 =1 ζj 0 ,K
P
for t ∈ Ij,k+1 .
2) SE(t) ≤ ζj,∗ + ζj,k ≤ ζ1,∗ + jj−1
0 =1 ζj 0 ,K + ζj,k
Definition 5.6. Define the following
1) Φj,k , Φj,0 and φk
0
0
a) Φj,k := I − P̂j−1 P̂j−1
− P̂j,new,k P̂j,new,k
is part of the CS matrix (the CS matrix will be Φj,k A)
for t ∈ Ij,k+1 , i.e. Φ(t) = Φj,k for this duration.
0
b) Φj,0 := I − P̂j−1 P̂j−1
is part of the CS matrix for t ∈ Ij,1 , i.e. Φ(t) = Φj,0 for this duration.
Φj,0 is also the matrix used in all of the projection PCA steps for t ∈ [tj , tj+1 − 1].
c) φk := maxj maxT :|T |≤s k(((Φj,k A)T )0 (Φj,k A)T )−1 k2 . It is easy to see that φk ≤
1
1−maxj δs (Φj,k A)
[17].
2) Dj,new,k , Dj,new and Dj,∗
a) Dj,new,k := Φj,k Pj,new . span(Dj,new,k ) is the unestimated part of the newly added subspace for
any t ∈ Ij,k+1 .
b) Dj,new := Dj,new,0 = Φj,0 Pj,new . span(Dj,new ) is interpreted similarly for any t ∈ Ij,1 .
c) Dj,∗,k := Φj,k Pj−1 . span(Dj,∗,k ) is the unestimated part of the existing subspace for any t ∈ Ij,k
d) Dj,∗ := Dj,∗,0 = Φj,0 Pj−1 . span(Dj,∗,k ) is interpreted similarly for any t ∈ Ij,1
e) Notice that ζj,0 = kDj,new k2 , ζj,k = kDj,new,k k2 , ζj,∗ = kDj,∗ k2 . Also, clearly, kDj,∗,k k2 ≤ ζj,∗ .
Definition 5.7.
QR
1) Let Dj,new = Ej,new Rj,new denote its QR decomposition. Here Ej,new is a basis matrix and Rj,new is
upper triangular.
2) Let Ej,new,⊥ be a basis matrix for the orthogonal complement of span(Ej,new ) = span(Dj,new ). To be
0
precise, Ej,new,⊥ is a n × (n − cj,new ) basis matrix that satisfies Ej,new,⊥
Ej,new = 0.
15
3) Using Ej,new and Ej,new,⊥ , define Cj,k , Cj,k,⊥ , Hj,k , Hj,k,⊥ and Bj,k as
1 X
Ej,new 0 Φj,0 Lt Lt 0 Φj,0 Ej,new
Cj,k :=
α t∈I
j,k
X
1
Cj,k,⊥ :=
Ej,new,⊥ 0 Φj,0 Lt Lt 0 Φj,0 Ej,new,⊥
α t∈I
j,k
X
1
Ej,new 0 Φj,0 ((Aet )(Aet )0 − Lt (Aet )0 − Aet Lt 0 )Φj,0 Ej,new
Hj,k :=
α t∈I
j,k
1 X
Ej,new,⊥ 0 Φj,0 ((Aet )(Aet )0 − Lt (Aet )0 − Aet Lt 0 )Φj,0 Ej,new,⊥
Hj,k,⊥ :=
α t∈I
j,k
1 X
1 X
Bj,k :=
Ej,new,⊥ 0 Φj,0 L̂t L̂0t Φj,0 Ej,new =
Ej,new,⊥ 0 Φj,0 (Lt − Aet )(Lt 0 − (Aet )0 )Φj,0 Ej,new
α t∈I
α t∈I
j,k
j,k
4) Define
h
Aj,k := Ej,new Ej,new,⊥
"
i C
j,k
0
0
#"
Cj,k,⊥
"
0
h
i H
j,k Bj,k
Hj,k := Ej,new Ej,new,⊥
Bj,k Hj,k,⊥
Ej,new 0
#
Ej,new,⊥ 0
#"
#
Ej,new 0
Ej,new,⊥ 0
5) From the above, it is easy to see that
Aj,k + Hj,k =
1 X
Φj,0 L̂t L̂0t Φj,0 .
α t∈I
j,k
EV D
6) Recall from Algorithm 2 that Aj,k + Hj,k =
h
P̂j,new,k P̂j,new,k,⊥
"
i Λ
k
0
0
Λk,⊥
#"
0
P̂j,new,k
0
P̂j,new,k,⊥
#
is the
EVD of Aj,k + Hj,k . Here P̂j,new,k is an n × cj,new basis matrix.
7) Using the above, Aj,k + Hj,k can be decomposed in two ways as follows:
"
#"
#
h
i Λ
0
0
P̂j,new,k
k
Aj,k + Hj,k = P̂j,new,k P̂j,new,k,⊥
0
0 Λk,⊥
P̂j,new,k,⊥
"
#"
#
h
i A +H
0
Bj,k
Ej,new 0
j,k
j,k
= Ej,new Ej,new,⊥
Bj,k
Aj,k,⊥ + Hj,k,⊥
Ej,new,⊥ 0
P
Remark 5.8. Thus, from the above definition, Hj,k = α1 [Φ0 t (−Lt (Aet )0 − Aet L0t + (Aet )(Aet )0 )Φ0 +
P
0
0
0
F + F 0 ] where F := Enew,⊥ Enew,⊥
Φ0 t Lt L0t Φ0 Enew Enew
= Enew,⊥ Enew,⊥
(D∗,k−1 at,∗ )(D∗,k−1 at,∗ +
0
Dnew,k−1 at,new )0 Enew Enew
. Since E[at,∗ a0t,new ] = 0, k α1 F k2 . r2 ζ 2 λ+ w.h.p. Recall . means (in an informal
sense) that the RHS contains the dominant terms in the bound.
Definition 5.9. In the sequel, we let
1) r := r0 + (J − 1)cmx and c := cmx = maxj cj,new ,
2) κs,∗ := maxj κs (Pj−1 ), κs,new := maxj κs (Pj,new ), κs,k := maxj κs (Dj,new,k ), κ̃s,k := maxj κs ((I −
Pj,new Pj,new 0 )P̂j,new,k ), gk := maxj gj,k ,
16
+
+
+
+
+
3) δs+3s
, κ+
:=
s+3sa ,∗ := 0.3, κs+3sa ,new := 0.15, κs := 0.15, κ̃s+3sa := 0.15 and g
a
√
2 are
the upper bounds assumed in Theorem 4.2 on δs+3sa (A), maxj κs+3sa (Pj ), maxj κs+3sa (Pj,new ),
maxj maxk κs (Dj,new,k ), maxj κs+3sa (Qj,new,k ) and maxt gt respectively.
4) φ+ := 1.1735 We see later that this is an upperbound on φk under the assumptions of Theorem 4.2.
5) γnew,k := min(1.2k−1 γnew , γ∗ )
(recall that the theorem assumes maxj maxt∈Ij,k kat,new k∞ ≤ γnew,k )
6) Pj,∗ := Pj−1 and P̂j,∗ := P̂j−1 (see Remark 5.10).
Remark 5.10. Notice that the subscript j always appears as the first subscript, while k is the last one. At
many places in this paper, we remove the subscript j for simplicity. Whenever there is only one subscript,
it refers to the value of k, e.g., Φ0 refers to Φj,0 , P̂new,k refers to P̂j,new,k . Also, P∗ := Pj−1 and P̂∗ := P̂j−1 .
Definition 5.11. Define the following:
1) ζ∗+ := rζ (We note that ζ∗+ = (r0 + (j − 1)c)ζ will also work.)
2) Define the sequence {ζk + }k=0,1,2,···K recursively as follows
ζ0+ := 1
ζk+ :=
1−
(ζ∗+ )2
b + 0.125cζ
for k ≥ 1,
− (ζ∗+ )2 f − 0.125cζ − b
(3)
where
+ +
+ 2 + +
2
0
+ 2
b := Cκ+
s g ζk−1 + C̃(κs ) g (ζk−1 ) + C f (ζ∗ )
2κ+ φ+
C := p s + + φ+ ,
1 − (ζ∗ )2
+
+ 2
κ+
κ+
2φ+
sφ
s (φ )
+
p
p
+
1
+
φ
+
+
,
C 0 := (φ+ )2 + p
1 − (ζ∗+ )2
1 − (ζ∗+ )2
1 − (ζ∗+ )2
κ+ (φ+ )2
C̃ := (φ+ )2 + p s
.
1 − (ζ∗+ )2
As we will see, ζ∗+ and ζk+ are the high probability upper bounds on ζj,∗ and ζj,k (defined in Definition
5.4) under the assumptions of Theorem 4.2.
Definition 5.12. Define the random variable Xj,k := {a1 , a2 , · · · , atj +kα−1 }.
Recall that the at ’s are mutually independent over t.
Definition 5.13. Define the set Γ̌j,k as follows:
Γ̌j,k := {Xj,k : ζj,k ≤ ζk+ and T̂t = Tt and et satisfies (2) for all t ∈ Ij,k }
Γ̌j,K+1 := {Xj+1,0 : T̂t = Tt and et satisfies (2) for all t ∈ Ij,K+1 }
17
Definition 5.14. Recursively define the sets Γj,k as follows:
Γ1,0 := {X1,0 : ζ1,∗ ≤ rζ and T̂t = Tt and et satisfies (2) for all t ∈ [ttrain+1 : t1 − 1]}
Γj,k := Γj,k−1 ∩ Γ̌j,k
k = 1, 2, . . . , K, j = 1, 2, . . . , J
Γj+1,0 := Γj,K ∩ Γ̌j,K+1
j = 1, 2, . . . , J
B. Proof Outline for Theorem 4.2
The proof of Theorem 4.2 essentially follows from two main lemmas, 6.1 and 6.2. Lemma 6.1 gives
an exponentially decaying upper bound on ζk+ defined in Definition 5.11. ζk+ will be shown to be a high
probability upper bound for ζk under the assumptions of the Theorem. Lemma 6.2 says that conditioned
on Xj,k−1 ∈ Γj,k−1 , Xj,k will be in Γj,k w.h.p.. In words this says that if, during the time interval Ij,k−1 , the
algorithm has worked well (recovered the support of St exactly and recovered the background subspace
+
with subspace recovery error below ζk−1
+ ζ∗+ ), then it will also work well in Ij,k w.h.p.. The proof of
Lemma 6.2 requires two lemmas: one for the projected CS step and one for the projection PCA step
of the algorithm. These are lemmas 8.1 and 8.2. The proof Lemma 8.1 follows using Lemmas 6.1, 3.4,
2.14, the ModCS error bound (Theorem 2.16), and some straightforward steps. The proof of Lemma 8.2
is longer and uses Lemma 2.13 to get a pound on ζk . From here we use the matrix Hoeffding inequalities
(Corollaries 2.8 and 2.9) to bound each of the terms in the bound on ζk to finally show that, conditioned
on Γej,k−1 ζk ≤ ζk+ w.h.p.. These are Lemmas 8.6 and 8.7.
VI. M AIN L EMMAS AND P ROOF OF T HEOREM 4.2
Recall that when there is only one subscript, it refers to the value of k (i.e. ζk = ζj,k ).
Lemma 6.1 (Exponential decay of ζk+ ). Assume that the bounds on ζ from Theorem 4.2 hold. Define the
sequence ζk+ as in Definition 5.11. Then
1) ζ0+ = 1 and ζk+ ≤ 0.6k + 0.4cζ for all k = 1, 2, . . . , K,
2) the denominator of ζk+ is positive for all k = 1, 2, . . . , K.
We will prove this lemma in Section VII.
Lemma 6.2. Assume that all the conditions of Theorem 4.2 hold. Also assume that P(Γej,k−1 ) > 0. Then
P(Γej,k |Γej,k−1 ) ≥ pk (α, ζ) ≥ pK (α, ζ) for all k = 1, 2, . . . , K,
where pk (α, ζ) is defined in equation (5).
Remark 6.3. Under the assumptions of Theorem 4.2, it is easy to see that the following holds. For any
k = 1, 2 . . . K, Γej,k implies that ζj,∗ ≤ ζ∗+
+
From the definition of Γej,k , ζj 0 ,K ≤ ζK
for all j 0 ≤ j − 1. By Lemma 6.1 and the definition of K in
P
+
Definition 5.2, ζK
≤ 0.6K + 0.4cζ ≤ cζ for all j 0 ≤ j − 1. Using Remark 5.5, ζj,∗ ≤ ζ1∗ + j−1
j 0 =1 ζj 0 ,K ≤
r0 ζ + (j − 1)cζ ≤ ζ∗+ .
18
Proof of Theorem 4.2. The theorem is a direct consequence of Lemmas 6.1, 6.2, and Lemma 2.6.
Notice that Γej,0 ⊇ Γej,1 ⊇ · · · ⊇ Γej,K,0 ⊇ Γej+1,0 . Thus, by Lemma 2.6, P(Γej+1,0 |Γej,0 ) =
QJ
Q
e
e
e
e
P(Γej+1,0 |Γej,K ) K
j=1 P(Γj+1,0 |Γj,0 ).
k=1 P(Γj,k |Γj,k−1 ) and P(ΓJ+1,0 |Γ1,0 ) =
Using Lemmas 6.2, and the fact that pk (α, ζ) ≥ pK (α, ζ) (see their respective definitions in Lemma
8.7 and equation (5)), we get P(ΓeJ+1,0 |Γ1,0,0 ) ≥ pK (α, ζ)KJ . Also, P(Γe1,0 ) = 1. This follows by the
assumption on P̂0 and Lemma 8.1. Thus, P(ΓeJ+1,0 ) ≥ pK (α, ζ)KJ .
Using the definition of αadd we get that P(ΓeJ+1,0 ) ≥ pK (α, ζ)KJ ≥ 1 − n−10 whenever α ≥ αadd .
The event ΓeJ+1,0 implies that T̂t = Tt and et satisfies (2) for all t < tJ+1 . Using Remarks 5.5 and
√
6.3, ΓeJ+1,0 implies that all the bounds on the subspace error hold. Using these, kat,new k2 ≤ cγnew,k , and
√
kat k2 ≤ rγ∗ , ΓeJ+1,0 implies that all the bounds on ket k2 hold (the bounds are obtained in Lemma 8.1).
Thus, all conclusions of the the result hold w.p. at least 1 − n−10 .
VII. P ROOF OF L EMMA 6.1
Proof. First recall the definition of ζk+ (Definition 5.11). Recall from Definition 5.9 that κ+
s := 0.15 ,
√
+
+
+
φ := 1.1735, and g := 2. So we can make these substitutions directly. Notice that ζk is an increasing
function of ζ∗+ , ζ, c, and f . Therefore we can use upper bounds on each of these quantities to get an upper
bound on ζk+ . From the definition of ζ in Theorem 4.2 and ζ∗+ := rζ we get
•
ζ∗+ ≤ 10−4
•
ζ∗+ f ≤ 1.5 × 10−4
•
cζ ≤ 10−4
ζ∗+
rζ
r
=
= ≤ r (We assume that c ≥ 1. Recall that c is the upper bound on the number of new
cζ
cζ
c
directions added)
•
ζ∗+ f r = r2 f ζ ≤ 1.5 × 10−4
•
+
First we prove by induction that ζk+ ≤ ζk−1
≤ 0.6 for all k ≥ 1. Notice that ζ0+ = 1 by definition.
•
Base case (k = 1): Using the above bounds we get that ζ1+ < 0.5985 < 1 = ζ0+ .
•
+
+
+
For the induction step, assume that ζk−1
≤ ζk−2
. Then because ζk+ is increasing in ζk−1
we get that
+
+
+
ζk+ = finc (ζk−1
) ≤ finc (ζk−2
) = ζk−1
.
1) To prove the first claim, first rewrite ζk+ as
+
ζk+
=
+
ζk−1
+
+ 2 + +
C(ζ∗+ f ) (ζcζ∗ ) + .125
Cκ+
s g + C̃(κs ) g (ζk−1 )
+ cζ
1 − (ζ∗+ )2 − (ζ∗+ )2 f − 0.125cζ − b
1 − (ζ∗+ )2 − (ζ∗+ )2 f − 0.125cζ − b
+
≤ .6 we get
Where C, C̃, and b are as in Definition 5.11. Using the above bounds including ζk−1
that
ζk+
≤
+
ζk−1
(0.6)+cζ(0.16)
=
k−1
X
ζ0+ (0.6)k +
i=0
k
(0.6) (0.16)cζ ≤
∞
X
(0.6)k (0.16)cζ ≤ 0.6k +0.4cζ
ζ0+ (0.6)k +
i=0
2) To see that the denominator is positive, observe that the denominator is decreasing in all of its
arguments: ζ∗+ , ζ∗+ f, cζ, and b. Using the same upper bounds as before, we get that the denominator
is greater than or equal to 0.78 > 0.
19
VIII. P ROOF OF L EMMA 6.2
The proof of Lemma 6.2 follows from two lemmas. The first is the final conclusion for the projected
CS step for t ∈ Ij,k . The second is the final conclusion for one projection PCA (i.e.) for t ∈ Ij,k . We
will state the two lemmas first and then proceed to prove them in order.
Lemma 8.1 (Projected Modified Compressed Sensing Lemma). Assume that all conditions of Theorem
4.2 hold.
1) For all t ∈ Ij,k , for any k = 1, 2, . . . K, if Xj,k−1 ∈ Γj,k−1 ,
√
√
√
+ √
cγnew,k + ζ∗+ rγ∗ ≤ c0.72k−1 γnew + 1.06 ζ ≤
a) the projection noise βt satisfies kβt k2 ≤ ζk−1
ξ0 .
b) N̂t = Nt
√
√
+ √
+
c) et satisfies (2) and ket k2 ≤ φ+ [κ+
rγ∗ ] ≤ 0.18 · 0.72k−1 cγnew + 1.17 ·
s ζk−1 cγnew,k + ζ∗
√
1.06 ζ. Recall that (2) is
ITt (Φ(t) )Tt † βt = ITt [(Φ(t) )0Tt (Φ(t) )Tt ]−1 ITt 0 Φ(t) Lt
2) For all k = 1, 2, . . . K, P(T̂t = Tt and et satisfies (2) for all t ∈ Ij,k |Γej,k−1 ) = 1.
Lemma 8.2 (Subspace Recovery Lemma). Assume that all the conditions of Theorem 4.2 hold. Let
ζ∗+ = rζ. Then, for all k = 1, 2, . . . K,
P(ζk ≤ ζk+ |Γej,k−1 ) ≥ pk (α, ζ)
where ζk+ is defined in Definition 5.11 and pk (α, ζ) is defined in (5).
Proof of Lemma 6.2. Observe that P(Γj,k |Γj,k−1 ) = P(Γ̌j,k |Γj,k−1 ). The lemma then follows by combining
Lemma 8.2 and item 2 of Lemma 8.1.
A. Proof of Lemma 8.1
In order to prove Lemma 8.1 we first need a bound on the RIC of the compressed sensing matrix Φk A.
Lemma 8.3 (Bounding the RIC). Recall that ζ∗ := k(I − P̂∗ P̂∗0 )P∗ k. The following hold.
1) κs,A (P̂∗ )2 ≤ 2ζ∗ (1 + δs (A)) + κs,A (P∗ )2
2) κs,A (P̂new,k ) ≤ κs,A,new + κ̃s,A,k (ζk + ζ∗ )
3) δs ((I − P̂∗ P̂∗0 )A) ≤ δs (A) + κs,A (P̂∗ )2 ≤ δs (A) + 2ζ∗ (1 + δs (A)) + κs,A (P∗ )2
0
4) δs ((I − P̂∗ P̂∗0 − P̂new,k P̂new,k
)A) ≤ δs (A) + κs,A (P̂∗ )2 + κs,A (P̂new,k )2 ≤ δs (A) + (2ζ∗ (1 + δs (A)) +
κs,A (P∗ )2 ) + (κs,A,new + κ̃s,A,k (ζk + ζ∗ ))2
20
Proof.
1) For any set T such that |T | ≤ s,
kA0T P̂∗ k2 = kA0T P̂∗ P̂∗0 AT k
= kA0T (P̂∗ P̂∗0 − P∗ P∗0 + P∗ P∗0 )AT k
≤ kA0T (P̂∗ P̂∗0 − P∗ P∗0 )AT k + kA0T P∗ P∗0 AT k
≤ kA0T kkP̂∗ P̂∗0 − P∗ P∗0 kkAT k + kA0T P∗ P∗0 AT k
≤ 2ζ∗ kAT k2 + κs,A (P∗ )2
≤ 2ζ∗ (1 + δs ) + κs,A (P∗ )2
2) For any set T such that |T | ≤ s,
0
0
kA0T P̂new,k k ≤ kA0T (I − Pnew Pnew
)P̂new,k k + kA0T Pnew Pnew
P̂new,k k
0
≤ κ̃s,A,k k(I − Pnew Pnew
)P̂new,k k + kA0T Pnew k
0
= κ̃s,A,k k(I − P̂new,k P̂new,k
)Pnew k + κs,A,new
0
= κ̃s,A,k k(I − P̂∗ P̂∗0 − P̂new,k P̂new,k
+ P̂∗ P̂∗0 )Pnew k + κs,A,new
0
0
0
≤ κ̃s,A,k k(I − P̂∗ P̂∗ − P̂new,k P̂new,k )Pnew k + kP̂∗ P̂∗ Pnew k + κs,A,new
≤ κ̃s,A,k ζk + κ̃s,A,k ζ∗ + κs,A,new
≤ κ̃s,A,k ζk + ζ∗ + κs,A,new
Lines 3 and 7 use Lemma 2.14. Line 2 uses lemma 2.15.
3) Follows from (1) and Lemma 3.4.
4) Follows from (2) and Lemma 3.4.
Corollary 8.4. If the conditions of Theorem 4.2 are satisfied, and Xj,k−1 ∈ Γj,k−1 , then
2
+
+
1) δs (Φ0 A) ≤ δs+3sa (Φ0 A) ≤ δs+3s
+ 2ζ∗+ (1 + δs+3s
) + κ+
s+3sa ,∗ < 0.1 < 0.207
a
a
2
+
+
+
+
+
2) δs (Φk−1 A) ≤ δs+3sa (Φk−1 A) ≤ δs+3s
+ 2ζ∗+ (1 + δs+3s
) + κ+
s+3sa ,∗ + (κs+3sa ,new + κ̃s+3sa ,k−1 (ζk−1 +
a
a
ζ∗+ ))2 < 0.207
3) φk−1 ≤
1
1−δs (Φk−1 A)
< φ+
+
Proof. This follows using Lemma 8.3, the definition of Γj,k−1 , and the bound on ζk−1
from Lemma
6.1.
The following are striaghtforward bounds that will be useful for the proof of Lemma 8.1 and later.
Fact 8.5. Under the assumptions of Theorem 4.2:
√
√
ζ
1) ζγ∗ ≤ (r0 +(J−1)c)
ζ
3/2 ≤
10−4
≤ 10−4
(r0 +(J−1)c)
1
ζ∗+ γ∗2 ≤ (r0 +(J−1)c)
2 ≤ 1
√
√
ζ
ζ∗+ γ∗ ≤ √
≤ ζ
r0 +(J−1)c
1.5×10−4
+
ζ∗ f ≤ r0 +(J−1)c ≤ 1.5 × 10−4
2) ζ∗+ ≤
3)
4)
5)
21
+
6) ζk−1
≤ 0.6k−1 + 0.4cζ (from Lemma 6.1)
+
7) ζk−1
γnew,k ≤ (0.6 · 1.2)k−1 γnew + 0.4cζγ∗ ≤ 0.72k−1 γnew + √
+
2
2
2
+
+ 0.4cζγ∗2 ≤ 0.864k−1 γnew
8) ζk−1
≤ (0.6 · 1.22 )k−1 γnew
γnew,k
√
0.4 ζ
≤
r0 +(J−1)c
0.4
(r0 +(J−1)c)2
√
0.72k−1 γnew + 0.4 ζ
2
+ 0.4
≤ 0.864k−1 γnew
Proof of Lemma 8.1. For t = ttrain+1 we have no prior support estimate (this is why we need the first set
of bounds in assumption 2 of Theorem 4.2), so step 1b of algorithm 2 is just basis pursuit and we refer
to the proof in [15].
+
For t > ttrain + 1 recall that Xj,k−1 ∈ Γj,k−1 implies that ζ∗ ≤ ζ∗+ and ζk−1 ≤ ζk−1
and T̂t = Nt−1 (at
time t − 1 the support was recovered exactly).
1)
0
a) For t ∈ Ij,k , βt := (I − P̂(t−1) P̂(t−1)
)Lt = D∗,k−1 at,∗ + Dnew,k−1 at,new . Thus, using Fact 8.5
√
√
kβt k2 ≤ ζ∗ rγ∗ + ζk−1 cγnew,k
p √
p √
≤ ζ r + (0.72k−1 γnew + .4 ζ) c
p √
√
√
= c0.72k−1 γnew + ζ( r + 0.4 c) ≤ ξ0 .
b) By Corollary 8.4, δs+3sa (Φk−1 ) < 0.207. Since T̂t = Nt−1 we can use Lemma 2.16 with
s∆ = s∆e = sa . (i.e. the ‘misses’ and ‘extras’ in the support estimate are the newly added and
deleted indices). We also have that |Nt | ≤ s, kβt k2 ≤ ξ0 = ξ. So by part 1 of lemma 2.16 if
|(St )i | > ω + √%s∆ 7.50ξ, then i ∈ N̂t . That is to say (St )i is large enough to be detected. Also
according to part 2 of lemma 2.16 if ω ≥
%
√ 7.50ξ,
s∆
then there will be no false additions,
and all true removals will be eliminated from the support estimate N̂t . This along with the
assumption that
%
√ 7.50ξ
sa
≤ ω < Smin −
%
√ 7.50ξ
sa
gives exact support recovery i.e. N̂t = Nt .
c) We use the fact that we have exact support recovery for the least squares step of the algorithm
to write an expression for et := Ŝt − St . To begin, observe that
(Ŝt )Nt = [(Φk−1 A)Nt ]† yt
= [(Φk−1 A)Nt ]† (Φk−1 ASt + Φk−1 Lt )
= [(Φk−1 AINt )0 (Φk−1 AINt )]−1 (Φk−1 AINt )0 (Φk−1 ASt + Φk−1 Lt )
= [(Φk−1 AINt )0 (Φk−1 AINt )]−1 (Φk−1 AINt )0 (Φk−1 AINt IN0 t St + Φk−1 Lt )
= (St )Nt + [(Φk−1 AINt )0 (Φk−1 AINt )]−1 (Φk−1 AINt )0 Φk−1 Lt
= (St )Nt + [(Φk−1 ANt )0 (Φk−1 ANt )]−1 A0Nt Φk−1 Lt
The 4th equality is because supp(St ) = Nt . Note that since δ|Nt | (Φk−1 A) < 1 the
columns of Φk−1 ANt are linearly independent, so (Φk−1 ANt )0 (Φk−1 ANt ) is invertible. So
(et )Nt = [(Φk−1 ANt )0 (Φk−1 ANt )]−1 A0Nt Φk−1 Lt , and (et )Ttc = 0. Therefore
et = INt [(Φk−1 ANt )0 (Φk−1 ANt )]−1 A0Nt Φk−1 Lt
= INt [(Φk−1 ANt )0 (Φk−1 ANt )]−1 A0Nt (Φk−1 P∗ at,∗ + Dnew,k−1 at,new ).
22
So
p
√
√
+
1 + δs (A) ζ∗+ rγ∗ + κs,k−1 ζk−1
cγnew,k
√ p
p √
√
k−1
r ζ + c0.15(0.72) γnew + c0.06 ζ
≤ 1.25
p √
√
√ k−1
≤ 0.19 c0.72 γnew + 1.25 ζ
r + 0.06 c
ket k2 ≤ φ+
2) The second claim is just a restatement of the first.
B. Proof of Lemma 8.2
The proof of Lemma 8.2 will use the next two lemmas (8.6, and 8.7).
Lemma 8.6. If λmin (Ak ) − kAk,⊥ k2 − kHk k2 > 0, then
ζk ≤
kHk k2
kRk k2
≤
λmin (Ak ) − kAk,⊥ k2 − kHk k2
λmin (Ak ) − kAk,⊥ k2 − kHk k2
(4)
where Rk := Hk Enew and Ak , Ak,⊥ , Hk are defined in Definition 5.7.
0
0
0
Proof. Since ζk = k(I − P̂new,k P̂new,k
)Dnew k2 = k(I − P̂new,k P̂new,k
)Enew Rnew k2 ≤ k(I − P̂new,k P̂new,k
)Enew k2 ,
the result follows by Lemma 2.13.
Lemma 8.7 (High probability bounds for each of the terms in the ζk bound (4)). Assume the conditions
of Theorem 4.2 hold. Also assume that P(Γej,k−1 ) > 0 for all 1 ≤ k ≤ K + 1. Then, for all 1 ≤ k ≤ K
cζ e
+ 2
Γj,k−1 > 1 − pa,k (α, ζ) where
1) P λmin (Ak ) ≥ λ−
new,k 1 − (ζ∗ ) − 12
−αζ 2 (λ− )2
−αc2 ζ 2 (λ− )2
pa,k (α, ζ) := c exp
+ c exp
4 , γ4)
8 · 242 · min(1.24k γnew
8 · 242 · 42
∗
cζ e
+ 2
Γj,k−1 > 1 − pb (α, ζ) where
2) P λmax (Ak,⊥ ) ≤ λ−
new,k (ζ∗ ) f + 24
−αc2 ζ(λ− )2
pb (α, ζ) := (n − c) exp
8 · 242
e Γ
(b
+
0.125cζ)
3) P kHk k2 ≤ λ−
j,k−1 ≥ 1 − pc (α, ζ) where b is as defined in Definition 5.11
new,k
and
−αζ 2 (λ− )2
+
pc (α, ζ) :=n exp
2
2 + 0.0072γ
8 · 242 (0.0324γnew
new + 0.0004)
−αζ 2 (λ− )2
n exp
+
2 + 0.0006γ
2
32 · 242 (0.06γnew
new + 0.4)
−αζ 2 (λ− )2 2
n exp
.
2 + 0.00034γ
2
32 · 242 (0.186γnew
new + 2.3)
Proof of Lemma 8.2. Lemma 8.2 now follows by combining Lemmas 8.6 and 8.7 and defining
pk (α, ζ) := 1 − pa,k (α, ζ) − pb (α, ζ) − pc (α, ζ).
(5)
23
As above, we will start with some simple facts that will be used to prove Lemma 8.7.
P
P
For convenience, we will use α1 t to denote α1 t∈Ij,k
Fact 8.8. Under the assumptions of Theorem 4.2 the following are true.
1) The matrices Dnew , Rnew , Enew , D∗ , Dnew,k−1 , Φk−1 are functions of the r.v. Xj,k−1 . All terms
P
that we bound for the first two claims of the lemma are of the form α1 t∈Ij,k Zt where
Zt = f1 (Xj,k−1 )Yt f2 (Xj,k−1 ), Yt is a sub-matrix of at a0t and f1 (.) and f2 (.) are functions of Xj,k−1 .
2) Xj,k−1 is independent of any at for t ∈ Ij,k , and hence the same is true for the matrices Dnew ,
Rnew , Enew , D∗ , Dnew,k−1 , Φk−1 . Also, at ’s for different t ∈ Ij,k are mutually independent. Thus,
conditioned on Xj,k−1 , the Zt ’s defined above are mutually independent.
3) All the terms that we bound for the third claim contain et . Using the second claim of Lemma 8.1,
conditioned on Xj,k−1 , et satisfies (2) w.p. one whenever Xj,k−1 ∈ Γj,k−1 . Conditioned on Xj,k−1 , all
P
these terms are also of the form α1 t∈Ij,k Zt with Zt as defined above, whenever Xj,k−1 ∈ Γj,k−1 .
Thus, conditioned on Xj,k−1 , the Zt ’s for these terms are mutually independent, whenever Xj,k−1 ∈
Γj,k−1 .
4) It is easy to see that kΦk−1 P∗ k2 ≤ ζ∗ , ζ0 = kDnew k2 ≤ 1, Φ0 Dnew = Φ00 Dnew = Dnew , kRnew k ≤ 1,
p
0
0
k(Rnew )−1 k ≤ 1/ 1 − ζ∗2 , Enew,⊥ 0 Dnew = 0, and kEnew 0 Φ0 et k = k(Rnew
)−1 Dnew
Φ0 Aet k =
+
κs,A
−1 0
0
−1 0
k(Rnew ) Dnew Aet k ≤ k(Rnew ) Dnew ANt kket k ≤ √ 2 ket k. The bounds on kRnew k and
1−ζ∗
k(Rnew )−1 k follow using Lemma 2.14 and the fact that σi (Rnew ) = σi (Dnew ).
5) Xj,k−1 ∈ Γj,k−1 implies that
a) ζ∗ ≤ ζ∗+ (see Remark 6.3)
+
b) ζk−1 ≤ ζk−1
≤ 0.6k−1 + 0.4cζ (This follows by the definition of Γj,k−1 and Lemma 6.1.)
6) Item 5 implies that
a) λmin (Rnew Rnew 0 ) ≥ 1 − (ζ∗+ )2 . This follows from Lemma 2.14 and the fact that σmin (Rnew ) =
σmin (Dnew ).
+
b) kINt 0 Φk−1 P∗ k2 ≤ kΦk−1 P∗ k2 ≤ ζ∗ ≤ ζ∗+ , kINt 0 Dnew,k−1 k2 ≤ κs,k−1 ζk−1 ≤ κ+
s ζk−1 .
P
P
7) By Weyl’s theorem (Theorem 2.11), for a sequence of matrices Bt , λmin ( t Bt ) ≥ t λmin (Bt ) and
P
P
λmax ( t Bt ) ≤ t λmax (Bt ).
p
8) kAet k2 ≤ kANt k2 ket k2 ≤ 1 + δs (A)ket k2 .
P
Proof of Lemma 8.7. Consider Ak := α1 t Enew 0 Φ0 Lt Lt 0 Φ0 Enew . Notice that Enew 0 Φ0 Lt = Rnew at,new +
Enew 0 D∗ at,∗ . Let Zt = Rnew at,new at,new 0 Rnew 0 and let Yt = Rnew at,new at,∗ 0 D∗ 0 Enew 0 + Enew 0 D∗ at,∗ at,new 0 Rnew 0 ,
then
Ak Consider
P
t
Zt =
P
t
1X
1X
Zt +
Yt
α t
α t
(6)
0
Rnew at,new at,new 0 Rnew
.
1) Using item 2 of Fact 8.8, the Zt ’s are conditionally independent given Xj,k−1 .
2) Using
item
2,
Ostrowoski’s theorem (Theorem 2.12), and item 6, for
P
P
= λmin Rnew α1 t E(at,new at,new 0 )Rnew 0
Xj,k−1 ∈ Γj,k−1 , λmin E( α1 t Zt |Xj,k−1 )
P
λmin (Rnew Rnew 0 ) λmin α1 t E(at,new at,new 0 ) ≥ (1 − (ζ∗+ )2 )λ−
new,k .
all
≥
24
3) Finally, using items 4 and the bound on kat k∞ from the model, conditioned on Xj,k−1 , 0 Zt 2
2
, γ∗2 I holds w.p. one for all Xj,k−1 ∈ Γj,k−1 .
I ≤ c max (1.2)2k γnew
cγnew,k
−
Thus, applying Corollary 2.8 with = cζλ
, we get
24
!
!
−
1X
cζλ
−αζ 2 (λ− )2
+ 2 −
P λmin
Zt ≥ (1 − (ζ∗ ) )λnew,k −
Xj,k−1 ≥ 1 − c exp
4 , γ4)
α t
24 8 · 242 · min(1.24k γnew
∗
(7)
for all Xj,k−1 ∈ Γj,k−1 .
Consider Yt = Rnew at,new at,∗ 0 D∗ 0 Enew + Enew 0 D∗ at,∗ at,new 0 Rnew 0 .
1) Using item 2, the Yt ’s are conditionally independent given Xj,k−1 .
2) Using item 2 and the fact that at,new and at,∗ are mutually uncorrelated, E
1
α
P
t
Yt |Xj,k−1 = 0 for
all Xj,k−1 ∈ Γj,k−1 .
3) Using the bound on kat k∞ , items 4, 6, and Fact 8.5, conditioned on Xj,k−1 , kYt k ≤
√
√
2 crζ∗+ γ∗ γnew,k ≤ 2 crζ∗+ γ∗2 ≤ 2 holds w.p. one for all Xj,k−1 ∈ Γj,k−1 .
Thus, under the same conditioning, −bI Yt bI with b = 2 w.p. one.
−
, we get
Thus, applying Corollary 2.8 with = cζλ
24
!
!
−cζλ− −αc2 ζ 2 (λ− )2
1X
Yt ≥
P λmin
for all Xj,k−1 ∈ Γj,k−1
Xj,k−1 ≥ 1 − c exp
α t
24
8 · 242 · (2b)2
(8)
−
cζλ
+ 2
Combining (6), (7) and (8) and using the union bound, P(λmin (Ak ) ≥ λ−
new,k (1−(ζ∗ ) )− 12 |Xj,k−1 ) ≥
−
1 − pa (α, ζ) for all Xj,k−1 ∈ Γj,k−1 . The first claim of the lemma follows by using λ−
new,k ≥ λ and then
applying Lemma 2.5 with X ≡ Xj,k−1 and C ≡ Γj,k−1 .
P
Now consider Ak,⊥ := α1 t Enew,⊥ 0 Φ0 Lt Lt 0 Φ0 Enew,⊥ . Using item 4, Enew,⊥ 0 Φ0 Lt = Enew,⊥ 0 D∗ at,∗ . Thus,
P
Ak,⊥ = α1 t Zt with Zt = Enew,⊥ 0 D∗ at,∗ at,∗ 0 D∗ 0 Enew,⊥ which is of size (n − c) × (n − c). Using the same
P
ideas as above we can show that 0 Zt r(ζ∗+ )2 γ∗2 I ζI and E α1 t Zt |Xj,k−1 (ζ∗+ )2 λ+ I. Thus
by Corollary 2.8 with =
cζλ−
24
and Lemma 2.5 the second claim follows.
Using the expression for Hk given in Definition 5.7, it is easy to see that
1 X
0
Aet (Aet ) + max(kT 2k2 , kT 4k2 ) + kBk k2 (9)
kHk k2 ≤ max{kHk k2 , kHk,⊥ k2 } + kBk k2 ≤ α t
2
P
P
where T 2 := α1 t Enew 0 Φ0 (Lt (Aet )0 + Aet Lt 0 )Φ0 Enew and T 4 := α1 t Enew,⊥ 0 Φ0 (Lt (Aet )0 +
(Aet )0 Lt )Φ0 Enew,⊥ . The second inequality follows by using the facts that (i) Hk = T 1 − T 2 where T 1 :=
P
P
0
0
1
1
0
0
t Enew Φ0 Aet (Aet ) Φ0 Enew , (ii) Hk,⊥ = T 3 − T 4 where T 3 := α
t Enew,⊥ Φ0 Aet (Aet ) Φ0 Enew,⊥ , and
α
P
(iii) max(kT 1k2 , kT 3k2 ) ≤ k α1 t Aet (Aet )0 k2 . Next, we obtain high probability bounds on each of the
terms on the RHS of (9) using the Hoeffding corollaries.
P
Consider k α1 t Aet (Aet )0 k2 . Let Zt = Aet (Aet )0 .
1) Using item 2, conditioned on Xj,k−1 , the various Zt ’s in the summation are independent, for all
Xj,k−1 ∈ Γj,k−1 .
2) Using item 6, and the bound on kat k∞ , conditioned on Xj,k−1 , 0 Zt b1 I w.p. one for all
√
√
+
+
Xj,k−1 ∈ Γj,k−1 . Here b1 := (1 + δs+ )(κ+
cγnew,k + ζ∗+ φ+ rγ∗ )2 .
s,A ζk−1 φ
25
3) Also using item 6, and the bound on kat k∞ , 0 1
α
P
t
E(Zt |Xj,k−1 ) b2 I, with b2 := (1 +
2 +
2
+ 2 +
+ 2 +
+ 2
δs+ )(κ+
s,A ) (ζk−1 ) (φ ) λnew,k + (ζ∗ ) (φ ) λ for all Xj,k−1 ∈ Γj,k−1 .
Thus, applying Corollary 2.8 with =
cζλ−
,
24
1 X
cζλ− P Aet (Aet )0 ≤ b2 +
Xj,k−1
α t
24
2
0
!
≥ 1 − n exp
−αc2 ζ 2 (λ− )2
8 · 242 b21
for all Xj,k−1 ∈ Γj,k−1
(10)
0
Consider T 2. Let Zt := Enew Φ0 (Lt (Aet )0 + Aet Lt )Φ0 Enew which is of size c × c. Then T 2 =
1
α
P
t
Zt .
1) Using item 2, conditioned on Xj,k−1 , the various Zt ’s used in the summation are mutually
independent, for all Xj,k−1 ∈ Γj,k−1 . Using item 4, Enew 0 Φ0 Lt = Rnew at,new + Enew 0 D∗ at,∗ and
Enew 0 Φ0 et = (Rnew 0 )−1 Dnew 0 et .
2) Thus, using items 4, 6, and the bound on kat k∞ , it follows that conditioned on Xj,k−1 ,
κ+
+ √
kZt k2 ≤ 2b̃3 ≤ 2b3 w.p. one for all Xj,k−1 ∈ Γj,k−1 . Here, b̃3 := √ s,A+ 2 φ+ (κ+
s,A ζk−1 cγnew,k +
1−(ζ∗ )
√ +
√
√
√ + 2 + +
2 +
2
+
rζ∗ γ∗ )( cγnew,k + rζ∗+ γ∗ ) and b3 := √ 1 + 2 (φ+ cκ+
ζ
rcκs,A ζk−1 ζ∗ γnew,k γ∗ +
γ
+φ
s,A k−1 new,k
1−(ζ∗ )
√
+
+2 2
+
φ+ rcκ+
s ζ∗ γ∗ γnew,k + φ rζ∗ γ∗ ).
P
κ+
+
+
3) Also, k α1 t E(Zt |Xj,k−1 )k2 ≤ 2b̃4 ≤ 2b4 where b̃4 := √ s,A+ φ+ κ+
s ζk−1 λnew,k +
√
κ+
s,A
1−(ζ∗+ )2
+
φ
κ+
s,A
+
+
and b4 := √
φ+ κ+
s,A ζk−1 λnew,k
1−(ζ∗+ )2
−
,
2.9 with = cζλ
24
(ζ∗+ )2 λ+
1−(ζ∗ )2
+√
1
1−(ζ∗+ )2
+
φ (ζ∗+ )2 λ+ .
Thus, applying Corollary
cζλ− −αc2 ζ 2 (λ− )2
P kT 2k2 ≤ 2b4 +
for all Xj,k−1 ∈ Γj,k−1
Xj,k−1 ≥ 1 − c exp
24
32 · 242 · 4b23
Consider T 4. Let Zt := Enew,⊥ 0 Φ0 (Lt (Aet )0 + Aet Lt 0 )Φ0 Enew,⊥ which is of size (n − c) × (n − c). Then
P
T 4 = α1 t Zt .
1) Using item 2, conditioned on Xj,k−1 , the various Zt ’s used in the summation are mutually
independent, for all Xj,k−1 ∈ Γj,k−1 . Using item 4, Enew,⊥ 0 Φ0 Lt = Enew,⊥ 0 D∗ at,∗ .
2) Thus, conditioned on Xj,k−1 , kZt k2 ≤ 2b5 w.p. one for all Xj,k−1 ∈ Γj,k−1 . Here b5 :=
√
√
+ +
( 1 + δs +)(φ+ r(ζ∗+ )2 γ∗2 + φ+ rcκ+
s ζ∗ ζk−1 γ∗ γnew,k ) This follows using items 6 and the bound
on kat k∞ .
p
P
3) Also, k α1 t E(Zt |Xj,k−1 )k2 ≤ 2b6 , b6 := 1 + δs+ (φ+ (ζ∗+ )2 λ+ ).
−
Applying Corollary 2.9 with = cζλ
,
24
−αc2 ζ 2 (λ− )2
cζλ− P kT 4k2 ≤ 2b6 +
for all Xj,k−1 ∈ Γj,k−1
Xj,k−1 ≥ 1 − (n − c) exp
24
32 · 242 · 4b25
+
Consider max(kT 2k2 , kT 4k2 ). Sinceb3 > b5 (follows
because ζk−1
≤ 21)2 and
b4 > b6 , so 2b6 +
−αc2 ζ 2 (λ− )2
−αc ζ (λ− )2
cζλ−
cζλ−
< 2b4 + 24 and 1 − (n − c) exp
> 1 − (n − c) exp
. Therefore, for all
2
2 ·4b2
24
8·242 ·4b5
2 28·24
3
−
2
−
ζ (λ )
Xj,k−1 ∈ Γj,k−1 , P kT 4k2 ≤ 2b4 + cζλ
X
≥ 1 − (n − c) exp −αc
.
24 j,k−1
32·242 ·4b2
3
By the union bound, for all Xj,k−1 ∈ Γj,k−1 ,
cζλ− −αc2 ζ 2 (λ− )2
P max(kT 2k2 , kT 4k2 ) ≤ 2b4 +
Xj,k−1 ≥ 1 − n exp
24
32 · 242 · 4b23
(11)
26
Consider kBk k2 . Let Zt := Enew,⊥ 0 Φ0 (Lt − Aet )(Lt 0 − (Aet )0 )Φ0 Enew which is of size (n − c) × c.
P
Then Bk = α1 t Zt . Using item 4, Enew,⊥ 0 Φ0 (Lt − Aet ) = Enew,⊥ 0 (D∗ at,∗ − Φ0 Aet ), Enew 0 Φ0 (Lt − Aet ) =
0
0
Rnew at,new + Enew 0 D∗ at,∗ + (Rnew
)−1 Dnew
Aet . Also, kZt k2 ≤ b7 w.p. one for all Xj,k−1 ∈ Γj,k−1 and
P
k α1 t E(Zt |Xj,k−1 )k2 ≤ b8 for all Xj,k−1 ∈ Γj,k−1 . Here
√
√
+
+
b7 :=(1 + δs+ )( rζ∗+ (1 + φ+ )γ∗ + (κ+
cγnew,k )·
s,A )ζk−1 φ
!
!
√
√ +
√
1
1
+
+ 2 +
+
+
cγnew,k + rζ∗ 1 + p
κs,A φ
κs,A ζk−1 φ cγnew,k
γ∗ + p
1 − (ζ∗+ )2
1 − (ζ∗+ )2
and
b8 :=(1 +
δs+ )
+
+
κ+
s,A ζk−1 φ
+p
!
1
1−
3 +
2
+ 2
(κ+
s,A ) (ζk−1 ) (φ )
+ 2
(ζ∗ )
1
1
λ+
new,k +
!
(1 + δs+ )(ζ∗+ )2 1 + φ+ + p
κ+ φ+ + p
κ+ (φ+ )2
+ 2 s,A
+ 2 s,A
1 − (ζ∗ )
1 − (ζ∗ )
λ+
−
Thus, applying Corollary 2.9 with = cζλ
,
24
cζλ− −αc2 ζ 2 (λ− )2
P kBk k2 ≤ b8 +
for all Xj,k−1 ∈ Γj,k−1
Xj,k−1 ≥ 1 − n exp
24
32 · 242 b27
(12)
Using (9), (10), (11) and (12) and the union bound, for any Xj,k−1 ∈ Γj,k−1 ,
cζλ− P kHk k2 ≤ b9 +
Xj,k−1 ≥
8
−αc2 ζ 2 (λ− )2
−αc2 ζ 2 (λ− )2
−αc2 ζ 2 (λ− )2 2
1 − n exp
− n exp
− n exp
8 · 242 b21
32 · 242 · 4b23
32 · 242 b27
where b9 := b2 + 2b4 + b8 .
−
+
−
Using λ−
new,k ≥ λ and f := λ /λ , b9 +
cζλ−
8
≤ λ−
new,k (b + 0.125cζ) (recall b is defined in Definition
+
+
5.11). Using Fact 8.5 and substituting κ+
s = 0.15, φ = 1.2, δs = 0.05, one can upper bound b1 , b3 and
b7 and show that the above probability is lower bounded by 1 − pc (α, ζ). Finally, applying Lemma 2.5,
the third claim of the lemma follows.
IX. A PPENDIX
A. Proof of Lemma 2.5
Proof. It is easy to see that P(B e , De ) = E[IB (X, Y )ID (X)].
If E[IB (X, Y )|X] ≥ p for all X ∈ D, this means that E[IB (X, Y )|X]ID (X) ≥ pIC (X). This, in turn,
implies that
P(B e , De ) = E[IB (X, Y )ID (X)] = E[E[IB (X, Y )|X]ID (X)] ≥ pE[ID (X)].
Recall from Definition 2.4 that P(B e |X) = E[IB (X, Y )|X] and P(De ) = E[ID (X)]. Thus, we conclude
that if P(B e |X) ≥ p for all X ∈ D, then P(B e , De ) ≥ pP(De ). Using the definition of P(B e |De ), the
claim follows.
27
B. Proof of Corollary 2.8
Proof.
1) Since, for any X ∈ D, conditioned on X, the Zt ’s are independent, the same is also true for
Zt − g(X) for any function of X. Let Yt := Zt − E(Zt |X). Thus, for any X ∈ D, conditioned on X,
the Yt ’s are independent. Also, clearly E(Yt |X) = 0. Since for all X ∈ D, P(b1 I Zt b2 I|X) = 1
and since λmax (.) is a convex function, and λmin (.) is a concave function, of a Hermitian matrix,
thus b1 I E(Zt |X) b2 I w.p. one for all X ∈ D. Therefore, P(Yt2 (b2 − b1 )2 I|X) = 1 for all
P
X ∈ D. Thus, for Theorem 2.7, σ 2 = k t (b2 − b1 )2 Ik2 = α(b2 − b1 )2 . For any X ∈ D, applying
Theorem 2.7 for {Yt }’s conditioned on X, we get that, for any > 0,
!
!
−α2
1X
Yt ≤ X > 1 − n exp
for all X ∈ D
P λmax
α t
8(b2 − b1 )2
P
P
P
By Weyl’s theorem, λmax ( α1 t Yt ) = λmax ( α1 t (Zt − E(Zt |X)) ≥ λmax ( α1 t Zt ) +
P
P
P
λmin ( α1 t −E(Zt |X)). Since λmin ( α1 t −E(Zt |X)) = −λmax ( α1 t E(Zt |X)) ≥ −b4 , thus
P
P
λmax ( α1 t Yt ) ≥ λmax ( α1 t Zt ) − b4 . Therefore,
!
!
2
−α
1X
P λmax
Zt ≤ b4 + X > 1 − n exp
for all X ∈ D
α t
8(b2 − b1 )2
2) Let Yt = E(Zt |X) − Zt . As before, E(Yt |X) = 0 and conditioned on any X ∈ D, the Yt ’s are
independent and P(Yt2 (b2 − b1 )2 I|X) = 1. As before, applying Theorem 2.7, we get that for
any > 0,
!
!
−α2
1X
P λmax
Yt ≤ X > 1 − n exp
for all X ∈ D
α t
8(b2 − b1 )2
P
P
P
By Weyl’s theorem, λmax ( α1 t Yt ) = λmax ( α1 t (E(Zt |X) − Zt )) ≥ λmin ( α1 t E(Zt |X)) +
P
P
P
P
λmax ( α1 t −Zt ) = λmin ( α1 t E(Zt |X)) − λmin ( α1 t Zt ) ≥ b3 − λmin ( α1 t Zt ) Therefore, for
any > 0,
P λmin
1X
Zt
α t
!
!
≥ b3 − X ≥ 1 − n exp
−α2
8(b2 − b1 )2
for all X ∈ D
C. Proof of Corollary 2.9
"
Proof. Define the dilation of an n1 × n2 matrix M as dilation(M ) :=
0 M0
M 0
(n1 + n2 ) × (n1 + n2 ) Hermitian matrix [19]. As shown in [19, equation 2.12],
λmax (dilation(M )) = kdilation(M )k2 = kM k2
#
. Notice that this is an
(13)
Thus, the corollary assumptions imply that P(kdilation(Zt )k2 ≤ b1 |X) = 1 for all X ∈ D. Thus,
P(−b1 I dilation(Zt ) b1 I|X) = 1 for all X ∈ D.
28
Using
(13), the corollary assumptions
P
dilation( α1 t E(Zt |X)) b2 I for all X ∈ D.
also
imply
that
1
α
P
t
E(dilation(Zt )|X)
=
Finally, Zt ’s conditionally independent given X, for any X ∈ D, implies that the same thing also holds
for dilation(Zt )’s.
Thus, applying Corollary 2.8 for the sequence {dilation(Zt )}, we get that,
!
!
2
1X
−α
for all X ∈ D
P λmax
dilation(Zt ) ≤ b2 + X ≥ 1 − (n1 + n2 ) exp
α t
32b21
P
P
P
Using (13), λmax ( α1 t dilation(Zt )) = λmax (dilation( α1 t Zt )) = k α1 t Zt k2 and this gives the final
result.
D. Proof of Lemma 2.14
Proof. Because P , Q and P̂ are basis matrix, P 0 P = I, Q0 Q = I and P̂ 0 P̂ = I.
1) Using P 0 P = I and kM k22 = kM M 0 k2 , k(I − P̂ P̂ 0 )P P 0 k2 = k(I − P̂ P̂ 0 )P k2 . Similarly, k(I −
P P 0 )P̂ P̂ 0 k2 = k(I − P P 0 )P̂ k2 . Let D1 = (I − P̂ P̂ 0 )P P 0 and let D2 = (I − P P 0 )P̂ P̂ 0 . Notice that
p
p
p
p
kD1 k2 = λmax (D10 D1 ) = kD10 D1 k2 and kD2 k2 = λmax (D20 D2 ) = kD20 D2 k2 . So, in order
SV D
to show kD1 k2 = kD2 k2 , it suffices to show that kD10 D1 k2 = kD20 D2 k2 . Let P 0 P̂ = U ΣV 0 . Then,
D10 D1 = P (I −P 0 P̂ P̂ 0 P )P 0 = P U (I −Σ2 )U 0 P 0 and D20 D2 = P̂ (I − P̂ 0 P P 0 P̂ )P̂ 0 = P̂ V (I −Σ2 )V 0 P̂ 0
are the compact SVD’s of D10 D1 and D20 D2 respectively. Therefore, kD10 D1 k = kD20 D2 k2 = kI −
Σ2 k2 and hence k(I − P̂ P̂ 0 )P P 0 k2 = k(I − P P 0 )P̂ P̂ 0 k2 .
2) kP P 0 − P̂ P̂ 0 k2 = kP P − P̂ P̂ 0 P P 0 + P̂ P̂ 0 P P 0 − P̂ P̂ 0 k2 ≤ k(I − P̂ P̂ 0 )P P 0 k2 +k(I −P P 0 )P̂ P̂ 0 k2 = 2ζ∗ .
3) Since Q0 P = 0, then kQ0 P̂ k2 = kQ0 (I − P P 0 )P̂ k2 ≤ k(I − P P 0 )P̂ k2 = ζ∗ . q
4) Let M = (I−P̂ P̂ 0 )Q). Then M 0 M = Q0 (I−P̂ P̂ 0 )Q and so σi ((I−P̂ P̂ 0 )Q) = λi (Q0 (I − P̂ P̂ 0 )Q).
Clearly, λmax (Q0 (I−P̂ P̂ 0 )Q) ≤ 1. By Weyl’s Theorem, λmin (Q0 (I−P̂ P̂ 0 )Q) ≥ 1−λmax (Q0 P̂ P̂ 0 Q) =
p
1 − kQ0 P̂ k22 ≥ 1 − ζ∗2 . Therefore, 1 − ζ∗2 ≤ σi ((I − P̂ P̂ 0 )Q) ≤ 1.
E. Proof of Lemma 3.4
Proof. Let B = (I − P P 0 )A. By the Rayleigh-Ritz-Bits theorem,
(1 − δs (B)) ≤ λmin (BT0 BT ) ≤ λmax (BT0 BT ) ≤ (1 + δs (B)).
Which implies,
1 − λmin (BT0 BT ) ≤ δs (B) and λmax (BT0 BT ) − 1 ≤ δs (B)
with one of the two holding with equality. Now BT0 BT = A0T AT − (A0T P )(A0T P )0 . By Weyl’s Theorem,
λmin (A0T AT ) + λmin (−(A0T P )(A0T P )0 ) ≤ λmin (BT0 BT )
(14)
λmax (BT0 BT ) ≤ λmax (A0T AT ) + λmax (−(A0T P )(A0T P )0 ).
(15)
and
29
From (14) using the fact that λmin (−H) = −λmax (H) we get 1 − δs (A) − λmax ((A0T P )(A0T P )0 ) ≤
λmin (BT0 BT ) which implies,
1 − λmin (BT0 BT ) ≤ δs (A) + λmax ((A0T P )(A0T P )0 ) ≤ δs (A) + κ2s,A (P ).
From (15) we get λmax (BT0 BT ) ≤ 1 + δs (A) − λmin ((A0T P )(A0T P )0 ) which implies,
λmax (BT0 BT ) − 1 ≤ δs (A) − λmin ((A0T P )(A0T P )0 ) ≤ δs (A).
Therefore,
δs (B) = δs ((I − P P 0 )A) ≤ δs (A) + κ2s,A (P )
R EFERENCES
[1] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of ACM, vol. 58, no. 3, 2011.
[2] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal
on Optimization, vol. 21, 2011.
[3] H. Xu, C. Caramanis, and S. Sanghavi, “Robust pca via outlier pursuit,” IEEE Tran. on Information Theorey, vol. 58, no. 5, 2012.
[4] M. McCoy and J. Tropp, “Two proposals for robust pca using semidefinite programming,” arXiv:1012.1086v3, 2010.
[5] M. B. McCoy and J. A. Tropp, “Sharp recovery bounds for convex deconvolution, with applications,” arXiv:1205.1580.
[6] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Foundations of
Computational Mathematics, no. 6, 2012.
[7] Y. Hu, S. Goud, and M. Jacob, “A fast majorize-minimize algorithm for the recovery of sparse and low-rank matrices,” IEEE Transactions
on Image Processing, vol. 21, no. 2, p. 742=753, Feb 2012.
[8] A. E. Waters, A. C. Sankaranarayanan, and R. G. Baraniuk, “Sparcs: Recovering low-rank and sparse matrices from compressive
measurements,” in Proc. of Neural Information Processing Systems(NIPS), 2011.
[9] E. Richard, P.-A. Savalle, and N. Vayatis, “Estimation of simultaneously sparse and low rank matrices,” arXiv:1206.6474, appears in
Proceedings of the 29th International Conference on Machine Learning (ICML 2012).
[10] D. Hsu, S. M. Kakade, and T. Zhang, “Robust matrix decomposition with outliers,” arXiv:1011.1518.
[11] M. Mardani, G. Mateos, and G. B. Giannakis, “Recovery of low-rank plus compressed sparse matrices with application to unveiling
traffic anomalies,” arXiv:1204.6537.
[12] J. Wright, A. Ganesh, K. Min, and Y. Ma, “Compressive principal component pursuit,” arXiv:1202.4596.
[13] A. Ganesh, K. Min, J. Wright, and Y. Ma, “Principal component pursuit with reduced linear measurements,” arXiv:1202.6445.
[14] M. Tao and X. Yuan, “Recovering low-rank and sparse components of matrices from incomplete and noisy observations,” SIAM Journal
on Optimization, vol. 21, no. 1, pp. 57–81, 2011.
[15] C. Qiu, N. Vaswani, and L. Hogben, “Recursive robust pca or recursive sparse recovery in large but structured noise,” in IEEE Intl.
Conf. Acoustics, Speech, Sig. Proc. (ICASSP), 2013, longer version in arXiv: 1211.3754 [cs.IT].
[16] N. Vaswani and W. Lu, “Modified-cs: Modifying compressive sensing for problems with partially known support,” IEEE Trans. Signal
Processing, September 2010.
[17] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans. Info. Th., vol. 51(12), pp. 4203 – 4215, Dec. 2005.
[18] G. Grimmett and D. Stirzaker, Probability and Random Processes.
Oxford University Press, 2001.
[19] J. A. Tropp, “User-friendly tail bounds for sums of random matrices,” Foundations of Computational Mathematics, vol. 12, no. 4, 2012.
[20] C. Davis and W. M. Kahan, “The rotation of eigenvectors by a perturbation. iii,” SIAM Journal on Numerical Analysis, Mar. 1970.
[21] R. Horn and C. Johnson, Matrix Analysis.
Cambridge University Press, 1985.
[22] J. Zhan and N. Vaswani, “Time invariant error bounds for modified-cs based sparse signal sequence recovery,” arXiv:1305.0842.
[23] G. Li and Z. Chen., “Projection-pursuit approach to robust dispersion matrices and principal components: Primary theory and monte
carlo,” Journal of the American Statistical Association, vol. 80, no. 391, pp. 759–766, 1985.
Download