Sparse Representations and the Basis Pursuit Algorithm

advertisement
Exploiting Statistical Dependencies in
Sparse Representations *
Michael Elad
The Computer Science Department
The Technion – Israel Institute of technology
Haifa 32000, Israel
* Joint work with
Tomer Faktor & Yonina Eldar
This research was supported by the
European Community's FP7-FET program
SMALL under grant agreement no. 225913
Agenda
Part I – Motivation
for Modeling
Dependencies
Part II – Basics
of the Boltzmann
Machine
Part V – First Steps
Towards Adaptive
Recovery
Today we will
show that
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
Part III – Sparse
Recovery
Revisited
Part IV – Construct
Practical Pursuit
Algorithms
Using BM, sparse and redundant representation
modeling can be extended naturally and
comfortably to cope with dependencies within
the representation vector, while keeping most of
the simplicity of the original model
2
Part I
Motivation for Modeling
Dependencies
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
3
Sparse & Redundant Representations
 We model a signal x as a multiplication of a (given) dictionary D by a
sparse vector .
 The classical (Bayesian) approach for modeling  is as a
Bernoulli-Gaussian RV,
K
implying that the entries of
 are iid.
 Typical recovery: find  from
N
y  x  e  D  e .
This is done with pursuit
algorithms (OMP, BP, … ) that
assume this iid structure.
 Is this fair? Is it correct?
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
A Dictionary
D α

A sparse
vector
N
x
4
Lets Apply a Simple Experiment
 We gather a set of 100,000 randomly chosen image-patches of size 8×8,
extracted from the image ‘Barbara’. These are our signals y k k.
 
 We apply sparse-coding for these patches using
the OMP algorithm (error-mode with =2) with
the redundant DCT dictionary (256 atoms). Thus,
we approximate the solution of

ˆ k  min 

2
0
s.t. y k  D  N2
2
 Lets observe the obtained representations’
supports. Specifically, lets check whether
the iid assumption is true.
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
5
The Experiment’s Results (1)
 The supports are denoted by s of length K (256), with {+1} for
on-support element and {-1} for off-support ones.
s1
s2
s3
 First, we check the
probability of each atom
participate in the support.
sK
sK-1
 1,1
K
1
P  si  1 , 1  i  K
0.9
0.8
0.7
0.6
0.5
As can be seen, these
0.4
probabilities are far
0.3
from being uniform,
0.2
even though OMP gave
0.1
all atoms the same weight. 0
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
0
50
100
150
200
250
6
The Experiment’s Results (2)
 Next, we check whether there
are dependencies between the
atoms, by evaluating
P  si  1, s j  1
P  s i  1  P  s j  1
 P  si  1, s j  1 

log10 
 P  si  1  P  s j  1 


50
3.5
3
2.5
, 1  i, j  K
100
2
150
 This ratio is 1 if si and sj are
independent, and away from 1
when there is a dependency.
After abs-log, a value away from
zero indicates a dependency.
1.5
1
200
0.5
250
50
100
150
200
250
0
As can be seen, there are some dependencies between the
elements, even though OMP did not promote them.
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
7
The Experiment’s Results (3)
 We saw that there are dependencies
within the support vector. Could we
50
claim that these form a
block-sparsity pattern?
 If si and sj belong to the same
block, we should get that
P  s i  1 | s j  1  1
 Thus, after abs-log, any value
away from 0 means a non-block
structure.

1
log10 
 P si  1 s j  1







4
3.5
3
100
2.5
2
150
1.5
1
200
0.5
250
50
100
150
200
250
0
We can see that the supports do not form a block-sparsity
structure, but rather a more complex one.
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
8
Our Goals
 We would like to (statistically) model these dependencies somehow.
 We will adopt a generative model for  of the following form:
Generate the support
vector s (of length K)
by drawing it from the
distribution P(s)
Generate the non-zero
values for the on-support
entries, drawn from the
distributions N(0, i2 ).
 What P(s) is? We shall follow recent work and choose
the Boltzmann Machine [Garrigues & Olshausen ’08]
[Cevher, Duarte, Hedge & Baraniuk ‘09].
 Our goals in this work are:
α
o Develop general-purpose pursuit algorithms for this model.
o Estimate this model’s parameters for a given bulk of data.
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
9
Part II
Basics of the
Boltzmann Machine
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
10
BM – Definition
 Recall that the support vector, s, is a binary vector of length K,
indicating the on- and the off-support elements:
s1
s2
s3
sK-1
sK
 1,1
K
 The Boltzmann Machine assigns the following probability to this vector:
1
1 T
 T

P s  
exp b s  s Ws 
Z  W,b 
2


 Boltzmann Machine (BM) is a special case of the more general MarkovRandom-Field (MRF) model.
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
11
The Vector b
 If we assume that W=0, the BM becomes
1
1 T
 T

P  s   exp b s  s Ws 
Z
2


 
1
1 K
T
 exp b s   exp bk s k 
Z
Z k 1
 In this case, the entries of s are independent.
 In this case, the probability of sk=+1 is given by
exp(bk )
pk  P  sk  1 
exp(bk )  exp(bk )
Thus, pk becomes very small (promoting sparsity) if bk<<0.
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
12
The Matrix W
 W introduces the inter-dependencies between the entries of s:

Zero value Wij implies an ‘independence’ between si and sj.
 A positive value Wij implies an ‘excitatory’ interaction between si and sj.
 A negative value Wij implies an ‘inhibitory’ interaction between si and sj.
 The larger the magnitude of Wij, the stronger are these forces.
 W is symmetric, since the interaction between i-j is the same as the
one between j-i.
1
1 T
 T

 We assume that
P  s   exp b s  s Ws 
Z
2


diag(W)=0, since
traceW


 T

1
1 T
1 T
 exp b s  s  W  diag( W)  s  s diag( W)s 
Z
2
2




Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
13
Special Cases of Interest
 When W=0 and b=c·1, we return to
the iid case we started with. If c<<0,
this model promotes sparsity.
 When W is block-diagonal (up to
permutation) with very high and
positive entries, this corresponds to a
block-sparsity structure.
 A banded W (up to permutation)
with 2L+1 main diagonals
corresponds to a chordal graph that
leads to a decomposable model.
iid
Block-Sparsity
Chordal
Tree
 A tree-structure can also fit this
model easily.
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
14
Part III
MAP/MMSE Sparse
Recovery Revisited
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
15
The Signal Model
x
K
N
 We assume that α is built by:
 Choosing the support s with BM
probability P(s) from all the 2K
possibilities Ω.
The Dictionary
D
 D is fixed and known.
α
 Choosing the αs coefficients
using independent Gaussian
entries N 0, i2  we shall assume
that σi are the same (= σα).


 The ideal signal is x=Dα=Dsαs.
The p.d.f. P(α) and P(x) are clear and known
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
16
Adding Noise
x
K
Noise Assumed:
+
N
A fixed Dictionary
D
α
e
y
The noise e is additive
white Gaussian vector
with probability PE(e)
 xy

P  y x   C  exp  
2
2

e

2




The conditional p.d.f.’s P(y|), P(|y), and even
P(y|s), P(s|y), are all clear and well-defined
(although they may appear nasty).
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
17
Pursuit via the Posterior
We have
access to
MAP
MAP
ŝ
 ArgMax P(s | y)
s
P s | y 
Oracle
oracle
MAP
̂
MMSE
known
support s
̂
MMSE
̂
 E   | y
 There is a delicate problem due to the mixture of continuous and discrete PDF’s.
This is why we estimate the support using MAP first.
 MAP estimation of the representation is obtained using the oracle’s formula.
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
18
The Oracle
P  y | s  P  s 
P   | y, s   P  s | y  
 y D 

s s
P  y | s   exp  
2
2

e

P y
2




 y D 

s s
P  s | y   exp  
2
2

e


2
1
 T
 


D
D

I DsT y  Qs1 hs
ˆs  s s
 

2
e
2

Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
 s 2 
2
P  s   exp  
2 
 2 
s 2 


2 
2 

2
Comments:
• This estimate is both
the MAP and MMSE.
• The oracle estimate
of x is obtained by
multiplication by Ds.
19
MAP Estimation
ŝ
MAP
P(y | s) 
 ArgMax P(s | y)  ArgMax
s
s
 P(y | s,  )P( )d
s
s
s
 ....
s
s
P(y | s)P(s)
P(y)
Based on our prior for
generating the support

P  s   C  exp b s  0.5s Ws
T
T
 h Q hs log(det(Qs )) 
  2
    exp 


2
 2

 
2
e
2

T
s
1
s
2
e

s
T 1


 2e  2
1 T
 hs Qs hs log(det(Qs ))

T
P s y   2   exp 

 b s  s Ws 
2
2
2
 2e

  


 
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
20
The MAP
ŝ
MAP
 hsT Qs1 hs log(det(Qs ))





2
2
 2e

 ArgMax exp 

T
2
s

 
1
1 T


  b  log 2  s  s Ws 

4
e 
2



Implications:
 The MAP estimator requires to test all the possible supports for the
maximization. For the found support, the oracle formula is used.
 In typical problems, this process is impossible as there is a
combinatorial set of possibilities.
 In order to use the MAP approach, we shall turn to approximations.
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
21
The MMSE
MMSE

 E | y   P(s | y)  E | y , s
ˆ
s
This is the oracle for s, as we
have seen before
1
E  | y, s  

Q
ˆs
s hs


s
T 1


 2e  2
1 T
 hs Qs hs log(det(Qs ))

T
P s y   2   exp 

 b s  s Ws 
2
2
2
 2e

  


 
MMSE

  P(s | y)  
ˆ
ˆs
s
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
22
The MMSE
MMSE

 E | y   P(s | y)  E | y , s
ˆ
s
Implications:
MMSE

ˆ
  P(s | y)  
ˆs
s
 The best estimator (in terms of L2 error) is a weighted average of
many sparse representations!!!
 As in the MAP case, in typical problems one cannot compute this
expression, as the summation is over a combinatorial set of
possibilities. We should propose approximations here as well.
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
23
Part IV
Construct Practical
Pursuit Algorithms
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
24
An OMP-Like Algorithm
ŝ
MAP
 hsT Qs1 hs log(det(Qs ))





2
2
 2e

 ArgMax exp 

T
2
s

 
1
1 T


  b  log 2  s  s Ws 

4
e 
2



The Core Idea – Gather s one element at a time greedily:
 Initialize all entries with {-1}.
 Sweep through all entries and check which gives the maximal growth
of the expression – set it to {+1}.
 Proceed this way till the overall expression decreases for any
additional {+1}.
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
25
A THR-Like Algorithm
ŝ
MAP
 hsT Qs1 hs log(det(Qs ))





2
2
 2e

 ArgMax exp 

T
2
s

 
1
1 T


  b  log 2  s  s Ws 

4
e 
2



The Core Idea – Find the on-support in one-shut:
 Initialize all entries with {-1}.
 Sweep through all entries and check for each the obtained growth of
the expression – sort (down) the elements by this growth.
 Choose the first |s| elements as {+1}. Their number should be chosen
so as to maximize the MAP expression (by checking 1,2,3, …).
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
26
An Subspace-Pursuit-Like Algorithm
ŝ
MAP
 hsT Qs1 hs log(det(Qs ))





2
2
 2e

 ArgMax exp 

T
2
s

 
1
1 T


  b  log 2  s  s Ws 

4
e 
2



The Core Idea – Add and remove entries from s iteratively:
 Operate as in the THR algorithm, and choose the first |s|+ entries as
{+1}, where |s| is maximizes the posterior.
 Consider removing each of the |s|+  elements, and see the obtained
decrease in the posterior – remove the weakest  of them.
 Add/remove  entries … similar to
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
[Needell & Tropp `09],[Dai & Milenkovic `09].
27
An MMSE Approximation Algorithm
s
T 1


 2e  2
1 T
 hs Qs hs log(det(Qs ))

T
P s y   2   exp 

 b s  s Ws 
2
2
2
 2e

  


 
MMSE

ˆ
  P(s | y)  Qs1 hs
s
The Core Idea – Draw random supports from the posterior
and average their oracle estimates:
 Operate like the OMP-like algorithm, but choose the next +1 by
randomly drawing it based on the marginal posteriors.
 Repeat this process several times, and average their oracle’s solutions.
 This is an extension of the Random-OMP algorithm
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
[Elad & Yavneh, `09].
28
Lets Experiment
Choose
W&b
Choose
α
Generate
Boltzmann Machine
supports {si}i using
Gibbs Sampler
Generate
representations
{i}i using the
supports
Choose
D
Generate
signals {xi}i by
multiplication
by D
Choose
e
Generate
measurements
{yi}i by adding
noise
Oracle
?
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
OMP-like
THR-like
Rand-OMP
Plain OMP
29
Chosen Parameters
The Boltzmann parameters b and W
1
0
10
-1
-2
1
20
-3
-4
0.5
30
-5
-6
0
40
-0.5
-7
-8 0
10
20
30
40
50
60
5
50
The signal
and noise
STD’s
  60
e  4  80
-1
60
10
10
20
30
40
50
60
15
20
25
10
20
30
40
50
60
The dictionary D
Image Denoising & Beyond Via Learned
Dictionaries and Sparse representations
By: Michael Elad
30
Gibbs Sampler for Generating s
Repeat J (=100)
times
Fix all entries of s
apart from
sk and
Set
0.25
draw it randomly
k=1
0.2
from the0.15marginal
Initialize s
randomly
{1}
Set
k=k+1 till
Obtained
k=K
histogram
for K=5
0.1

P sk s
C
k

C
P sk , sk
  P s 
0.05

0
C
k

0
5
10
15
20
25
30
0.25


2 cosh b  W s 
T c
k k
exp skbk  sk W s
k
T c
k k
0.15
0.1
0.05
0
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
True
histogram
for K=5
0.2
0
5
10
15
20
25
30
31
Results – Representation Error
 As expected, the oracle
performs the best.
 The Rand-OMP is the second
best, being an MMSE
approximation.
 MAP-OMP is better than
MAP-THR.
 There is a strange behavior
in MAP-THR for weak noise –
should be studied.
 The plain-OMP is the worst,
with practically useless
results.
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
2
 
ˆ   2 
E

2
  2 
1.2
1
0.8
0.6
Oracle
MAP-OMP
MAP-THR
Rand-OMP
Plain-OMP
0.4
0.2
0
10
20
30
40
50
60
70
e
80
32
Results – Support Error
 Generally the same story.
 Rand-OMP result is not
sparse! It is compared by
choosing the first |s|
dominant elements, where
|s| is the number of
elements chosen by the
MAP-OMP. As can be seen, it
is not better than the MAPOMP.
 ŝ  s 2

2
E
 max ˆs , s
1.2






1
Oracle
MAP-OMP
MAP-THR
Rand-OMP
Plain-OMP
0.8
0.6
0.4
0.2
0
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
10
20
30
40
50
60
70
80
e
33
The Unitary Case
 What happens when D is square and unitary?
 We know that in such a case, if W=0 (i.e., no interactions), then
MAP and MMSE have closed-form solutions, in the form of a
shrinkage [Turek, Yavneh, Protter, Elad, `10].
3
2
3
MMSE
1
1
0
0
-1
-1
-2
-2
-3
-3
-2
-1
0
MAP
2
1
2
3
-3
-3
-2
-1
0

T
̂   D y
1
2

3
 Can we hope for something similar for the more general W? The
answer is (partly) positive!
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
34
The Posterior Again
s
T 1


 2e  2
h
1 T
 s Qs hs log(det(Qs ))

T
P s y   2   exp 


b
s

s
W
s

2
2
2


  
 2e

 
2e
2  2e
Qs  D Ds  2 I 
I
2


T
s
hs  DsT y
1 T
 T

P s y  exp q s  s Ws  where q  b 
2


 

 2  
2
1 

D T y  log1    



4  2 2
 (  2 )
2  
e 

 e  e
When D is unitary, the BM prior and the
resulting posterior are conjugate distributions
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
35
Implications
The MAP estimation becomes the following Boolean
Quadratic Problem (NP-Hard!)
1 T
 T

ŝ  ArgMax P s y  ArgMax exp q s  s Ws 
2
s sk2 1
s sk2 1


 
Approximate
Multi-user detection in
communication
[Verdu, 1998]
Assume
W≥0
Graph-cut can be used for
exact solution
[Cevher, Duarte, Hedge, &
Baraniuk 2009]
Assume
banded W
Our work: BeliefPropagation for exact
solution with O(K∙2L)
What about exact MMSE? It is possible too! We are working on it NOW.
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
36
Results
 W (of size 64×64) in this
experiment has 11
diagonals.
 All algorithms behave as
expected.
 Interestingly, the exact-MAP
outperforms the
approximated MMSE
algorithm.
 The plain-OMP (which would
have been ideal MAP for the
iid case) is performing VERY
poorly.
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
2
 
ˆ   2 
E

2



2
1.2
Oracle
MAP-OMP
Exact-MAP
Rand-OMP
Plain-OMP
1
0.8
0.6
0.4
0.2
0
e
10
20
30
40
50
60
70
80
37
Results
 As before …
 ŝ  s 2

2
E
 max ˆs , s
1.2






1
0.8
0.6
Oracle
MAP-OMP
Exact-MAP
Rand-OMP
Plain-OMP
0.4
0.2
0
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
10
20
30
40
50
60
70
80
e
38
Part V
First Steps Towards
Adaptive Recovery
Exploiting Statistical Dependencies
in Sparse Representations
By: Michael Elad
39
The Grand-Challenge
y 
N
i i 1
y 
N
i i 1
i i1
N
Estimate the model
parameters and the
representations
i i1
N
Assume known
W,b, D,  
and estimate the
representations
D
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
W,b, D,  
Assume known
ii  sii
and estimate the
BM parameters
Assume known
i , W,b, 
and estimate the
dictionary
 
W,b
Assume known
i
and estimate the
signal STD’s
40
Maximum-Likelihood?
si i1
N
Wˆ ,bˆ  ArgMax P s 
N
W ,b
W,b
ML-Estimate
of the BM
parameters
i i 1
1

N
W,b  ArgMax  P  si W,b 
W ,b
i 1
1 T
 T

 ArgMax
exp
b
s

s
W
s

i
i
i
N 
2
W ,b


Z  W,b  i1
N
Problem !!! We do not have access to the
partition function Z(W,b).
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
41
Solution: Maximum-Pseudo-Likelihood
MPL-Estimate
of the BM
parameters
si i1
N
Likelihood
P s W,b
function



P s s  
2 cosh b  W s 
k
C
k

c
exp skbk  sk WkT sk
k
T c
k k
K

k 1

W,b
P sk W,b, s
C
k

Pseudo-Likelihood
function


 tr S T WS  1T S T b 
ˆ  ArgMax 

W,b
W ,b
  1T  WS  b  1


 
 The function (∙) is log(cosh(∙)). This is a convex programming task.
 MPL is known to be a consistent estimator.
 We solve this problem using SESOP+SD.
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
42
Simple Experiment & Results
We perform the following
experiment:
 W is 64×64 banded, with L=3,
and all values fixed (0.2).
 b is a constant vector (-0.1).
 10,000 supports are created from
this BM distribution – the average
cardinality is 16.5 (out of 64).
 We employ 50 iterations of the
MPL-SESOP to estimate the BM
parameters.
 Initialization: W=0 and b is set
by the scalar probabilities for
each entry.
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0
10
20
30
40
50
60
43
Simple Experiment & Results
Estimation Error
4.5
1.7
4
1.6
ˆ
bb
3.5
b
3
MPL function
5
x 10
1.5
2
1.4
2
1.3
2.5
ˆ
WW
2
W
1.5
1.2
2
1.1
2
1
1
0.9
0.5
0.8
0
0.7
0
10
20
30
Iterations
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
40
50
0
10
20
30
40
50
Iterations
44
Testing on Image Patches
Unitary DCT
dictionary
D
y 
N
i i 1
Use
Exact-MAP
Assume known
W,b, D,  
and estimate the
representations
Image patches
of size 8×8
pixels
i i1
N
Use
MPL-SESOP *
Assume known
ii  sii
and estimate the
BM parameters
W,b
Use
simple ML
Assume known
i
and estimate the
signal STD’s
 
2 rounds
Denoising Performance?

This
experiment
done and
for patches from 5 images (Lenna, Barbara,
* We
assume
that W was
is banded,
estimate
as such,
by searching
House, itBoat,
Peppers),
with varying e in the range [2-25].
the show
best permutation.
(greedily)
The results
an improvement of ~0.9dB over plain-OMP.
Topics in MMSE Estimation
For Sparse Approximation
By: Michael Elad
45
Part VI
Summary and
Conclusion
Sparse and Redundant Representation Modeling
of Signals – Theory and Applications
By: Michael Elad
46
Today We Have Seen that …
Sparsity and Redundancy
are important ideas that
can be used in designing
better tools in
signal/image processing
We use the Boltzmann
Machine and propose:
We intend to explore:
 Other pursuit
methods
 Theoretical study of
these algorithms
 K-SVD + this model
 Applications …
Is it
enough?
No! We know that the
found representations
exhibit dependencies.
Taking these into account
could further improve the
model and its behavior
What
next?
Sparse and Redundant Representation Modeling
of Signals – Theory and Applications
By: Michael Elad
 A general set of
pursuit methods
 Exact MAP pursuit for
the unitary case
In this
work
 Estimation of the BM
parameters
47
Download