Overcomplete dictionaries

advertisement
Over-complete Representations
for Signals/Images
IT 530, Lecture Notes
Introduction: Complete and overcomplete bases
• Signals are often represented as a linear combination of
basis functions (e.g. Fourier or wavelet representation).
• The basis functions always have the same dimensionality as
the (discrete) signals they represent.
• The number of basis vectors is traditionally the same as the
dimensionality of the signals they represent.
• These bases may be orthonormal (Fourier, wavelet, PCA) or
may not be orthonormal (non-orthonormalized ICA).
Introduction: Complete and overcomplete bases
• A more general representation for signals uses
so called “over-complete bases”, where the
number of basis functions is MORE than the
dimensionality of the signals.
• Complete and over-complete bases:
x  As
x  As
x  R , A R
n
sR
n
nn
x  R n , A  R nm
s  R ( m  n)
m
Introduction: Construction of overcomplete bases
• Over-complete bases can be created by union of
multiple sets of complete bases.
• Example 1: A signal with n values can be
represented using a union of n x n Fourier and n x
n Haar wavelet bases, yielding a n x 2n basis
matrix.
• Example 2: A signal with n values can be
represented by adding sinusoids of more
frequencies to an existing Fourier basis matrix
with n vectors.
Introduction: uniqueness?
• With complete bases, the representation of
the signal is always unique.
• Example: Signals are uniquely defined by their
wavelet or Fourier transforms, the eigencoefficients of any signal (given PCA basis) are
uniquely defined.
• This uniqueness is LOST with over-complete
basis.
Introduction: compactness!
• Advantage: over-complete bases afford much
greater compactness in signal representation.
• Example: Consider two types of audio-signals
– whistles and claps. Signals of either type can
be represented in a complete Fourier or
wavelet basis (power-law of compressibility
will apply).
• BUT: imagine two complete bases respectively
learned for whistles and claps – B1 and B2.
Introduction: compactness!
• Suppose B1 and B2 are such that a whistle
(resp. clap) signal will likely have a compact
representation in the whistle (resp. clap) basis
and not in the other one.
• A whistle+clap signal will NOT have a compact
representation in either basis - B1 or B2 !
• But the whistle+clap signal WILL have a
compact representation in the overcomplete
basis B = [B1 B2].
More problems 
• Since a signal can have many representations in
an over-complete basis, which one do we pick?
• Pick the sparsest one, i.e. the one with least
number of non-zero elements which either
perfectly reconstructs the signal, or reconstructs
the signal up to some error.
• Finding the sparsest representation for a signal in
an over-complete basis is an NP-hard problem
(i.e. the best known algorithm for this task has
exponential time complexity )
More problems 
• In other words, the following problem is NPhard:
x  As
x  R , A R
n
nm
s  R ( m  n)
m
s  min s 0 subject tox  As
*
Solving (?) those problems
• The NP-hard problem has several methods for
approximation – basis pursuit (BP), matching
pursuit (MP), orthogonal matching pursuit
(OMP) and many others.
• None of them will give you the sparsest
solution – but they will (under different
conditions) yield a solution that is sparse
enough.
Bayesian approach
• Consider
x  As
x  R n , A  R nm
s  R m ( m  n)
• Assume a suitable prior P(s) on s.
• Assume:
2
 x  As
x  As    log( p ( x | A, s )) 
2 2
*
s  arg maxs p ( x | A, s ) p ( s )
Bayesian approach
• For zero noise and a Gaussian prior on s, the
solution for s is obtained by solving:
min s 2 subject tox  As
Pseudo-inverse
 s  ( A A  I ) A x  A s
• For zero noise and a Laplacian prior on s, the
solution for s is obtained by solving:
*
T
1
T
min s 1 subject tox  As

Basis Pursuit
Approximation
 no closed formsoln.,use linear programming
Linear programming: canonical form
Linear objective
(energy) function
Linear equality and
inequality constraints
Linear programming for Basis Pursuit
min s 1 subject tox  As
 no closed formsolution,use linear programming

Vector of the same size as s, with the negative elements set to 0, and
positive elements same as in s.
s  [u;v]
Vector of the same size as s, with the non-negative elements set to 0,
and negative elements multiplied by -1
min u; v 1 subject to[ A; A][u; v]  x
and
u  0, v  0
Linear programming problems can be solved in
polynomial time! There are various algorithms like
simplex (worst-case exponential), interior-point
method and ellipsoidal algorithm (both
polynomial)
L1 norm and L0 norm
• There is a special relationship between the
following two problems (which we will study
in compressive sensing later on):
x  As
x  As
x  R n , A  R nm
x  R n , A  R nm
s  R m ( m  n)
s  R m ( m  n)
s *  min s 0 subject tox  As s *  min s subject tox  As
1
The L1 norm is a “softer” version of the L0 norm. Other Lp-norms where 0 < p < 1
are possible and impose a stronger form of sparsity, but they lead to non-convex
problems. Hence L1 is preferred.
p 1/ p
x
p
 n
   | xi |
 i 1





Matching Pursuit
• One of the simplest approximation algorithms
to obtain the coefficients s of a signal y in an
over-complete basis A.
• Developed by Mallat and Zhang in 1993 (ref: S. G.
Mallat and Z. Zhang, Matching Pursuits with Time-Frequency Dictionaries,
IEEE Transactions on Signal Processing, December 1993)
• Based on successively choosing that vector in
A which has maximal inner product with a socalled residual vector (initialized to y in the
beginning).
Pseudo-code
 (0)
(0)
r  y, s  0, i  0
 (i ) 2
while ( r
 )
{
“j” or “l” is an index for dictionary columns
 ( i )T   2
j  arg maxl | r al / al |
 (i )T   (i 1)  ( i )

sj  r aj;r
 r  s ja j ;i  i 1
}
OUT P UT: {s j }
Properties of matching pursuit
(i )
2
• The reconstruction error, i.e. || r || is always
guaranteed to decrease. The decrease is at an
exponential rate.
• At any iteration, the following relationship
holds true:
i 1
|| y ||2   s 2j  || r (i ) ||2
j 0
Orthogonal Matching Pursuit (OMP)
• More sophisticated algorithm as compared to
matching pursuit (MP).
• The signal is approximated by successive
projection onto those dictionary columns (i.e.
columns of A) that are associated with a
current “support set”.
• The support set is also successively updated.
Pseudo-code

 (0)

r  y, s  0, T ( 0 )   , i  0
 (i ) 2
while ( r
 )
Support set
{

 ( i )T  
(1) a j  arg max j | r a j / a j
(2) T
(i )
T
(i )
2
|
Several coefficients
are re-computed in
each iteration
j

y  AT ( i ) sT ( i )
(3) sT ( i )  arg mins ( i )
T
 (i 1)

(4) r
 y  As ; i  i  1
}

OUT P UT: {s j , a j }
2
 A T ( i ) y
Sub-matrix containing
only those columns which
lie in the support set
OMP versus MP
• Unlike MP, OMP never re-selects any element.
• Unlike MP, in OMP, the residual at an iteration is always orthogonal
to all currently selected elements.
• OMP is costlier per iteration (due to pseudo-inverse ) but generally
more accurate than MP.
• Unlike MP, OMP converges in K iterations for a dictionary with K
elements.
• OMP always gives the optimal approximation w.r.t. the selected
subset of the dictionary (note: this does not mean that the selected
subset itself was optimal).
OMP and MP for noisy signals
• It is trivial to extend OMP and MP for noisy
signals.
• The stopping criterion is a small residual
magnitude  (not zero).
BP under noise
y  x  z , z ~ N (0,1)
x  As
s  arg mins
A quadratic programming
problem that is
structurally similar to a
linear program
1
2
y  As   s 1
2
minw, p cT w 
p
2
2
subject to
[ A; A]w  p  b; w  0;   1
where
c  [1;1],s  u - v, w  [u;-v]
Learning the bases
• So far we assumed that the basis (i.e. A) was
fixed, and optimized for the sparse
representation.
• Now, we need to learn A as well!
• We’ve learned about PCA, ICA. But they don’t
always give the best representation!
PCA
ICA
ICA
Over-complete
Learning the bases: analogy with Kmeans
• In K-means, we start with a bunch of K cluster
centers and assign each point in the dataset to
the nearest cluster center.
• The cluster centers are re-computed by taking
the mean of all points assigned to a cluster.
• The assignment and cluster-center
computation problems are iterated until a
convergence criterion is met.
Learning the bases: analogy with Kmeans
• K-means is a special sparse coding problem
where each point is represented by strictly
one of K dictionary elements.
• Our dictionary (or bases) learning problem is
more complex: we are trying to express each
point as a linear combination of a subset of
dictionary elements (or a sparse linear
combination of dictionary elements).
Learning the Bases!
• Find model (i.e. over-complete basis) A for which the
likelihood of the data is maximized.
• Above integral is not available in closed form for most
priors on s (e.g. Laplacian - intractable).
• Approximation (Method 1): Assume that the volume of
the pdf is concentrated around the mode (w.r.t. s).
Ref: Olshausen and Field, “Natural image statistics and efficient coding”
Learning the bases: Method 1
 P( s ) P( x
k
k
| A, sk )ds  P( s ) P( xk | A, s )
*
k
*
k
N
A  arg maxA  P( s ) P( xk | A, s )
k 1
*
k
N
 arg min A  ( s
k 1
*
k 1
*
k
* 2
k
 xk  As
Gradient descent
A
( t 1)
N
 A    ( A s  xk )s
(t )
k 1
(t ) *
k
*T
k
)
Two-step iterative procedure
• Fix the basis A and obtain the sparse coefficients
for each signal using MP, OMP or BP (some
papers – like the one by Olshausen and Field - use
gradient descent for this step!).
• Now fix the coefficients, and update the basis
vectors (using various techniques, one of which
was described on the previous slide).
• Normalize each basis vector to unit norm.
• Repeat the previous two steps until some error
criterion is met.
Toy Experiment 1
Result of basis learning
(dictionary with 144 elements)
with sparsity constraints on the
codes. Training performed on
12 x 12 patches extracted from
natural images.
Ref: Olshausen and Field, “Natural image statistics and efficient coding”
Toy Experiment 2
• Data-points generated as a (Laplacian/superLaplacian) random linear combination of some
arbitrarily chosen basis vectors.
• In the no-noise situation, the aim is to extract
the basis vectors and the coefficients of the
linear combination.
• The fitting results are shown on the next slide.
The true and estimated directions agree quite
well.
Ref: Lewicki and Sejnowski, “Learning overcomplete representations”
Learning the Bases: Method of
Optimal Directions (MOD) - Method 2
• Given a fixed dictionary A, assume sparse
codes for every signal are computed using
OMP, MP etc.
• The overall error is now given as
n
E ( A)   yk  Ask
2
 Y  AS
2
k 1
• We want to find dictionary A that minimizes
this error.
Ref: Engan et al, “Method of optimal directions for frame design”
Learning the Bases: Method of
Optimal Directions (MOD) - Method 2
• Take the derivative of E(A) w.r.t. A and set it to 0.
This gives us the following update:
( t 1)
(Y  AS)S  0  A
T
 YS
( t )T
(t )
(S S
( t )T 1
)
• Following the update of A, each column in A is
independently rescaled to unit norm.
• The updates of A and S alternate with each other
till some convergence criterion is reached.
• This method is more efficient than the one by
Olshausen and Field.
Learning the Bases: Method 3- Union
of Orthonormal Bases
• Like before, we represent a signal in the
following way:
X  AS  
( A, S )  min A, S X  AS   S
2
1
• A is an over-complete dictionary, but let us
assume that it is a union of ortho-normal
bases, in the form
A  [ A1 | A2 | ...| AM ]
i,1  i  M , Ai A  I
T
i
Learning the Bases: Method 3- Union
of Ortho-normal Bases
• The coefficient matrix S can now be written as
follows (M subsets, each corresponding to a
single orthonormal basis):
S  [S1 | S2 | ... | SM ]
• Assuming we have fixed bases stored in A, the
coefficients in S can be estimated using block
coordinate descent, described on the
following slide.
Learning the Bases: Method 3- Union
of Ortho-normal Bases
for t  1 : T {
t  0 (1  t / T );
UpdateS using BCR with parametert
}
BCR( t , S ){
for m  1 : M {
X m  X   Aj S j
There is a quick way of performing this
optimization given an ortho-normal basis –
SOFT THRESHOLDING (could be replaced by
hard thresholding if you had a stronger
sparseness prior than an L1 norm
jm
UpdateS m as follows: S m  arg minS * X m  Am S *  t S * 1
2
}}
Learning the Bases: Method 3- Union
of Ortho-normal Bases
• Given the coefficients, we now want to update
the dictionary which is done as follows:
for m  1 : M
{
X m  X   Aj S j ;
jm
S m X  UV ;
T
m
Am  VU T ;
}
T
Why are we doing this? It is
related to the so-called
orthogonal Procrustes
problem - a well-known
application of SVD. We will see
this on the next slide.
The specific problem we are solving is given below. Note
that it cannot be solved using a pseudo-inverse as that
will not impose orthonormality constraint, if there is
noise in the data or if the coefficients are perturbed or
thresholded.
min A X m  Am S m s.t. Am AmT  I
2
A*  min A X  AS s.t. AT A  I
2
min A X  AS  min A trace(( X  AS )T ( X  AS ))
2
 min A trace( X T X  2 X T AS  S T S )
 maxA trace( X T AS )
 maxA trace( ASX T )    trace( FG)  trace(GF )
- -Let SX T  Q, SVD of Q gives Q  UDV T  
 maxA trace( AUDV T )
 maxA trace(V T AUD)
 maxA trace( Z ( A) D) where Z ( A)  V T AU
 maxA  zii d ii   d ii ( Z ( A)T Z ( A)  I )
i
i
T hemaximumis achievedfor Z ( A)  I , i.e.
V T AU  I  A  VU T
Learning the Bases: Method 3- Union
of Ortho-normal Bases
• Keeping all bases in A fixed, update the
coefficients in S using a known sparse coding
technique.
• Keeping the coefficients in S fixed, update the
bases in A using the aforementioned SVDbased method.
• Repeat the above two steps until a
convergence criterion is reached.
Learning the Bases: Method 4 – K-SVD
• Recall: we want to learn a dictionary and sparse
codes on that dictionary given some data-points:
min A, S Y  AS subject to i, si
2
0
 T0
• Starting with a fixed dictionary, sparse coding
follows as usual – OMP, BP, MP etc. The criterion
could be based on reconstruction error or L0norm of the sparse codes.
• The dictionary is updated one column at a time.
2
K
Y  AS  Y   a s
2
j 1
j
j T
2
K
 Y  a s  a s
j k
j
j T
k 2
k T
 Ek  a s
Does NOT depend
on the k-th
dictionary column
k
k T
Row ‘k’ (NOT
column) of
matrix S
How to find ak, given the above expression? We have
decomposed the original error matrix, i.e. Y-AS, into a sum of
rank-1 matrices, out of which only the last term depends on ak. So
we are trying to find a rank-1 approximation for Ek, and this can
be done by computing the SVD of Ek, and using the singular
vectors corresponding to the largest singular value.
Ek  UV T , ak  U1T , skT  (1,1)V1T
Problem! The dictionary codes may no more be sparse! SVD does
not have any in-built sparsity constraint in it! So, we proceed as
follows:
k  {i | s (i)  0}
k
T
 k {0,1}
p |k |
,  k (k (i), i )  1
Consider the error matrix defined as follows:
(Y  AS ) k
E  UV
R
k
 Ek  k  ( a s ) k
2
k 2
k R
 E a s
R
k
Considers only those
columns of Y (i.e. only
those data-points) that
actually USE the k-th
dictionary atom,
effectively yielding a
smaller matrix, of size p by
|k|
2
k
k T
.
T
ak  U1T , sRk   (1,1)V1T
This update affects the sparse
codes of only those data-points
that used the k-th dictionary
element
KSVD and K-means
• Limit the sparsity factor T0 = 1.
• Enforce all the sparse codes to be either 0 or
1.
• Then you get the K-means algorithm! 
Implementation issues
• KSVD is a popular and effective method. But some implementation
issues haunt K-SVD (life is never easy ).
• KSVD is susceptible to local minima and over-fitting if K is too large
(just like you can get meaningless clusters if your number of cluster
is too small, or get meaningless densities if the number of
histogram bins is too many).
• KSVD convergence is not fully guaranteed. The dictionary updates
given fixed sparse codes ensure the error decreases. However the
sparse codes given fixed dictionary may not decrease the error – it
is affected by the behaviour of the sparse coding approximation
algorithms.
• You can speed up the algorithm by removing extremely
“unpopular” dictionary elements, or removing duplicate (or nearduplicate) columns of the dictionary.
(Many) Applications of KSVD
•
•
•
•
•
•
•
Image denoising
Image inpainting
Image deblurring
Blind compressive sensing
Classification
Compression
And you can work on many more 
Application: Image Compression
(Training Phase)
• Training set for dictionary learning: a set of 11000
patches of size 8 x 8 – taken from a face image
database. Dictionary size K = 441 atoms
(elements).
• OMP used in the sparse coding step during
training – stopping criterion is a fixed number of
coefficients T0 = 10.
• Over-complete Haar and DCT dictionaries – of
size 64 x 441 – and ortho-normal DCT basis of size
64 x 64 (JPEG), also used for comparison.
Application: Image Compression
(Testing Phase)
• A lossy image compression algorithm is evaluated
using an ROC curve – the X axis contains the
average number of bits to store the signal. The Yaxis is the associated error or PSNR. Normally, the
acceptable error is fixed and the number of bits is
calculated.
• The test image is divided into non-overlapping
patches of size 8 x 8.
• Each patch is projected onto the trained
dictionary and its sparse code is obtained using
OMP given a fixed error e.
Application: Image Compression
(Testing Phase)
• The encoded image contains the following:
1. Sparse codes for each patch and the indices
of each coefficient (in the dictionary).
2. The number of coefficients used to represent
each patch (different patches will need
different number of coefficients).
Application: Image Compression
(Testing Phase)
• The average number of bits per pixel (RPP) is
Sum total of the
calculated as:
number of
a# patches # coeffs (b  Q)
RPP 
# pixels
Number of bits
required to store the
number of
coefficients for each
patch
Number of bits
required to store the
dictionary index for
each coefficient
Huffman encoding
coefficients
representing
each patch
Number of bits
required to code
each coefficient
(quantization level)
Even if the error ‘e’ for the OMP was fixed, we need to compute the total
MSE between the true and the compressed image. This is due to effects of
quantization while storing the sparse coefficient values for each patch.
Application: Image Denoising
• KSVD for denoising seeks to minimize the
following objective function:
( X , D,{ ij })  arg min X , D ,{ ij }  X  Y   ij  ij   D ij  Rij X
2
i, j
0
i, j
Y = noisy image
X = underlying clean image (to be estimated)
ij = sparse dictionary coefficients for patch at location (i,j)
D = dictionary
Rij = matrix that extracts the patch xij from image X, i.e.
xij = Rij X
2
Application: Image Denoising
• Note: The dictionary may be learned a priori from a
corpus of image patches. The patches from the noisy
image can then be denoised by mere sparse coding.
• The more preferable method is to train the dictionary
directly on the noisy image in tandem with the sparse
coding step (as in the previous slide).
• This avoids having to depend on the training set and
allows for tuning of the dictionary to the underlying
image structure (as opposed to the structure of some
other images).
KSVD Algorithm for Denoising (Dictionary
learned on the noisy image)
• Set X = Y, D = overcomplete DCT
• Until some “convergence” criterion is satisfied, repeat the
following:
1. Obtain the sparse codes for every patch (typically using
OMP) as follows:
i, j, min  ij s.t. Rij X  D ij
2
0
 C 2
2. Perform the dictionary learning update typical for KSVD.
• Estimate the final image X by averaging the reconstructed
overlapping patches, OR estimate X given D and ij:
X  arg minX  X  Y   D ij  Rij X
2
ij
 X  (I   RijT Rij ) 1 (Y   RijT D ij )
ij
ij
2
KSVD Algorithm for Denoising
(Dictionary learned on the noisy
image)
2
2
X  arg min X  X  Y   D ij  Rij X
ij
 X  (I   RijT Rij ) 1 (Y   RijT D ij )
ij
 set to
ij
30

This equation is a mathematically rigorous way to show how X was
reconstructed by averaging overlapping denoised patches together with
the noisy image as well.
Baseline for comparison: method by Portilla
et al, “Image denoising using scale mixture of
Gaussians in the wavelet domain”, IEEE TIP
2003.
http://www.cs.technion.ac.il/~elad/Variou
s/KSVD_Matlab_ToolBox.zip
Application of KSVD: Filling in Missing
Pixels
Application of KSVD: Filling in Missing
Pixels
• An over-complete dictionary is trained a priori on a set
of face images.
• A test face image (not part of the training set) is
synthetically degraded by masking out 50 to 70 percent
of the pixels.
• Patches from the degraded image are sparse coded by
projection onto the trained dictionary using OMP.
• OMP is modified so that only the non-degraded pixels
are considered during any error computation (the
dictionary elements are therefore appropriately rescaled to unit norm).
Application of KSVD: Filling in Missing
Pixels
• OMP is modified so that only the nondegraded pixels are considered during any
error computation (the dictionary elements
are therefore appropriately re-scaled to unit
norm).
2
yij  Dsij such that sij
0
T
2
( yij  Dsij )  m askij such that sij
0
T
Application of KSVD: Filling in Missing
Pixels
• The patches are reconstructed in the following
manner:
yij  Dsij
NOTE: Dictionary elements without masking!!
Dictionary learning for classification
• Dictionary learning is useful for signal
reconstruction (we have seen many techniques).
• We now explore applications for classification:
example, texture classification or object
recognition.
• Application in texture classification: Ref: Skretting
and Husoy, “Texture classification using sparse
frame-based representations”, EURASIP Journal
on Signal Processing, 2006.
Texture classification/segmentation:
dictionary learning
• Training phase: Patches from representative
texture images of a given class Ci are arranged
in a matrix Xi of size N by L.
• We will present Xi in the following form:
X i  FiWi , X i  R NL , Fi  R NK ,Wi  R KL
• Here Fi is a dictionary and Wi is a set of sparse
coefficients.
• Starting with an initial Fi , we estimate Wi with
a typical sparse coding step (with constraints).
Texture classification: dictionary
learning
• Then given the new Wi , the dictionary Fi is
updated using the following:
T 1
Fi  X iWi (WiWi )
T
• The afore-mentioned two steps are repeated
until some convergence criterion is met.
Texture classification: Test phase
• Blocks from a test image are projected onto each
of the dictionaries and the residual errors are
computed after finding the coefficients with a
sparse coding method (assuming a fixed number
of nonzero elements in the sparse code).
• Each block is assigned to the class Ci if the
dictionary Fi gave the least reconstruction error.
• Unfortunately, this method produces very high
errors!
Texture classification: Test phase
• Significant improvement is obtained if the error
images are smoothed prior to classification. The
smoothing can be done with a Gaussian filter.
• If the Gaussian  is low, errors will be high in
many places.
• If the Gaussian  is large, the classification errors
will be low within a segment, though high along
the borders between two or more segments.
• The large Gaussian  imposes the belief that
nearby pixels are from the same class (a belief
that gets violated at the borders).
Texture classification: observations
• It is observed that class-specific dictionary
learning outperforms classification with
universal bases like wavelets or DCT.
• Classification performance is affected by
parameters such as patch size, number of
dictionary atoms, Gaussian .
Discriminative Dictionary Learning
• The earlier scheme enforces good
reconstruction of patches from a given texture
class if a good dictionary is available.
• But it is not discriminative enough – it does
not enforce poorer reconstructions given the
“wrong” class!
• This issue is dealt with in the paper – Mairal et
al, “Discriminative Learned Dictionaries for
Local Image Analysis”, CVPR 2008.
Discriminative Dictionary Learning
• The main discriminative function is given as
N
follows: 
 ( y  y )
Ci ( y1 , y2 ,..., y N )  log( e
j
i
)
j 1
• This is called as a “softmax” discriminative
function.
• It is close to zero when yi is distinctly the smallest
amongst {y1, y2,…, yN}. In that case the summation
will be dominated by the term e ( y  y )  1
whereas the other terms will be close to zero
(this is especially true for large value of ).
i
i
Discriminative Dictionary Learning
• The objective function for our task is as
follows:
Dictionary
Sparse code
Sparsity level
Datapoint
 ( x, D)  min x  D subject to  0  L

2
R ( x, D)  x  D ( x, D)


 C
N
min{ D }N
j
j 1
i 1,
lS i

i
2
({R ( xl , D j )} )   R ( xl , Di )

N
j 1
High  = purely reconstructive
Low  = purely discriminative


Discriminative Dictionary Learning:
Overall Algorithm
• Sparse coding step proceeds exactly as before.
• Dictionary vectors are updated by optimizing
the aforementioned objective function:
 C
N
min{D }N
j
j 1
i 1,
lSi

i
({R ( xl , D j )} )   R ( xl , Di )

N
j 1


Discriminative Dictionary Learning:
MOD-like update
• Setting the derivative of the energy with
respect to Dj to zero, we get the following:
 C j ({R( xl , D j ,  lj )}Nj1 )

 


D j
lS j 

( xl  D j lj ) lj  0


• This update is very much like the MOD
algorithm for dictionary learning
(reconstruction).
C j ({R( xl , D j ,  lj )}Nj1 )
D j
 
Solved by computing the SVD of the
following matrix B and taking the
singular vectors corresponding to the
largest singular value (just as in
standard KSVD – go back and
compare):
Discriminative Dictionary Learning:
Parameter Choice
•  is initially chosen to be small. Its value is
increased gradually (more discriminative)
across iterations to update the dictionary.
•  is initially large (more reconstructive). Its
value is gradually decreased (less
reconstructive) across iterations to update the
dictionary.
Application 1: Texture
Segmentation/Classification
• Experiments performed on a mosaic of
textures from the well-known Brodatz texture
database (see next slide).
• Experiments performed on 12 x 12 patches.
Sparsity level set to L = 4. Dictionary size was K
= 128.
• During training, 30 iterations of discriminative
learning were performed.
Application 1: Texture
Segmentation/Classification
• During testing, 12 x 12 patches from the test
image were classified by first computing the
reconstruction errors on the respective
dictionaries corresponding to each class,
followed by smoothing of the obtained error
images.
• The patch is assigned to the class yielding the
least of the smoothed error-values.
Method of Skretting
and Husoy (cited
earlier in the slides)
Application 2: Learning Discriminative
Patches
• Consider a distinct object (e.g. a car, bicycle
etc.) against a background.
• Consider a rectangular bounding box drawn
around the object.
• Let small-sized patches from within the
bounding box form a set S1.
• Let the corresponding set from the
background be denoted as S2.
Application 2: Learning Discriminative
Patches
• These patches are bad features to distinguish
between the foreground object and the
background.
• Because many patches from background and
foreground are similar.
• And because the bounding box does not
coincide with the object boundary.
• However SOME patches of the foreground
object carry discriminative information!
Discriminative patch!
Application 2: Learning Discriminative
Patches
• Two discriminative dictionaries were learned –
D1 for S1, and D2 for S2, with K = 128, L = 4.
• After the first few iterations, in each
subsequent iteration for dictionary updates,
only the 90% best classified patches were
retained (as per the “C” function for
discriminative learning).
Download