Chicago_ML_Generalized_K_Means

advertisement
AN INTRODUCTION TO
GENERALIZED K-MEANS WITH
IMAGE PROCESSING APPLICATIONS
JEREMY WATT AND REZA BORHANI
What we’ll discuss
1. The cutting edge – Dictionary Learning and
large scale image compression
2. From K-means to G-Kmeans (aka Dictionary
Learning)
3. A tried and true application of G-K Means in
Image Processing
DICTIONARY LEARNING AND
JOINT IMAGE COMPRESSION
Image compression via G-Kmeans
An 8x8 block
Image compression via G-Kmeans
𝑦
Image compression via G-Kmeans
 Say we have a robust matrix D that can
represent image blocks from like images well
as a sparse sum of its columns
So we essentially have
π’š ≈ 𝑫𝒙
where
π‘₯ 0≤𝑆
π‘₯
0
= # nonzero values in π‘₯
Standard basis 𝒆𝒋 = 0,0, … 0,1,0, … 0
Image compression via G-Kmeans
Image adapted from Irina Rish’s Sparse Statistical Models Course
Image compression via G-Kmeans
 Can record far fewer coefficients than pixel values!
 Since sender and receiver both have Dictionary, we
just send a few coefficients – much cheaper!
How do we find the right fit?
min π’š − 𝑫𝒙
𝒙
subject to π‘₯
0
≤𝑆
 Greed is Good - what is a greedy approach to
solving this problem?
• Repeat S times
• Cycle through each column of D not yet
used and find the single best fit
• Subtract the best column’s influence
from y
Francis Bach et al. Sparse Coding and Dictionary Learning for Image Analysis Part III:
Optimization for Sparse Coding and Dictionary Learning
How do we find the right fit?
min π’š − 𝑫𝒙
𝒙
subject to π‘₯
0
≤𝑆
 Greed is Good - what is a greedy approach to
solving this problem?
𝐫 = π’š, 𝑳 = 
for s = 1 … 𝑆
𝒕 = argmin 𝒓 − 𝒂𝒋 𝒅𝒋
𝑗,∉𝑳 𝒂𝒋
𝐫 ⟡ 𝒓 − 𝒂𝒕 𝒅𝒕 𝑳 ⟡ 𝑳 ∪ 𝒕
𝐱 ⟡ 𝒙 + 𝒂𝒕 𝒆𝒕
The era of Big Data
250 million images uploaded to Facebook everyday. There are currently
~ 90 billion photos total on Facebook.
Joint Compressor
JPEG
JPEG
Α bits
JPEG
𝛼1 bits 𝛼2 bits 𝛼3 bits
In 2007, almost 70 million CT-scans performed only in the U.S.
Joint Compressor
DICOM
DICOM
DICOM
β1 bits
β2 bits
β3 bits
Β bits
Joint image compressor
FROM K-MEANS TO G-KMEANS
(AKA DICTIONARY LEARNING)
What is data clustering?
 A method for understanding the structure of
a given (training) data set via clustering.
 A way of classifying points in a newly received
(test) data set
 K-means is the most popular method
Begin Training Phase
First – Initialize centroids
Next – Assign data
Recalculate centroids
Re-assign data
Re-compute centroids
Re-assign data
 Pick K centroids
 Repeat until convergence
οƒΊ Assign each (training) point to its closest centroid
οƒΊ Re compute centroid locations
 Output final K centroids and assignments
?
Begin Testing Phase
 Which centroid is closest?
Shortcomings
 Number of centroids?
οƒΊ Initialization?
οƒΊ Are yours the right ones?
 non-glob clusters
 complexity
Classification
 Not the best, but not bad right out of the box
 MNIST dataset – 60,000 training, 10,000 test
 Compeitors include SVMs, Logistic
regression, etc.
(Pseudo) Image Compression
 Per pixel: instead of (R,G,B) store index of closest
centroid
 Also called vector quantization
 Image taken from Bishop’s Pattern Recognition
𝑑1
𝑑2
𝑑3
For K Centroids 𝑫 = π’…πŸ | … |𝒅𝑲
π’š1
𝑦10
𝑦27
For P Points Y= π’š1 | … |π’šπ‘ƒ
0
𝒙1 = 1
0
1
π‘₯10 = 0
0
0
π‘₯27 = 0
1
π‘šπ‘Žπ‘”π‘’π‘›π‘‘π‘Ž
Assignment vector x= π‘¦π‘’π‘™π‘™π‘œπ‘€
𝑏𝑙𝑒𝑒
0
𝒙1 = 1
0
1
π‘₯10 = 0
0
For P Assignments X= 𝒙1 | … |𝒙𝑃
0
π‘₯27 = 0
1
0
𝒙1 = 1
0
Notice
• Columns of X – assignment of one
data point
• Rows of X – all assignments to a
single centroid
0
π‘₯27 = 0
1
K-Means algorithm: Notation
 Centroid location matrix 𝑫 = π’…πŸ | … |𝒅𝑲
 Data matrix 𝒀 = π’šπŸ | … |π’šπ‘·
 Assignment matrix 𝑿 = π’™πŸ | … |𝒙𝑷
K-Means algorithm: Notation
 Centroid matrix 𝑫 = π’…πŸ | … |𝒅𝑲
 Data matrix 𝒀 = π’šπŸ | … |π’šπ‘·
 Assignment matrix 𝑿 = π’™πŸ | … |𝒙𝑷
 πΆπ‘˜ − the kth cluster

πΆπ‘˜ − the cardinality of the kth cluster
K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
𝒕 = argmin π’šπ’‘ − 𝒅𝒋
𝑗
𝒙𝒑 ⟡ 𝒆𝒕
for k = 1 … 𝐾 (Update Centroids)
π’…π’Œ ⟡
𝟏
π‘ͺπ’Œ
π’š∈π‘ͺπ’Œ π’š
K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
𝒕 = argmin π’šπ’‘ − 𝑫𝒆𝒋
𝑗
𝒙𝒑 ⟡ 𝒆𝒕
for k = 1 … 𝐾 (Update Centroids)
π’…π’Œ ⟡
𝟏
π‘ͺπ’Œ
π’š∈π‘ͺπ’Œ π’š
K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
argmin
𝒙𝒑 ∈ π’†πŸ ,…,𝒆𝑲
π’šπ’‘ − 𝑫𝒙𝒑
for k = 1 … 𝐾 (Update Centroids)
π’…π’Œ ⟡
𝟏
π‘ͺπ’Œ
π’š∈π‘ͺπ’Œ π’š
K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
argmin
𝒙𝒑 ∈ π’†πŸ ,…,𝒆𝑲
π’šπ’‘ − 𝑫𝒙𝒑
for k = 1 … 𝐾 (Update Centroids)
π’…π’Œ ⟡
𝟏
π‘ͺπ’Œ
π’š∈π‘ͺπ’Œ π’š
K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
argmin
𝒙𝒑 ∈ π’†πŸ ,…,𝒆𝑲
π’šπ’‘ − 𝑫𝒙𝒑
for k = 1 … 𝐾 (Update Centroids)
argmin 𝒀 − π’…π’Œ π’™π’Œ
π’…π’Œ
The kth row of X
argmin 𝒀 − π’…π’Œ π’™π’Œ
π’…π’Œ
For example say we
 4 data points π’šπŸ , π’šπŸ , π’šπŸ‘ , π’šπŸ’
 2 centroids π’…πŸ , π’…πŸ
Then π’™πŸ = 1,1,0,0 records which points are
assigned to the first centroid.
And π’…πŸ π’™πŸ = π’…πŸ π’…πŸ 𝟎|𝟎
𝟏 𝟐
argmin 𝒀 − π’…πŸ 𝒙
π’…πŸ
argmin π’šπŸ − π’…πŸ
π’…πŸ
𝟐
=
+ π’šπŸ − π’…πŸ
2
π’šπŸ
π’šπŸ
π’…πŸ
What value of π’…πŸ is minimizes the squared
distances?
𝟏 𝟐
argmin 𝒀 − π’…πŸ 𝒙
π’…πŸ
argmin π’šπŸ − π’…πŸ
π’…πŸ
𝟐
=
+ π’šπŸ − π’…πŸ
π’šπŸ
π’šπŸ
π’…πŸ
Not quite…
2
𝟏 𝟐
argmin 𝒀 − π’…πŸ 𝒙
π’…πŸ
argmin π’šπŸ − π’…πŸ
π’…πŸ
𝟐
=
+ π’šπŸ − π’…πŸ
2
π’šπŸ
π’šπŸ
π’…πŸ
Not quite…
𝟏 𝟐
argmin 𝒀 − π’…πŸ 𝒙
π’…πŸ
argmin π’šπŸ − π’…πŸ
𝟐
π’…πŸ
=
+ π’šπŸ − π’…πŸ
π’šπŸ
π’…πŸ
π’šπŸ
The average of the π’šπŸ , and π’šπŸ is the
minimum!
2
𝟏 𝟐
argmin 𝒀 − π’…πŸ 𝒙
π’…πŸ
argmin π’šπŸ − π’…πŸ
𝟐
π’…πŸ
=
+ π’šπŸ − π’…πŸ
2
π’šπŸ
π’…πŸ
π’šπŸ
π’Œ
Generally argmin 𝒀 − π’…π’Œ 𝒙
π’…π’Œ
=
𝟏
π‘ͺπ’Œ
π’š∈π‘ͺπ’Œ π’š
K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
argmin
𝒙𝒑 ∈ π’†πŸ ,…,𝒆𝑲
π’šπ’‘ − 𝑫𝒙𝒑
for k = 1 … 𝐾 (Update Centroids)
argmin 𝒀 − π’…π’Œ π’™π’Œ
π’…π’Œ
K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
argmin
𝒙𝒑 ∈ π’†πŸ ,…,𝒆𝑲
π’šπ’‘ − 𝑫𝒙𝒑
for k = 1 … 𝐾 (Update Centroids)
argmin 𝒀 − π’…π’Œ π’™π’Œ
π’…π’Œ
As we generalize the form of the centroid update
optimization problem will not change! However its solution
(the explicit centroid update) will.
K-means algorithm
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to
𝒙𝒑 ∈ π’†πŸ , … , 𝒆𝑲 p= 1 … 𝑃
How would I solve this greedily?
• Repeat until convergence
• Fix D, update each column of X
• Fix X, update each column of D
K-means algorithm
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to
𝒙𝒑 ∈ π’†πŸ , … , 𝒆𝑲 p= 1 … 𝑃
Training Phase
for p= 1 … 𝑃 (Update Assignments)
argmin
π’šπ’‘ − 𝑫𝒙𝒑
𝒙𝒑 ∈ π’†πŸ ,…,𝒆𝑲
for k = 1 … 𝐾 (Update Centroids)
argmin 𝒀 − π’…π’Œ π’™π’Œ
π’…π’Œ
K-means algorithm
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to
𝒙𝒑 ∈ π’†πŸ , … , 𝒆𝑲 p= 1 … 𝑃
Training Phase
for p= 1 … 𝑃 (Update Assignments)
argmin
π’šπ’‘ − 𝑫𝒙𝒑
𝒙𝒑 ∈ π’†πŸ ,…,𝒆𝑲
for k = 1 … 𝐾 (Update Centroids)
argmin 𝒀 − π’…π’Œ π’™π’Œ
π’…π’Œ
K-means algorithm
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to
𝒙𝒑 ∈ π’†πŸ , … , 𝒆𝑲 p= 1 … 𝑃
Training Phase
for p= 1 … 𝑃 (Update Assignments)
argmin
π’šπ’‘ − 𝑫𝒙𝒑
𝒙𝒑 ∈ π’†πŸ ,…,𝒆𝑲
for k = 1 … 𝐾 (Update Centroids)
argmin 𝒀 − π’…π’Œ π’™π’Œ
π’…π’Œ
K-means algorithm
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to
𝒙𝒑 ∈ π’†πŸ , … , 𝒆𝑲 p= 1 … 𝑃
Training Phase
K-means algorithm
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to
𝒙𝒑 ∈ π’†πŸ , … , 𝒆𝑲 p= 1 … 𝑃
Training Phase
New data point
Testing Phase
argmin π’š − 𝑫𝒙
𝒙
subject to 𝒙 ∈ π’†πŸ , … , 𝒆𝑲
K-means algorithm
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to
𝒙𝒑 ∈ π’†πŸ , … , 𝒆𝑲 p= 1 … 𝑃
Training Phase
Testing Phase
argmin π’š − 𝑫𝒙
𝒙
subject to 𝒙 ∈ π’†πŸ , … , 𝒆𝑲
1.2
𝒙1 = 4.2
0
2.6
π‘₯10 = 0
0.1
0
π‘₯27 = 1.6
0.3
π‘šπ‘Žπ‘”π‘’π‘›π‘‘π‘Ž
Assignment vector x= π‘¦π‘’π‘™π‘™π‘œπ‘€
𝑏𝑙𝑒𝑒
K-means algorithm
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to
𝒙𝒑 ∈ π’†πŸ , … , 𝒆𝑲 p= 1 … 𝑃
But why limit ourselves to a weight-one assignment?
And why only one centroid each?
K-means algorithm
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to
𝒙𝒑 ∈ π’†πŸ , … , 𝒆𝑲 p= 1 … 𝑃
But why limit ourselves to a weight-one assignment?
And why only one centroid each?
Why not let at most S centroids of arbitrary assignment
value?
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to 𝒙𝑝
0
≤ 𝑆 p= 1 … 𝑃
K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
𝒕 = argmin π’šπ’‘ − 𝒅𝒋
𝑗
𝒙𝒑 ⟡ 𝒆𝒕
for k = 1 … 𝐾 (Update Centroids)
argmin 𝒀 − π’…π’Œ π’™π’Œ
π’…π’Œ
Tweaked K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
𝒕 = argmin π’šπ’‘ − 𝒂𝒋 𝒅𝒋
𝑗, 𝒂𝒋
𝒙𝒑 ⟡ 𝒂𝒕 𝒆𝒕
for k = 1 … 𝐾 (Update Centroids)
argmin 𝒀 − π’…π’Œ π’™π’Œ
π’…π’Œ
Tweaked K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
𝒕 = argmin π’šπ’‘ − 𝒂𝒋 𝒅𝒋
𝑗, 𝒂𝒋
𝒙𝒑 ⟡ 𝒂𝒕 𝒆𝒕
for k = 1 … 𝐾 (Update Centroids)
argmin 𝒀 − π’…π’Œ π’™π’Œ
π’…π’Œ
Tweaked K-means algorithm
Input: Y, initial D and X
Output: Final D and X
Until Convergence:
for p= 1 … 𝑃 (Update Assignments)
𝒓𝒑 = π’šπ’‘ , 𝑳 = 
for s = 1 … 𝑆
𝒕 = argmin 𝒓𝒑 − 𝒂𝒋 𝒅𝒋
𝑗,∉𝑳 𝒂𝒋
𝒓𝒑 ⟡ 𝒓𝒑 − 𝒂𝒕 𝒅𝒕 𝑳 ⟡ 𝑳 ∪ 𝒕
𝒙𝒑 ⟡ 𝒙𝒑 + 𝒂𝒕 𝒆𝒕
for k = 1 … 𝐾 (Update Centroids)
π’Œ
G-Kmeans algorithm
min 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to 𝒙𝑝
0
≤𝑆
Training Phase
Testing Phase
argmin π’š − 𝑫𝒙
𝒙
subject to 𝒙 0 ≤ 𝑆
Flashback: K-means algorithm
argmin 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to
𝒙𝒑 ∈ π’†πŸ , … , 𝒆𝑲 p= 1 … 𝑃
Training Phase
Testing Phase
argmin π’š − 𝑫𝒙
𝒙
subject to 𝒙 ∈ π’†πŸ , … , 𝒆𝑲
Where do we go from here?
min 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to 𝒙𝑝
0
≤𝑆
min 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to 𝒙𝑝
1
≤𝜏
Where do we go from here?
min 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to 𝒙𝑝
0
≤𝑆
Relaxed
Greedy
min 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to 𝒙𝑝
1
≤𝜏
Where do we go from here?
min 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to 𝒙𝑝
0
See
References
from this
presentation!
≤𝑆
Relaxed
min 𝒀 − 𝑫𝑿
𝑿,𝑫
subject to 𝒙𝑝
1
≤𝜏
Flashback to section 1
min π’š − 𝑫𝒙
𝒙
subject to π‘₯
0
≤𝑆
Image adapted from Irina Rish’s Sparse Statistical Models Course
A TRIED AND TRUE
APPLICATION OF G-K MEANS
IN IMAGE PROCESSING
Tried and True Applications
of Dictionary Learning
 Inpainting
 Denoising
 Super-resolution
Tried and True Applications
of Dictionary Learning
 Inpainting
 Denoising
 Super-resolution
Inpainting
 Train a dictionary D that can represent image
blocks well as a sparse sum of its columns
So we essentially have
π’š ≈ 𝑫𝒙
where
π‘₯ 0≤𝑆
Inpainting
 Train a dictionary D that can represent image
blocks well as a sparse sum of its columns
So we essentially have
π’š ≈ 𝑫𝒙
where
π‘₯ 0≤𝑆
• 𝑫 − has 441 columns (centroids)
• Training set of 11000 8x8 blocks from
database of images
• For training and testing π‘₯ 0 ≤ 10
Inpainting
From Elad’s K-SVD: An Algorithm for Designing Overcomplete Dictionaries for
Sparse Representation
QUESTIONS?
Jeremy Watt
jermwatt@gmail.com
Reza Borhani
borhani@u.northwester.edu
Image and Video Processing Lab at
Northwestern University
http://ivpl.eecs.northwestern.edu
References and Further
Reading
• Francis Bach et al. Sparse Coding and Dictionary Learning for Image Analysis
Part III: Optimization for Sparse Coding and Dictionary Learning available at
http://lear.inrialpes.fr/people/mairal/tutorial_cvpr2010/
• Michael Elad et al. Sparse and Redundant Representations (Book), and many
great presentations on his website at http://www.cs.technion.ac.il/~elad/
• For Greedy methods discussed in this presentation see in particular chapter
3 and 12 of Elad’s book
• For applications to image processing see Chapter 13-15 and references
within
• From Elad’s K-SVD: An Algorithm for Designing Overcomplete Dictionaries
for Sparse Representation available at
http://intranet.daiict.ac.in/~ajit_r/IT530/KSVD_IEEETSP.pdf
• Irina Rish’s Sparse Statistical Models Course , great presentations and problem
sets available at https://sites.google.com/site/eecs6898sparse2011/
• Hastie et al, The Elements of Statistical Learning, available from the authors
as a pdf at http://www-stat.stanford.edu/~tibs/ElemStatLearn/
• For L1 ( == Lasso) recovery see Chapter 3.2 and 3.4
Download