Chicago_ML_Generalized_K_Means

AN INTRODUCTION TO GENERALIZED K-MEANS WITH IMAGE PROCESSING APPLICATIONS JEREMY WATT AND REZA BORHANI What we’ll discuss 1. The cutting edge – Dictionary Learning and large scale image compression 2. From K-means to G-Kmeans (aka Dictionary Learning) 3. A tried and true application of G-K Means in Image Processing DICTIONARY LEARNING AND JOINT IMAGE COMPRESSION Image compression via G-Kmeans An 8x8 block Image compression via G-Kmeans 𝑦 Image compression via G-Kmeans  Say we have a robust matrix D that can represent image blocks from like images well as a sparse sum of its columns So we essentially have 𝒚 ≈ 𝑫𝒙 where 𝑥 0≤𝑆 𝑥 0 = # nonzero values in 𝑥 Standard basis 𝒆𝒋 = 0,0, … 0,1,0, … 0 Image compression via G-Kmeans Image adapted from Irina Rish’s Sparse Statistical Models Course Image compression via G-Kmeans  Can record far fewer coefficients than pixel values!  Since sender and receiver both have Dictionary, we just send a few coefficients – much cheaper! How do we find the right fit? min 𝒚 − 𝑫𝒙 𝒙 subject to 𝑥 0 ≤𝑆  Greed is Good - what is a greedy approach to solving this problem? • Repeat S times • Cycle through each column of D not yet used and find the single best fit • Subtract the best column’s influence from y Francis Bach et al. Sparse Coding and Dictionary Learning for Image Analysis Part III: Optimization for Sparse Coding and Dictionary Learning How do we find the right fit? min 𝒚 − 𝑫𝒙 𝒙 subject to 𝑥 0 ≤𝑆  Greed is Good - what is a greedy approach to solving this problem? 𝐫 = 𝒚, 𝑳 =  for s = 1 … 𝑆 𝒕 = argmin 𝒓 − 𝒂𝒋 𝒅𝒋 𝑗,∉𝑳 𝒂𝒋 𝐫 ⟵ 𝒓 − 𝒂𝒕 𝒅𝒕 𝑳 ⟵ 𝑳 ∪ 𝒕 𝐱 ⟵ 𝒙 + 𝒂𝒕 𝒆𝒕 The era of Big Data 250 million images uploaded to Facebook everyday. There are currently ~ 90 billion photos total on Facebook. Joint Compressor JPEG JPEG Α bits JPEG 𝛼1 bits 𝛼2 bits 𝛼3 bits In 2007, almost 70 million CT-scans performed only in the U.S. Joint Compressor DICOM DICOM DICOM β1 bits β2 bits β3 bits Β bits Joint image compressor FROM K-MEANS TO G-KMEANS (AKA DICTIONARY LEARNING) What is data clustering?  A method for understanding the structure of a given (training) data set via clustering.  A way of classifying points in a newly received (test) data set  K-means is the most popular method Begin Training Phase First – Initialize centroids Next – Assign data Recalculate centroids Re-assign data Re-compute centroids Re-assign data  Pick K centroids  Repeat until convergence  Assign each (training) point to its closest centroid  Re compute centroid locations  Output final K centroids and assignments ? Begin Testing Phase  Which centroid is closest? Shortcomings  Number of centroids?  Initialization?  Are yours the right ones?  non-glob clusters  complexity Classification  Not the best, but not bad right out of the box  MNIST dataset – 60,000 training, 10,000 test  Compeitors include SVMs, Logistic regression, etc. (Pseudo) Image Compression  Per pixel: instead of (R,G,B) store index of closest centroid  Also called vector quantization  Image taken from Bishop’s Pattern Recognition 𝑑1 𝑑2 𝑑3 For K Centroids 𝑫 = 𝒅𝟏 | … |𝒅𝑲 𝒚1 𝑦10 𝑦27 For P Points Y= 𝒚1 | … |𝒚𝑃 0 𝒙1 = 1 0 1 𝑥10 = 0 0 0 𝑥27 = 0 1 𝑚𝑎𝑔𝑒𝑛𝑡𝑎 Assignment vector x= 𝑦𝑒𝑙𝑙𝑜𝑤 𝑏𝑙𝑢𝑒 0 𝒙1 = 1 0 1 𝑥10 = 0 0 For P Assignments X= 𝒙1 | … |𝒙𝑃 0 𝑥27 = 0 1 0 𝒙1 = 1 0 Notice • Columns of X – assignment of one data point • Rows of X – all assignments to a single centroid 0 𝑥27 = 0 1 K-Means algorithm: Notation  Centroid location matrix 𝑫 = 𝒅𝟏 | … |𝒅𝑲  Data matrix 𝒀 = 𝒚𝟏 | … |𝒚𝑷  Assignment matrix 𝑿 = 𝒙𝟏 | … |𝒙𝑷 K-Means algorithm: Notation  Centroid matrix 𝑫 = 𝒅𝟏 | … |𝒅𝑲  Data matrix 𝒀 = 𝒚𝟏 | … |𝒚𝑷  Assignment matrix 𝑿 = 𝒙𝟏 | … |𝒙𝑷  𝐶𝑘 − the kth cluster  𝐶𝑘 − the cardinality of the kth cluster K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) 𝒕 = argmin 𝒚𝒑 − 𝒅𝒋 𝑗 𝒙𝒑 ⟵ 𝒆𝒕 for k = 1 … 𝐾 (Update Centroids) 𝒅𝒌 ⟵ 𝟏 𝑪𝒌 𝒚∈𝑪𝒌 𝒚 K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) 𝒕 = argmin 𝒚𝒑 − 𝑫𝒆𝒋 𝑗 𝒙𝒑 ⟵ 𝒆𝒕 for k = 1 … 𝐾 (Update Centroids) 𝒅𝒌 ⟵ 𝟏 𝑪𝒌 𝒚∈𝑪𝒌 𝒚 K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) argmin 𝒙𝒑 ∈ 𝒆𝟏 ,…,𝒆𝑲 𝒚𝒑 − 𝑫𝒙𝒑 for k = 1 … 𝐾 (Update Centroids) 𝒅𝒌 ⟵ 𝟏 𝑪𝒌 𝒚∈𝑪𝒌 𝒚 K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) argmin 𝒙𝒑 ∈ 𝒆𝟏 ,…,𝒆𝑲 𝒚𝒑 − 𝑫𝒙𝒑 for k = 1 … 𝐾 (Update Centroids) 𝒅𝒌 ⟵ 𝟏 𝑪𝒌 𝒚∈𝑪𝒌 𝒚 K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) argmin 𝒙𝒑 ∈ 𝒆𝟏 ,…,𝒆𝑲 𝒚𝒑 − 𝑫𝒙𝒑 for k = 1 … 𝐾 (Update Centroids) argmin 𝒀 − 𝒅𝒌 𝒙𝒌 𝒅𝒌 The kth row of X argmin 𝒀 − 𝒅𝒌 𝒙𝒌 𝒅𝒌 For example say we  4 data points 𝒚𝟏 , 𝒚𝟐 , 𝒚𝟑 , 𝒚𝟒  2 centroids 𝒅𝟏 , 𝒅𝟐 Then 𝒙𝟏 = 1,1,0,0 records which points are assigned to the first centroid. And 𝒅𝟏 𝒙𝟏 = 𝒅𝟏 𝒅𝟏 𝟎|𝟎 𝟏 𝟐 argmin 𝒀 − 𝒅𝟏 𝒙 𝒅𝟏 argmin 𝒚𝟏 − 𝒅𝟏 𝒅𝟏 𝟐 = + 𝒚𝟐 − 𝒅𝟏 2 𝒚𝟏 𝒚𝟐 𝒅𝟏 What value of 𝒅𝟏 is minimizes the squared distances? 𝟏 𝟐 argmin 𝒀 − 𝒅𝟏 𝒙 𝒅𝟏 argmin 𝒚𝟏 − 𝒅𝟏 𝒅𝟏 𝟐 = + 𝒚𝟐 − 𝒅𝟏 𝒚𝟏 𝒚𝟐 𝒅𝟏 Not quite… 2 𝟏 𝟐 argmin 𝒀 − 𝒅𝟏 𝒙 𝒅𝟏 argmin 𝒚𝟏 − 𝒅𝟏 𝒅𝟏 𝟐 = + 𝒚𝟐 − 𝒅𝟏 2 𝒚𝟏 𝒚𝟐 𝒅𝟏 Not quite… 𝟏 𝟐 argmin 𝒀 − 𝒅𝟏 𝒙 𝒅𝟏 argmin 𝒚𝟏 − 𝒅𝟏 𝟐 𝒅𝟏 = + 𝒚𝟐 − 𝒅𝟏 𝒚𝟏 𝒅𝟏 𝒚𝟐 The average of the 𝒚𝟏 , and 𝒚𝟐 is the minimum! 2 𝟏 𝟐 argmin 𝒀 − 𝒅𝟏 𝒙 𝒅𝟏 argmin 𝒚𝟏 − 𝒅𝟏 𝟐 𝒅𝟏 = + 𝒚𝟐 − 𝒅𝟏 2 𝒚𝟏 𝒅𝟏 𝒚𝟐 𝒌 Generally argmin 𝒀 − 𝒅𝒌 𝒙 𝒅𝒌 = 𝟏 𝑪𝒌 𝒚∈𝑪𝒌 𝒚 K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) argmin 𝒙𝒑 ∈ 𝒆𝟏 ,…,𝒆𝑲 𝒚𝒑 − 𝑫𝒙𝒑 for k = 1 … 𝐾 (Update Centroids) argmin 𝒀 − 𝒅𝒌 𝒙𝒌 𝒅𝒌 K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) argmin 𝒙𝒑 ∈ 𝒆𝟏 ,…,𝒆𝑲 𝒚𝒑 − 𝑫𝒙𝒑 for k = 1 … 𝐾 (Update Centroids) argmin 𝒀 − 𝒅𝒌 𝒙𝒌 𝒅𝒌 As we generalize the form of the centroid update optimization problem will not change! However its solution (the explicit centroid update) will. K-means algorithm argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝒑 ∈ 𝒆𝟏 , … , 𝒆𝑲 p= 1 … 𝑃 How would I solve this greedily? • Repeat until convergence • Fix D, update each column of X • Fix X, update each column of D K-means algorithm argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝒑 ∈ 𝒆𝟏 , … , 𝒆𝑲 p= 1 … 𝑃 Training Phase for p= 1 … 𝑃 (Update Assignments) argmin 𝒚𝒑 − 𝑫𝒙𝒑 𝒙𝒑 ∈ 𝒆𝟏 ,…,𝒆𝑲 for k = 1 … 𝐾 (Update Centroids) argmin 𝒀 − 𝒅𝒌 𝒙𝒌 𝒅𝒌 K-means algorithm argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝒑 ∈ 𝒆𝟏 , … , 𝒆𝑲 p= 1 … 𝑃 Training Phase for p= 1 … 𝑃 (Update Assignments) argmin 𝒚𝒑 − 𝑫𝒙𝒑 𝒙𝒑 ∈ 𝒆𝟏 ,…,𝒆𝑲 for k = 1 … 𝐾 (Update Centroids) argmin 𝒀 − 𝒅𝒌 𝒙𝒌 𝒅𝒌 K-means algorithm argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝒑 ∈ 𝒆𝟏 , … , 𝒆𝑲 p= 1 … 𝑃 Training Phase for p= 1 … 𝑃 (Update Assignments) argmin 𝒚𝒑 − 𝑫𝒙𝒑 𝒙𝒑 ∈ 𝒆𝟏 ,…,𝒆𝑲 for k = 1 … 𝐾 (Update Centroids) argmin 𝒀 − 𝒅𝒌 𝒙𝒌 𝒅𝒌 K-means algorithm argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝒑 ∈ 𝒆𝟏 , … , 𝒆𝑲 p= 1 … 𝑃 Training Phase K-means algorithm argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝒑 ∈ 𝒆𝟏 , … , 𝒆𝑲 p= 1 … 𝑃 Training Phase New data point Testing Phase argmin 𝒚 − 𝑫𝒙 𝒙 subject to 𝒙 ∈ 𝒆𝟏 , … , 𝒆𝑲 K-means algorithm argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝒑 ∈ 𝒆𝟏 , … , 𝒆𝑲 p= 1 … 𝑃 Training Phase Testing Phase argmin 𝒚 − 𝑫𝒙 𝒙 subject to 𝒙 ∈ 𝒆𝟏 , … , 𝒆𝑲 1.2 𝒙1 = 4.2 0 2.6 𝑥10 = 0 0.1 0 𝑥27 = 1.6 0.3 𝑚𝑎𝑔𝑒𝑛𝑡𝑎 Assignment vector x= 𝑦𝑒𝑙𝑙𝑜𝑤 𝑏𝑙𝑢𝑒 K-means algorithm argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝒑 ∈ 𝒆𝟏 , … , 𝒆𝑲 p= 1 … 𝑃 But why limit ourselves to a weight-one assignment? And why only one centroid each? K-means algorithm argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝒑 ∈ 𝒆𝟏 , … , 𝒆𝑲 p= 1 … 𝑃 But why limit ourselves to a weight-one assignment? And why only one centroid each? Why not let at most S centroids of arbitrary assignment value? argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝑝 0 ≤ 𝑆 p= 1 … 𝑃 K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) 𝒕 = argmin 𝒚𝒑 − 𝒅𝒋 𝑗 𝒙𝒑 ⟵ 𝒆𝒕 for k = 1 … 𝐾 (Update Centroids) argmin 𝒀 − 𝒅𝒌 𝒙𝒌 𝒅𝒌 Tweaked K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) 𝒕 = argmin 𝒚𝒑 − 𝒂𝒋 𝒅𝒋 𝑗, 𝒂𝒋 𝒙𝒑 ⟵ 𝒂𝒕 𝒆𝒕 for k = 1 … 𝐾 (Update Centroids) argmin 𝒀 − 𝒅𝒌 𝒙𝒌 𝒅𝒌 Tweaked K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) 𝒕 = argmin 𝒚𝒑 − 𝒂𝒋 𝒅𝒋 𝑗, 𝒂𝒋 𝒙𝒑 ⟵ 𝒂𝒕 𝒆𝒕 for k = 1 … 𝐾 (Update Centroids) argmin 𝒀 − 𝒅𝒌 𝒙𝒌 𝒅𝒌 Tweaked K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … 𝑃 (Update Assignments) 𝒓𝒑 = 𝒚𝒑 , 𝑳 =  for s = 1 … 𝑆 𝒕 = argmin 𝒓𝒑 − 𝒂𝒋 𝒅𝒋 𝑗,∉𝑳 𝒂𝒋 𝒓𝒑 ⟵ 𝒓𝒑 − 𝒂𝒕 𝒅𝒕 𝑳 ⟵ 𝑳 ∪ 𝒕 𝒙𝒑 ⟵ 𝒙𝒑 + 𝒂𝒕 𝒆𝒕 for k = 1 … 𝐾 (Update Centroids) 𝒌 G-Kmeans algorithm min 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝑝 0 ≤𝑆 Training Phase Testing Phase argmin 𝒚 − 𝑫𝒙 𝒙 subject to 𝒙 0 ≤ 𝑆 Flashback: K-means algorithm argmin 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝒑 ∈ 𝒆𝟏 , … , 𝒆𝑲 p= 1 … 𝑃 Training Phase Testing Phase argmin 𝒚 − 𝑫𝒙 𝒙 subject to 𝒙 ∈ 𝒆𝟏 , … , 𝒆𝑲 Where do we go from here? min 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝑝 0 ≤𝑆 min 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝑝 1 ≤𝜏 Where do we go from here? min 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝑝 0 ≤𝑆 Relaxed Greedy min 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝑝 1 ≤𝜏 Where do we go from here? min 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝑝 0 See References from this presentation! ≤𝑆 Relaxed min 𝒀 − 𝑫𝑿 𝑿,𝑫 subject to 𝒙𝑝 1 ≤𝜏 Flashback to section 1 min 𝒚 − 𝑫𝒙 𝒙 subject to 𝑥 0 ≤𝑆 Image adapted from Irina Rish’s Sparse Statistical Models Course A TRIED AND TRUE APPLICATION OF G-K MEANS IN IMAGE PROCESSING Tried and True Applications of Dictionary Learning  Inpainting  Denoising  Super-resolution Tried and True Applications of Dictionary Learning  Inpainting  Denoising  Super-resolution Inpainting  Train a dictionary D that can represent image blocks well as a sparse sum of its columns So we essentially have 𝒚 ≈ 𝑫𝒙 where 𝑥 0≤𝑆 Inpainting  Train a dictionary D that can represent image blocks well as a sparse sum of its columns So we essentially have 𝒚 ≈ 𝑫𝒙 where 𝑥 0≤𝑆 • 𝑫 − has 441 columns (centroids) • Training set of 11000 8x8 blocks from database of images • For training and testing 𝑥 0 ≤ 10 Inpainting From Elad’s K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation QUESTIONS? Jeremy Watt jermwatt@gmail.com Reza Borhani borhani@u.northwester.edu Image and Video Processing Lab at Northwestern University http://ivpl.eecs.northwestern.edu References and Further Reading • Francis Bach et al. Sparse Coding and Dictionary Learning for Image Analysis Part III: Optimization for Sparse Coding and Dictionary Learning available at http://lear.inrialpes.fr/people/mairal/tutorial_cvpr2010/ • Michael Elad et al. Sparse and Redundant Representations (Book), and many great presentations on his website at http://www.cs.technion.ac.il/~elad/ • For Greedy methods discussed in this presentation see in particular chapter 3 and 12 of Elad’s book • For applications to image processing see Chapter 13-15 and references within • From Elad’s K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation available at http://intranet.daiict.ac.in/~ajit_r/IT530/KSVD_IEEETSP.pdf • Irina Rish’s Sparse Statistical Models Course , great presentations and problem sets available at https://sites.google.com/site/eecs6898sparse2011/ • Hastie et al, The Elements of Statistical Learning, available from the authors as a pdf at http://www-stat.stanford.edu/~tibs/ElemStatLearn/ • For L1 ( == Lasso) recovery see Chapter 3.2 and 3.4

Chicago_ML_Generalized_K_Means

Related documents

Products

Support

Chicago_ML_Generalized_K_Means

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib