AN INTRODUCTION TO GENERALIZED K-MEANS WITH IMAGE PROCESSING APPLICATIONS JEREMY WATT AND REZA BORHANI What we’ll discuss 1. The cutting edge – Dictionary Learning and large scale image compression 2. From K-means to G-Kmeans (aka Dictionary Learning) 3. A tried and true application of G-K Means in Image Processing DICTIONARY LEARNING AND JOINT IMAGE COMPRESSION Image compression via G-Kmeans An 8x8 block Image compression via G-Kmeans π¦ Image compression via G-Kmeans ο§ Say we have a robust matrix D that can represent image blocks from like images well as a sparse sum of its columns So we essentially have π ≈ π«π where π₯ 0≤π π₯ 0 = # nonzero values in π₯ Standard basis ππ = 0,0, … 0,1,0, … 0 Image compression via G-Kmeans Image adapted from Irina Rish’s Sparse Statistical Models Course Image compression via G-Kmeans ο§ Can record far fewer coefficients than pixel values! ο§ Since sender and receiver both have Dictionary, we just send a few coefficients – much cheaper! How do we find the right fit? min π − π«π π subject to π₯ 0 ≤π ο§ Greed is Good - what is a greedy approach to solving this problem? • Repeat S times • Cycle through each column of D not yet used and find the single best fit • Subtract the best column’s influence from y Francis Bach et al. Sparse Coding and Dictionary Learning for Image Analysis Part III: Optimization for Sparse Coding and Dictionary Learning How do we find the right fit? min π − π«π π subject to π₯ 0 ≤π ο§ Greed is Good - what is a greedy approach to solving this problem? π« = π, π³ = ο for s = 1 … π π = argmin π − ππ π π π,∉π³ ππ π« β΅ π − ππ π π π³ β΅ π³ ∪ π π± β΅ π + ππ ππ The era of Big Data 250 million images uploaded to Facebook everyday. There are currently ~ 90 billion photos total on Facebook. Joint Compressor JPEG JPEG Α bits JPEG πΌ1 bits πΌ2 bits πΌ3 bits In 2007, almost 70 million CT-scans performed only in the U.S. Joint Compressor DICOM DICOM DICOM β1 bits β2 bits β3 bits Β bits Joint image compressor FROM K-MEANS TO G-KMEANS (AKA DICTIONARY LEARNING) What is data clustering? ο§ A method for understanding the structure of a given (training) data set via clustering. ο§ A way of classifying points in a newly received (test) data set ο§ K-means is the most popular method Begin Training Phase First – Initialize centroids Next – Assign data Recalculate centroids Re-assign data Re-compute centroids Re-assign data ο§ Pick K centroids ο§ Repeat until convergence οΊ Assign each (training) point to its closest centroid οΊ Re compute centroid locations ο§ Output final K centroids and assignments ? Begin Testing Phase ο§ Which centroid is closest? Shortcomings ο§ Number of centroids? οΊ Initialization? οΊ Are yours the right ones? ο§ non-glob clusters ο§ complexity Classification ο§ Not the best, but not bad right out of the box ο§ MNIST dataset – 60,000 training, 10,000 test ο§ Compeitors include SVMs, Logistic regression, etc. (Pseudo) Image Compression ο§ Per pixel: instead of (R,G,B) store index of closest centroid ο§ Also called vector quantization ο§ Image taken from Bishop’s Pattern Recognition π1 π2 π3 For K Centroids π« = π π | … |π π² π1 π¦10 π¦27 For P Points Y= π1 | … |ππ 0 π1 = 1 0 1 π₯10 = 0 0 0 π₯27 = 0 1 ππππππ‘π Assignment vector x= π¦πππππ€ πππ’π 0 π1 = 1 0 1 π₯10 = 0 0 For P Assignments X= π1 | … |ππ 0 π₯27 = 0 1 0 π1 = 1 0 Notice • Columns of X – assignment of one data point • Rows of X – all assignments to a single centroid 0 π₯27 = 0 1 K-Means algorithm: Notation ο§ Centroid location matrix π« = π π | … |π π² ο§ Data matrix π = ππ | … |ππ· ο§ Assignment matrix πΏ = ππ | … |ππ· K-Means algorithm: Notation ο§ Centroid matrix π« = π π | … |π π² ο§ Data matrix π = ππ | … |ππ· ο§ Assignment matrix πΏ = ππ | … |ππ· ο§ πΆπ − the kth cluster ο§ πΆπ − the cardinality of the kth cluster K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) π = argmin ππ − π π π ππ β΅ ππ for k = 1 … πΎ (Update Centroids) π π β΅ π πͺπ π∈πͺπ π K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) π = argmin ππ − π«ππ π ππ β΅ ππ for k = 1 … πΎ (Update Centroids) π π β΅ π πͺπ π∈πͺπ π K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) argmin ππ ∈ ππ ,…,ππ² ππ − π«ππ for k = 1 … πΎ (Update Centroids) π π β΅ π πͺπ π∈πͺπ π K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) argmin ππ ∈ ππ ,…,ππ² ππ − π«ππ for k = 1 … πΎ (Update Centroids) π π β΅ π πͺπ π∈πͺπ π K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) argmin ππ ∈ ππ ,…,ππ² ππ − π«ππ for k = 1 … πΎ (Update Centroids) argmin π − π π ππ π π The kth row of X argmin π − π π ππ π π For example say we ο§ 4 data points ππ , ππ , ππ , ππ ο§ 2 centroids π π , π π Then ππ = 1,1,0,0 records which points are assigned to the first centroid. And π π ππ = π π π π π|π π π argmin π − π π π π π argmin ππ − π π π π π = + ππ − π π 2 ππ ππ π π What value of π π is minimizes the squared distances? π π argmin π − π π π π π argmin ππ − π π π π π = + ππ − π π ππ ππ π π Not quite… 2 π π argmin π − π π π π π argmin ππ − π π π π π = + ππ − π π 2 ππ ππ π π Not quite… π π argmin π − π π π π π argmin ππ − π π π π π = + ππ − π π ππ π π ππ The average of the ππ , and ππ is the minimum! 2 π π argmin π − π π π π π argmin ππ − π π π π π = + ππ − π π 2 ππ π π ππ π Generally argmin π − π π π π π = π πͺπ π∈πͺπ π K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) argmin ππ ∈ ππ ,…,ππ² ππ − π«ππ for k = 1 … πΎ (Update Centroids) argmin π − π π ππ π π K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) argmin ππ ∈ ππ ,…,ππ² ππ − π«ππ for k = 1 … πΎ (Update Centroids) argmin π − π π ππ π π As we generalize the form of the centroid update optimization problem will not change! However its solution (the explicit centroid update) will. K-means algorithm argmin π − π«πΏ πΏ,π« subject to ππ ∈ ππ , … , ππ² p= 1 … π How would I solve this greedily? • Repeat until convergence • Fix D, update each column of X • Fix X, update each column of D K-means algorithm argmin π − π«πΏ πΏ,π« subject to ππ ∈ ππ , … , ππ² p= 1 … π Training Phase for p= 1 … π (Update Assignments) argmin ππ − π«ππ ππ ∈ ππ ,…,ππ² for k = 1 … πΎ (Update Centroids) argmin π − π π ππ π π K-means algorithm argmin π − π«πΏ πΏ,π« subject to ππ ∈ ππ , … , ππ² p= 1 … π Training Phase for p= 1 … π (Update Assignments) argmin ππ − π«ππ ππ ∈ ππ ,…,ππ² for k = 1 … πΎ (Update Centroids) argmin π − π π ππ π π K-means algorithm argmin π − π«πΏ πΏ,π« subject to ππ ∈ ππ , … , ππ² p= 1 … π Training Phase for p= 1 … π (Update Assignments) argmin ππ − π«ππ ππ ∈ ππ ,…,ππ² for k = 1 … πΎ (Update Centroids) argmin π − π π ππ π π K-means algorithm argmin π − π«πΏ πΏ,π« subject to ππ ∈ ππ , … , ππ² p= 1 … π Training Phase K-means algorithm argmin π − π«πΏ πΏ,π« subject to ππ ∈ ππ , … , ππ² p= 1 … π Training Phase New data point Testing Phase argmin π − π«π π subject to π ∈ ππ , … , ππ² K-means algorithm argmin π − π«πΏ πΏ,π« subject to ππ ∈ ππ , … , ππ² p= 1 … π Training Phase Testing Phase argmin π − π«π π subject to π ∈ ππ , … , ππ² 1.2 π1 = 4.2 0 2.6 π₯10 = 0 0.1 0 π₯27 = 1.6 0.3 ππππππ‘π Assignment vector x= π¦πππππ€ πππ’π K-means algorithm argmin π − π«πΏ πΏ,π« subject to ππ ∈ ππ , … , ππ² p= 1 … π But why limit ourselves to a weight-one assignment? And why only one centroid each? K-means algorithm argmin π − π«πΏ πΏ,π« subject to ππ ∈ ππ , … , ππ² p= 1 … π But why limit ourselves to a weight-one assignment? And why only one centroid each? Why not let at most S centroids of arbitrary assignment value? argmin π − π«πΏ πΏ,π« subject to ππ 0 ≤ π p= 1 … π K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) π = argmin ππ − π π π ππ β΅ ππ for k = 1 … πΎ (Update Centroids) argmin π − π π ππ π π Tweaked K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) π = argmin ππ − ππ π π π, ππ ππ β΅ ππ ππ for k = 1 … πΎ (Update Centroids) argmin π − π π ππ π π Tweaked K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) π = argmin ππ − ππ π π π, ππ ππ β΅ ππ ππ for k = 1 … πΎ (Update Centroids) argmin π − π π ππ π π Tweaked K-means algorithm Input: Y, initial D and X Output: Final D and X Until Convergence: for p= 1 … π (Update Assignments) ππ = ππ , π³ = ο for s = 1 … π π = argmin ππ − ππ π π π,∉π³ ππ ππ β΅ ππ − ππ π π π³ β΅ π³ ∪ π ππ β΅ ππ + ππ ππ for k = 1 … πΎ (Update Centroids) π G-Kmeans algorithm min π − π«πΏ πΏ,π« subject to ππ 0 ≤π Training Phase Testing Phase argmin π − π«π π subject to π 0 ≤ π Flashback: K-means algorithm argmin π − π«πΏ πΏ,π« subject to ππ ∈ ππ , … , ππ² p= 1 … π Training Phase Testing Phase argmin π − π«π π subject to π ∈ ππ , … , ππ² Where do we go from here? min π − π«πΏ πΏ,π« subject to ππ 0 ≤π min π − π«πΏ πΏ,π« subject to ππ 1 ≤π Where do we go from here? min π − π«πΏ πΏ,π« subject to ππ 0 ≤π Relaxed Greedy min π − π«πΏ πΏ,π« subject to ππ 1 ≤π Where do we go from here? min π − π«πΏ πΏ,π« subject to ππ 0 See References from this presentation! ≤π Relaxed min π − π«πΏ πΏ,π« subject to ππ 1 ≤π Flashback to section 1 min π − π«π π subject to π₯ 0 ≤π Image adapted from Irina Rish’s Sparse Statistical Models Course A TRIED AND TRUE APPLICATION OF G-K MEANS IN IMAGE PROCESSING Tried and True Applications of Dictionary Learning ο§ Inpainting ο§ Denoising ο§ Super-resolution Tried and True Applications of Dictionary Learning ο§ Inpainting ο§ Denoising ο§ Super-resolution Inpainting ο§ Train a dictionary D that can represent image blocks well as a sparse sum of its columns So we essentially have π ≈ π«π where π₯ 0≤π Inpainting ο§ Train a dictionary D that can represent image blocks well as a sparse sum of its columns So we essentially have π ≈ π«π where π₯ 0≤π • π« − has 441 columns (centroids) • Training set of 11000 8x8 blocks from database of images • For training and testing π₯ 0 ≤ 10 Inpainting From Elad’s K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation QUESTIONS? Jeremy Watt jermwatt@gmail.com Reza Borhani borhani@u.northwester.edu Image and Video Processing Lab at Northwestern University http://ivpl.eecs.northwestern.edu References and Further Reading • Francis Bach et al. Sparse Coding and Dictionary Learning for Image Analysis Part III: Optimization for Sparse Coding and Dictionary Learning available at http://lear.inrialpes.fr/people/mairal/tutorial_cvpr2010/ • Michael Elad et al. Sparse and Redundant Representations (Book), and many great presentations on his website at http://www.cs.technion.ac.il/~elad/ • For Greedy methods discussed in this presentation see in particular chapter 3 and 12 of Elad’s book • For applications to image processing see Chapter 13-15 and references within • From Elad’s K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation available at http://intranet.daiict.ac.in/~ajit_r/IT530/KSVD_IEEETSP.pdf • Irina Rish’s Sparse Statistical Models Course , great presentations and problem sets available at https://sites.google.com/site/eecs6898sparse2011/ • Hastie et al, The Elements of Statistical Learning, available from the authors as a pdf at http://www-stat.stanford.edu/~tibs/ElemStatLearn/ • For L1 ( == Lasso) recovery see Chapter 3.2 and 3.4