Review: Learning Bimodal Structures in Audio-Visual Data CSE 704 : Readings in Joint Visual, Lingual and Physical Models and Inference Algorithms Suren Kumar Vision and Perceptual Machines Lab 106 Davis Hall UB North Campus surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19 Summary What? Learn bimodally informative structures from audio-visual signals Why? Understand complex relationship between the inputs to different sensory modalities How? Represent audio-video signals as sparse sum of kernels consisting of audio-waveform and a spatio-temporal visual basis. Motivation: Biological Evidence, Evolution Suren Kumar January 28, 2014 Introduction 2/19 Literature I Detect synchronous co-occurrences of transient structures in different modalities. 1. Extract fixed and predefined unimodal features in audio and video stream separately 2. Analyze correlation between the resulting feature representation What is the problem with it? Suren Kumar January 28, 2014 Introduction 3/19 Literature I Detect synchronous co-occurrences of transient structures in different modalities. 1. Extract fixed and predefined unimodal features in audio and video stream separately 2. Analyze correlation between the resulting feature representation What is the problem with it? Suren Kumar January 28, 2014 Introduction 3/19 Basic Steps I Capture bimodal signal structure by shift-invariance sparse generative model. I Unsupervised learning for forming an overcomplete dictionary Suren Kumar January 28, 2014 Introduction 4/19 p th norm of a vector x in Euclidean space is defined as 1 P ||x||p = ( ni=1 |xi |p ) p Suren Kumar January 28, 2014 Background 5/19 Matching Pursuit Consider a dictionary D = (g )g ∈D in a Hilbert space H. The dictionary is countable, normalized (||g || = 1), complete, but not orthogonal and possibly redundant. Sparse Approximation Problem: Given PN > 0 and f ∈ H, construct an N-term combination fN = k=1,2,..N ck gk with gk ∈ D which approximates f “at best”, and study how fast fN converges to f . Sparse P Recovery Problem: if f has an unknown representation f = g cg g with (cg )g ∈D a (possibly) sparse sequence, recover this sequence exactly or approximately from the data of f . Building an optimal sparse representation of arbitrary signals is NP-hard problem. Suren Kumar January 28, 2014 Background:Matching Pursuit 6/19 Matching Pursuit Matching pursuit is a greedy approximation to the problem. Initialization f0 = 0 Projection Step: At step k − 1, projection step is the approximation of f fk−1 = Span{g1 , ..., gk−1 } Selection Step: Choice of next element based on residual rk−1 = f − fk−1 gk = arg maxg ∈D | < rk−1 , g > | Suren Kumar January 28, 2014 Background:Matching Pursuit 7/19 Convolutional Generative Model Audio visual data s = (a, v ) , a(t), v (x, y , t) (a) (v ) Dictionary {φk }, φk = (φk (t), φk (x, y , t)) Each atom can be translated to any point in space and time using operator T(p,q,r ) (a) (v ) T(p,q,r ) = (φk (t − r ), φk (x − p, y − q, t − r )) Thus an audio-visual signal can be represented as P Pnk (a) (v ) s≈ K k=1 i=1 cki T(p,q,r )k φk , cki = (cki , cki ) i Coding Find the coefficients (sparse) Learning Learning the dictionary Suren Kumar January 28, 2014 Audio-Visual Representation 8/19 Audio-Visual Matching Pursuit Transient substructures that co-occur simultaneously are indicative of common underlying physical cause. Discrete audio-visual translation τ Signal Approximation. Start with R 0 s = s Suren Kumar January 28, 2014 Audio-Visual Representation 9/19 Audio-Visual Matching Pursuit The function φ0 and its spatio-temporal translation are chosen maximizing similarity measures C (R 0 s, φ) When to Stop? Suren Kumar January 28, 2014 Audio-Visual Representation 10/19 Audio-Visual Matching Pursuit The function φ0 and its spatio-temporal translation are chosen maximizing similarity measures C (R 0 s, φ) When to Stop? Number of iterations N or the maximum value of C between residual and dictionary elements falls below a threshold. I Similarity measure should reflect properties of human perception I Unaffected by small relative time-shifts Suren Kumar January 28, 2014 Audio-Visual Representation 10/19 Choosing the norm Suren Kumar January 28, 2014 Audio-Visual Representation 11/19 Learning Find the kernel functions from a given set of audio-visual data. Assuming the noise to be gaussian Suren Kumar January 28, 2014 Learning:Gradient Ascent Learning 12/19 Learning Find the kernel functions from a given set of audio-visual data. Assuming the noise to be gaussian Suren Kumar January 28, 2014 Learning:Gradient Ascent Learning 12/19 Learning Update: φk [j] = φk [j − 1] + ηδφk Suren Kumar January 28, 2014 Learning:Gradient Ascent Learning 13/19 Learning Key Idea : Update only one atom φk Update each function with principal component of the residual errors. In general has faster convergence compared to GA. Suren Kumar January 28, 2014 Learning:K-SVD algorithm 14/19 Synthetic Data Audio - 3 sine waves and video has four black shapes Suren Kumar January 28, 2014 Experiments 15/19 Audio-Visual Speech Speaker uttering the digits from zero to nine in english Study convergence for 1 and 2 norm with GA and K-SVD I C1 encodes joint audio-visual structures better. I K-SVD is faster compared to GA. Suren Kumar January 28, 2014 Experiments 16/19 Suren Kumar January 28, 2014 Experiments 17/19 Sound Source Localization CUAVE - Visual Distractor and Acoustin Distractor I I I I Filter Audio Corresponding audio filter audio function and store the maximum projection Filter with corresponding video function and store maximum projection. Cluster video position Suren Kumar January 28, 2014 Experiments 18/19 Summary I Audio - Visual Matching Pursuit I Similarity measures and learning algorithms I Testing for speaker localization with distractors. Suren Kumar January 28, 2014 Experiments 19/19 Reference I http://www.math.tamu.edu/ popov/Learning/Cohen.pdf Suren Kumar January 28, 2014 Experiments 20/19