Review: Learning Bimodal Structures in Audio-Visual Data Physical Models and Inference Algorithms

advertisement
Review: Learning Bimodal Structures
in Audio-Visual Data
CSE 704 : Readings in Joint Visual, Lingual and
Physical Models and Inference Algorithms
Suren Kumar
Vision and Perceptual Machines Lab
106 Davis Hall
UB North Campus
surenkum@buffalo.edu
January 28, 2014
Suren Kumar
January 28, 2014
1/19
Summary
What? Learn bimodally informative structures from audio-visual
signals
Why? Understand complex relationship between the inputs to
different sensory modalities
How? Represent audio-video signals as sparse sum of kernels
consisting of audio-waveform and a spatio-temporal visual
basis.
Motivation: Biological Evidence, Evolution
Suren Kumar
January 28, 2014
Introduction
2/19
Literature
I
Detect synchronous co-occurrences of transient structures in
different modalities.
1. Extract fixed and predefined unimodal features in audio and
video stream separately
2. Analyze correlation between the resulting feature
representation
What is the problem with it?
Suren Kumar
January 28, 2014
Introduction
3/19
Literature
I
Detect synchronous co-occurrences of transient structures in
different modalities.
1. Extract fixed and predefined unimodal features in audio and
video stream separately
2. Analyze correlation between the resulting feature
representation
What is the problem with it?
Suren Kumar
January 28, 2014
Introduction
3/19
Basic Steps
I
Capture bimodal signal structure by shift-invariance sparse
generative model.
I
Unsupervised learning for forming an overcomplete dictionary
Suren Kumar
January 28, 2014
Introduction
4/19
p th norm of a vector x in Euclidean space is defined as
1
P
||x||p = ( ni=1 |xi |p ) p
Suren Kumar
January 28, 2014
Background
5/19
Matching Pursuit
Consider a dictionary D = (g )g ∈D in a Hilbert space H.
The dictionary is countable, normalized (||g || = 1), complete, but
not orthogonal and possibly redundant.
Sparse Approximation Problem: Given
PN > 0 and f ∈ H,
construct an N-term combination fN = k=1,2,..N ck gk with
gk ∈ D which approximates f “at best”, and study how fast fN
converges to f .
Sparse
P Recovery Problem: if f has an unknown representation
f = g cg g with (cg )g ∈D a (possibly) sparse sequence, recover
this sequence exactly or approximately from the data of f .
Building an optimal sparse representation of arbitrary signals is
NP-hard problem.
Suren Kumar
January 28, 2014
Background:Matching Pursuit
6/19
Matching Pursuit
Matching pursuit is a greedy approximation to the problem.
Initialization f0 = 0
Projection Step: At step k − 1, projection step is the
approximation of f
fk−1 = Span{g1 , ..., gk−1 }
Selection Step: Choice of next element based on residual
rk−1 = f − fk−1
gk = arg maxg ∈D | < rk−1 , g > |
Suren Kumar
January 28, 2014
Background:Matching Pursuit
7/19
Convolutional Generative Model
Audio visual data s = (a, v ) , a(t), v (x, y , t)
(a)
(v )
Dictionary {φk }, φk = (φk (t), φk (x, y , t))
Each atom can be translated to any point in space and time using
operator T(p,q,r )
(a)
(v )
T(p,q,r ) = (φk (t − r ), φk (x − p, y − q, t − r ))
Thus an audio-visual signal can be represented as
P
Pnk
(a) (v )
s≈ K
k=1
i=1 cki T(p,q,r )k φk , cki = (cki , cki )
i
Coding Find the coefficients (sparse)
Learning Learning the dictionary
Suren Kumar
January 28, 2014
Audio-Visual Representation
8/19
Audio-Visual Matching Pursuit
Transient substructures that co-occur simultaneously are indicative
of common underlying physical cause.
Discrete audio-visual translation τ
Signal Approximation. Start with R 0 s = s
Suren Kumar
January 28, 2014
Audio-Visual Representation
9/19
Audio-Visual Matching Pursuit
The function φ0 and its spatio-temporal translation are chosen
maximizing similarity measures C (R 0 s, φ)
When to Stop?
Suren Kumar
January 28, 2014
Audio-Visual Representation
10/19
Audio-Visual Matching Pursuit
The function φ0 and its spatio-temporal translation are chosen
maximizing similarity measures C (R 0 s, φ)
When to Stop?
Number of iterations N or the maximum value of C between
residual and dictionary elements falls below a threshold.
I Similarity measure should reflect properties of human
perception
I Unaffected by small relative time-shifts
Suren Kumar
January 28, 2014
Audio-Visual Representation
10/19
Choosing the norm
Suren Kumar
January 28, 2014
Audio-Visual Representation
11/19
Learning
Find the kernel functions from a given set of audio-visual data.
Assuming the noise to be gaussian
Suren Kumar
January 28, 2014
Learning:Gradient Ascent Learning
12/19
Learning
Find the kernel functions from a given set of audio-visual data.
Assuming the noise to be gaussian
Suren Kumar
January 28, 2014
Learning:Gradient Ascent Learning
12/19
Learning
Update:
φk [j] = φk [j − 1] + ηδφk
Suren Kumar
January 28, 2014
Learning:Gradient Ascent Learning
13/19
Learning
Key Idea : Update only one atom φk
Update each function with principal component of the residual
errors. In general has faster convergence compared to GA.
Suren Kumar
January 28, 2014
Learning:K-SVD algorithm
14/19
Synthetic Data
Audio - 3 sine waves and video has four black shapes
Suren Kumar
January 28, 2014
Experiments
15/19
Audio-Visual Speech
Speaker uttering the digits from zero to nine in english
Study convergence for 1 and 2 norm with GA and K-SVD
I
C1 encodes joint audio-visual structures better.
I
K-SVD is faster compared to GA.
Suren Kumar
January 28, 2014
Experiments
16/19
Suren Kumar
January 28, 2014
Experiments
17/19
Sound Source Localization
CUAVE - Visual Distractor and Acoustin Distractor
I
I
I
I
Filter Audio
Corresponding audio filter audio function and store the
maximum projection
Filter with corresponding video function and store maximum
projection.
Cluster video position
Suren Kumar
January 28, 2014
Experiments
18/19
Summary
I
Audio - Visual Matching Pursuit
I
Similarity measures and learning algorithms
I
Testing for speaker localization with distractors.
Suren Kumar
January 28, 2014
Experiments
19/19
Reference
I
http://www.math.tamu.edu/ popov/Learning/Cohen.pdf
Suren Kumar
January 28, 2014
Experiments
20/19
Download