mlmi04uk_jebara_dbnmi_01

advertisement
Tony Jebara, Columbia University
Dynamic Bayesian
Networks for
Multimodal
Interaction
Tony Jebara
Machine Learning Lab
Columbia University
joint work with A. Howard and N. Gu
Tony Jebara, Columbia University
Outline
•Introduction: Multi-Modal and Multi-Person
•Bayesian Networks and the Junction Tree Algorithm
•Maximum Likelihood and Expectation Maximization
•Dynamic Bayesian Networks (HMMs, Kalman Filters)
•Hidden ARMA Models
•Maximum Conditional Likelihood and Conditional EM
•Two-Person Visual Interaction (Gesture Games)
•Input-Output Hidden Markov Models
•Audio-Visual Interaction (Conversation)
•Intractable DBNs, Minimum Free Energy, Generalized EM
•Dynamical System Trees
•Multi-Person Visual Interaction (Football Plays)
•Haptic-Visual Modeling (Surgical Drills)
•Ongoing Directions
Tony Jebara, Columbia University
Introduction
•Simplest Dynamical Systems (single Markovian Process)
•Hidden Markov Model and Kalman Filter
•But Multi-modal data (audio, video and haptics) have:
•Different time scale processes
•Different amplitude scale processes
•Different noise characteristics processes
•Also, Multi-person data (multi-limb, two-person, group)
•Weakly coupled
•Conditionally Dependent
•Dangerous to
slam all time
data into one
single series:
•Find new ways to zipper multiple interacting processes
Tony Jebara, Columbia University
Bayesian Networks
x4
x2
x6
•Also called Graphical Models
x1
•Marry graph theory & statistics
x5
x3
•Directed graph which efficiently
encodes large p(x1,…,xN) as
product of conditionals
n
p
x
,
K
,
x
=
(1
of node given parents
Õ i = 1 p (x i | p i )
n)
•Avoids storing huge hypercube over all variables x1,…,xN
•Here, xi discrete (multinomial) or continuous (Gaussian)
•Split BNs over sets of hidden XH and observed XV variables
•Three basic operations for BNs p (X V , X H | q)
1) Infer marginals/conditionals of hidden (JTA) p (X H | X V , q)
2) Compute likelihood of data (JTA)
å X p (X H , X V | q)
3) Maximize likelihood the data (EM) max q å X p (X H , X V | q)
H
H
Tony Jebara, Columbia University
Bayes Nets to Junction Trees
•Workhorse of BNs is Junction Tree Algorithm
x4
x2
x2
x6
x1
x5
x3
1) Bayes Net
x2
x6
x1
x3
x5
2) Moral Graph
x4
x6
x1
x3
x4
x5
3) Triangulated
x 1x 2x 3
x 2x 3
x2
x 2x 3x 5
x 2x 5
x 2x 4
x 2x 5x 6
4) Junction tree
Tony Jebara, Columbia University
Junction Tree Algorithm
•The JTA sends messages from cliques through
separators (these are just tables or potential functions)
•Ensures that various tables in the junction tree graph
agree/consistent over shared variables (via marginals).
AB
If agree:
B
å
V \S
y V = f S = p (S ) = f S =
Else: Send message
From V to W…
f S* =
y W* =
å
f S*
fS
y V* = y V
V = {A, B }
BC
y
V \S V
yW
Send message
From W to V…
f S** =
y V** =
å
*
y
W \S W
f S**
f S*
y W** = y W*
y V*
å
S = {B }
W \S
W = {B ,C }
yW
Then, Cliques Agree
Tony Jebara, Columbia University
Junction Tree Algorithm
•On trees, JTA is guaranteed: 1)Init 2)Collect 3)Distribute
Ends with potentials
as marginals
or conditionals
of hidden variables
given data
p(Xh1|Xv)
p(Xh2|Xv)
p(Xh1, Xh2|Xv)
And likelihood
p(Xv)
is potential
normalizer
Tony Jebara, Columbia University
Maximum Likelihood with EM
•We wish to maximize the likelihood over q for learning:
max q å
XH
p (X H , X V | q)
•EM instead iteratively maxes lower bound on log-likelihood:
log å
XH
p (X H , X V | q) ³
å
XH
q (X H )log p (X V | X H , q)
+ KL (q (X H ) || p (X H | X V q)) = L (q, qt )
q(z)
L(q,q)
q
t+1
t
q
=
arg
max
L
q
,
q
(
)
•E-step:
q
= p (X H | X V , q )
= arg max L (qt , q)
t
•M-step: qt + 1
l(q)
q
Tony Jebara, Columbia University
Dynamic Bayes Nets
•Dynamic Bayesian Networks are BNs unrolled in time
•Simples and most classical examples are:
Linear Dynamical System
Hidden Markov Model
s0
s1
s2
s3
x0
x1
x2
x3
y0
y1
y2
y3
y0
y1
y2
y3
State Transition Model:
State Transition Model:
P( St  1  i | St  j )   (i, j )
P( Xt  1  xt  1 | Xt  xt )  N ( xt  1 | Axt , Q)
P( S 0  i )   (i )
P( X 0  x 0)  N ( x 0 | µ0, Q 0)
Emission Model:
Emission Model:
P(Yt  i | St  j )   (i, j )
P(Yt  yt | Xt  xt )  N ( yt | Cxt , R)
or
P(Yt  yt | St  i )  N ( yt |  i i )
Tony Jebara, Columbia University
Two-Person Interaction
•Learn from two interacting people (person Y and
person X) to mimic interaction via simulated person Y.
Learn from two users to get p(y|x)
Interact with single user via p(y|x)
•One hidden Markov model for each user…no coupling!
s0
s1
s2
s3
s0
s1
s2
s3
y0
y1
y2
y3
y0
y1
y2
y3
•One time series for both users… too rigid!
Tony Jebara, Columbia University
DBN: Hidden ARMA Model
Learn to imitate behavior
by watching a teacher
exhibit it.
X
Eg. unsupervised
observation of 2- agent
interaction
Y
Eg. Track lip motion
Discover correlations
between past action &
subsequent reaction
Estimate p(Y | past X , past Y)
Tony Jebara, Columbia University
DBN: Hidden ARMA Model
•Focus on predicting person Y from past of both X and Y
•Have multiple linear models of the past to the future
•Use a window for moving average (compressed with PCA)
•But, select among them using S (nonlinear)
•Here, we show only a 2nd order moving average
to predict the next Y given
past two Y’s, past two X’s and current X
and random choice of ARMA linear model
s0
s1
s2
s3
s4
s5
s6
y0
y1
y2
y3
y4
y5
y6
x0
x1
x2
x3
x4
x5
x6
Tony Jebara, Columbia University
Hidden ARMA Features:
•Model skin color as mixture of RGB Gaussians
•Track person as mixture of spatial Gaussians
•But, want to predict only Y from X… Be discriminative
•Use maximum conditional likelihood (CEM)
Tony Jebara, Columbia University
Conditional EM
•Only need a conditional? p (Y V | X V , q)
•Then maximize conditional likelihood
max q log å
EM:
divide &
conquer
CEM:
discriminative
divide &
conquer
XH
p (X H ,Y V , X V | q) - log å
XH
p (X H , X V | q)
l = - 8.0
l c = - 1.7
l = - 54.7
l c = + 0.4
Tony Jebara, Columbia University
Conditional EM
CEM p(y|x)
CEM vs. EM p(c|x,y)
CEM accuracy = 100%
EM accuracy = 51%
EM p(y|x)
Tony Jebara, Columbia University
Conditional EM for hidden ARMA
Estimate Prediction Discriminatively/Conditionally p(future|past)
2 Users gesture to each other for a few minutes
Model: Mix of 25 Gaussians, STM: T=120, Dims=22+15
Nearest Neighbor 1.57% RMS
Constant Velocity 0.85% RMS
Hidden ARMA: 0.64% RMS
Tony Jebara, Columbia University
Hidden ARMA on Gesture
SCARE
WAVE
CLAP
Tony Jebara, Columbia University
DBN: Input-Output HMM
•Similarly, learn person’s response
audio video stimuli to predict Y
(or agent A) from X (or world W)
•Wearable collects audio & video A,W
-Sony Picturebook Laptop
-2 Cameras (7 Hz) (USB & Analog)
-2 Microphones (USB & Analog)
-100 Megs per hour (10$/Gig)
Tony Jebara, Columbia University
DBN: Input-Output HMM
•Consider simulating agent given world s0
s1
•Hidden Markov model on its own
is insufficient since it does not
y0
y1
distinguish between the input
rule the world has and the output
a0
a1
we need to generate
•Instead, form input-output HMM
s0
s1
•One IOHMM predicts agent’s audio
using all 3 past channels
w1
w0
•One IOHMM predicts agent’s video
•Use CEM to learn the IOHMM discriminatively
log p (A | W ) =
log p(A,W )
-
s2
s3
y2
y3
a2
a3
s2
s3
w2
w3
log p(W )
Tony Jebara, Columbia University
Input-Output HMM Data
Video
-Histogram lighting correction
-RGB Mixture of Gaussians to detect skin
-Face: 2000 pixels at 7Hz (X,Y,Intensity)
Audio
-Hamming Window, FFT, Equalization
-Spectrograms at 60Hz
-200 bands (Amplitude, Frequency)
Very noisy data set!
Tony Jebara, Columbia University
Video Representation
- Principal Components Analysis
- linear vectors in Euclidean space
- Images, spectrograms, time series X
= vectors.
- Vectorization is bad, nonlinear
2
T
D æ
K
ö
id
id ÷
ç
X
c
V
ç
åi = 1 åd = 1 çè n åm = 1 nm m ø÷÷
- Images = collections of (X,Y,I) tuples “pixels”
- Spectrograms = collections of (A,F) tuples
…therefore...
- Corresponded Principal Components Analysis
2
K
æ
ö
ij
id
jd ÷
ç
M
X
c
V
å nm m ÷
å å
n å ç n
÷
è
ø
d= 1
m=1
i= 1 j= 1
T
t
D
M are soft permutation
matrices
Tony Jebara, Columbia University
Video Representation
Original
PCA
CPCA
2000 XYI Pixels: Compress to 20 dims
Tony Jebara, Columbia University
Input-Output HMM
For agent and world:
1 Loudness scalar
20 Spectro Coeffs
20 Face Coeffs
Estimate hidden
trellis from
partial data
Tony Jebara, Columbia University
Input-Output HMM with CEM
Conditionally model
p(Agent Audio | World Audio , World Video)
p(Agent Video | World Audio, World Video)
Don’t care how well we can model world audio and video
Just as long as we can map it to agent audio or agent video
Avoids temporal scale problems too (Video 5Hz, Audio 60 Hz)
a3
a2
a0
a1
Audio IOHMM:
log p (A | W ) =
s0
s1
s2
s3
s0
s1
s2
s3
w0
w1
w2
w3
w0
w1
w2
w3
log p(A,W )
CEM: 60-state 82-Dim HMM
Diagonal Gaussian Emissions
90,000 Samples Train / 36,000 Test
-
log p(W )
Tony Jebara, Columbia University
Input-Output HMM with CEM
TRAINING & TESTING
Audio
Video
Joint Likelihood
Conditional Likelihood
RESYNTHESIS
Spectrograms from eigenspace
KD-Tree on Video Coefficients
to closest image in training
(point-cloud too confusing)
EM (red) CEM (blue)
99.61 100.58
-122.46 -121.26
Tony Jebara, Columbia University
Input-Output HMM Results
Train
Test
Tony Jebara, Columbia University
Intractable Dynamic Bayes Nets
Interaction Through Output
Interaction Through Hidden States
y
y22
y32
s02
s12
s22
s32
s31
s01
s11
s21
s31
y3
y01
y11
y21
y 31
s03
s13
s23
s33
y02
s02
s12
s22
s32
s01
s11
s21
y0
y1
y2
Factorial Hidden Markov Model
2
1
Coupled Hidden Markov Model
Tony Jebara, Columbia University
Intractable DBNs: Generalized EM
•As before, we use bound on likelihood:
log å
XH
p (X H , X V | q) ³
å
XH
q (X H )log p (X V | X H , q)
+ KL (q (X H ) || p (X H | X V q)) = L (q, qt )
•But best q over hidden vars that minimizes KL intractable!
•Thus, restrict q to only explore factorized distributions
•EM still converges underpartial E steps & partial M steps,
q(z)
-L(q,q)
q
q
t+1
t
= arg maxqÎ FA CT ORIZED L (q, q )
qt + 1 = arg max q L (qt , q)
l(q)
q
Tony Jebara, Columbia University
Intractable DBNs Variational EM
•Now, the q distributions are limited to be chains
•Tractable as an iterative method
•Also known as variational EM structured mean-field
s03
s13
s23
s33
s02
s12
s22
s32
s02
s12
s22
s32
s01
s11
s21
s31
s01
s11
s21
s31
Factorial Hidden Markov Model
Coupled Hidden Markov Model
Tony Jebara, Columbia University
Dynamical System Trees
•How to handle more people and a hieararchy of coupling?
•DSTs consider coupling university staff:
students -> department -> school -> university
s0
s1
s01,2
s
s01
s
x01
2
0
x
1
0
2
0
y
s13,4
s11
3
0
s
2
0
y
s11,2
3,4
0
2
1
s
s04
x11
3
0
x
x04
y03
y04
2
1
x
y
1
1
2
1
y
s13
s14
x13
x14
y13
y14
Interaction Through Aggregated Community State
Internal nodes are states. Leaf nodes are emissions.
Any subtree is also a DST. DST above unrolled over 2 time steps
Tony Jebara, Columbia University
Dynamical System Trees
•Also apply generalization of EM and do
variational structured mean field for q distribution.
•Becomes formulaic fo any DST topology!
•Code available at http://www.cs.columbia.edu/~jebara/dst
s0
s1
s01,2
s11,2
3,4
0
s
s01
2
0
s
s13,4
s11
s03
2
1
s
s04
x01
x11
2
0
x
x04
2
1
x
s13
s14
x13
x14
Tony Jebara, Columbia University
DSTs and Generalized EM
Introduce v.p.
Inference
s0
s01,2
Introduce v.p.
Inference
s03,4
s01
s02
s03
s04
Inference
x01
x02
x03
x04
Inference
y01
y02
y03
y04
Introduce v.p.
Structured Mean Field:
Use tractable distribution Q to approximate P
Introduce variational parameters
Find Min KL(Q||P)
Tony Jebara, Columbia University
DSTs for American Football
Initial frame of a typical play
Trajectories of players
Tony Jebara, Columbia University
DSTs for American Football
~20 time series of two types
of plays (wham and digs)
Likelihood ratio of models
used as classifer
DST1 puts all players into
1 game state
DST2 combines players into
two teams and then into game
Tony Jebara, Columbia University
DSTs for Gene Networks
•Time series of cell cycle
•Hundreds of gene
expression levels over time
•Use given hierarchical
clustering
•DST with hierarchical
clustering structure
gives best test
likelihood
Tony Jebara, Columbia University
Robotic Surgery, Haptics & Video
•Davinci Laparoscopic Robot
•Used in hundreds of hospitals
•Surgeon works on console
•Robot mimics movement
on (local) patient
•Captures all actuator/robot
data as 300Hz time series
•Multi-Channel Video of
cameras inside patient
Tony Jebara, Columbia University
Robotic Surgery, Haptics & Video
Tony Jebara, Columbia University
Robotic Surgery, Haptics & Video
64 Dimensional Time Series @ 300 Hz
Console and Actuator Parameters
Expert
Suturing
Novice
Tony Jebara, Columbia University
Robotic Surgical Drills Results
•Compress Haptic & Video data with PCA to 60 dims.
•Collected Data from Novices and Experts and
built several DBNs (IOHMMs, DSTs, etc.) of
expert and novice for 3 different drills (6 models total).
•Preliminary results:
Minefield
Russian Roulette
Suture
Tony Jebara, Columbia University
Conclusion
•Dynamic Bayesian networks are natural upgrade to HMMs.
•Relevant for structured, multi-modal and multi-person temporal data.
•Several exampls of dynamic Bayesian networks for
audio, video and haptic channels
single, two-person and multi-person activity.
•DBNs: HMMs, Kalman Filters, hidden ARMA, input-output HMMs.
•Use max likelihood (EM) or max conditional likelihood (CEM).
•Intractable DBNs: switched Kalman filters, dynamical systems trees.
•Use max free energy (GEM) and structured mean field.
•Examples of applications:
gesture interaction (gesture games)
audio-video interaction (social conversation)
multi-person game playing (American football)
haptic-video interaction (robotic laparoscopy).
•Funding provided in part by the National Science Foundation, the
Central Intelligence Agency, Alphastar and Microsoft.
Download