Tony Jebara, Columbia University Dynamic Bayesian Networks for Multimodal Interaction Tony Jebara Machine Learning Lab Columbia University joint work with A. Howard and N. Gu Tony Jebara, Columbia University Outline •Introduction: Multi-Modal and Multi-Person •Bayesian Networks and the Junction Tree Algorithm •Maximum Likelihood and Expectation Maximization •Dynamic Bayesian Networks (HMMs, Kalman Filters) •Hidden ARMA Models •Maximum Conditional Likelihood and Conditional EM •Two-Person Visual Interaction (Gesture Games) •Input-Output Hidden Markov Models •Audio-Visual Interaction (Conversation) •Intractable DBNs, Minimum Free Energy, Generalized EM •Dynamical System Trees •Multi-Person Visual Interaction (Football Plays) •Haptic-Visual Modeling (Surgical Drills) •Ongoing Directions Tony Jebara, Columbia University Introduction •Simplest Dynamical Systems (single Markovian Process) •Hidden Markov Model and Kalman Filter •But Multi-modal data (audio, video and haptics) have: •Different time scale processes •Different amplitude scale processes •Different noise characteristics processes •Also, Multi-person data (multi-limb, two-person, group) •Weakly coupled •Conditionally Dependent •Dangerous to slam all time data into one single series: •Find new ways to zipper multiple interacting processes Tony Jebara, Columbia University Bayesian Networks x4 x2 x6 •Also called Graphical Models x1 •Marry graph theory & statistics x5 x3 •Directed graph which efficiently encodes large p(x1,…,xN) as product of conditionals n p x , K , x = (1 of node given parents Õ i = 1 p (x i | p i ) n) •Avoids storing huge hypercube over all variables x1,…,xN •Here, xi discrete (multinomial) or continuous (Gaussian) •Split BNs over sets of hidden XH and observed XV variables •Three basic operations for BNs p (X V , X H | q) 1) Infer marginals/conditionals of hidden (JTA) p (X H | X V , q) 2) Compute likelihood of data (JTA) å X p (X H , X V | q) 3) Maximize likelihood the data (EM) max q å X p (X H , X V | q) H H Tony Jebara, Columbia University Bayes Nets to Junction Trees •Workhorse of BNs is Junction Tree Algorithm x4 x2 x2 x6 x1 x5 x3 1) Bayes Net x2 x6 x1 x3 x5 2) Moral Graph x4 x6 x1 x3 x4 x5 3) Triangulated x 1x 2x 3 x 2x 3 x2 x 2x 3x 5 x 2x 5 x 2x 4 x 2x 5x 6 4) Junction tree Tony Jebara, Columbia University Junction Tree Algorithm •The JTA sends messages from cliques through separators (these are just tables or potential functions) •Ensures that various tables in the junction tree graph agree/consistent over shared variables (via marginals). AB If agree: B å V \S y V = f S = p (S ) = f S = Else: Send message From V to W… f S* = y W* = å f S* fS y V* = y V V = {A, B } BC y V \S V yW Send message From W to V… f S** = y V** = å * y W \S W f S** f S* y W** = y W* y V* å S = {B } W \S W = {B ,C } yW Then, Cliques Agree Tony Jebara, Columbia University Junction Tree Algorithm •On trees, JTA is guaranteed: 1)Init 2)Collect 3)Distribute Ends with potentials as marginals or conditionals of hidden variables given data p(Xh1|Xv) p(Xh2|Xv) p(Xh1, Xh2|Xv) And likelihood p(Xv) is potential normalizer Tony Jebara, Columbia University Maximum Likelihood with EM •We wish to maximize the likelihood over q for learning: max q å XH p (X H , X V | q) •EM instead iteratively maxes lower bound on log-likelihood: log å XH p (X H , X V | q) ³ å XH q (X H )log p (X V | X H , q) + KL (q (X H ) || p (X H | X V q)) = L (q, qt ) q(z) L(q,q) q t+1 t q = arg max L q , q ( ) •E-step: q = p (X H | X V , q ) = arg max L (qt , q) t •M-step: qt + 1 l(q) q Tony Jebara, Columbia University Dynamic Bayes Nets •Dynamic Bayesian Networks are BNs unrolled in time •Simples and most classical examples are: Linear Dynamical System Hidden Markov Model s0 s1 s2 s3 x0 x1 x2 x3 y0 y1 y2 y3 y0 y1 y2 y3 State Transition Model: State Transition Model: P( St 1 i | St j ) (i, j ) P( Xt 1 xt 1 | Xt xt ) N ( xt 1 | Axt , Q) P( S 0 i ) (i ) P( X 0 x 0) N ( x 0 | µ0, Q 0) Emission Model: Emission Model: P(Yt i | St j ) (i, j ) P(Yt yt | Xt xt ) N ( yt | Cxt , R) or P(Yt yt | St i ) N ( yt | i i ) Tony Jebara, Columbia University Two-Person Interaction •Learn from two interacting people (person Y and person X) to mimic interaction via simulated person Y. Learn from two users to get p(y|x) Interact with single user via p(y|x) •One hidden Markov model for each user…no coupling! s0 s1 s2 s3 s0 s1 s2 s3 y0 y1 y2 y3 y0 y1 y2 y3 •One time series for both users… too rigid! Tony Jebara, Columbia University DBN: Hidden ARMA Model Learn to imitate behavior by watching a teacher exhibit it. X Eg. unsupervised observation of 2- agent interaction Y Eg. Track lip motion Discover correlations between past action & subsequent reaction Estimate p(Y | past X , past Y) Tony Jebara, Columbia University DBN: Hidden ARMA Model •Focus on predicting person Y from past of both X and Y •Have multiple linear models of the past to the future •Use a window for moving average (compressed with PCA) •But, select among them using S (nonlinear) •Here, we show only a 2nd order moving average to predict the next Y given past two Y’s, past two X’s and current X and random choice of ARMA linear model s0 s1 s2 s3 s4 s5 s6 y0 y1 y2 y3 y4 y5 y6 x0 x1 x2 x3 x4 x5 x6 Tony Jebara, Columbia University Hidden ARMA Features: •Model skin color as mixture of RGB Gaussians •Track person as mixture of spatial Gaussians •But, want to predict only Y from X… Be discriminative •Use maximum conditional likelihood (CEM) Tony Jebara, Columbia University Conditional EM •Only need a conditional? p (Y V | X V , q) •Then maximize conditional likelihood max q log å EM: divide & conquer CEM: discriminative divide & conquer XH p (X H ,Y V , X V | q) - log å XH p (X H , X V | q) l = - 8.0 l c = - 1.7 l = - 54.7 l c = + 0.4 Tony Jebara, Columbia University Conditional EM CEM p(y|x) CEM vs. EM p(c|x,y) CEM accuracy = 100% EM accuracy = 51% EM p(y|x) Tony Jebara, Columbia University Conditional EM for hidden ARMA Estimate Prediction Discriminatively/Conditionally p(future|past) 2 Users gesture to each other for a few minutes Model: Mix of 25 Gaussians, STM: T=120, Dims=22+15 Nearest Neighbor 1.57% RMS Constant Velocity 0.85% RMS Hidden ARMA: 0.64% RMS Tony Jebara, Columbia University Hidden ARMA on Gesture SCARE WAVE CLAP Tony Jebara, Columbia University DBN: Input-Output HMM •Similarly, learn person’s response audio video stimuli to predict Y (or agent A) from X (or world W) •Wearable collects audio & video A,W -Sony Picturebook Laptop -2 Cameras (7 Hz) (USB & Analog) -2 Microphones (USB & Analog) -100 Megs per hour (10$/Gig) Tony Jebara, Columbia University DBN: Input-Output HMM •Consider simulating agent given world s0 s1 •Hidden Markov model on its own is insufficient since it does not y0 y1 distinguish between the input rule the world has and the output a0 a1 we need to generate •Instead, form input-output HMM s0 s1 •One IOHMM predicts agent’s audio using all 3 past channels w1 w0 •One IOHMM predicts agent’s video •Use CEM to learn the IOHMM discriminatively log p (A | W ) = log p(A,W ) - s2 s3 y2 y3 a2 a3 s2 s3 w2 w3 log p(W ) Tony Jebara, Columbia University Input-Output HMM Data Video -Histogram lighting correction -RGB Mixture of Gaussians to detect skin -Face: 2000 pixels at 7Hz (X,Y,Intensity) Audio -Hamming Window, FFT, Equalization -Spectrograms at 60Hz -200 bands (Amplitude, Frequency) Very noisy data set! Tony Jebara, Columbia University Video Representation - Principal Components Analysis - linear vectors in Euclidean space - Images, spectrograms, time series X = vectors. - Vectorization is bad, nonlinear 2 T D æ K ö id id ÷ ç X c V ç åi = 1 åd = 1 çè n åm = 1 nm m ø÷÷ - Images = collections of (X,Y,I) tuples “pixels” - Spectrograms = collections of (A,F) tuples …therefore... - Corresponded Principal Components Analysis 2 K æ ö ij id jd ÷ ç M X c V å nm m ÷ å å n å ç n ÷ è ø d= 1 m=1 i= 1 j= 1 T t D M are soft permutation matrices Tony Jebara, Columbia University Video Representation Original PCA CPCA 2000 XYI Pixels: Compress to 20 dims Tony Jebara, Columbia University Input-Output HMM For agent and world: 1 Loudness scalar 20 Spectro Coeffs 20 Face Coeffs Estimate hidden trellis from partial data Tony Jebara, Columbia University Input-Output HMM with CEM Conditionally model p(Agent Audio | World Audio , World Video) p(Agent Video | World Audio, World Video) Don’t care how well we can model world audio and video Just as long as we can map it to agent audio or agent video Avoids temporal scale problems too (Video 5Hz, Audio 60 Hz) a3 a2 a0 a1 Audio IOHMM: log p (A | W ) = s0 s1 s2 s3 s0 s1 s2 s3 w0 w1 w2 w3 w0 w1 w2 w3 log p(A,W ) CEM: 60-state 82-Dim HMM Diagonal Gaussian Emissions 90,000 Samples Train / 36,000 Test - log p(W ) Tony Jebara, Columbia University Input-Output HMM with CEM TRAINING & TESTING Audio Video Joint Likelihood Conditional Likelihood RESYNTHESIS Spectrograms from eigenspace KD-Tree on Video Coefficients to closest image in training (point-cloud too confusing) EM (red) CEM (blue) 99.61 100.58 -122.46 -121.26 Tony Jebara, Columbia University Input-Output HMM Results Train Test Tony Jebara, Columbia University Intractable Dynamic Bayes Nets Interaction Through Output Interaction Through Hidden States y y22 y32 s02 s12 s22 s32 s31 s01 s11 s21 s31 y3 y01 y11 y21 y 31 s03 s13 s23 s33 y02 s02 s12 s22 s32 s01 s11 s21 y0 y1 y2 Factorial Hidden Markov Model 2 1 Coupled Hidden Markov Model Tony Jebara, Columbia University Intractable DBNs: Generalized EM •As before, we use bound on likelihood: log å XH p (X H , X V | q) ³ å XH q (X H )log p (X V | X H , q) + KL (q (X H ) || p (X H | X V q)) = L (q, qt ) •But best q over hidden vars that minimizes KL intractable! •Thus, restrict q to only explore factorized distributions •EM still converges underpartial E steps & partial M steps, q(z) -L(q,q) q q t+1 t = arg maxqÎ FA CT ORIZED L (q, q ) qt + 1 = arg max q L (qt , q) l(q) q Tony Jebara, Columbia University Intractable DBNs Variational EM •Now, the q distributions are limited to be chains •Tractable as an iterative method •Also known as variational EM structured mean-field s03 s13 s23 s33 s02 s12 s22 s32 s02 s12 s22 s32 s01 s11 s21 s31 s01 s11 s21 s31 Factorial Hidden Markov Model Coupled Hidden Markov Model Tony Jebara, Columbia University Dynamical System Trees •How to handle more people and a hieararchy of coupling? •DSTs consider coupling university staff: students -> department -> school -> university s0 s1 s01,2 s s01 s x01 2 0 x 1 0 2 0 y s13,4 s11 3 0 s 2 0 y s11,2 3,4 0 2 1 s s04 x11 3 0 x x04 y03 y04 2 1 x y 1 1 2 1 y s13 s14 x13 x14 y13 y14 Interaction Through Aggregated Community State Internal nodes are states. Leaf nodes are emissions. Any subtree is also a DST. DST above unrolled over 2 time steps Tony Jebara, Columbia University Dynamical System Trees •Also apply generalization of EM and do variational structured mean field for q distribution. •Becomes formulaic fo any DST topology! •Code available at http://www.cs.columbia.edu/~jebara/dst s0 s1 s01,2 s11,2 3,4 0 s s01 2 0 s s13,4 s11 s03 2 1 s s04 x01 x11 2 0 x x04 2 1 x s13 s14 x13 x14 Tony Jebara, Columbia University DSTs and Generalized EM Introduce v.p. Inference s0 s01,2 Introduce v.p. Inference s03,4 s01 s02 s03 s04 Inference x01 x02 x03 x04 Inference y01 y02 y03 y04 Introduce v.p. Structured Mean Field: Use tractable distribution Q to approximate P Introduce variational parameters Find Min KL(Q||P) Tony Jebara, Columbia University DSTs for American Football Initial frame of a typical play Trajectories of players Tony Jebara, Columbia University DSTs for American Football ~20 time series of two types of plays (wham and digs) Likelihood ratio of models used as classifer DST1 puts all players into 1 game state DST2 combines players into two teams and then into game Tony Jebara, Columbia University DSTs for Gene Networks •Time series of cell cycle •Hundreds of gene expression levels over time •Use given hierarchical clustering •DST with hierarchical clustering structure gives best test likelihood Tony Jebara, Columbia University Robotic Surgery, Haptics & Video •Davinci Laparoscopic Robot •Used in hundreds of hospitals •Surgeon works on console •Robot mimics movement on (local) patient •Captures all actuator/robot data as 300Hz time series •Multi-Channel Video of cameras inside patient Tony Jebara, Columbia University Robotic Surgery, Haptics & Video Tony Jebara, Columbia University Robotic Surgery, Haptics & Video 64 Dimensional Time Series @ 300 Hz Console and Actuator Parameters Expert Suturing Novice Tony Jebara, Columbia University Robotic Surgical Drills Results •Compress Haptic & Video data with PCA to 60 dims. •Collected Data from Novices and Experts and built several DBNs (IOHMMs, DSTs, etc.) of expert and novice for 3 different drills (6 models total). •Preliminary results: Minefield Russian Roulette Suture Tony Jebara, Columbia University Conclusion •Dynamic Bayesian networks are natural upgrade to HMMs. •Relevant for structured, multi-modal and multi-person temporal data. •Several exampls of dynamic Bayesian networks for audio, video and haptic channels single, two-person and multi-person activity. •DBNs: HMMs, Kalman Filters, hidden ARMA, input-output HMMs. •Use max likelihood (EM) or max conditional likelihood (CEM). •Intractable DBNs: switched Kalman filters, dynamical systems trees. •Use max free energy (GEM) and structured mean field. •Examples of applications: gesture interaction (gesture games) audio-video interaction (social conversation) multi-person game playing (American football) haptic-video interaction (robotic laparoscopy). •Funding provided in part by the National Science Foundation, the Central Intelligence Agency, Alphastar and Microsoft.