Variational filtering and DEM EPSRC Symposium Workshop on Computational Neuroscience Monday 8 – Thursday 11, December 2008 Abstract This presentation reviews variational treatments of dynamic models that furnish time-dependent conditional densities on the path or trajectory of a system's states and the time-independent densities of its parameters. These obtain by maximizing a variational action with respect to conditional densities. The action or path-integral of free-energy represents a lower-bound on the model’s log-evidence or marginal likelihood required for model selection and averaging. This approach rests on formulating the optimization in generalized coordinates of motion. The resulting scheme can be used for online Bayesian inversion of nonlinear hierarchical dynamic causal models and is shown to outperform existing approaches, such as Kalman and particle filtering. Furthermore, it provides for multiple inference on a models states, parameters and hyperparameters using exactly the same principles. Free-form (Variational filtering) and fixed form (Dynamic Expectation Maximization) variants of the scheme will be demonstrated using simulated (bird-song) and real data (from hemodynamic systems studied in neuroimaging). y = g ( x (1) , v (1) , θ (1) ) + z (1) x& (1) Bayesian filtering and the brain = f ( x , v ,θ ) (1) (1) (1) v (1) = η Prediction error y1 y3 y2 x1 x2 v1 Generation y4 y1 y2 µ1x µ1v y3 y4 µ 2x ε (1) x ε ( 2) v Inversion ε (1) v Overview µ%& = Vu + D µ% Variational learning and free-energy Hierarchical dynamic models Generalised coordinates (dynamical priors) Hierarchal forms (structural priors) Model inversion Variational filtering (free-form) Laplace approximation and DEM (fixed-form) x(t ) Comparative evaluations t Hemodynamics and Bird songs x& := dx dt x′ := x[1] f x := ∂f = ∇x f ∂x x% := x, x′, x′′, x′′′K x x′ x ′′ Variational learning, free-energy and action Aim: To optimise the path-integral (Action) of a free-energy bound on model evidence with respect to a recognition density q ∂ t F = F ( y , q (ϑ )) Free-energy: F = ln p( y | m) − D( q(ϑ ) || p(ϑ | y, m)) =G−H Expected energy: G = ln p ( y,ϑ ) Entropy: H = ln q (ϑ ) q = U (ϑ ) q q When optimised, the recognition density approximates the true conditional density and Action becomes a bound approximation to the integrated log-evidence; these can then be used for inference on parameters and models respectively The mean-field approximation q (ϑ ) = ∏ q (ϑ i ) = q (u (t ))q (θ )q (λ ) i Lemma : The free energy is maximised with respect to q (u (t )) ∝ exp(V (t )) Recognition density q(ϑ i ) when V (t ) = U (t ) δ q (ϑ ) F = 0 ⇒ i q (θ , λ ) q(θ ) ∝ exp(V θ ) V θ = ∫ U (t ) q (λ ) ∝ exp(V λ ) dt + U θ q ( u ,λ ) V λ = ∫ U (t ) λ dt + U q ( u ,θ ) Where U θ = ln p (θ ) and U λ = ln p (λ ) are the prior energies and the instantaneous energy is specified by a generative model U (t ) = ln p( y%, u | θ , λ) Variational energy and actions Overview µ& = Vu + D µ% Variational learning and free-energy Hierarchical dynamic models Generalised coordinates (dynamical priors) Hierarchal forms (structural priors) Model inversion Variational filtering (free-form) Laplace approximation and DEM (fixed-form) Comparative evaluations Hemodynamics and Bird songs Generalised coordinates and dynamic models y = g ( x, v ) + z x& = f ( x, v) + w y = g ( x, v ) + z y ′ = g x x′ + g v v′ + z ′ y′′ = g x x′′ + g v v′′ + z′′ M x′ = f ( x, v ) + w x′′ = f x x′ + f v v′ + w′ x′′′ = f x x′′ + f v v′′ + w′′ M y% = g% + z% g = g ( x, v ) g ′ = g x x′ + g v v′ g ′′ = g x x′′ + g v v′′ M Dx% = f% + w% f = f ( x, v ) f ′ = f x x ′ + f v v′ f ′′ = f x x′′ + f v v′′ M Likelihood (Dynamical) prior p ( y% | x% , v% ) = N ( g% , Σ z ) p( Dx% | v% ) = N ( f% , Σ w ) 0 1 0 O ⊗I D= O 1 0 Energies and generalised precisions Instantaneous energy a simple function of prediction error U (t ) := ln p ( y% , u ) = ln p ( y% | x% , v% ) p ( x% | v% ) p ( v% ) % − 1 ε%T Π % ε% = 12 ln Π 2 %z Π % = Π ε% v = y% − g% ε% = x % w Π ε% = Dx% − f% Precision matrices in generalised coordinates and time 2 2 4 4 6 6 8 8 10 10 12 12 14 16 % = S (γ ) Π 5 10 15 14 16 % = S (γ ) ⊗ Π (λ ) Π ρ&&(0) L 1 0 1 0 0 − ρ&&(0) 0 −1 = S (γ ) = &&(0) ρ&&(0) − 12 γ ρ&& 0 O M M % E −1 E −1T Π 5 10 15 r % y = Ey% (t ) 0 1 2 γ 0 General and Gaussian forms − 12 γ 0 3 4 γ2 L O Hierarchical forms ~ y = g~ + ~ z y = g ( x, v ) + z ⇒ ~ ~ x& = f ( x, v) + w D~ x = f +w v~ (i −1) = g~ (i ) + ~ z (i ) ⇒ ~ ~ (i ) D~ x (i ) = f (i ) + w v ( i −1) = g ( x ( i ) , v (i ) ) + z ( i ) x& (i ) = f ( x ( i ) , v ( i ) ) + w(i ) v~ ( 2 ) v~ p (v~ ) = N (v~ : η v , C v ) ~ ~ p(~ x | v~ , θ , λ ) = N ( D~ x : f ,Σw) λ ~ x λ(2) ~ x ( 2) θ (2) ~ ~ p(~ x ( 2 ) | v~ ( 2 ) , ϑ ( 2 ) ) = N ( D ~ x (2) : f (2) , Σ (2)w ) θ v~ (1) ~ y ~ p (v ( 2) | u (3) , ϑ (3) ) = N (v ( 2 ) : g~ (3) , Σ (3) z ) ~ p( ~ y | v~ , ~ x ,θ , λ ) = N ( ~ y : g~ , Σ z ) ϑ = {u,θ , λ} u = {x% , v%} λ (1) ~ x (1) ~ y ~ p (v (1) | u ( 2) , ϑ ( 2 ) ) = N (v (1) : g~ ( 2) , Σ ( 2) z ) θ (1) ~ ~ p(~ x (1) | v~ (1) , ϑ (1) ) = N ( D ~ x (1) : f (1) , Σ (1) w ) ~ p( ~ y | u (1) , ϑ (1) ) = N ( ~ y : g~ (1) , Σ (1) z ) Hierarchal forms and empirical priors y = g ( x (1) , v (1) ) + z (1) x& (1) = f ( x (1) , v (1) ) + w(1) M U (t ) = ln p ( y% , x% , v% | θ , λ ) v ( i −1) = g ( x (i ) , v ( i ) ) + z ( i ) % − 1 ε% T Π % ε% = 12 ln Π 2 x& ( i ) = f ( x ( i ) , v (i ) ) + w( i ) M The instantaneous energy function of prediction error y g (1) v (1) M v ε = − ( m) M g (m) v η v ( m ) = η + z ( m+1) p ( y% , x% , v% ) = p ( y% | x% , v% ) p ( x% , v% ) p ( x% , v% ) = p(v% m −1 (m) )∏ p ( x% (i ) | v% (i ) ) p(v% (i ) | x% (i +1) , v% (i +1) ) i =1 p ( x% (i ) | v% (i ) ) = N ( Dx% (i ) : f% (i ) , Σ% (i ) w ) Dynamical priors (empirical) p ( v% (i ) | x% (i +1) , v% (i +1) ) = N (v% (i ) : g% (i +1) , Σ% (i ) z ) Structural priors (empirical) p ( v% ( m ) ) = N (v% ( m ) : η% , Σ% v ) Priors (full) Overview µ%& = Vu + D µ% Variational learning and free-energy Hierarchical dynamic models Generalised coordinates (dynamical priors) Hierarchal forms (structural priors) Model inversion Variational filtering (free-form) Laplace approximation and DEM (fixed-form) Comparative evaluations Hemodynamics and Bird songs Variational filtering (free-form) Aim: To approximate the recognition density given the variational (instantaneous) energy Lemma: q(u (t )) ∝ exp(V (t )) is the stationary solution, in a moving frame of reference, for an ensemble of particles, whose equations of motion and ensemble dynamics are u& = V (t ) v + µ ′ + Γ u& ′ = V (t ) v′ + µ ′′ + Γ ⇒ u& = V (t )u + D µ% + Γ u& ′′ = K q& = ∇u ⋅ [∇u q − q∇uV (t )] − ∇u ⋅ qDµ% Proof: Substituting the recognition density q ∝ exp(V (t )) gives q& = −∇u ⋅ qDµ% This describes a stationary density under a moving frame of reference, with velocity D µ% as seen using the co-ordinate transform υ& = Dµ% ⇒ q& (υ ) = q& (u ) + ∇u ⋅ qυ& = 0 A toy example µ%& = Vu + D µ% V (u% , t ) = 4(u − µ ) 2 + (u ′ − µ ′) 2 + 14 (u ′′ − µ ′′) 2 µ& (t ) = µ ′(t ) µ& ′(t ) = µ ′′(t ) 5 µ& ′′(t ) = 14 µ (t ) 4 3 u (t ) 2 5 1 0 u ′′ 0 u′ -5 2 -1 0 -2 0 20 40 60 80 100 120 u& = V (t )u + µ ′ + Γ(t ) µ& = µ ′ u& ′ = V (t )u ′ + µ ′′ + Γ(t ) ⇒ µ& ′ = µ ′′ = µ&& µ& ′′ = K u& ′′ = K t bins -2 u Optimizing free-energy under the Laplace approximation i Mean-field approximation: q (ϑ ) = ∏ q (ϑ ) = q (u (t )) q (θ ) q (λ ) i i i Laplace approximation: q (ϑ ) = N ( µ , Σ ) i The Laplace approximation enables us the specify the sufficient statistics of the recognition density very simply Conditional modes µ% (t ) = arg max V (t ) u Conditional precisions Π u = −U (t )uu µ θ = arg max V θ θ Πθ = − ∫ U (t )θθ dt − Uθθ µ λ = arg max V λ λ Π λ = − ∫ U (t )λλ dt − U λλ θ λ Under these approximations, all we need to do is optimise the conditional modes Fixed form schemes … a gradient ascent in moving coordinates Taking the expectation of the ensemble dynamics, we get: u& = V (t )u + Dµ% + Γ ⇒ µ&% = V (t ) + Dµ% ⇔ µ%& − Dµ% = V (t ) u u ~& − Dµ~ can be regarded as a gradient ascent in a frame of reference that Here, µ moves along the trajectory encoded in generalised coordinates. The stationary solution, in this moving frame of reference, maximises variational action. µ&% − Dµ% = 0 ⇒ ∂ uV (t ) = 0 ⇔ δ uV u = 0 by the Fundamental lemma; c.f., Hamilton's principle of stationary action. Dµ% µ&% − Dµ% Dynamic expectation maximization µ%& = Vu + D µ% ℑ = V (t )uu + D D-Step inference q (u (t )) ∆µ% = (exp(ℑ) − I )ℑ−1 (V (t )u + Dµ% ) Π u = −U (t )uu local linearisation (Ozaki 1992) E-Step learning q (θ ) M-Step uncertainty q (λ ) ∆µ θ = −Vθθθ −1Vθθ θ Πθ = −Uθθ ∆µ λ = −Vλλλ −1Vλλ λ Π λ = −U λλ q (ϑ ) = q (u (t ))q (θ ) q (λ ) A dynamic recognition system that minimises prediction error Overview µ& = Vu + D µ% Variational learning and free-energy Hierarchical dynamic models Generalised coordinates (dynamical priors) Hierarchal forms (structural priors) Model inversion Variational filtering (free-form) Laplace approximation and DEM (fixed-form) Comparative evaluations Hemodynamics and Bird songs y = g ( x (1) , v (1) , θ (1) ) + z (1) x& (1) A linear convolution model = f ( x , v ,θ ) (1) (1) (1) v (1) = η Prediction error y1 y3 y2 x1 x2 v1 Generation y4 y1 y2 µ1x µ1v y3 y4 µ 2x ε (1) x ε ( 2) v Inversion ε (1) v hidden states Variational filtering on states and causes 1.5 v(t ) µ%& = Vu + D µ% 1 0.5 0 cause -0.5 u& (t ) = V (t ) u + D µ% (t ) + Γ (t ) -1 u (t ) = [ v , v ′, v ′′, K x , x ′, x ′′, K] 5 10 15 20 25 30 T cause 1.2 1 0.8 x time 0.6 x′ 0.4 x ′′ 0.2 0 x(t ) -0.2 -0.4 5 10 15 20 time {bins} 25 30 u& (t ) = V (t ) u + D µ% (t ) + Γ (t ) predicted response and error µ%& (t ) = V (t ) u + D µ% (t ) hidden states 0.3 predicted response and error hidden states 1 1 0.2 0.2 0.5 0.5 states (a.u.) states (a.u.) 0.1 0.1 0 0 -0.5 -0.1 -0.2 5 10 15 20 time {bins} 25 30 -1 0 0 -0.5 -0.1 5 10 15 20 time {bins} 25 -0.2 30 Linear deconvolution with variational filtering (SDE) – free-form 5 0.8 0.8 0.6 0.6 0.4 0.2 0.2 0 -0.2 -0.2 15 20 time {bins} 30 0.4 0 10 25 causal states - level 2 1 states (a.u.) states (a.u.) causal states - level 2 5 15 20 time {bins} 25 30 -1 5 10 15 20 time {bins} 25 30 Linear deconvolution with Dynamic expectation maximisation (ODE) – fixed-form 1 -0.4 10 -0.4 5 10 15 20 time {bins} 25 30 2 The order of generalised motion 4 6 8 10 12 14 % = S (γ ) Π 16 5 10 n =1 15 2 Precision in generalised coordinates v(t ) 1.5 1 0.5 6 sum squared error (causal states) 0 5 -0.5 -1 4 Accuracy and embedding (n) 3 -1.5 0 10 20 30 40 time n=7 2 2 v(t ) 1 1.5 1 0 1 3 5 7 9 11 13 0.5 0 n -0.5 -1 -1.5 0 5 10 15 20 time 25 30 35 hidden states DEM and extended Kalman filtering 1 0.5 0 With convergence when γ → ∞ ≡ n =1 DEM(0) -0.5 DEM(4) EKF -1 hidden states true 1 -1.5 0 5 10 15 20 25 30 35 time 0.5 sum of squared error (hidden states) 0 0.9 DEM(0) -0.5 0.8 EKF -1 0.7 0 0.6 0.5 0.4 0.3 0.2 0.1 EKF DEM(0) DEM(4) 10 20 time 30 40 A nonlinear convolution model states (a.u.) predicted response and error hidden states 20 10 15 8 10 6 5 4 0 2 -5 5 10 15 20 time {bins} 25 30 0 5 10 15 20 time {bins} level m =1 causal states 25 30 g ( x, v ) 1 5 x2 f ( x, v ) Πz Πw e v − x ln 2 e4 e16 ηv 2.5 m=2 2 2 1 2 + sin( 161 πt ) states (a.u.) 1.5 1 0.5 This system has a slow sinusoidal input or cause that excites increases in a single hidden state. The response is a quadratic function of the hidden states (c.f., Arulampalam et al 2002). 0 -0.5 -1 5 10 15 20 time {bins} 25 30 Conditional expectation (hidden states) 14 DEM and particle filtering 12 true DEM PF EKF 10 8 6 4 4 performance sumComparative of squared error (hidden states) 10 2 0 0 5 10 15 20 25 30 35 time Conditional covariance (hidden states) 0.7 0.6 DEM PF EKF 0.5 Sum of squared error -2 3 10 2 10 1 10 0 10 0.4 -1 10 0.3 0.2 0.1 0 0 5 10 15 20 time 25 30 35 EKF PF DEM predicted response and error g hidden states µ 1 0.2 0.5 µ%& = Vu + D µ% x states (a.u.) 0.1 0 Inference on states 0 -0.5 -0.1 -0.2 5 10 15 20 time {bins} 25 30 -1 5 10 15 20 time {bins} 25 30 Triple estimation (DEM) v parameters µ 0.2 1 0 0.8 states (a.u.) µ causal states 1.2 θ 0.6 -0.2 true DEM 0.4 0.2 -0.4 0 -0.2 5 10 15 20 time {bins} 25 30 -0.6 1 2 Learning parameters Overview µ%& = Vu + D µ% Variational learning and free-energy Hierarchical dynamic models Generalised coordinates (dynamical priors) Hierarchal forms (structural priors) Model inversion Variational filtering (free-form) Laplace approximation and DEM (fixed-form) Comparative evaluations Hemodynamics and Bird songs An fMRI study of attention Stimuli 250 radially moving dots at 4.7 degrees/s Pre-Scanning 5 x 30s trials with 5 speed changes (reducing to 1%) Task: detect change in radial velocity Scanning (no speed changes) 4 x 100 scan sessions; each comprising 10 scans of 4 different conditions F A F N F A F N S ................. A – dots, motion and attention (detect changes) N – dots and motion S – dots V5 (motion sensitive area) F – fixation Buchel et al 1999 A hemodynamic model Visual input Motion u(t ) Attention convolution kernel signal s& = θ u − κ s − γ( f − 1) state equations flow f& = s x& = f ( x,u ,θ ) volume dHb τv& = f − v1/α τq& = f E ( f )q ρ − v1 /α q/v Output: a mixture of intra- and extravascular signal y (t ) = g (v, q) = V0 (k1 (1 − q) + k 2 (1 − q/v) + k 3 (1 − v)) output equation y = g ( x, θ ) predicted response and error hidden states 6 0.6 4 0.4 0.2 states (a.u.) 2 0 0 -0.2 -2 -4 -6 Inference on states -0.4 -0.6 50 100 150 200 250 300 time {bins} 350 -0.8 50 100 150 200 250 300 350 time {bins} Hemodynamic deconvolution (V5) causal states parameters 0.025 states (a.u.) 1 0.02 0.015 0.5 Learning parameters 0.01 0 0.005 -0.5 50 100 150 200 250 300 time {bins} 350 0 vis mot att hidden states 1.6 1.4 1.2 1 0.8 0.6 visual stimulation signal flow volume dHb 0.4 0.2 0 0 20 40 60 80 100 120 140 time (bins) … and a closer look at the states neuronal causes 0.05 0.04 0.03 0.02 0.01 0 -0.01 -0.02 0 20 40 60 80 time (bins) 100 120 140 Overview µ%& = Vu + D µ% Variational learning and free-energy Hierarchical dynamic models Generalised coordinates (dynamical priors) Hierarchal forms (structural priors) Model inversion Variational filtering (free-form) Laplace approximation and DEM (fixed-form) Comparative evaluations Hemodynamics and Bird songs Synthetic song-birds syrinx simulus 5000 Frequency (Hz) 4500 4000 3500 3000 2500 2000 0.5 hierarchy of Lorenz attractors 1 Time (sec) 1.5 prediction and error hidden states 50 50 40 40 30 30 20 20 10 10 0 0 -10 -10 20 40 60 80 100 120 -20 20 40 time 60 80 100 120 time simulus Frequency (Hz) Song recognition with DEM percept 5000 5000 4500 4500 4000 4000 Frequency (Hz) -20 3500 3000 2500 causes - level 2 hidden states 50 50 40 40 20 10 10 0 0 -10 -10 20 40 60 time 80 100 120 -20 20 40 60 time 80 2000 0.5 20 100 120 3000 2500 2000 30 30 3500 1 Time (sec) 1.5 0.5 1 Time (sec) 1.5 … and broken birds LFP 60 4500 40 LFP (micro-volts) Frequency (Hz) percept 5000 4000 3500 3000 20 0 -20 2500 2000 0.5 1 Time (sec) -40 1.5 0 60 4500 40 4000 3500 3000 2500 m =1 20 0 -20 -40 2000 0.5 1 Time (sec) -60 1.5 0 no dynamical priors 60 4500 40 4000 3500 3000 2500 500 1000 1500 peristimulus time (ms) 2000 LFP 5000 LFP (micro-volts) Frequency (Hz) 2000 LFP 5000 LFP (micro-volts) Frequency (Hz) no structural priors 500 1000 1500 peristimulus time (ms) 20 n =1 0 -20 -40 2000 0.5 1 Time (sec) 1.5 -60 0 500 1000 1500 peristimulus time (ms) 2000 Summary µ%& = Vu + D µ% Variational learning and free-energy Hierarchical dynamic models Generalised coordinates (dynamical priors) Hierarchal forms (structural priors) Model inversion Variational filtering (free-form) Laplace approximation and DEM (fixed-form) Comparative evaluations Hemodynamics and Bird songs