Variational filtering and DEM

advertisement
Variational filtering and DEM
EPSRC Symposium Workshop on
Computational Neuroscience
Monday 8 – Thursday 11, December 2008
Abstract
This presentation reviews variational treatments of dynamic models that furnish time-dependent
conditional densities on the path or trajectory of a system's states and the time-independent
densities of its parameters. These obtain by maximizing a variational action with respect to
conditional densities. The action or path-integral of free-energy represents a lower-bound on the
model’s log-evidence or marginal likelihood required for model selection and averaging. This
approach rests on formulating the optimization in generalized coordinates of motion. The
resulting scheme can be used for online Bayesian inversion of nonlinear hierarchical dynamic
causal models and is shown to outperform existing approaches, such as Kalman and particle
filtering. Furthermore, it provides for multiple inference on a models states, parameters and
hyperparameters using exactly the same principles. Free-form (Variational filtering) and fixed
form (Dynamic Expectation Maximization) variants of the scheme will be demonstrated using
simulated (bird-song) and real data (from hemodynamic systems studied in neuroimaging).
y = g ( x (1) , v (1) , θ (1) ) + z (1)
x&
(1)
Bayesian filtering and the brain
= f ( x , v ,θ )
(1)
(1)
(1)
v (1) = η
Prediction error
y1
y3
y2
x1
x2
v1
Generation
y4
y1
y2
µ1x
µ1v
y3
y4
µ 2x
ε (1) x
ε ( 2) v
Inversion
ε (1) v
Overview
µ%& = Vu + D µ%
Variational learning and free-energy
Hierarchical dynamic models
Generalised coordinates (dynamical priors)
Hierarchal forms (structural priors)
Model inversion
Variational filtering (free-form)
Laplace approximation and DEM (fixed-form)
x(t )
Comparative evaluations
t
Hemodynamics and Bird songs
x& :=
dx
dt
x′ := x[1]
f x :=
∂f
= ∇x f
∂x
x% := x, x′, x′′, x′′′K
x
x′
x ′′
Variational learning, free-energy and action
Aim: To optimise the path-integral (Action) of a free-energy bound on model evidence
with respect to a recognition density q
∂ t F = F ( y , q (ϑ ))
Free-energy:
F = ln p( y | m) − D( q(ϑ ) || p(ϑ | y, m))
=G−H
Expected energy:
G = ln p ( y,ϑ )
Entropy: H = ln q (ϑ )
q
= U (ϑ )
q
q
When optimised, the recognition density approximates the true conditional density and
Action becomes a bound approximation to the integrated log-evidence; these can then be
used for inference on parameters and models respectively
The mean-field approximation
q (ϑ ) = ∏ q (ϑ i ) = q (u (t ))q (θ )q (λ )
i
Lemma : The free energy is maximised with respect to
q (u (t )) ∝ exp(V (t ))
Recognition density
q(ϑ i )
when
V (t ) = U (t )
δ q (ϑ ) F = 0 ⇒
i
q (θ , λ )
q(θ ) ∝ exp(V θ )
V θ = ∫ U (t )
q (λ ) ∝ exp(V λ )
dt + U θ
q ( u ,λ )
V λ = ∫ U (t )
λ
dt
+
U
q ( u ,θ )
Where U θ = ln p (θ ) and U λ = ln p (λ ) are the prior energies
and the instantaneous energy is specified by a generative model
U (t ) = ln p( y%, u | θ , λ)
Variational energy
and actions
Overview
µ& = Vu + D µ%
Variational learning and free-energy
Hierarchical dynamic models
Generalised coordinates (dynamical priors)
Hierarchal forms (structural priors)
Model inversion
Variational filtering (free-form)
Laplace approximation and DEM (fixed-form)
Comparative evaluations
Hemodynamics and Bird songs
Generalised coordinates and dynamic models
y = g ( x, v ) + z
x& = f ( x, v) + w
y = g ( x, v ) + z
y ′ = g x x′ + g v v′ + z ′
y′′ = g x x′′ + g v v′′ + z′′
M
x′ = f ( x, v ) + w
x′′ = f x x′ + f v v′ + w′
x′′′ = f x x′′ + f v v′′ + w′′
M
y% = g% + z%
g = g ( x, v )
g ′ = g x x′ + g v v′
g ′′ = g x x′′ + g v v′′
M
Dx% = f% + w%
f = f ( x, v )
f ′ = f x x ′ + f v v′
f ′′ = f x x′′ + f v v′′
M
Likelihood
(Dynamical) prior
p ( y% | x% , v% ) = N ( g% , Σ z )
p( Dx% | v% ) = N ( f% , Σ w )
0 1

 0 O 
⊗I
D=

O 1


0

Energies and generalised precisions
Instantaneous energy
a simple function of prediction error
U (t ) := ln p ( y% , u ) = ln p ( y% | x% , v% ) p ( x% | v% ) p ( v% )
% − 1 ε%T Π
% ε%
= 12 ln Π
2
%z

Π
% =
Π

 ε% v = y% − g% 

ε% =  x

% w 
Π
ε% = Dx% − f% 
Precision matrices in generalised coordinates and time
2
2
4
4
6
6
8
8
10
10
12
12
14
16
% = S (γ )
Π
5
10
15
14
16
% = S (γ ) ⊗ Π (λ )
Π
ρ&&(0) L  1
0
 1
 0
  0
− ρ&&(0)
0
−1

=
S (γ ) =
&&(0)
 ρ&&(0)
  − 12 γ
ρ&&
0

 
O  M
 M
% E −1
E −1T Π
5
10
15
r %
y = Ey% (t )
0
1
2
γ
0
General and Gaussian forms
− 12 γ
0
3
4
γ2
L




O
Hierarchical forms
~
y = g~ + ~
z
y = g ( x, v ) + z
⇒
~
~
x& = f ( x, v) + w
D~
x = f +w
v~ (i −1) = g~ (i ) + ~
z (i )
⇒
~
~ (i )
D~
x (i ) = f (i ) + w
v ( i −1) = g ( x ( i ) , v (i ) ) + z ( i )
x& (i ) = f ( x ( i ) , v ( i ) ) + w(i )
v~ ( 2 )
v~
p (v~ ) = N (v~ : η v , C v )
~ ~
p(~
x | v~ , θ , λ ) = N ( D~
x : f ,Σw)
λ
~
x
λ(2)
~
x ( 2)
θ (2)
~ ~
p(~
x ( 2 ) | v~ ( 2 ) , ϑ ( 2 ) ) = N ( D ~
x (2) : f (2) , Σ (2)w )
θ
v~ (1)
~
y
~
p (v ( 2) | u (3) , ϑ (3) ) = N (v ( 2 ) : g~ (3) , Σ (3) z )
~
p( ~
y | v~ , ~
x ,θ , λ ) = N ( ~
y : g~ , Σ z )
ϑ = {u,θ , λ}
u = {x% , v%}
λ (1)
~
x (1)
~
y
~
p (v (1) | u ( 2) , ϑ ( 2 ) ) = N (v (1) : g~ ( 2) , Σ ( 2) z )
θ (1)
~ ~
p(~
x (1) | v~ (1) , ϑ (1) ) = N ( D ~
x (1) : f (1) , Σ (1) w )
~
p( ~
y | u (1) , ϑ (1) ) = N ( ~
y : g~ (1) , Σ (1) z )
Hierarchal forms and empirical priors
y = g ( x (1) , v (1) ) + z (1)
x& (1) = f ( x (1) , v (1) ) + w(1)
M
U (t ) = ln p ( y% , x% , v% | θ , λ )
v ( i −1) = g ( x (i ) , v ( i ) ) + z ( i )
% − 1 ε% T Π
% ε%
= 12 ln Π
2
x& ( i ) = f ( x ( i ) , v (i ) ) + w( i )
M
The instantaneous energy
function of prediction error
 y   g (1) 

 v (1)  
M
v
ε =   −  ( m) 
 M  g 

 (m)  
v
η

 

v ( m ) = η + z ( m+1)
p ( y% , x% , v% ) = p ( y% | x% , v% ) p ( x% , v% )
p ( x% , v% ) = p(v%
m −1
(m)
)∏ p ( x% (i ) | v% (i ) ) p(v% (i ) | x% (i +1) , v% (i +1) )
i =1
p ( x% (i ) | v% (i ) ) = N ( Dx% (i ) : f% (i ) , Σ% (i ) w ) Dynamical priors (empirical)
p ( v% (i ) | x% (i +1) , v% (i +1) ) = N (v% (i ) : g% (i +1) , Σ% (i ) z ) Structural priors (empirical)
p ( v% ( m ) ) = N (v% ( m ) : η% , Σ% v )
Priors (full)
Overview
µ%& = Vu + D µ%
Variational learning and free-energy
Hierarchical dynamic models
Generalised coordinates (dynamical priors)
Hierarchal forms (structural priors)
Model inversion
Variational filtering (free-form)
Laplace approximation and DEM (fixed-form)
Comparative evaluations
Hemodynamics and Bird songs
Variational filtering (free-form)
Aim: To approximate the recognition density given the variational (instantaneous) energy
Lemma: q(u (t )) ∝ exp(V (t )) is the stationary solution, in a moving frame of reference,
for an ensemble of particles, whose equations of motion and ensemble dynamics are
u& = V (t ) v + µ ′ + Γ
u& ′ = V (t ) v′ + µ ′′ + Γ ⇒ u& = V (t )u + D µ% + Γ
u& ′′ = K
q& = ∇u ⋅ [∇u q − q∇uV (t )] − ∇u ⋅ qDµ%
Proof: Substituting the recognition density q ∝ exp(V (t )) gives
q& = −∇u ⋅ qDµ%
This describes a stationary density under a moving frame of reference, with velocity D µ%
as seen using the co-ordinate transform
υ& = Dµ% ⇒
q& (υ ) = q& (u ) + ∇u ⋅ qυ& = 0
A toy example
µ%& = Vu + D µ%
V (u% , t ) = 4(u − µ ) 2 + (u ′ − µ ′) 2 + 14 (u ′′ − µ ′′) 2
µ& (t ) = µ ′(t )
µ& ′(t ) = µ ′′(t )
5
µ& ′′(t ) = 14 µ (t )
4
3
u (t )
2
5
1
0
u ′′
0
u′
-5
2
-1
0
-2
0
20
40
60
80
100
120
u& = V (t )u + µ ′ + Γ(t )
µ& = µ ′
u& ′ = V (t )u ′ + µ ′′ + Γ(t ) ⇒ µ& ′ = µ ′′ = µ&&
µ& ′′ = K
u& ′′ = K
t bins
-2
u
Optimizing free-energy under the Laplace approximation
i
Mean-field approximation: q (ϑ ) = ∏ q (ϑ ) = q (u (t )) q (θ ) q (λ )
i
i
i
Laplace approximation: q (ϑ ) = N ( µ , Σ )
i
The Laplace approximation enables us the specify the sufficient statistics of the recognition density very simply
Conditional modes
µ% (t ) = arg max V (t )
u
Conditional precisions
Π u = −U (t )uu
µ θ = arg max V θ
θ
Πθ = − ∫ U (t )θθ dt − Uθθ
µ λ = arg max V λ
λ
Π λ = − ∫ U (t )λλ dt − U λλ
θ
λ
Under these approximations, all we need to do is optimise the conditional modes
Fixed form schemes … a gradient ascent in moving coordinates
Taking the expectation of the ensemble dynamics, we get:
u& = V (t )u + Dµ% + Γ ⇒
µ&% = V (t ) + Dµ% ⇔ µ%& − Dµ% = V (t )
u
u
~& − Dµ~ can be regarded as a gradient ascent in a frame of reference that
Here, µ
moves along the trajectory encoded in generalised coordinates. The stationary
solution, in this moving frame of reference, maximises variational action.
µ&% − Dµ% = 0 ⇒
∂ uV (t ) = 0 ⇔ δ uV u = 0
by the Fundamental lemma; c.f., Hamilton's principle of stationary action.
Dµ%
µ&% − Dµ%
Dynamic expectation maximization
µ%& = Vu + D µ%
ℑ = V (t )uu + D
D-Step
inference
q (u (t ))
∆µ% = (exp(ℑ) − I )ℑ−1 (V (t )u + Dµ% )
Π u = −U (t )uu
local linearisation (Ozaki 1992)
E-Step
learning
q (θ )
M-Step
uncertainty
q (λ )
∆µ θ = −Vθθθ −1Vθθ
θ
Πθ = −Uθθ
∆µ λ = −Vλλλ −1Vλλ
λ
Π λ = −U λλ
q (ϑ ) = q (u (t ))q (θ ) q (λ )
A dynamic recognition system that minimises
prediction error
Overview
µ& = Vu + D µ%
Variational learning and free-energy
Hierarchical dynamic models
Generalised coordinates (dynamical priors)
Hierarchal forms (structural priors)
Model inversion
Variational filtering (free-form)
Laplace approximation and DEM (fixed-form)
Comparative evaluations
Hemodynamics and Bird songs
y = g ( x (1) , v (1) , θ (1) ) + z (1)
x&
(1)
A linear convolution model
= f ( x , v ,θ )
(1)
(1)
(1)
v (1) = η
Prediction error
y1
y3
y2
x1
x2
v1
Generation
y4
y1
y2
µ1x
µ1v
y3
y4
µ 2x
ε (1) x
ε ( 2) v
Inversion
ε (1) v
hidden states
Variational filtering on
states and causes
1.5
v(t )
µ%& = Vu + D µ%
1
0.5
0
cause
-0.5
u& (t ) = V (t ) u + D µ% (t ) + Γ (t )
-1
u (t ) = [ v , v ′, v ′′, K x , x ′, x ′′, K]
5
10
15
20
25
30
T
cause
1.2
1
0.8
x
time
0.6
x′
0.4
x ′′
0.2
0
x(t )
-0.2
-0.4
5
10
15
20
time {bins}
25
30
u& (t ) = V (t ) u + D µ% (t ) + Γ (t )
predicted response and error
µ%& (t ) = V (t ) u + D µ% (t )
hidden states
0.3
predicted response and error
hidden states
1
1
0.2
0.2
0.5
0.5
states (a.u.)
states (a.u.)
0.1
0.1
0
0
-0.5
-0.1
-0.2
5
10
15
20
time {bins}
25
30
-1
0
0
-0.5
-0.1
5
10
15
20
time {bins}
25
-0.2
30
Linear deconvolution with variational
filtering (SDE) – free-form
5
0.8
0.8
0.6
0.6
0.4
0.2
0.2
0
-0.2
-0.2
15
20
time {bins}
30
0.4
0
10
25
causal states - level 2
1
states (a.u.)
states (a.u.)
causal states - level 2
5
15
20
time {bins}
25
30
-1
5
10
15
20
time {bins}
25
30
Linear deconvolution with Dynamic expectation
maximisation (ODE) – fixed-form
1
-0.4
10
-0.4
5
10
15
20
time {bins}
25
30
2
The order of generalised motion
4
6
8
10
12
14
% = S (γ )
Π
16
5
10
n =1
15
2
Precision in generalised coordinates
v(t )
1.5
1
0.5
6
sum squared error (causal states)
0
5
-0.5
-1
4
Accuracy and
embedding (n)
3
-1.5
0
10
20
30
40
time
n=7
2
2
v(t )
1
1.5
1
0
1
3
5
7
9
11
13
0.5
0
n
-0.5
-1
-1.5
0
5
10
15
20
time
25
30
35
hidden states
DEM and extended
Kalman filtering
1
0.5
0
With convergence when
γ → ∞ ≡ n =1
DEM(0)
-0.5
DEM(4)
EKF
-1
hidden states
true
1
-1.5
0
5
10
15
20
25
30
35
time
0.5
sum of squared error (hidden states)
0
0.9
DEM(0)
-0.5
0.8
EKF
-1
0.7
0
0.6
0.5
0.4
0.3
0.2
0.1
EKF
DEM(0)
DEM(4)
10
20
time
30
40
A nonlinear convolution model
states (a.u.)
predicted response and error
hidden states
20
10
15
8
10
6
5
4
0
2
-5
5
10
15
20
time {bins}
25
30
0
5
10
15
20
time {bins}
level
m =1
causal states
25
30
g ( x, v )
1
5
x2
f ( x, v )
Πz
Πw
e v − x ln 2
e4
e16
ηv
2.5
m=2
2
2
1
2
+ sin( 161 πt )
states (a.u.)
1.5
1
0.5
This system has a slow sinusoidal input or cause that excites
increases in a single hidden state. The response is a quadratic
function of the hidden states (c.f., Arulampalam et al 2002).
0
-0.5
-1
5
10
15
20
time {bins}
25
30
Conditional expectation (hidden states)
14
DEM and particle filtering
12
true
DEM
PF
EKF
10
8
6
4
4
performance
sumComparative
of squared error
(hidden states)
10
2
0
0
5
10
15
20
25
30
35
time
Conditional covariance (hidden states)
0.7
0.6
DEM
PF
EKF
0.5
Sum of squared error
-2
3
10
2
10
1
10
0
10
0.4
-1
10
0.3
0.2
0.1
0
0
5
10
15
20
time
25
30
35
EKF
PF
DEM
predicted response and error
g
hidden states
µ
1
0.2
0.5
µ%& = Vu + D µ%
x
states (a.u.)
0.1
0
Inference on states
0
-0.5
-0.1
-0.2
5
10
15
20
time {bins}
25
30
-1
5
10
15
20
time {bins}
25
30
Triple estimation (DEM)
v
parameters
µ
0.2
1
0
0.8
states (a.u.)
µ
causal states
1.2
θ
0.6
-0.2
true
DEM
0.4
0.2
-0.4
0
-0.2
5
10
15
20
time {bins}
25
30
-0.6
1
2
Learning parameters
Overview
µ%& = Vu + D µ%
Variational learning and free-energy
Hierarchical dynamic models
Generalised coordinates (dynamical priors)
Hierarchal forms (structural priors)
Model inversion
Variational filtering (free-form)
Laplace approximation and DEM (fixed-form)
Comparative evaluations
Hemodynamics and Bird songs
An fMRI study of attention
Stimuli 250 radially moving dots at 4.7 degrees/s
Pre-Scanning
5 x 30s trials with 5 speed changes (reducing to 1%)
Task: detect change in radial velocity
Scanning (no speed changes)
4 x 100 scan sessions; each comprising 10 scans of 4 different conditions
F A F N F A F N S .................
A – dots, motion and attention (detect changes)
N – dots and motion
S – dots
V5 (motion sensitive area)
F – fixation
Buchel et al 1999
A hemodynamic model
Visual input
Motion
u(t )
Attention
convolution kernel
signal
s& = θ u − κ s − γ( f − 1)
state equations
flow
f& = s
x& = f ( x,u ,θ )
volume
dHb
τv& = f − v1/α
τq& = f E ( f )q ρ − v1 /α q/v
Output: a mixture of intra- and extravascular signal
y (t ) = g (v, q) = V0 (k1 (1 − q) + k 2 (1 − q/v) + k 3 (1 − v))
output equation
y = g ( x, θ )
predicted response and error
hidden states
6
0.6
4
0.4
0.2
states (a.u.)
2
0
0
-0.2
-2
-4
-6
Inference on states
-0.4
-0.6
50
100 150 200 250 300
time {bins}
350
-0.8
50
100
150 200 250 300 350
time {bins}
Hemodynamic deconvolution (V5)
causal states
parameters
0.025
states (a.u.)
1
0.02
0.015
0.5
Learning parameters
0.01
0
0.005
-0.5
50
100 150 200 250 300
time {bins}
350
0
vis
mot
att
hidden states
1.6
1.4
1.2
1
0.8
0.6
visual stimulation
signal
flow
volume
dHb
0.4
0.2
0
0
20
40
60
80
100
120
140
time (bins)
… and a closer look at the states
neuronal causes
0.05
0.04
0.03
0.02
0.01
0
-0.01
-0.02
0
20
40
60
80
time (bins)
100
120
140
Overview
µ%& = Vu + D µ%
Variational learning and free-energy
Hierarchical dynamic models
Generalised coordinates (dynamical priors)
Hierarchal forms (structural priors)
Model inversion
Variational filtering (free-form)
Laplace approximation and DEM (fixed-form)
Comparative evaluations
Hemodynamics and Bird songs
Synthetic song-birds
syrinx
simulus
5000
Frequency (Hz)
4500
4000
3500
3000
2500
2000
0.5
hierarchy of Lorenz attractors
1
Time (sec)
1.5
prediction and error
hidden states
50
50
40
40
30
30
20
20
10
10
0
0
-10
-10
20
40
60
80
100
120
-20
20
40
time
60
80
100
120
time
simulus
Frequency (Hz)
Song recognition with DEM
percept
5000
5000
4500
4500
4000
4000
Frequency (Hz)
-20
3500
3000
2500
causes - level 2
hidden states
50
50
40
40
20
10
10
0
0
-10
-10
20
40
60
time
80
100
120
-20
20
40
60
time
80
2000
0.5
20
100
120
3000
2500
2000
30
30
3500
1
Time (sec)
1.5
0.5
1
Time (sec)
1.5
… and broken birds
LFP
60
4500
40
LFP (micro-volts)
Frequency (Hz)
percept
5000
4000
3500
3000
20
0
-20
2500
2000
0.5
1
Time (sec)
-40
1.5
0
60
4500
40
4000
3500
3000
2500
m =1
20
0
-20
-40
2000
0.5
1
Time (sec)
-60
1.5
0
no dynamical priors
60
4500
40
4000
3500
3000
2500
500
1000
1500
peristimulus time (ms)
2000
LFP
5000
LFP (micro-volts)
Frequency (Hz)
2000
LFP
5000
LFP (micro-volts)
Frequency (Hz)
no structural priors
500
1000
1500
peristimulus time (ms)
20
n =1
0
-20
-40
2000
0.5
1
Time (sec)
1.5
-60
0
500
1000
1500
peristimulus time (ms)
2000
Summary
µ%& = Vu + D µ%
Variational learning and free-energy
Hierarchical dynamic models
Generalised coordinates (dynamical priors)
Hierarchal forms (structural priors)
Model inversion
Variational filtering (free-form)
Laplace approximation and DEM (fixed-form)
Comparative evaluations
Hemodynamics and Bird songs
Download