On the Selection of Multiregression Dynamic Models of fMRI Lilia Costa

advertisement
On the Selection of Multiregression Dynamic Models of fMRI
networked time series
Lilia Costa
The University of Warwick, UK; Universidade Federal da Bahia, BR
Jim Smith
The University of Warwick, UK
Thomas Nichols
The University of Warwick, UK
March 20, 2013
Abstract
A Multiregression Dynamic Model (MDM) is a class of multivariate time series that allows various dynamic causal processes to be represented in a graphical way. In contrast
with many other Dynamic Bayesian Networks, the hypothesized relationships accommodate
conditional conjugate inference. This means that it is straightforward to search over many
different connectivity networks with dynamically changing intensity of transmission to find
the MAP model within a class of models. In this paper we customize this conjugate search
within scientific models describing the dynamic connectivity of the brain. As well as demonstrating the efficacy of our dynamic models, we illustrate how diagnostic methods, analogous
to those defined for static Bayesian Networks, can be used to suggest embellishment of the
model class to extend the process of model selection.
Keywords: Multiregression Dynamic Model, Bayesian Network, Markov Equivalent Graph,
Model Selection, Functional magnetic resonance imaging (fMRI).
1
Introduction
We consider the application of a class of Dynamic Bayesian Network (DBN) models, called
the Multiregression Dynamic Model (MDM), to resting state functional Magnetic Resonance
Imaging (fMRI) data. Functional MRI consists of a dynamic acquisition, i.e. a series of images,
which provides a time series at each volume element. These data reflect the blood oxygenation
level, which is related to the activity of brain neurons. A traditional fMRI experiment consists of
alternating periods of active and control experimental conditions and the purpose is to compare
brain activity between two different cognitive states (e.g. remembering a list of words versus
just passively reading a list of words). In contrast, a “resting state” experiment is conducted by
having the subject remain in a state of quiet repose, and the analysis focuses on understanding
E-mail: L.C.Carneiro-da-Costa@warwick.ac.uk
This research was partially supported by CAPES, Brazil.
1
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
the pattern of connectivity among different cerebral areas. The ultimate (and ambitious) goal is
to understand how one neural system influences another (Poldrack et al., 2011). Some studies
assume that the connection strengths between different brain regions are constant. These static
models are used to define dependence networks, a subset of which is then revised to form a more
scientifically accurate dynamic model (Smith et al., 2011; Friston, 2011). However, clearly a
more promising strategy would be to first perform a search over a class of models which was rich
enough to capture the dynamic changes in the connectivity strengths that are known to exist
in this application. The Multiregression Dynamic Model (MDM) can do just this (Queen and
Smith, 1993; Queen and Albers, 2009) and in this paper we demonstrate how it can be applied
to resting fMRI.
Currently the most popular approach is to hypothesize relationships with an undirected
graph or a directed acyclic graph (DAG). Smith et al. (2011) compared different connectivity
estimation approaches for fMRI data. Whilst they found that BNs are one of the most successful methods for detecting (undirected) network edges, none of the methods (except Patel’s
τ ) was remotely successful in estimating connection direction. In this paper we describe the
first application — to our knowledge — of Bayes’ factor MDM search in this domain. As with
standard BNs the Bayes’ factor of MDM can be written in closed form, so the model space can
be scored quickly. However unlike a static BN, the MDM models dynamic links and so allows
us to discriminate between models that would be Markov equivalent in their static versions.
Furthermore the directionality exhibited in the graph an MDM can be associated with a causal
directionality in a very natural way (Queen and Albers, 2009) that is also scientifically meaningful. We are therefore able to demonstrate that the MDM is not only a useful method for
detecting the existence of brain connectivity but also for estimating its direction.
This paper also presents new prequential diagnostics for this model class analogous to those
originally developed for static BNs (Cowell et al., 1999) using the closed form of the one-step
ahead predictive distribution. It is well known that Bayes’ factor model selection methods can
breakdown whenever no representative in the associated model class fits the data well. It is
therefore extremely important to check that selected models are consistent with the observed
series. Here we recommend initially selecting across a class of simple linear MDMs which are
time homogeneous, linear and with no change points. We then check the best model using
these new prequential diagnostics. In practice we have found the linear MDMs usually perform
well for most nodes receiving inputs from other nodes. However when diagnostics discover
a discrepancy of fit, the MDM class is sufficiently expressive that it can be embellished to
accommodate other anomalous features. For example, it is possible to include time dependent
error variances, change points, interaction terms in the regression and so on, to better reflect the
underlying model and refine the analysis. Often, even after such embellishment, the model still
stays within a conditionally conjugate class. So if our diagnostics identify serious deviation from
the highest scoring simple MDM, we can adapt it and its high scoring neighbors with features
explaining the deviations. The model selection process using Bayes’ factors can then be reapplied
to discover models that describe the process even better. In this way we can iteratively augment
the fitted model and its highest scoring competitors with embellishments until the search class
accommodates the main features observed in the dynamic processes well.
The remainder of this paper is structured as follows. Section 2 describes the resting state
data we use here and Section 3 reviews the class of MDMs. The performance of the MDM will
be investigated using synthetic data in Section 4. Section 5 then develops diagnostic statistics
for an MDM using real data that can be applied to the model selection as described above.
Directions for future work are given in Section 6.
2
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
2
Resting State Data
This study focuses on fMRI time series data, where 36 healthy adults at rest were observed
over 6 minute intervals (Smith et al., 2009). The first step in the study of fMRI connectivity
is data reduction. This is as usually achieved by summarizing the data as a set of time series
derived from predefined regions of interest (ROI’s). The choice of ROI set to use is somewhat
arbitrary, and poorly chosen ROI’s may mix heterogeneous brain regions, and, further, a ROI
based network cannot easily represent spatially overlapping networks (Smith et al., 2011). An
alternative is to use Independent Components Analysis (ICA) to generate data-driven spatial
patterns that describe the structure of local and long-range connectivity (Cole et al., 2010).
Here we have used such an ICA approach. Each subject’s image data was transformed into
a standard atlas space, and then all subjects’ data concatenated temporally. After an initial
principal components data reduction to 20 components, ICA produced a set of 20 matched pairs
of spatial components (that are orthogonal and maximally statistical independent) and temporal
loadings (that may be correlated). Of these 20, 10 spatial components were identified as wellknown resting-state networks (Smith et al., 2009). The temporal loadings, split back into 36
separate time series, express the temporal evolution of the corresponding spatial pattern in each
subject and were the source of the data for our MDM modeling. To demonstrate the methodology
developed in this paper, it is sufficient to consider the networks of just three of these components.
Henceforth we refer to those components as regions (to distinguish them from the network we
build between these three regions): region 1, a visual network composed of medial, occipital pole
and lateral visual areas (comprised of the average of the 3 visual networks in Smith et al., 2009);
region 2, the “default mode network” (DMN) comprising posterior cingulate, bilateral inferiorlateral-parietal and ventromedial frontal areas; and region 3, an “executive control” network
that covers several medial-frontal areas including anterior cingulate and paracingulate cortex
(Figure 1).
For each of the 36 subjects and for each of the 3 regions, we have time series of length 176
that represent the neural activity; for a given subject we write the time series as Y (1), Y (2),
and Y (3), for the 3 regions.
(a) Visual (b) DMN (c) Executive Control Figure 1: Three networks from independent component analysis of the 36-subject resting fMRI dataset
(Smith et al., 2009): Region 1 - Visual (a), Region 2 - Default Mode Network (b) and Region 3 - Executive
Control (c).
A simple preliminary analysis of this data clearly demonstrates the scientifically predicted
dynamically evolving changes in strength of dependency between these series. For example plots
of estimates of regression coefficients of one of the subjects using simple regression models, fitted
to estimate the linear relationship between Y (1) and Y (2); Y (1) and Y (3); and Y (2) and Y (3)
3
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
and based on a moving window of 30 time points are given in Figure 2. The plots clearly exhibit
some large drifts in dependency strengths over time. This strongly suggests that a time varying
flexible model such as the simple linear MDM defined below should be used for this application.
0
50
100
Time interval
150
2
1
0
-2
-1
The estimate of regression' slope
1
0
-2
-1
The estimate of regression' slope
1
0
-1
-2
The estimate of regression' slope
(c) Y(2)->Y(3)
2
(b) Y(1)->Y(3)
2
(a) Y(1)->Y(2)
0
50
100
150
Time interval
0
50
100
150
Time interval
Figure 2: The posterior mean and 95% HPD intervals for regression parameters between every 2 regions:
(a) Visual → DMN ; (b) Visual → Executive Control ; (c) DMN → Executive Control over 147 time
intervals.
3
The Multiregression Dynamic Model
The MDM (Queen and Smith, 1993) is a graphical multivariate model for a n-dimensional
time series Yt (1), . . . , Yt (n), t = 1, . . . , T . It is a composition of simpler univariate regression
dynamic linear models (DLMs; West and Harrison, 1997), which can model smooth changes in
the parents’ effect on a given node during the period of investigation. There are five features of
this model class that are useful for this study.
1. Each MDM is defined in part by a directed acyclic graph whose vertices are components
of the series at a given time, and edges denote that contemporaneous (and possibly past)
observations are included as regressors. These directed edges therefore denote that direct
contemporaneous relationships might exist between these components and also that the
directionality of the edges is ‘causal’ in a sense that is carefully argued in Queen and Albers
(2009);
2. Dependence relationships between each component and its contemporaneous parents — as
represented by the corresponding regression coefficients — are allowed to drift with time.
We note that this drift can be set to zero and then the MDM simplifies to a standard
Gaussian graphical model;
3. Each one of these multivariate models admits a conjugate analysis. In particular its
marginal likelihood can be expressed as a product of multivariate student t distributions,
as detailed below. Its closed form allows us to perform fast model selection;
4. Although the predictive distributions of each component given its parents is multivariate
student t distributed, because covariates enter the scale function of these conditionals,
the joint distribution can be highly non-Gaussian (see e.g. Queen and Smith, 1993, for
4
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
an example of this). Because ICA and related techniques applied to the study of fMRI
data often work on the assumption that processes are not jointly Gaussian, the MDM is
more compatible with these methods than many of its competitors that assume marginal
Gaussianity;
5. The class of MDM can be further modified to include other features that might be necessary
in a straightforward and convenient way, for example by adding dependence on the past
of the parents (not just the present), allowing for change points, and other embellishments
that we illustrate below.
At time t the data at n regions is denoted by the column vector Yt0 = (Yt (1), . . . , Yt (n)),
and their observed values designated respectively by yt0 = (yt (1), . . . , yt (n)). Let the time series until time t for region r = 1, . . . , n be Yt (r)0 = (Y1 (r), . . . , Yt (r)) and the time series for
possible parents of region r at time t be Xt (r)0 = (Yt (1), . . . , Yt (r − 1)) for r = 2, . . . , n. Note
that the n regions can always be ordered to ensure that P a(r) ⊆ Xt (r), where the parent set
of Yt (r) is known as P a(r). The MDM is defined by n observation equations, a system equation and initial information (Queen and Smith, 1993). The observation equations specify the
time-varying regression parameters of each region on its parents. The system equation is a
multivariate autoregressive model for the evolution of time-varying regression coefficients; and
the initial information is given though a prior density for the regression coefficients. The linear
multiregression dynamic model (LMDM; Queen and Albers, 2008) is specified in terms of a
collection of conditional regression DLMs (West and Harrison, 1997) as follows.
We write the observation equations as
Yt (r) = Ft (r)0 θ t (r) + vt (r),
vt (r) ∼ N (0, Vt (r));
where r = 1, . . . , n; t = 1, . . . , T ; N (·, ·) is a Gaussian distribution; Ft (r)0 is a covariate vector
with dimension pr determined by P a(r), pr = |P a(r)| + 1; the first element of Ft (r)0 is 1,
representing an intercept, and the remaining columns are Yt (r∗), for Yt (r∗) ∈ P a(r); θ t (r) is the
pr -dimensional time-varying regression coefficient; and vt (r) is the independent residual error
with variance Vt (r). Concatenating
the n regression coefficients as θ 0t = (θ t (1)0 , . . . , θ t (n)0 ) gives
Pn
a vector of length p = r=1 pr . We next write the system equation as
θ t = Gt θ t−1 + wt ,
wt ∼ N (0, Wt );
where Gt = blockdiag{Gt (1), . . . , Gt (n)}, each Gt (r) being a pr × pr matrix, wt are innovations
for the latent regression coefficients, and Wt = blockdiag{Wt (1), . . . , Wt (n)}, each Wt (r) being
a pr × pr matrix. The error wt is assumed independent of vs for all t and s. For most of the
development we need only consider Gt (r) = Ipr , where Ipr is pr -dimensional identity matrix.
Finally the initial information is written as
(θ 0 |y0 ) ∼ N (m0 , C0 );
where θ 0 expresses the prior knowledge of the regression parameters, before observing any
data, given the information at time t = 0, i.e. y0 . The mean vector m0 is an initial estimate of the parameters and C0 is the p × p variance-covariance matrix. C0 can be defined as
blockdiag{C0 (1), . . . , C0 (n)}, with each C0 (r) being a pr square matrix. When the observational
variances are unknown and constant, i.e. Vt (r) = V (r) for all t, by defining φ(r) = V (r)−1 , a
prior
n0 (r) d0 (r)
(φ(r)|y0 ) ∼ G
,
,
2
2
5
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
where G(·, ·) denotes a Gamma distribution leads to a conjugate analysis where conditionally
each components of the marginal likelihood has a Student t distribution. In order to use this
conjugate analysis it is convenient to reparameterise the model as Wt (r) = V (r)Wt∗ (r) and
C0 (r) = V (r)C∗0 (r). For a fixed innovation signal matrix Wt∗ (r) this change implies no loss of
generality (West and Harrison, 1997).
It simplifies the analysis and so the interpretation of the model class to define the innovation
signal matrix indirectly in terms of a single hyperparameter for each component DLM called
a discount factor (West and Harrison, 1997; Petris et al., 2009), especially for model selection
purposes we have in used here. This vastly reduces the dimensionality of the model class whilst
in practice often losing very little in the quality of fit. This well used technique expresses different
values of Wt∗ in terms of the loss of information in the change of θ between times t − 1 and t.
More precisely, for some δ ∈ (0, 1],
Wt∗ =
1−δ ∗
Ct−1 ;
δ
where Ct (r) = V (r)C∗t (r) is the posterior variance of θ t . Note that when δ = 1, Wt∗ = 0Ip ,
there are no stochastic changes in the state vector and we degenerate to a conventional standard
multivariate Gaussian prior to posterior analysis.
For any choice of discount factor δ the recurrences given above for any MDM gives a closed
form expression for this marginal likelihood. This means that we can estimate δ simply by
maximizing this marginal likelihood, performing a direct one-dimensional optimization over δ,
such as that used in Heard et al. (2006) to complete the search algorithm. The selected model
is then the one with the discount factor giving the highest associated Bayes’ factor score, see
below.
The joint density over the vector of observed associated with any MDM series can be factorized as the product of the distribution of the first node and transition distributions between
the subsequent nodes (Queen and Smith, 1993). Moreover, the conditional one-step forecast
distribution can be written as (Yt (r)|yt−1 , xt (r)) ∼ Tnt−1 (r) (ft (r), Qt (r)), where Tnt (r) (·, ·) is a
noncentral t distribution with nt (r) degrees of freedom and the parameters are easily found
through Kalman filter recurrences (see e.g. West and Harrison, 1997). The joint log predictive
likelihood (LPL) is then calculated as
LPL(m) = log py (y|m)
n
X
=
log pr (y(r)|x(r), m)
=
r=1
n X
T
X
(1)
log ptr (yt (r)|yt−1 , xt (r), m),
r=1 t=1
where m reflects the current choice of model that determines the relationship between the n
regions expressed graphically through the underlying graph. The most popular Bayesian scoring
method and the one we use here is the Bayes’ factor measure (Jeffreys, 1961; West and Harrison,
1997) which is defined as the ratio between the predictive likelihood of two models.
4
A LMDM Analysis of a Synthetic fMRI Dataset
It is first interesting to investigate the potential of an MDM to discriminate between effective
connectivities that change over time and those that remain the same, i.e. whether they originated
6
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
from a dynamic or static process. Static BNs with the same skeleton are often Markov equivalent
(Lauritzen, 1996). This is not so for their dynamic MDM analogues. So it is possible for an MDM
to detect directions of relationships in DAGs which are Markov equivalent in static analysis. We
will explore these questions below and demonstrate how this is possible using a simulation
experiment. This in turn allows us to search over hypotheses expressing potential deviations of
causations.
Here simulation of observations from known MDMs were studied using sample sizes T = 100,
200 and 300, and different dynamic levels W∗ (r) = 0Ipr (static), 0.001Ipr , 0.01Ipr and 0.1Ipr .
The impact of these different scenarios into MDM’s results was verified regarding 2 and 3 regions
and 100 datasets for each T and W∗ (r) pair.
For two nodes, data were generated using the MDM with graph DAG1 given in Figure 3 (a).
The initial values for the regression parameters were 0.3 for connection between Y (1) and Y (2),
(2)
i.e. θ0 (2), and the value 0 for other θ’s (intercept parameters). The observational variance was
defined as 12.5 for Y (1) and 6.3 for Y (2) so that the marginal variances were almost the same
for both regions. Thus we set
(k)
(k)
(k)
(k)
wti (r) ∼ N (0, W (k) (r)),
θti (r) = θt−1i (r) + wti (r),
for r = 1, 2; t = 1, . . . , T ; i = 1, . . . , 100 replications; k = 1, . . . , pr ; p1 = 1; p2 = 2; W (k) (r) =
W ∗(k) (r) × V (r) and W ∗(k) (r) is the k th element of the diagonal of matrix W∗ (r) defined above.
Observed values were then simulated using the following equations:
(1)
vti (1) ∼ N (0, V (1));
(2)
θti (2)Yti (1) + vti (2),
vti (2)
Yti (1) = θti (1) + vti (1),
(1)
θti (2)
Yti (2) =
+
1
1
2
(a) DAG1
1
3
2
(b) DAG2
1
2
3
2
(c) DAG3
(d) DAG4
∼ N (0, V (2)).
1
3
2
(e) DAG5
Figure 3: Direct acyclic graphs used in the synthetic study. With 2 nodes, (a) DAG1 and (b)
DAG2 are Markov equivalent as static BNs. For 3 nodes, (c) DAG3 and (d) DAG4 are considered
Markov equivalent whilst neither is equivalent to (e) DAG5.
For three nodes, the graphical structure used to obtain the synthetic data is shown in Figure 3
(c) DAG3. The initial values for the regression parameters were 0.3 for the connectivity between
(2)
(2)
Y (1) and Y (2), i.e. θ0 (2), 0.2 for the connectivity between Y (2) and Y (3), i.e. θ0 (3), and
the value 0 for other θ’s (intercept parameters). The observational variance was set to be the
same as for two nodes for the first and second variables and 5.0 for Y (3). The observation and
system equations were also set to be the same as for node 2, except for r = 1, . . . , 3; p3 = 2; and
(1)
(2)
Yti (3) = θti (3) + θti (3)Yti (2) + vti (3),
vti (3) ∼ N (0, V (3)).
The log predictive likelihood (LPL) was first computed for different values of discount factor
δ, using a weakly informative prior with n0 (r) = d0 (r) = 0.001 and C∗0 (r) = 3Ipr for all r. The
discount factors were chosen as the value that maximized the LPL.
7
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
50
100
150
200
250
300
(a) Static Data (W*=0)
-1.0
0.0
1.0
0
-2.0
True/Estim. Values
0.0
-1.0
True value
Static Estimate
Dynamic Estimate
-2.0
True/Estim. Values
1.0
In order to better understand the effect of estimating the connectivity when applying a
wrong static/dynamic model, the largest static datasets (W∗ (r) = 0Ipr and T = 300) were
fitted with a dynamic model, using graph DAG1 and δ = 0.93 , the average in fitted models of
the data generated with W∗ (r) = 0.001Ipr . Then, dynamic datasets from the same scenario, i.e.
DAG1, T = 300 and W∗ (r) = 0.001Ipr , were fitted using the static model with δ = 1. Figure 4
(2)
shows the true (blue lines) and smoothing estimated values of parameter θt (2) - connectivity
(Y (1), Y (2)) versus time t for dynamic (violet lines) and static (green lines) models. When the
data is generated from a static model (Figure 4(a)), dynamic models usually estimate the true
values quite well. Most of time, they simply alternate between under and over-estimating the true
values of parameters but nevertheless centre around the true value. In contrast, when the data
are simulated from the dynamic models (Figure 4(b)), static models fail to appropriately describe
the series at each time point. This phenomenon is particularly pertinent to this application.
Because we know connectivities change over time, by fitting static models (as we typically do
when fitting BNs) we fit models which can score very poorly even when the topology of the
connectivity is right! So in particular any preliminary model search using static BN models is
likely to be unreliable and potentially misleading in this dynamically evolving environmental.
0
50
100
150
200
250
300
(b) Dynamic Data (W*=0.001)
Figure 4: The true value (blue lines) and estimation result by smoothing for parameter θt(2) (2) - connectivity (Y (1), Y (2)) from DAG1, considering dynamic model (mean δ of 0.93 and violet lines) and static
model (mean δ of 1 and green lines), for a particular replication. The dashed lines represent the 95%
HPD intervals. (a) shows results from data simulated based on static model (W∗ (r) = 0Ipr ) while (b)
shows results for data from dynamic model (W∗ (r) = 0.001Ipr ).
Figure 5 shows the log predictive likelihood versus different values of discount factor, considering DAG3 (solid lines), DAG4 (dashed lines) and DAG5 (dotted lines). The sample size
increases from the first to the last row whilst the dynamic level (innovation variance) increases
from the first to the last column. Although the ranges of LPL differ across the graphs, the range
sizes are the same, i.e. 500 so that it is easy to compare them. We can see in this figure that
the choice of δ is consistent with the innovation variance. We found the average estimated δ is
about 1 when data are from static system and less than 1 for dynamic synthetic data. Thus, at
least in a simulation study the MDM was able to identify clearly the better system source on
the basis of the lengths of data we record in experiments like these.
Note that when data is static (W∗ (r) is the zero matrix in the first column) it is difficult
to distinguish the DAGs. Using the model selection criteria proposed by West and Harrison
(1997) where −1 <logBF< 1 suggests no significant difference amongst the three DAGs then
this occurs for almost 70% and 55% of the replications for T = 100 and T = 200, respectively.
This percentage decreases to 38% for the largest sample size (T = 300). However the equivalent
DAGs remain indistinguishable (DAG3 and DAG4 were both selected for 56% of replications).
8
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
Another interesting result is that even when data follow a dynamic system but is fitted by a
static model, the non-Markov equivalent DAGs are distinguishable whilst equivalent DAGs are
not. For instance, when W∗ (r) = 0.01Ipr and T = 100 (first row and third column), the value of
LPL for DAG5 is smaller than the value for other DAGs, but there is not a significant difference
between the values of LPL for DAG3 and DAG4 when δ = 1, which we could deduce anyway since
these models are Markov equivalent (see e.g. Ali et al., 2009). In contrast, there are important
differences between the LPL of DAGs when dynamic data are fitted with dynamic models,
DAG3 having the largest value of LPL. In particular, MDMs appear to select the appropriate
direction of connectivity with a high success rate. Note however that their performance varies
as a function of the innovation variance and sample size (note the distance between the lines
of DAGs changes from on graph to another). In general we find that these directionalities are
more difficult to detect than the existence of such a connection: a phenomenon noticed for real
series as well.
Figure 6 shows the percentage of replications in which the correct DAG was selected, DAG1
and DAG3, considering 2 nodes (a) and 3 nodes (b) respectively. As might be expected, the
higher the sample size, the higher the chance of identifying the true DAG correctly. But T
shows the largest impact in the results when the dynamics of the data are very slowly changing
(W∗ (r) = 0.001Ipr ). An interesting result concerns different values of innovation variance. When
the connectivity does not change over time (W∗ (r) = 0Ipr ), all three DAGs or the two Markov
equivalent DAGs were selected for the overwhelming majority of replication (around 95%).
However, there is a sharp increase in the performance of a model where W∗ (r) = 0Ipr (static
model) to a model where W∗ (r) = 0.001Ipr and continues to improve as W∗ (r) = 0.01Ipr . This
demonstrates that the additional dynamic structures allow causal interactions to be identified
more clearly.
5
The use of Diagnostics in an MDM using Real fMRI Data
In the last section we demonstrated that if the data generating process was indeed an MDM,
then we can expect standard BF model selection techniques to find the generating model. In this
section we apply the MDM to a typical real data set. We demonstrate how the model selection
still appears to work well, provided that the methodology is used in conjunction with diagnostic
checks.
In the past Cowell et al. (1999) have argued convincingly that when fitting graphical models
it is extremely important to customize diagnostic methods not only to determine whether the
model appears to be capturing the data generating mechanism well but also to suggest embellishments of the class that might fit better. Their preferred methods are based on one step ahead
prediction. These can be simply modified to give analogous diagnostics for use in our dynamic
context. Here we give three types of diagnostic monitors based on analogues for probabilistic
networks (Cowell et al., 1999).
First the global monitor is used to compare networks. After identifying a DAG providing the
best explanation over LMDM candidate models, the predicted relationship between a particular
node and its parents can be explored through the parent-child monitor. Finally the node monitor
diagnostic can indicate whether or not the selected model in a global monitor fits adequately.
If this is not so then a more complex model will be substituted in a way illustrated below, and
the search repeated.
I - Global Monitor
9
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
0.7
0.8
0.9
0.6
0.7
0.8
0.9
1.0
LPL
0.6
0.7
0.8
0.9
1.0
0.5
0.6
0.7
0.8
0.9
W*= 0 and T= 200
W*= 0.001 and T= 200
W*= 0.01 and T= 200
W*= 0.1 and T= 200
0.7
0.8
0.9
1.0
0.5
0.6
0.7
0.8
0.9
1.0
-2400
LPL
-2800
0.5
0.6
0.7
0.8
0.9
1.0
0.5
0.6
0.7
0.8
0.9
DF
DF
DF
W*= 0 and T= 300
W*= 0.001 and T= 300
W*= 0.01 and T= 300
W*= 0.1 and T= 300
0.7
0.8
DF
0.9
1.0
0.5
0.6
0.7
0.8
0.9
1.0
-3800
LPL
-4200
0.5
DF
1.0
-4000
LPL
-2700
-2900
LPL
-2600
-2400
-2400
-2200
-2200
-2500
DF
0.6
1.0
-2600
LPL
-2000
LPL
-1700
-1900
0.6
-1800
-1500
-1600
DF
-1400
DF
-2600
0.5
-1500
0.5
DF
-1600
0.5
-1300
-800
LPL
0.5
DF
-1800
LPL
-1200
-1100
1.0
W*= 0.1 and T= 100
-1000
LPL
0.6
-900
-700
LPL
-900
-1100
DAG3
DAG4
DAG5
0.5
LPL
W*= 0.01 and T= 100
-1100
W*= 0.001 and T= 100
-700
W*= 0 and T= 100
0.6
0.7
0.8
0.9
1.0
0.5
0.6
DF
0.7
0.8
0.9
1.0
DF
Figure 5: The log predictive likelihood versus different values of discount factor (DF or δ), for DAG3
(solid lines), DAG4 (dashed lines) and DAG5 (dotted lines). The sample size increases from the first to
the last row whilst the dynamic level (innovation variance) increases from the first to the last column.
The range of y − axis (LPL) has the same size of 500 for all graphs.
'!!"
'!!"
!"#$%#&'#()*+#$,-$$',&&#')$.#/#'0,($
!"#$%#&'#()*+#$,-$$',&&#')$.#/#'0,($
&!"
&!"
%!"
%!"
)*'!!"
)*#!!"
$!"
$!"
)*+!!"
#!"
#!"
!"
!"
!"
!(!!'"
!(!'"
!('"
!"
!(!!'"
!(!'"
12$
12$
(a) 2 Nodes (b) 3 Nodes )*'!!"
)*#!!"
)*+!!"
!('"
Figure 6: The number out of 100 replications in which the correct DAG was selected, for 2 nodes (a
- DAG1) and 3 nodes (b - DAG3), considering different values of innovation variance (W ∗ ) and sample
size (T ).
10
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
The first stage of our analysis is to select the best candidate DAG using simple LMDMs as
described in the last section and Bayes’ factor scores. It is well known that for BF techniques
to be successfully applied in the selection of non-stochastic graphs in an real data, the prior
distributions on the hyperparameters of candidate models sharing the same features must first
be matched (Heckerman, 1999). If this is not done then one model can be preferred to another
not for structural reasons but for quite spurious ones. This is also true for the dynamic class of
models we fit here.
However fortunately the dynamic nature of the class of MDM actually helps dilute the misleading effect of any such mismatch since after a few time steps, evidence about the conditional
variances and the predictive means is discounted and the marginal likelihood of each model
usually repositions itself. In particular the different priors usually have only a small effect on
the relative values of subsequent conditional marginal likelihoods. We describe below how we
have nevertheless matched priors to minimize this small effect in the consequent Bayes’ factor
scores driving the model selection.
1
3
4
2
Figure 7: A graphical structure considering 4 nodes.
Just as for BN to match priors we can exploit a use decomposition of the Bayes’ factor score
for MDMs. By equation 1, the joint log predictive likelihood can be written as the sum of the
log predictive likelihood for each observation series given its parents: a modularity property.
Therefore when some features are incorporated within the model class, the relative score of
such models only discriminate on the components of the model where they differ. Thus consider
again the graphical structure in the Figure 7. BFs score each of these four components separately
and then these are summed to give the full model score. For instance, suppose the LMDM is
updated because node 3 exhibits heteroscedasticity. On observing this violation, the conditional
one-step forecast distribution for node 3 can be replaced by one relating to a more complex
model. Thus for example a new one step ahead forecast density is p∗t3 (yt (3)|yt−1 , yt (1), yt (2))
of (Yt (3)|yt−1 , yt (1), yt (2)) ∼ Tnt−1 (3) (ft (3), Qht (3)), where the parameters ft (3) and nt−1 (3) are
defined as before, but Qht (3) is now defined as function of a random variance, say kt3 (Ft (3)0 θ t (3))
(see details in West and Harrison, 1997, section 10.7). The log Bayes’ factor model selection
comparing the original model with model modified according to heteroscedasticity can then be
calculated as
log (BF) =
T
X
log pt3 (yt (3)|y
t−1
, yt (1), yt (2)) −
t=1
T
X
log p∗t3 (yt (3)|yt−1 , yt (1), yt (2)).
t=1
Thus, provided we set prior densities over the same component parameters over different
models, because the model structure is common for all other nodes, the BF discriminates between
two models by finding the one that best fits the data only from the component where they differ:
11
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
Region
1 - Visual
2 - DMN
3 - Executive Control
Parent
No
2
3
2 and 3
No
1
3
1 and 3
No
1
2
1 and 2
Score
-418.09
- 443.66
-451.20
-455.52
-421.93
-444.68
-443.57
-444.39
-414.11
-412.40
-407.16
-407.09
Table 1: Evidence for each region under all possible sets of parents. Score was calculated as
LPL[Y (r)|P a(r)] and LPL = LPL[Y (1)|P a(1)] + LPL[Y (2)|P a(2)] + LPL[Y (3)|P a(3)] . The
higher score the higher evidence for this particular model.
in our example the component associate with node 3. This simplifies the search across our model
space significantly and in the small example means that an exhaustive search is trivial.
The setting of distributions for hyperparameters of different candidate parent sets is not
critical to the BF model selection provided that early predictive densities are comparable. We
have found a very simple way of achieving this is to set the prior covariance matrices over the
regression parameters of each model a priori to be independent with a shared variance. Note the
hyperparameters and the parameter δ of the nodes 1, 2 and 4 were the same for both models:
homoscedastic and heteroscedastic for the node 3. Many numerical checks have convinced us
that the results of the model selection we describe above is insensitive to these settings provided
that the high scoring models pass diagnostic tests some of which we discuss below.
To illustrate the model selection process for the MDM, the real resting state data described
in section 2 were used. The MDMs were fitted using a weakly informative prior with n0 (r) =
d0 (r) = 0.001 and C∗0 (r) = 3Ipr for all r. The discount factor δ was estimated for each node
and each graphical structure. Table 1 shows the score calculated as LPL for each region for all
possible sets of parents. The best scoring model has the variables associated with the regions
1 and 2 with no parents and the region 3 with the other two regions as its parents, see Figure
8. We found small values of discount factor for region 1 and 2. This suggests that the process
might be poorly described by an LMDM. However because regions 1 and 2 are founders, i.e. the
ones with no parents, under the fitted model their dynamics are likely to be driven exogenously.
This exogenous process is unlikely to be well described by a steady model (West and Harrison,
1997) as assumed as a default in the LMDM. But note that the effect on the comparative model
score by fitting a more elaborate DLM (or other model) to these nodes will cancel for all graphs
having these variables as founders. It follows that we can postpone the selection of a better
model of those univariate processes to a subsequent stage in the modeling process, see below.
II - Parent-child Monitor
The relationship between a particular node and its parents given of the MDM ensures that
we need only consider diagnostics associated with the performance of this component. Let
12
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
1
3
2
DAG6
Figure 8: The best DAG that was selected in the global monitor: visual region and DMN are the parents
of executive control (DAG6).
P a(Y(r)) = {Ypa(r) (1), . . . , Ypa(r) (pr − 1)}, then
log(BF )ri = log pr {y(r)|P a(y(r))} − log pri {y(r)|P a(y(r)) \ ypa(r) (i)},
for r = 1, . . . , n and i = 1, . . . , pr − 1; where {P a(Y(r)) \ Ypa(r) (i)} means the set of all parents
of Y(r) excluding the parent Ypa(r) (i).
To illustrate the use of this monitor, suppose we want to confirm the relations “parent-child”
for region 3. Figure 9 provides estimates of the relevant connectivities over time. The connectivity from the visual region to the executive control (Figure 9(a)) seems not to be significant
in the second half of the series. The significance of this connectivity is reflected in:
log(BF )31 = log p3 {y(3)|y(1), y(2)} − log p31 {y(3)|y(2)}.
Filtering
Smoothing
0
50
100
150
0.0
0.5
1.0
(b) DMN -> Executive Control
-1.0 -0.5
0.0
0.5
1.0
Point and interval estimates
(a) Visual -> Executive Control
-1.0 -0.5
Point and interval estimates
The cumulative log(BF) shows strong evidence for DAG6, which has both visual region and
DMN as the parents of executive control region, in the beginning of the series.
Filtering
Smoothing
0
50
100
150
Figure 9: The filtering (blue) and smoothing (green) posterior mean for regression parameters (a) θ(2) (3)
- connectivity (Y (1), Y (3)), visual region influences executive control region, and (b) θ(2) (3) - connectivity
(Y (2), Y (3)), DMN influences executive control region with change points (dashed lines).
Observe now the connectivity from DMN to the executive control in the Figure 9(b). There
is a sharp increase in the strength of this connectivity in the first quarter of the series and the
connectivity remains significant subsequently. This apparent dynamic change in the strength of
connectivity between different regions of the brain has been conjectured elsewhere (Ge et al.,
2009; Chang and Glover, 2010) and is fully consistent with the analyzed dataset. Note that
this dynamic class of models is sufficient to explicitly accommodate these changes in strength
through the dynamic changes in the coefficient regression parameters to and away from zero.
13
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
III - Node Monitor
Again the modularity ensures that the model for any given node can be embellished based
on a residual analysis. For instance, consider a non-linear structure for a founder node r. On
the basis of the partial autocorrelation of the residuals of the logarithm of the series, a more
sophisticated model of the form
(1)
(2)
log Yt (r) = θt (r) + θt (r) log Yt−1 (r) + vt (r),
suggests itself. Note that this model still provides a closed form score for these components. The
lower scores and their corresponding model estimation can then be substituted for the original
steady models to provide a much better scoring dynamic model, but are still respecting the same
causal structure as in the original analysis.
Denoting log Yt (r) by Zt (r), the conditional one-step forecast distribution for Zt (r) can then
be calculated using a DLM on the transformed series {Zt }. More generally if we hypothesis
that Zt (r) can be written as a continuous and monotonic function of Yt (r), say g(.), and so the
conditional one-step forecast cumulative distribution for Yr (r) can be found through
FYt (r) (y) = Ptr (Yt (r) 6 y|yt−1 , xt (r)) = Ptr∗ (Zt (r) 6 g −1 (y)|yt−1 , xt (r)) = FZt (r) (g −1 (y)).
So p∗tr (yt (r)|yt−1 , xt (r)), the conditional one-step forecast density for Yt (r) for this new
model, can be calculated explicitly (see details in West and Harrison, 1997, section 10.6). It
is also possible to keep the previous time series Yt (r) as Gaussian and then simply regress on
terms like y 2 cos y, y sin y and so on for previous y. These non-linear functional relationships
still give rise to closed form predictives in the Bayesian DLM and so the MDM. So these give
an extremely flexible class of MDM with closed form BFs that can be used to embellish our
processes demonstrated in the above Global Monitor.
When the variance is unknown, the conditional forecast distribution is a noncentral t distribution with a location parameter, say ft (r), scale parameter, Qt (r), and degrees of freedom,
nt−1 (r). The one-step forecast errors is defined as et (r) = Yt (r) − ft (r) and the standardized
conditional one-step forecast errors as et (r)/Qt (r)(1/2) . The assumption underlying the DLM
is that the standardized conditional one-step forecast errors have approximately a Gaussian
distribution, when nt−1 (r) is large, and are serially independent with constant variance (West
and Harrison, 1997; Durbin and Koopman, 2001). We noted from the QQ plot that there is no
strong evidence that the normality assumptions were made in this analyze are violated.
Perhaps of most scientific interest is that the plots of cumulative sum of the standardized
errors suggest the existence of change points, see Figure 10. This is most strongly evidenced near
t = 15 for the executive control, region 3. West and Harrison (1997) suggest a simple method,
based on the BF or the cumulative BF is less than a particular threshold a change point can be
thought to have occurred. Adopting this method and comparing the graph whose visual region
and DMN are the parents of executive control (DAG6) with the graph where there is no parent
from the executive control region, and with a threshold of 0.2, four time points were suggested
as change points. We show the two most interesting points in the Figure 9. It is straightforward
to run a new MDM with a change point at the identified point simply by increasing to state
variance of the corresponding system error at each point (West and Harrison, 1997, chapter 11).
Note that although this naive approach seems to deal adequately with identified change point, it
can obviously be improved using for example the full power of switching state space models (see
e.g. Fruhwirth-Schnatter, 2006, chapter 13) to model this apparent phenomena more formally.
Typically of course the model space will be much larger and the underlying DAG involve
more vertices. However then we simply replicate the methodology described above, iteratively
14
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
performing the same diagnostic checks on the more numerous components of the model in exactly
the same way as we have described above.
150
0
15
20
150
0.5
-1.0
0
5
10
15
20
0
5
10
15
20
50
100
150
time
5
-5
-8
0
0
-2
-4
-6
CuSum stand error
CuSum stand error
3
2
1
0
-3 -2 -1
CuSum stand error
100
10
10
50
0.0
ACF of erros
0.5
0.0
ACF of erros
-1.0
5
2
4
100
-0.5
0.5
0.0
-0.5
-1.0
0
0
-4
50
1.0
0
-0.5
150
1.0
100
1.0
50
-2
Standardized erros
2
0
-4
-2
Standardized erros
2
0
-2
-4
Standardized erros
0
ACF of erros
Executive Control
4
DMN
4
Visual
0
50
100
time
150
0
50
100
150
time
Figure 10: Time series plot, ACF-plot and cumulative sum of one-step-ahead conditional forecast errors,
being one region in each column.
6
Conclusions
In this paper we have demonstrated how the MDM can provide a flexible and relatively realistic
class of models for selecting between different potential network connnections in the brain, known
to vary with time, on the basis of fMRI data. The conditional closure of the associated score
functions makes model selection relatively fast. Diagnostic statistics for checking and where
necessary adapting the whole class is also straightforward as demonstrated above.
The application of these dynamic models to large scale datasets is still in its infancy. In this
paper we chose a simple example where exhaustive search was possible. However for problems
with up to 20 variables there are some challenges still remaining if we continue to use naive search
methods. First, for even moderate sized problems the number of possible causal models becomes
very large when the number of brain regions increases. For instance Ramsey et al. (2010) counted
15
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
2,609,107,200 alternative causal structures of DAGs for 10 nodes and for MDM’s whose number
of distinguishable models is even larger. Even after using prior information to limit this number
to scientifically plausible ones it will usually be necessary to develop greedy search algorithms to
guide the selection. Currently for these problems we use a greedy search algorithm called AHC
(Heard et al., 2006) which is very quick in this context. However the authors are now exploring
a number of customized methods for performing a full search of the LMDM space. We note that
because of the close association of the MDM with the BN, especially in the modularity of their
associated Bayes’ factor score can be exploited using respectively integer programing algorithms
and dynamic programing algorithms (see Cussens, 2011, and Cowell, 2013) and will report on
these methods at a later date.
Second there are exciting possibilities for using the model class to perform even more refined selection. For example often the main focus of interest in these experiments includes not
only a search for the likely model of a specific individual but an analysis of shared between
subject effects. Currently such features are analyzed using rather coarse aggregation methods
over shared time series. Using multivariate hierarchical models and Bayesian hyperclustering
techniques however it is possible to use the full machinery of Bayesian methods to formally make
inferences in a coherent way which acknowledges hypotheses about shared dependences between
such populations of subjects. Early results we have obtained building such hierarchical models
are promising and again will be reported later.
References
Ali, R. A., Richardson, T. S., Spirtes, P., 2009. Markov equivalence for ancestral graphs. The Annals of
Statistics, vol. 37, 28082837.
Chang, C., Glover, G.H., 2010. “ Time-frequency dynamics of resting-state brain connectivity measured with
fMRI”. NeuroImage, 50: 81-98.
Cole, D.M., Smith, S.M., Beckmann, C. F., 2010. Advances and pitfalls in the analysis and interpretation of
resting-state FMRI data. Frontiers in systems neuroscience, 4(April), 8. doi:10.3389/fnsys.2010.00008
Cowell, R.G., 2013. A simple greedy algorithm for reconstructing pedigrees. Theoretical Population Biology.
83. 55-63.
Cowell, R.G., Dawid, A.P., Lauritzen, S.L., and Spiegelhalter, D.J., 1999. Probabilistic Networks and Expert
Systems. Springer-Verlag, New York.
Cussens, J., 2011. Bayesian network learning with cutting planes. In Fabio G. Cozman and Avi Pfeffer, editors,
Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 153-160,
Barcelona. AUAI Press.
Durbin, J., Koopman, S.J., 2001. Time Series Analysis by State Space Methods. Oxford University Press,
Oxford.
Friston, K.J., 2011. Functional and Effective Connectivity: a review. Brain Connectivity, 1(1): 13-36.
doi:10.1089/brain.2011.0008.
Fruhwirth-Schnatter, S., 2006. Finite Mixture and Markov Switching Models, Springer.
16
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
Ge, T., Kendrick, K. M., Feng, J., 2009. “A novel extended Granger causal model approach demonstrates
brain hemispheric differences during face recognition learning ”. PLoS Comput. Biol. 5, e1000570.
doi:10.1371/journal.pcbi.1000570
Heard, N. A., Holmes, C. C., Stephens, D. A., 2006. A quantitative study of gene regulation involved in the
immune response of Anopheline mosquitoes: An Application of Bayesian Hierarchical Clustering of
Curves. Journal of the American Statistical Association, 101, 473, 18-29.
Heckerman, D., 1999. A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M.
Jordan, ed. MIT Press, Cambridge, MA.
Jeffreys, H., 1961. Theory of Probability (3rd ed.). Oxford University Press, London.
Lauritzen, S. L., 1996. Graphical Models. Oxford, United Kingdom: Clarendon Press.
Petris, G., Petrone, S., Campagnoli, P., 2009. Dynamic Linear Models with R. Springer, New York.
Poldrack, R.A., Mumford, J.A., Nichols, T.E., 2011. Handbook of fMRI Data Analysis, Cambridge University
Press.
Queen, C.M., and Albers, C.J., 2008. Forecast covariances in the linear multiregression dynamic model. J.
Forecast., 27: 175191. doi: 10.1002/for.1050.
Queen, C.M., and Albers, C.J., 2009. Intervention and causality: Forecasting traffic flows using a dynamic
bayesian network. Journal of the American Statistical Association. June. 104(486), 669-681.
Queen, C.M., and Smith, J.Q., 1993. “Multiregression dynamic models”. Journal of the Royal Statistical
Society, Series B, 55, 849-870.
Ramsey, J.D., Hanson, S.J., Hanson, C., Halchenko, Y.O., Poldrack, R.A., and Glymour, C., 2010. “Six
Problems for Causal Inference from fMRI”. NeuroImage, 49: 1545-1558.
Smith, S.M., Fox, P.T., Miller, K.L., Glahn, D.C., Fox, P.M., Mackay, C.E., Filippini, N., Watkins, K.E., Toro,
R., Laird, A.R., Beckmann, C.F., 2009. Correspondence of the brain’s functional architecture during
activation and rest. Proc Natl Acad Sci U S A, 106(31), 13040-5.
Smith, S. M., Miller, K. L., Salimi-Khorshidi, G., Webster, M., Beckmann, C., Nichols, T., Ramsey, J.,
Woolrich, M., 2011. Network modeling methods for FMRI. NeuroImage, 54(2), 875-891.
West, M., and Harrison, P.J., 1997. Bayesian Forecasting and Dynamic Models (2nd ed.) Springer-Verlag. New
York.
17
CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism
Download