On the Selection of Multiregression Dynamic Models of fMRI networked time series Lilia Costa The University of Warwick, UK; Universidade Federal da Bahia, BR Jim Smith The University of Warwick, UK Thomas Nichols The University of Warwick, UK March 20, 2013 Abstract A Multiregression Dynamic Model (MDM) is a class of multivariate time series that allows various dynamic causal processes to be represented in a graphical way. In contrast with many other Dynamic Bayesian Networks, the hypothesized relationships accommodate conditional conjugate inference. This means that it is straightforward to search over many different connectivity networks with dynamically changing intensity of transmission to find the MAP model within a class of models. In this paper we customize this conjugate search within scientific models describing the dynamic connectivity of the brain. As well as demonstrating the efficacy of our dynamic models, we illustrate how diagnostic methods, analogous to those defined for static Bayesian Networks, can be used to suggest embellishment of the model class to extend the process of model selection. Keywords: Multiregression Dynamic Model, Bayesian Network, Markov Equivalent Graph, Model Selection, Functional magnetic resonance imaging (fMRI). 1 Introduction We consider the application of a class of Dynamic Bayesian Network (DBN) models, called the Multiregression Dynamic Model (MDM), to resting state functional Magnetic Resonance Imaging (fMRI) data. Functional MRI consists of a dynamic acquisition, i.e. a series of images, which provides a time series at each volume element. These data reflect the blood oxygenation level, which is related to the activity of brain neurons. A traditional fMRI experiment consists of alternating periods of active and control experimental conditions and the purpose is to compare brain activity between two different cognitive states (e.g. remembering a list of words versus just passively reading a list of words). In contrast, a “resting state” experiment is conducted by having the subject remain in a state of quiet repose, and the analysis focuses on understanding E-mail: L.C.Carneiro-da-Costa@warwick.ac.uk This research was partially supported by CAPES, Brazil. 1 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism the pattern of connectivity among different cerebral areas. The ultimate (and ambitious) goal is to understand how one neural system influences another (Poldrack et al., 2011). Some studies assume that the connection strengths between different brain regions are constant. These static models are used to define dependence networks, a subset of which is then revised to form a more scientifically accurate dynamic model (Smith et al., 2011; Friston, 2011). However, clearly a more promising strategy would be to first perform a search over a class of models which was rich enough to capture the dynamic changes in the connectivity strengths that are known to exist in this application. The Multiregression Dynamic Model (MDM) can do just this (Queen and Smith, 1993; Queen and Albers, 2009) and in this paper we demonstrate how it can be applied to resting fMRI. Currently the most popular approach is to hypothesize relationships with an undirected graph or a directed acyclic graph (DAG). Smith et al. (2011) compared different connectivity estimation approaches for fMRI data. Whilst they found that BNs are one of the most successful methods for detecting (undirected) network edges, none of the methods (except Patel’s τ ) was remotely successful in estimating connection direction. In this paper we describe the first application — to our knowledge — of Bayes’ factor MDM search in this domain. As with standard BNs the Bayes’ factor of MDM can be written in closed form, so the model space can be scored quickly. However unlike a static BN, the MDM models dynamic links and so allows us to discriminate between models that would be Markov equivalent in their static versions. Furthermore the directionality exhibited in the graph an MDM can be associated with a causal directionality in a very natural way (Queen and Albers, 2009) that is also scientifically meaningful. We are therefore able to demonstrate that the MDM is not only a useful method for detecting the existence of brain connectivity but also for estimating its direction. This paper also presents new prequential diagnostics for this model class analogous to those originally developed for static BNs (Cowell et al., 1999) using the closed form of the one-step ahead predictive distribution. It is well known that Bayes’ factor model selection methods can breakdown whenever no representative in the associated model class fits the data well. It is therefore extremely important to check that selected models are consistent with the observed series. Here we recommend initially selecting across a class of simple linear MDMs which are time homogeneous, linear and with no change points. We then check the best model using these new prequential diagnostics. In practice we have found the linear MDMs usually perform well for most nodes receiving inputs from other nodes. However when diagnostics discover a discrepancy of fit, the MDM class is sufficiently expressive that it can be embellished to accommodate other anomalous features. For example, it is possible to include time dependent error variances, change points, interaction terms in the regression and so on, to better reflect the underlying model and refine the analysis. Often, even after such embellishment, the model still stays within a conditionally conjugate class. So if our diagnostics identify serious deviation from the highest scoring simple MDM, we can adapt it and its high scoring neighbors with features explaining the deviations. The model selection process using Bayes’ factors can then be reapplied to discover models that describe the process even better. In this way we can iteratively augment the fitted model and its highest scoring competitors with embellishments until the search class accommodates the main features observed in the dynamic processes well. The remainder of this paper is structured as follows. Section 2 describes the resting state data we use here and Section 3 reviews the class of MDMs. The performance of the MDM will be investigated using synthetic data in Section 4. Section 5 then develops diagnostic statistics for an MDM using real data that can be applied to the model selection as described above. Directions for future work are given in Section 6. 2 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism 2 Resting State Data This study focuses on fMRI time series data, where 36 healthy adults at rest were observed over 6 minute intervals (Smith et al., 2009). The first step in the study of fMRI connectivity is data reduction. This is as usually achieved by summarizing the data as a set of time series derived from predefined regions of interest (ROI’s). The choice of ROI set to use is somewhat arbitrary, and poorly chosen ROI’s may mix heterogeneous brain regions, and, further, a ROI based network cannot easily represent spatially overlapping networks (Smith et al., 2011). An alternative is to use Independent Components Analysis (ICA) to generate data-driven spatial patterns that describe the structure of local and long-range connectivity (Cole et al., 2010). Here we have used such an ICA approach. Each subject’s image data was transformed into a standard atlas space, and then all subjects’ data concatenated temporally. After an initial principal components data reduction to 20 components, ICA produced a set of 20 matched pairs of spatial components (that are orthogonal and maximally statistical independent) and temporal loadings (that may be correlated). Of these 20, 10 spatial components were identified as wellknown resting-state networks (Smith et al., 2009). The temporal loadings, split back into 36 separate time series, express the temporal evolution of the corresponding spatial pattern in each subject and were the source of the data for our MDM modeling. To demonstrate the methodology developed in this paper, it is sufficient to consider the networks of just three of these components. Henceforth we refer to those components as regions (to distinguish them from the network we build between these three regions): region 1, a visual network composed of medial, occipital pole and lateral visual areas (comprised of the average of the 3 visual networks in Smith et al., 2009); region 2, the “default mode network” (DMN) comprising posterior cingulate, bilateral inferiorlateral-parietal and ventromedial frontal areas; and region 3, an “executive control” network that covers several medial-frontal areas including anterior cingulate and paracingulate cortex (Figure 1). For each of the 36 subjects and for each of the 3 regions, we have time series of length 176 that represent the neural activity; for a given subject we write the time series as Y (1), Y (2), and Y (3), for the 3 regions. (a) Visual (b) DMN (c) Executive Control Figure 1: Three networks from independent component analysis of the 36-subject resting fMRI dataset (Smith et al., 2009): Region 1 - Visual (a), Region 2 - Default Mode Network (b) and Region 3 - Executive Control (c). A simple preliminary analysis of this data clearly demonstrates the scientifically predicted dynamically evolving changes in strength of dependency between these series. For example plots of estimates of regression coefficients of one of the subjects using simple regression models, fitted to estimate the linear relationship between Y (1) and Y (2); Y (1) and Y (3); and Y (2) and Y (3) 3 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism and based on a moving window of 30 time points are given in Figure 2. The plots clearly exhibit some large drifts in dependency strengths over time. This strongly suggests that a time varying flexible model such as the simple linear MDM defined below should be used for this application. 0 50 100 Time interval 150 2 1 0 -2 -1 The estimate of regression' slope 1 0 -2 -1 The estimate of regression' slope 1 0 -1 -2 The estimate of regression' slope (c) Y(2)->Y(3) 2 (b) Y(1)->Y(3) 2 (a) Y(1)->Y(2) 0 50 100 150 Time interval 0 50 100 150 Time interval Figure 2: The posterior mean and 95% HPD intervals for regression parameters between every 2 regions: (a) Visual → DMN ; (b) Visual → Executive Control ; (c) DMN → Executive Control over 147 time intervals. 3 The Multiregression Dynamic Model The MDM (Queen and Smith, 1993) is a graphical multivariate model for a n-dimensional time series Yt (1), . . . , Yt (n), t = 1, . . . , T . It is a composition of simpler univariate regression dynamic linear models (DLMs; West and Harrison, 1997), which can model smooth changes in the parents’ effect on a given node during the period of investigation. There are five features of this model class that are useful for this study. 1. Each MDM is defined in part by a directed acyclic graph whose vertices are components of the series at a given time, and edges denote that contemporaneous (and possibly past) observations are included as regressors. These directed edges therefore denote that direct contemporaneous relationships might exist between these components and also that the directionality of the edges is ‘causal’ in a sense that is carefully argued in Queen and Albers (2009); 2. Dependence relationships between each component and its contemporaneous parents — as represented by the corresponding regression coefficients — are allowed to drift with time. We note that this drift can be set to zero and then the MDM simplifies to a standard Gaussian graphical model; 3. Each one of these multivariate models admits a conjugate analysis. In particular its marginal likelihood can be expressed as a product of multivariate student t distributions, as detailed below. Its closed form allows us to perform fast model selection; 4. Although the predictive distributions of each component given its parents is multivariate student t distributed, because covariates enter the scale function of these conditionals, the joint distribution can be highly non-Gaussian (see e.g. Queen and Smith, 1993, for 4 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism an example of this). Because ICA and related techniques applied to the study of fMRI data often work on the assumption that processes are not jointly Gaussian, the MDM is more compatible with these methods than many of its competitors that assume marginal Gaussianity; 5. The class of MDM can be further modified to include other features that might be necessary in a straightforward and convenient way, for example by adding dependence on the past of the parents (not just the present), allowing for change points, and other embellishments that we illustrate below. At time t the data at n regions is denoted by the column vector Yt0 = (Yt (1), . . . , Yt (n)), and their observed values designated respectively by yt0 = (yt (1), . . . , yt (n)). Let the time series until time t for region r = 1, . . . , n be Yt (r)0 = (Y1 (r), . . . , Yt (r)) and the time series for possible parents of region r at time t be Xt (r)0 = (Yt (1), . . . , Yt (r − 1)) for r = 2, . . . , n. Note that the n regions can always be ordered to ensure that P a(r) ⊆ Xt (r), where the parent set of Yt (r) is known as P a(r). The MDM is defined by n observation equations, a system equation and initial information (Queen and Smith, 1993). The observation equations specify the time-varying regression parameters of each region on its parents. The system equation is a multivariate autoregressive model for the evolution of time-varying regression coefficients; and the initial information is given though a prior density for the regression coefficients. The linear multiregression dynamic model (LMDM; Queen and Albers, 2008) is specified in terms of a collection of conditional regression DLMs (West and Harrison, 1997) as follows. We write the observation equations as Yt (r) = Ft (r)0 θ t (r) + vt (r), vt (r) ∼ N (0, Vt (r)); where r = 1, . . . , n; t = 1, . . . , T ; N (·, ·) is a Gaussian distribution; Ft (r)0 is a covariate vector with dimension pr determined by P a(r), pr = |P a(r)| + 1; the first element of Ft (r)0 is 1, representing an intercept, and the remaining columns are Yt (r∗), for Yt (r∗) ∈ P a(r); θ t (r) is the pr -dimensional time-varying regression coefficient; and vt (r) is the independent residual error with variance Vt (r). Concatenating the n regression coefficients as θ 0t = (θ t (1)0 , . . . , θ t (n)0 ) gives Pn a vector of length p = r=1 pr . We next write the system equation as θ t = Gt θ t−1 + wt , wt ∼ N (0, Wt ); where Gt = blockdiag{Gt (1), . . . , Gt (n)}, each Gt (r) being a pr × pr matrix, wt are innovations for the latent regression coefficients, and Wt = blockdiag{Wt (1), . . . , Wt (n)}, each Wt (r) being a pr × pr matrix. The error wt is assumed independent of vs for all t and s. For most of the development we need only consider Gt (r) = Ipr , where Ipr is pr -dimensional identity matrix. Finally the initial information is written as (θ 0 |y0 ) ∼ N (m0 , C0 ); where θ 0 expresses the prior knowledge of the regression parameters, before observing any data, given the information at time t = 0, i.e. y0 . The mean vector m0 is an initial estimate of the parameters and C0 is the p × p variance-covariance matrix. C0 can be defined as blockdiag{C0 (1), . . . , C0 (n)}, with each C0 (r) being a pr square matrix. When the observational variances are unknown and constant, i.e. Vt (r) = V (r) for all t, by defining φ(r) = V (r)−1 , a prior n0 (r) d0 (r) (φ(r)|y0 ) ∼ G , , 2 2 5 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism where G(·, ·) denotes a Gamma distribution leads to a conjugate analysis where conditionally each components of the marginal likelihood has a Student t distribution. In order to use this conjugate analysis it is convenient to reparameterise the model as Wt (r) = V (r)Wt∗ (r) and C0 (r) = V (r)C∗0 (r). For a fixed innovation signal matrix Wt∗ (r) this change implies no loss of generality (West and Harrison, 1997). It simplifies the analysis and so the interpretation of the model class to define the innovation signal matrix indirectly in terms of a single hyperparameter for each component DLM called a discount factor (West and Harrison, 1997; Petris et al., 2009), especially for model selection purposes we have in used here. This vastly reduces the dimensionality of the model class whilst in practice often losing very little in the quality of fit. This well used technique expresses different values of Wt∗ in terms of the loss of information in the change of θ between times t − 1 and t. More precisely, for some δ ∈ (0, 1], Wt∗ = 1−δ ∗ Ct−1 ; δ where Ct (r) = V (r)C∗t (r) is the posterior variance of θ t . Note that when δ = 1, Wt∗ = 0Ip , there are no stochastic changes in the state vector and we degenerate to a conventional standard multivariate Gaussian prior to posterior analysis. For any choice of discount factor δ the recurrences given above for any MDM gives a closed form expression for this marginal likelihood. This means that we can estimate δ simply by maximizing this marginal likelihood, performing a direct one-dimensional optimization over δ, such as that used in Heard et al. (2006) to complete the search algorithm. The selected model is then the one with the discount factor giving the highest associated Bayes’ factor score, see below. The joint density over the vector of observed associated with any MDM series can be factorized as the product of the distribution of the first node and transition distributions between the subsequent nodes (Queen and Smith, 1993). Moreover, the conditional one-step forecast distribution can be written as (Yt (r)|yt−1 , xt (r)) ∼ Tnt−1 (r) (ft (r), Qt (r)), where Tnt (r) (·, ·) is a noncentral t distribution with nt (r) degrees of freedom and the parameters are easily found through Kalman filter recurrences (see e.g. West and Harrison, 1997). The joint log predictive likelihood (LPL) is then calculated as LPL(m) = log py (y|m) n X = log pr (y(r)|x(r), m) = r=1 n X T X (1) log ptr (yt (r)|yt−1 , xt (r), m), r=1 t=1 where m reflects the current choice of model that determines the relationship between the n regions expressed graphically through the underlying graph. The most popular Bayesian scoring method and the one we use here is the Bayes’ factor measure (Jeffreys, 1961; West and Harrison, 1997) which is defined as the ratio between the predictive likelihood of two models. 4 A LMDM Analysis of a Synthetic fMRI Dataset It is first interesting to investigate the potential of an MDM to discriminate between effective connectivities that change over time and those that remain the same, i.e. whether they originated 6 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism from a dynamic or static process. Static BNs with the same skeleton are often Markov equivalent (Lauritzen, 1996). This is not so for their dynamic MDM analogues. So it is possible for an MDM to detect directions of relationships in DAGs which are Markov equivalent in static analysis. We will explore these questions below and demonstrate how this is possible using a simulation experiment. This in turn allows us to search over hypotheses expressing potential deviations of causations. Here simulation of observations from known MDMs were studied using sample sizes T = 100, 200 and 300, and different dynamic levels W∗ (r) = 0Ipr (static), 0.001Ipr , 0.01Ipr and 0.1Ipr . The impact of these different scenarios into MDM’s results was verified regarding 2 and 3 regions and 100 datasets for each T and W∗ (r) pair. For two nodes, data were generated using the MDM with graph DAG1 given in Figure 3 (a). The initial values for the regression parameters were 0.3 for connection between Y (1) and Y (2), (2) i.e. θ0 (2), and the value 0 for other θ’s (intercept parameters). The observational variance was defined as 12.5 for Y (1) and 6.3 for Y (2) so that the marginal variances were almost the same for both regions. Thus we set (k) (k) (k) (k) wti (r) ∼ N (0, W (k) (r)), θti (r) = θt−1i (r) + wti (r), for r = 1, 2; t = 1, . . . , T ; i = 1, . . . , 100 replications; k = 1, . . . , pr ; p1 = 1; p2 = 2; W (k) (r) = W ∗(k) (r) × V (r) and W ∗(k) (r) is the k th element of the diagonal of matrix W∗ (r) defined above. Observed values were then simulated using the following equations: (1) vti (1) ∼ N (0, V (1)); (2) θti (2)Yti (1) + vti (2), vti (2) Yti (1) = θti (1) + vti (1), (1) θti (2) Yti (2) = + 1 1 2 (a) DAG1 1 3 2 (b) DAG2 1 2 3 2 (c) DAG3 (d) DAG4 ∼ N (0, V (2)). 1 3 2 (e) DAG5 Figure 3: Direct acyclic graphs used in the synthetic study. With 2 nodes, (a) DAG1 and (b) DAG2 are Markov equivalent as static BNs. For 3 nodes, (c) DAG3 and (d) DAG4 are considered Markov equivalent whilst neither is equivalent to (e) DAG5. For three nodes, the graphical structure used to obtain the synthetic data is shown in Figure 3 (c) DAG3. The initial values for the regression parameters were 0.3 for the connectivity between (2) (2) Y (1) and Y (2), i.e. θ0 (2), 0.2 for the connectivity between Y (2) and Y (3), i.e. θ0 (3), and the value 0 for other θ’s (intercept parameters). The observational variance was set to be the same as for two nodes for the first and second variables and 5.0 for Y (3). The observation and system equations were also set to be the same as for node 2, except for r = 1, . . . , 3; p3 = 2; and (1) (2) Yti (3) = θti (3) + θti (3)Yti (2) + vti (3), vti (3) ∼ N (0, V (3)). The log predictive likelihood (LPL) was first computed for different values of discount factor δ, using a weakly informative prior with n0 (r) = d0 (r) = 0.001 and C∗0 (r) = 3Ipr for all r. The discount factors were chosen as the value that maximized the LPL. 7 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism 50 100 150 200 250 300 (a) Static Data (W*=0) -1.0 0.0 1.0 0 -2.0 True/Estim. Values 0.0 -1.0 True value Static Estimate Dynamic Estimate -2.0 True/Estim. Values 1.0 In order to better understand the effect of estimating the connectivity when applying a wrong static/dynamic model, the largest static datasets (W∗ (r) = 0Ipr and T = 300) were fitted with a dynamic model, using graph DAG1 and δ = 0.93 , the average in fitted models of the data generated with W∗ (r) = 0.001Ipr . Then, dynamic datasets from the same scenario, i.e. DAG1, T = 300 and W∗ (r) = 0.001Ipr , were fitted using the static model with δ = 1. Figure 4 (2) shows the true (blue lines) and smoothing estimated values of parameter θt (2) - connectivity (Y (1), Y (2)) versus time t for dynamic (violet lines) and static (green lines) models. When the data is generated from a static model (Figure 4(a)), dynamic models usually estimate the true values quite well. Most of time, they simply alternate between under and over-estimating the true values of parameters but nevertheless centre around the true value. In contrast, when the data are simulated from the dynamic models (Figure 4(b)), static models fail to appropriately describe the series at each time point. This phenomenon is particularly pertinent to this application. Because we know connectivities change over time, by fitting static models (as we typically do when fitting BNs) we fit models which can score very poorly even when the topology of the connectivity is right! So in particular any preliminary model search using static BN models is likely to be unreliable and potentially misleading in this dynamically evolving environmental. 0 50 100 150 200 250 300 (b) Dynamic Data (W*=0.001) Figure 4: The true value (blue lines) and estimation result by smoothing for parameter θt(2) (2) - connectivity (Y (1), Y (2)) from DAG1, considering dynamic model (mean δ of 0.93 and violet lines) and static model (mean δ of 1 and green lines), for a particular replication. The dashed lines represent the 95% HPD intervals. (a) shows results from data simulated based on static model (W∗ (r) = 0Ipr ) while (b) shows results for data from dynamic model (W∗ (r) = 0.001Ipr ). Figure 5 shows the log predictive likelihood versus different values of discount factor, considering DAG3 (solid lines), DAG4 (dashed lines) and DAG5 (dotted lines). The sample size increases from the first to the last row whilst the dynamic level (innovation variance) increases from the first to the last column. Although the ranges of LPL differ across the graphs, the range sizes are the same, i.e. 500 so that it is easy to compare them. We can see in this figure that the choice of δ is consistent with the innovation variance. We found the average estimated δ is about 1 when data are from static system and less than 1 for dynamic synthetic data. Thus, at least in a simulation study the MDM was able to identify clearly the better system source on the basis of the lengths of data we record in experiments like these. Note that when data is static (W∗ (r) is the zero matrix in the first column) it is difficult to distinguish the DAGs. Using the model selection criteria proposed by West and Harrison (1997) where −1 <logBF< 1 suggests no significant difference amongst the three DAGs then this occurs for almost 70% and 55% of the replications for T = 100 and T = 200, respectively. This percentage decreases to 38% for the largest sample size (T = 300). However the equivalent DAGs remain indistinguishable (DAG3 and DAG4 were both selected for 56% of replications). 8 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism Another interesting result is that even when data follow a dynamic system but is fitted by a static model, the non-Markov equivalent DAGs are distinguishable whilst equivalent DAGs are not. For instance, when W∗ (r) = 0.01Ipr and T = 100 (first row and third column), the value of LPL for DAG5 is smaller than the value for other DAGs, but there is not a significant difference between the values of LPL for DAG3 and DAG4 when δ = 1, which we could deduce anyway since these models are Markov equivalent (see e.g. Ali et al., 2009). In contrast, there are important differences between the LPL of DAGs when dynamic data are fitted with dynamic models, DAG3 having the largest value of LPL. In particular, MDMs appear to select the appropriate direction of connectivity with a high success rate. Note however that their performance varies as a function of the innovation variance and sample size (note the distance between the lines of DAGs changes from on graph to another). In general we find that these directionalities are more difficult to detect than the existence of such a connection: a phenomenon noticed for real series as well. Figure 6 shows the percentage of replications in which the correct DAG was selected, DAG1 and DAG3, considering 2 nodes (a) and 3 nodes (b) respectively. As might be expected, the higher the sample size, the higher the chance of identifying the true DAG correctly. But T shows the largest impact in the results when the dynamics of the data are very slowly changing (W∗ (r) = 0.001Ipr ). An interesting result concerns different values of innovation variance. When the connectivity does not change over time (W∗ (r) = 0Ipr ), all three DAGs or the two Markov equivalent DAGs were selected for the overwhelming majority of replication (around 95%). However, there is a sharp increase in the performance of a model where W∗ (r) = 0Ipr (static model) to a model where W∗ (r) = 0.001Ipr and continues to improve as W∗ (r) = 0.01Ipr . This demonstrates that the additional dynamic structures allow causal interactions to be identified more clearly. 5 The use of Diagnostics in an MDM using Real fMRI Data In the last section we demonstrated that if the data generating process was indeed an MDM, then we can expect standard BF model selection techniques to find the generating model. In this section we apply the MDM to a typical real data set. We demonstrate how the model selection still appears to work well, provided that the methodology is used in conjunction with diagnostic checks. In the past Cowell et al. (1999) have argued convincingly that when fitting graphical models it is extremely important to customize diagnostic methods not only to determine whether the model appears to be capturing the data generating mechanism well but also to suggest embellishments of the class that might fit better. Their preferred methods are based on one step ahead prediction. These can be simply modified to give analogous diagnostics for use in our dynamic context. Here we give three types of diagnostic monitors based on analogues for probabilistic networks (Cowell et al., 1999). First the global monitor is used to compare networks. After identifying a DAG providing the best explanation over LMDM candidate models, the predicted relationship between a particular node and its parents can be explored through the parent-child monitor. Finally the node monitor diagnostic can indicate whether or not the selected model in a global monitor fits adequately. If this is not so then a more complex model will be substituted in a way illustrated below, and the search repeated. I - Global Monitor 9 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism 0.7 0.8 0.9 0.6 0.7 0.8 0.9 1.0 LPL 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 W*= 0 and T= 200 W*= 0.001 and T= 200 W*= 0.01 and T= 200 W*= 0.1 and T= 200 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 -2400 LPL -2800 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 DF DF DF W*= 0 and T= 300 W*= 0.001 and T= 300 W*= 0.01 and T= 300 W*= 0.1 and T= 300 0.7 0.8 DF 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 -3800 LPL -4200 0.5 DF 1.0 -4000 LPL -2700 -2900 LPL -2600 -2400 -2400 -2200 -2200 -2500 DF 0.6 1.0 -2600 LPL -2000 LPL -1700 -1900 0.6 -1800 -1500 -1600 DF -1400 DF -2600 0.5 -1500 0.5 DF -1600 0.5 -1300 -800 LPL 0.5 DF -1800 LPL -1200 -1100 1.0 W*= 0.1 and T= 100 -1000 LPL 0.6 -900 -700 LPL -900 -1100 DAG3 DAG4 DAG5 0.5 LPL W*= 0.01 and T= 100 -1100 W*= 0.001 and T= 100 -700 W*= 0 and T= 100 0.6 0.7 0.8 0.9 1.0 0.5 0.6 DF 0.7 0.8 0.9 1.0 DF Figure 5: The log predictive likelihood versus different values of discount factor (DF or δ), for DAG3 (solid lines), DAG4 (dashed lines) and DAG5 (dotted lines). The sample size increases from the first to the last row whilst the dynamic level (innovation variance) increases from the first to the last column. The range of y − axis (LPL) has the same size of 500 for all graphs. '!!" '!!" !"#$%#&'#()*+#$,-$$',&&#')$.#/#'0,($ !"#$%#&'#()*+#$,-$$',&&#')$.#/#'0,($ &!" &!" %!" %!" )*'!!" )*#!!" $!" $!" )*+!!" #!" #!" !" !" !" !(!!'" !(!'" !('" !" !(!!'" !(!'" 12$ 12$ (a) 2 Nodes (b) 3 Nodes )*'!!" )*#!!" )*+!!" !('" Figure 6: The number out of 100 replications in which the correct DAG was selected, for 2 nodes (a - DAG1) and 3 nodes (b - DAG3), considering different values of innovation variance (W ∗ ) and sample size (T ). 10 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism The first stage of our analysis is to select the best candidate DAG using simple LMDMs as described in the last section and Bayes’ factor scores. It is well known that for BF techniques to be successfully applied in the selection of non-stochastic graphs in an real data, the prior distributions on the hyperparameters of candidate models sharing the same features must first be matched (Heckerman, 1999). If this is not done then one model can be preferred to another not for structural reasons but for quite spurious ones. This is also true for the dynamic class of models we fit here. However fortunately the dynamic nature of the class of MDM actually helps dilute the misleading effect of any such mismatch since after a few time steps, evidence about the conditional variances and the predictive means is discounted and the marginal likelihood of each model usually repositions itself. In particular the different priors usually have only a small effect on the relative values of subsequent conditional marginal likelihoods. We describe below how we have nevertheless matched priors to minimize this small effect in the consequent Bayes’ factor scores driving the model selection. 1 3 4 2 Figure 7: A graphical structure considering 4 nodes. Just as for BN to match priors we can exploit a use decomposition of the Bayes’ factor score for MDMs. By equation 1, the joint log predictive likelihood can be written as the sum of the log predictive likelihood for each observation series given its parents: a modularity property. Therefore when some features are incorporated within the model class, the relative score of such models only discriminate on the components of the model where they differ. Thus consider again the graphical structure in the Figure 7. BFs score each of these four components separately and then these are summed to give the full model score. For instance, suppose the LMDM is updated because node 3 exhibits heteroscedasticity. On observing this violation, the conditional one-step forecast distribution for node 3 can be replaced by one relating to a more complex model. Thus for example a new one step ahead forecast density is p∗t3 (yt (3)|yt−1 , yt (1), yt (2)) of (Yt (3)|yt−1 , yt (1), yt (2)) ∼ Tnt−1 (3) (ft (3), Qht (3)), where the parameters ft (3) and nt−1 (3) are defined as before, but Qht (3) is now defined as function of a random variance, say kt3 (Ft (3)0 θ t (3)) (see details in West and Harrison, 1997, section 10.7). The log Bayes’ factor model selection comparing the original model with model modified according to heteroscedasticity can then be calculated as log (BF) = T X log pt3 (yt (3)|y t−1 , yt (1), yt (2)) − t=1 T X log p∗t3 (yt (3)|yt−1 , yt (1), yt (2)). t=1 Thus, provided we set prior densities over the same component parameters over different models, because the model structure is common for all other nodes, the BF discriminates between two models by finding the one that best fits the data only from the component where they differ: 11 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism Region 1 - Visual 2 - DMN 3 - Executive Control Parent No 2 3 2 and 3 No 1 3 1 and 3 No 1 2 1 and 2 Score -418.09 - 443.66 -451.20 -455.52 -421.93 -444.68 -443.57 -444.39 -414.11 -412.40 -407.16 -407.09 Table 1: Evidence for each region under all possible sets of parents. Score was calculated as LPL[Y (r)|P a(r)] and LPL = LPL[Y (1)|P a(1)] + LPL[Y (2)|P a(2)] + LPL[Y (3)|P a(3)] . The higher score the higher evidence for this particular model. in our example the component associate with node 3. This simplifies the search across our model space significantly and in the small example means that an exhaustive search is trivial. The setting of distributions for hyperparameters of different candidate parent sets is not critical to the BF model selection provided that early predictive densities are comparable. We have found a very simple way of achieving this is to set the prior covariance matrices over the regression parameters of each model a priori to be independent with a shared variance. Note the hyperparameters and the parameter δ of the nodes 1, 2 and 4 were the same for both models: homoscedastic and heteroscedastic for the node 3. Many numerical checks have convinced us that the results of the model selection we describe above is insensitive to these settings provided that the high scoring models pass diagnostic tests some of which we discuss below. To illustrate the model selection process for the MDM, the real resting state data described in section 2 were used. The MDMs were fitted using a weakly informative prior with n0 (r) = d0 (r) = 0.001 and C∗0 (r) = 3Ipr for all r. The discount factor δ was estimated for each node and each graphical structure. Table 1 shows the score calculated as LPL for each region for all possible sets of parents. The best scoring model has the variables associated with the regions 1 and 2 with no parents and the region 3 with the other two regions as its parents, see Figure 8. We found small values of discount factor for region 1 and 2. This suggests that the process might be poorly described by an LMDM. However because regions 1 and 2 are founders, i.e. the ones with no parents, under the fitted model their dynamics are likely to be driven exogenously. This exogenous process is unlikely to be well described by a steady model (West and Harrison, 1997) as assumed as a default in the LMDM. But note that the effect on the comparative model score by fitting a more elaborate DLM (or other model) to these nodes will cancel for all graphs having these variables as founders. It follows that we can postpone the selection of a better model of those univariate processes to a subsequent stage in the modeling process, see below. II - Parent-child Monitor The relationship between a particular node and its parents given of the MDM ensures that we need only consider diagnostics associated with the performance of this component. Let 12 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism 1 3 2 DAG6 Figure 8: The best DAG that was selected in the global monitor: visual region and DMN are the parents of executive control (DAG6). P a(Y(r)) = {Ypa(r) (1), . . . , Ypa(r) (pr − 1)}, then log(BF )ri = log pr {y(r)|P a(y(r))} − log pri {y(r)|P a(y(r)) \ ypa(r) (i)}, for r = 1, . . . , n and i = 1, . . . , pr − 1; where {P a(Y(r)) \ Ypa(r) (i)} means the set of all parents of Y(r) excluding the parent Ypa(r) (i). To illustrate the use of this monitor, suppose we want to confirm the relations “parent-child” for region 3. Figure 9 provides estimates of the relevant connectivities over time. The connectivity from the visual region to the executive control (Figure 9(a)) seems not to be significant in the second half of the series. The significance of this connectivity is reflected in: log(BF )31 = log p3 {y(3)|y(1), y(2)} − log p31 {y(3)|y(2)}. Filtering Smoothing 0 50 100 150 0.0 0.5 1.0 (b) DMN -> Executive Control -1.0 -0.5 0.0 0.5 1.0 Point and interval estimates (a) Visual -> Executive Control -1.0 -0.5 Point and interval estimates The cumulative log(BF) shows strong evidence for DAG6, which has both visual region and DMN as the parents of executive control region, in the beginning of the series. Filtering Smoothing 0 50 100 150 Figure 9: The filtering (blue) and smoothing (green) posterior mean for regression parameters (a) θ(2) (3) - connectivity (Y (1), Y (3)), visual region influences executive control region, and (b) θ(2) (3) - connectivity (Y (2), Y (3)), DMN influences executive control region with change points (dashed lines). Observe now the connectivity from DMN to the executive control in the Figure 9(b). There is a sharp increase in the strength of this connectivity in the first quarter of the series and the connectivity remains significant subsequently. This apparent dynamic change in the strength of connectivity between different regions of the brain has been conjectured elsewhere (Ge et al., 2009; Chang and Glover, 2010) and is fully consistent with the analyzed dataset. Note that this dynamic class of models is sufficient to explicitly accommodate these changes in strength through the dynamic changes in the coefficient regression parameters to and away from zero. 13 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism III - Node Monitor Again the modularity ensures that the model for any given node can be embellished based on a residual analysis. For instance, consider a non-linear structure for a founder node r. On the basis of the partial autocorrelation of the residuals of the logarithm of the series, a more sophisticated model of the form (1) (2) log Yt (r) = θt (r) + θt (r) log Yt−1 (r) + vt (r), suggests itself. Note that this model still provides a closed form score for these components. The lower scores and their corresponding model estimation can then be substituted for the original steady models to provide a much better scoring dynamic model, but are still respecting the same causal structure as in the original analysis. Denoting log Yt (r) by Zt (r), the conditional one-step forecast distribution for Zt (r) can then be calculated using a DLM on the transformed series {Zt }. More generally if we hypothesis that Zt (r) can be written as a continuous and monotonic function of Yt (r), say g(.), and so the conditional one-step forecast cumulative distribution for Yr (r) can be found through FYt (r) (y) = Ptr (Yt (r) 6 y|yt−1 , xt (r)) = Ptr∗ (Zt (r) 6 g −1 (y)|yt−1 , xt (r)) = FZt (r) (g −1 (y)). So p∗tr (yt (r)|yt−1 , xt (r)), the conditional one-step forecast density for Yt (r) for this new model, can be calculated explicitly (see details in West and Harrison, 1997, section 10.6). It is also possible to keep the previous time series Yt (r) as Gaussian and then simply regress on terms like y 2 cos y, y sin y and so on for previous y. These non-linear functional relationships still give rise to closed form predictives in the Bayesian DLM and so the MDM. So these give an extremely flexible class of MDM with closed form BFs that can be used to embellish our processes demonstrated in the above Global Monitor. When the variance is unknown, the conditional forecast distribution is a noncentral t distribution with a location parameter, say ft (r), scale parameter, Qt (r), and degrees of freedom, nt−1 (r). The one-step forecast errors is defined as et (r) = Yt (r) − ft (r) and the standardized conditional one-step forecast errors as et (r)/Qt (r)(1/2) . The assumption underlying the DLM is that the standardized conditional one-step forecast errors have approximately a Gaussian distribution, when nt−1 (r) is large, and are serially independent with constant variance (West and Harrison, 1997; Durbin and Koopman, 2001). We noted from the QQ plot that there is no strong evidence that the normality assumptions were made in this analyze are violated. Perhaps of most scientific interest is that the plots of cumulative sum of the standardized errors suggest the existence of change points, see Figure 10. This is most strongly evidenced near t = 15 for the executive control, region 3. West and Harrison (1997) suggest a simple method, based on the BF or the cumulative BF is less than a particular threshold a change point can be thought to have occurred. Adopting this method and comparing the graph whose visual region and DMN are the parents of executive control (DAG6) with the graph where there is no parent from the executive control region, and with a threshold of 0.2, four time points were suggested as change points. We show the two most interesting points in the Figure 9. It is straightforward to run a new MDM with a change point at the identified point simply by increasing to state variance of the corresponding system error at each point (West and Harrison, 1997, chapter 11). Note that although this naive approach seems to deal adequately with identified change point, it can obviously be improved using for example the full power of switching state space models (see e.g. Fruhwirth-Schnatter, 2006, chapter 13) to model this apparent phenomena more formally. Typically of course the model space will be much larger and the underlying DAG involve more vertices. However then we simply replicate the methodology described above, iteratively 14 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism performing the same diagnostic checks on the more numerous components of the model in exactly the same way as we have described above. 150 0 15 20 150 0.5 -1.0 0 5 10 15 20 0 5 10 15 20 50 100 150 time 5 -5 -8 0 0 -2 -4 -6 CuSum stand error CuSum stand error 3 2 1 0 -3 -2 -1 CuSum stand error 100 10 10 50 0.0 ACF of erros 0.5 0.0 ACF of erros -1.0 5 2 4 100 -0.5 0.5 0.0 -0.5 -1.0 0 0 -4 50 1.0 0 -0.5 150 1.0 100 1.0 50 -2 Standardized erros 2 0 -4 -2 Standardized erros 2 0 -2 -4 Standardized erros 0 ACF of erros Executive Control 4 DMN 4 Visual 0 50 100 time 150 0 50 100 150 time Figure 10: Time series plot, ACF-plot and cumulative sum of one-step-ahead conditional forecast errors, being one region in each column. 6 Conclusions In this paper we have demonstrated how the MDM can provide a flexible and relatively realistic class of models for selecting between different potential network connnections in the brain, known to vary with time, on the basis of fMRI data. The conditional closure of the associated score functions makes model selection relatively fast. Diagnostic statistics for checking and where necessary adapting the whole class is also straightforward as demonstrated above. The application of these dynamic models to large scale datasets is still in its infancy. In this paper we chose a simple example where exhaustive search was possible. However for problems with up to 20 variables there are some challenges still remaining if we continue to use naive search methods. First, for even moderate sized problems the number of possible causal models becomes very large when the number of brain regions increases. For instance Ramsey et al. (2010) counted 15 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism 2,609,107,200 alternative causal structures of DAGs for 10 nodes and for MDM’s whose number of distinguishable models is even larger. Even after using prior information to limit this number to scientifically plausible ones it will usually be necessary to develop greedy search algorithms to guide the selection. Currently for these problems we use a greedy search algorithm called AHC (Heard et al., 2006) which is very quick in this context. However the authors are now exploring a number of customized methods for performing a full search of the LMDM space. We note that because of the close association of the MDM with the BN, especially in the modularity of their associated Bayes’ factor score can be exploited using respectively integer programing algorithms and dynamic programing algorithms (see Cussens, 2011, and Cowell, 2013) and will report on these methods at a later date. Second there are exciting possibilities for using the model class to perform even more refined selection. For example often the main focus of interest in these experiments includes not only a search for the likely model of a specific individual but an analysis of shared between subject effects. Currently such features are analyzed using rather coarse aggregation methods over shared time series. Using multivariate hierarchical models and Bayesian hyperclustering techniques however it is possible to use the full machinery of Bayesian methods to formally make inferences in a coherent way which acknowledges hypotheses about shared dependences between such populations of subjects. Early results we have obtained building such hierarchical models are promising and again will be reported later. References Ali, R. A., Richardson, T. S., Spirtes, P., 2009. Markov equivalence for ancestral graphs. The Annals of Statistics, vol. 37, 28082837. Chang, C., Glover, G.H., 2010. “ Time-frequency dynamics of resting-state brain connectivity measured with fMRI”. NeuroImage, 50: 81-98. Cole, D.M., Smith, S.M., Beckmann, C. F., 2010. Advances and pitfalls in the analysis and interpretation of resting-state FMRI data. Frontiers in systems neuroscience, 4(April), 8. doi:10.3389/fnsys.2010.00008 Cowell, R.G., 2013. A simple greedy algorithm for reconstructing pedigrees. Theoretical Population Biology. 83. 55-63. Cowell, R.G., Dawid, A.P., Lauritzen, S.L., and Spiegelhalter, D.J., 1999. Probabilistic Networks and Expert Systems. Springer-Verlag, New York. Cussens, J., 2011. Bayesian network learning with cutting planes. In Fabio G. Cozman and Avi Pfeffer, editors, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 153-160, Barcelona. AUAI Press. Durbin, J., Koopman, S.J., 2001. Time Series Analysis by State Space Methods. Oxford University Press, Oxford. Friston, K.J., 2011. Functional and Effective Connectivity: a review. Brain Connectivity, 1(1): 13-36. doi:10.1089/brain.2011.0008. Fruhwirth-Schnatter, S., 2006. Finite Mixture and Markov Switching Models, Springer. 16 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism Ge, T., Kendrick, K. M., Feng, J., 2009. “A novel extended Granger causal model approach demonstrates brain hemispheric differences during face recognition learning ”. PLoS Comput. Biol. 5, e1000570. doi:10.1371/journal.pcbi.1000570 Heard, N. A., Holmes, C. C., Stephens, D. A., 2006. A quantitative study of gene regulation involved in the immune response of Anopheline mosquitoes: An Application of Bayesian Hierarchical Clustering of Curves. Journal of the American Statistical Association, 101, 473, 18-29. Heckerman, D., 1999. A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M. Jordan, ed. MIT Press, Cambridge, MA. Jeffreys, H., 1961. Theory of Probability (3rd ed.). Oxford University Press, London. Lauritzen, S. L., 1996. Graphical Models. Oxford, United Kingdom: Clarendon Press. Petris, G., Petrone, S., Campagnoli, P., 2009. Dynamic Linear Models with R. Springer, New York. Poldrack, R.A., Mumford, J.A., Nichols, T.E., 2011. Handbook of fMRI Data Analysis, Cambridge University Press. Queen, C.M., and Albers, C.J., 2008. Forecast covariances in the linear multiregression dynamic model. J. Forecast., 27: 175191. doi: 10.1002/for.1050. Queen, C.M., and Albers, C.J., 2009. Intervention and causality: Forecasting traffic flows using a dynamic bayesian network. Journal of the American Statistical Association. June. 104(486), 669-681. Queen, C.M., and Smith, J.Q., 1993. “Multiregression dynamic models”. Journal of the Royal Statistical Society, Series B, 55, 849-870. Ramsey, J.D., Hanson, S.J., Hanson, C., Halchenko, Y.O., Poldrack, R.A., and Glymour, C., 2010. “Six Problems for Causal Inference from fMRI”. NeuroImage, 49: 1545-1558. Smith, S.M., Fox, P.T., Miller, K.L., Glahn, D.C., Fox, P.M., Mackay, C.E., Filippini, N., Watkins, K.E., Toro, R., Laird, A.R., Beckmann, C.F., 2009. Correspondence of the brain’s functional architecture during activation and rest. Proc Natl Acad Sci U S A, 106(31), 13040-5. Smith, S. M., Miller, K. L., Salimi-Khorshidi, G., Webster, M., Beckmann, C., Nichols, T., Ramsey, J., Woolrich, M., 2011. Network modeling methods for FMRI. NeuroImage, 54(2), 875-891. West, M., and Harrison, P.J., 1997. Bayesian Forecasting and Dynamic Models (2nd ed.) Springer-Verlag. New York. 17 CRiSM Paper No. 13-06, www.warwick.ac.uk/go/crism