Tom Nichols ST416 - Advanced Topics in Biostatistics - Part 1: fMRI 1 ST416 - Advanced Topics in Biostatistics - Lecture 7 Multi-Subject Modelling 1 Modelling Group fMRI Data – The fully general, ugly case Most of the lectures (until the end of last week, when I talked about permutation) have focused on modelling single subject’s data. Specifically, fitting time series regression models that account for the experimental variation in the data Y while accounting for temporal autocorrelation in the errors . Now, we move on to consider modelling fMRI data from a group of subjects with the goal of making inference on the population from which they were sampled. For have N subject’s data, let the time series model for the kth subjects be Yk = Xk βk + k , (1) where Yk is a T -vector of the observed data, Xk is the T ×p predictor matrix, and k the T -vector of random errors. Xk contains the BOLD HRF-convolved blocks or event that define the fMRI experiment (see “Experimental Predictors” in Fig. 1), as well as possibly drift regressors. In practice, T may actually differ between subjects, but we leave it homogeneous for simplicity. The the only thing that is required is that each column of the design matrices Xk means the same thing in every subject. Let’s assemble all subjects’ data into a grand multi-subject model: X1 ··· 0 1 Y1 . . .. .. .. . . .. (2) , = Y = Yk , X = .. , Xk . .k . .. .. ... N YN 0 ··· XN where Y and are length-N T column vectors, and X is N T × N T block diagonal matrix. Then we can write all these N “first level” models at once as V1 σ12 ··· 0 .. . . .. 2 Y = Xβ + , ∼ N (0, V ), V = .. (3) Vk σk . .. . 2 0 ··· VN σN where β is a length N p vector. Recall that the Vk model temporal autocorrelation; e.g. if an |i−j| AR(1) model with parameter ρk is used, ((Vk ))ij = ρk . Tom Nichols ST416 - Advanced Topics in Biostatistics - Part 1: fMRI 2 Experimental Stimuli 0 20 40 60 80 100 120 Experimental Predictors 0 20 40 60 80 100 120 0 20 40 60 80 100 120 + + ... + Figure 1: The experimental stimuli and predictors associated with the BOLD response from a singe voxel over time. The top color bar indicates when the subject was cued to tap their fingers randomly (red), sequentially (green), only the index (yellow), or not at all (cyan). The associated experimental stimuli are shown, as well as the experimental predictors that are created by convolving the stimuli with a HRF. Finally, the original BOLD response (black) is shown with the predicted model fit (blue) based on the model formed with the experimental predictors. Tom Nichols ST416 - Advanced Topics in Biostatistics - Part 1: fMRI 3 So far, we have regarded the β as fixed, representing the properties of each individual subject. If we instead consider that we have sampled our N subjects at random from a population, then we can pose a “second level” or group model and consider the β’s as random. Moreover, we can try to model the β’s and make inferences on properties of the group as a whole. Specifically, β = XG βG + G (4) where XG is the N p × pG group level design matrix, and G are the N p-vector of group-level errors. For the experimental design illustrated in Figure 1, with 4 predictors, a XG matrix could be just average each of the predictors, I4 . .. XG = I4 (5) . .. I4 where I4 is a 4 × 4 identity matrix. The interpretation of each element of βG would be the average effect of the first three BOLD predictors in the population. Or we could model all of the men and women separately; e.g. if all the women were listed first and then men, I4 .. . I 4 (6) XG = I4 .. . I4 the first 4 elements of βG would represent the women’s population responses, and the second 4 elements would represent the men’s. Subject responses are not independent. In particular, we expect that a subject that has the largest BOLD response for one stimuli type is likely to have the largest BOLD response for another stimuli type. It can also occur that subjects subjects with the largest response on one type may have the smallest on others, exhibiting a correlations. Thus Var(G ) is not identity Tom Nichols but rather ST416 - Advanced Topics in Biostatistics - Part 1: fMRI ··· VG . Var(G ) = .. 0 .. . VG ··· 0 4 .. 2 σ , . G .. . VG (7) 2 where pG × pG VG is the correlation in group effects, and σG is the group-level variance. Combining the first level (3) and second level (4) models into one, we get Y = XXG βG + XG + . (8) This is sometimes referred to a variance components model, because we are modelling multiple sources of variation: Intrasubject measurement error variation in and between-subject variation in XG . If you estimate these variance components, you can use Generalized Least Squares (GLS) to ‘whiten’ the data and model with the matrix square root inverse (Cholesky decomposition) of Var(Y ) = Var(XG ) + Var() (9) and then find estimates of βG using OLS. This model, however, is a disaster for fMRI. It says that if you want to make inference on population parameters βG you need to keep all N subjects length-T time series around. But each subject has 10’s of 1,000’s of voxels, and each voxel’s time series length T = 100 to T = 1, 000. This is feasible on modern computer hardware, but still much slower and cumbersome than it needs to be. 2 Modelling Group fMRI Data – The summary statistics approach While the models fit to each subject are involved (Fig. 1 is a quite a simple design) in practice all investigators want to make inference on is one contrast at a time. I will show how, if there is only 1 measure per subject taken to the second level model, the maths simplify tremendously, we can see better what is actually happening (and, crucially, it is computationally very fast). To restate notation once more, for data on N subjects, the kth subject is modelled as Yk = Xk βk + k where Yk is a length T vector, Xk is N × p matrix of BOLD predictors, and k is the length T vector of random errors. To keep things simple, I will forget about autocorrelation, and let Tom Nichols ST416 - Advanced Topics in Biostatistics - Part 1: fMRI 5 k ∼ N (0, Iσk2 ); this would correspond to the setting where there is neglible autocorrelation (rare) or when we have already decorrelated data and model by some means. We are interested in just one particular contrast c of the parameters, cβk , estimated with cβ̂k = c(Xk0 Xk )−1 Xk0 Yk , E(cβ̂k ) = cβk , Var(cβ̂k ) = c(Xk0 Xk )−1 c0 σk2 . It will be helpful to have this result in hand cβ̂k = cβk + c(Xk0 Xk )−1 Xk0 k clearly showing how cβ̂k is just a peturbed version of cβk (can you show this!?). Again, we are not just intersted in subject k, but want to make inference on the population from which these N subjects were drawn. Thus the cβk are random, as they will change with each new draw of N random subjects. Assemble these random variables of interest into γ = [cβ1 , . . . , cβN ]0 , and then our model for these unobservable quantities is γ = XG βG + G , 2 G ∼ N (0, IσG ) where XG is a N × pG “second level” design matrix, and G is the N -vector of econd level 2 . errors, with “pure between subject” variance σG Of course we don’t observe γ, but only γ̂ = [cβ̂1 , . . . , cβ̂N ]0 . A little manipulation shows the model for γ̂ is γ̂ = γ + γ̂ − γ = XG βG + G + (γ̂ − γ) = XG βG + G̃ . 2 where G̃ is the “mixed effects” error. Var(G̃ ) is the sum IσG and Var(γ̂ − γ), but what is this second term? Of course Var(γ̂ + a) = Var(γ̂) for any constant a, but γ is not a constant but a random variable. Due to the helpful result above, though, it works out as if γ were a constant: For the kth element of (γ̂ − γ) Var ((γ̂ − γ)k ) = Var(cβ̂k − cβk ) = Var(cβk + c(Xk0 Xk )−1 Xk0 k − cβk ) = Var(c(Xk0 Xk )−1 Xk0 k ) = c(Xk0 Xk )−1 Xk0 Var(k )Xk (Xk0 Xk )−1 c0 = c(Xk0 Xk )−1 c0 σk2 = Var(cβ̂k ) where I have used the independence of k , but in fact you can get the equivalent result when Var(k ) = Vk σk2 and whitening has been used. Tom Nichols 2.1 ST416 - Advanced Topics in Biostatistics - Part 1: fMRI 6 Estimating the Group fMRI Model - Summary Statistics Approach So, to review, for group fMRI data, we fit the estimated contrast data γ̂ with the model γ̂ = XG βG + G̃ (10) Var(γ̂) = Var(G̃ ) 2 = IσG + diag({Var(cβ̂k )}). This shows that this is a “components of variance” model: The variance in the group contrast estimates γ̂ (when considered as samples from a population and not just as individual subjects) is a sum of a pure between-subject variance (same for all subjects) and a within-subject term (possibly different for each subject k). Once again we can use Generalised Least Squares (GLS) to estimate this. As a reminder, for a vanilla General Linear Model Y = Xβ + , the GLS estimate is obtained after “prewhitening” data and model: Var()−1/2 Y = Var()−1/2 X β + Var()−1/2 and finding estimates with β̂GLS = = −1 (Var()−1/2 X)0 (Var()−1/2 X) (Var()−1/2 X)0 Var()−1/2 Y −1 X 0 Var()−1 X X 0 Var()−1 Y. (11) Note that, while the whitening matrix is an inverse square-root of variance, it appears in the final expression for the estimate (11) as inverse variance. For our model (10) on γ̂, the prewhitening matrix Var(G̃ )−1/2 is diagonal, with elements 1 ((Var(G̃ )−1/2 ))k,k = q 2 σG + Var(cβ̂k ) 2 Thus each of the N contrast estimates γ̂k is weighted according to the balance of σG +Var(cβ̂k ). “Bad” subjects, those with large Var(cβ̂k ) will be shrunk more than “good” subject with 2 smaller Var(cβ̂k ). But of course, the exact weight depends on σG + Var(cβ̂k ). In particu2 lar, if σG Var(cβ̂k ) it won’t matter what the relative values of Var(cβ̂k ) are for different k as 2 2 σG + Var(cβ̂k ) ≈ σG for all k. Tom Nichols 2.2 ST416 - Advanced Topics in Biostatistics - Part 1: fMRI 7 Summary Statistics Approach – Special case of one-sample group model To see more exactly how this works, consider the special case of a one-sample group model, when pG = 1 and XG = [1, . . . , 1]0 . This is the case when we’re just trying to infer on the population mean of the bold response described by cβk . Using the expression for the whitened P 2 −1 estimate (11), note that the term corresponding to (X 0 Var()−1 X) reduces to k (σG + Var(cβ̂k ))−1 , and so we get , N N X X 1 cβ̂k . γ̂G = 2 2 k=1 σG + Var(cβ̂k ) k=1 σG + Var(cβ̂k ) 2.3 Summary Statistics Approach – Special case of homogeneous subjects Consider the case when all subjects have the same intrasubject variance, i.e. Var(cβ̂k ) = Var(cβ̂1 ) for all k = 2, . . . , N . Then the weighting term is a constant and can be brought out , N N X σ 2 + Var(cβ̂1 ) X G cβ̂k γ̂G = 2 σ k=1 G + Var(cβ̂1 ) k=1 , N X cβ̂k N. = k=1 This shows that when there is no hetereogeneity over subjects, the weighting vanishes and the calculation reduces to the usual one-sample estimate, i.e. the average. This results will hold 2 . approximately when the variation in Var(cβ̂k ) over subjects is small relative to σG In general, for model 11, you can show that if the whitening matrix is a multiple of a identity matrix, the GLS estimates are identical to the OLS estimates. 2.4 Estimating the Variance Components So far I’ve blithly said that GLS is easy, just prewhiten by the square root inverse of the errors. 2 In practice, this can be diifficult. For example, how do we obtain estimates of σG and Var(cβ̂k ) to perform the whitening of (10)? The standard answer is Restricted Maximum Likelihood (REML). REML consists of writing down the log likelihood of the residuals eG = γ̂ − β̂G , e0G (Var(G̃ ))−1 eG + log |Var(G̃ )| + log |XG0 (Var(G̃ ))−1 XG |, (12) 2 and maximizing with respect to the variance parameters in Var(G̃ )) = IσG +diag({Var(cβ̂k )}). 2 Specifically, the variance parameters include σG and all parameters inside Var(cβ̂k ) (which includes σk2 and, if we were modelling serial autocorrelation, any parameters in Vk ). Tom Nichols ST416 - Advanced Topics in Biostatistics - Part 1: fMRI 8 In fMRI, however, we can take a wonderful short cut: Because we have hundreds of observations for each subject’s first level model, Var(cβ̂k ) = c(Xk0 Xk )−1 σk2 is very well estimated by d β̂k ) = c(X 0 Xk )−1 σ̂ 2 . Var(c k k Hence we can take Var(cβ̂k ) as fixed and known, and then there is but one variance parameter, 2 σG . This simplifies the iterative optimization that is needed to maximize (12) and makes it quite fast, so much so that it can be done voxel-wise no problem. I won’t go into any more details of 2 REML, but for intution, it is handy to see how a methods of moments estimator of σG works 2.5 2 with Methods of Moments Estimating σG Consider the usual sample variance of the N group level observations 1 X (γ̂k − γ̂· )2 S 2 (γ̂) = N −1 k where γ̂· is the sample mean of the N observations. For the one-sample group model, when XG = [1, . . . , 1]0 , you can show that 2 E S 2 (γ̂) = σG + Var(c0 β̂· ) where Var(c0 β̂· ) is the sample mean of the N intrasubect variance estimates. The method of 2 2 moments estimator of σG is obtained by setting S 2 (γ̂) equal to E (S 2 (γ̂)) and solving for σG : 2 σ̃G = max{0, S 2 (γ̂) − Var(c0 β̂· )}, where I have used the maximum operator to prevent negative variance estimates from occuring. Acknowledgments These notes are based in part on Mumford & Nichols (2006). References Beckmann, C. F., Jenkinson, M., & Smith, S. M. (2003). General multilevel linear modeling for group analysis in FMRI. NeuroImage, 20(2), 1052-1063. Mumford, J. A., & Nichols, T. E. (2006). Modeling and inference of multisubject fMRI data. IEEE Engineering in Medicine and Biology Magazine, 25(2), 42-51.