Test-Retest Reliability Estimation of functional MRI Data Ranjan Maitra, 1 Steven R. Roys, 2 and Rao P. Gullapalli 2,* 1 Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, MD 21250 2 Department of Radiology, University of Maryland School of Medicine, Baltimore, MD 21201 Corresponding Author: Rao P. Gullapalli Functional Imaging Laboratory Department of Radiology 22 S. Greene St University of Maryland, Baltimore Baltimore, MD 21201 Phone: 410-328-2099 Fax: 410-328-0341 e-mail: rgullapalli@umm.edu Running Title: Test-retest reliability estimate of fMRI data Page 1 ABSTRACT Functional Magnetic Resonance Imaging (fMRI) data are commonly used to construct activation maps for the human brain. Quantifying the reliability of such maps is important. We have developed statistical models to provide precise estimates for reliability from several runs of the same paradigm over time. Specifically our method here extends the premise of maximum likelihood developed in Genovese et al (MRM 1997;38:497-507) by incorporating spatial context in the estimation process. Experiments indicate that our methodology provides more conservative estimates of true-positives as compared to those obtained by Genovese et al. The reliability estimates can be used to obtain voxelspecific reliability measures for activated as well as inactivated regions in future experiments. We derive statistical methodology to decide on optimal thresholds to determine region- and context-specific activation. Empirical guidelines are also provided on the number of repeat scans to acquire in order to arrive at accurate reliability estimates. We report all results on experiments involving a motor paradigm performed on a single subject several times over a period of two months. Keywords : fMRI, quantitation, Markov Random Field, Iterated Conditional Modes, ROC analysis, motor task Page 2 Introduction The past decade has seen functional Magnetic Resonance Imaging (fMRI) evolve into a standard tool to map the human brain (1). Other imaging modalities such as positron emission tomography (PET) or electro-encephalography (EEG) also have similar capabilities; however the non-invasive nature and the consequent convenience of being able to repeat a scan with a given paradigm several times provides fMRI with a substantial edge. This advantage is enhanced by its comparative superior spatial resolution. Moreover, while the temporal resolution provided by fMRI is far less compared to EEG, recent developments in acquisition techniques with single trial studies show promise in closing this gap (2). In spite of these advances however, there is little published work on the reliability and reproducibility of fMRI data. The Blood-Oxygen-Level-Dependent (BOLD) response that is usually seen in fMRI takes several seconds after the neural stimulation. Changes in neural activity instigate changes in local hemodynamics, which result in the changing concentration of deoxy-hemoglobin. Thus any neural stimulus provided passes through the so-called hemodynamic filter before it can be observed as a BOLD effect in regular fMRI scanning sessions. These fMRI activations, and hence the observed local hemodynamic changes are dependent on the type of paradigm used to elicit response in addition to the acquisition technique. Most researchers assume that the activations obtained from fMRI are inherently reliable and base their conclusions on this assumption. The goal of most fMRI experiments is to determine if a given task elicits a response in a brain region either in comparison to another task or in comparison to a so called ‘rest’ state where the subject is presumably doing nothing. There is also a general desire to quantitate the activation provided by these fMRI studies. This quantitation has to be considered in the context of known factors that affect activation patterns in an fMRI experiment. Such factors include physiological variation, scanner noise, and patient motion. Physiological variation may Page 3 arise from cardiac and respiratory motion and any consequent flow-related artifacts. These may be monitored and digitally filtered from fMRI data (3). A second important factor is scanner noise variability, but this can be minimized through implementation of good quality control programs. A third factor that can affect the activation pattern is patient motion during the scan, which can be detrimental to the quality of fMRI images. Such motion can be due to gross movement during the scan or because of involuntary motion that may be stimulus-correlated (4). Since the signal differences between the activated state and the control state tend to be small, of the order of 1-5%, a sub-pixel motion can induce large signal changes, and these could be falsely interpreted as activation. Most studies, therefore, resort to some sort of image registration that aligns the sequence of images to subpixel accuracy (5). However, even when all these variables are taken into consideration during an fMRI examination, quantitation of the results still poses a major challenge. A receiver-operator-characteristic (ROC) approach was used by Skudlarski et al. to evaluate the performance of t-tests in accurately determining the true positive pixels from the false positive pixels (6). Their studies determined the optimal t-tests and MR imaging parameters that would maximize the true positive rate. While their suggestions provide a good framework for designing experiments, they do not provide for a sense of reliability and reproducibility of fMRI activation patterns within a subject or between several subjects. . As per Genovese et al., a method is considered 'reliable' if it identifies the same regions as active across several replications of the same experiment (7). The 'reliability estimate' of an fMRI experiment provides a quantitative measure of this aspect. The reliability estimate of an fMRI experiment provides a quantitative measure of confidence in the identified activation patterns. Some attempts to obtain test-retest reliability measures were recently made by Genovese et al, who presented a method that estimated the probability of a voxel being correctly identified as active (which we denote as π A) and an inactive voxel being falsely identified as active (π I) at a particular threshold Page 4 from M experimental replications for a given paradigm (7, 8). Under this framework they modeled the number of times (out of M replications) that a voxel is identified as active as a mixture of two binomial distributions with the π A and π I common for the whole image. The mixing proportion of active voxels was also assumed to be uniform over the entire image and is denoted by λ in their work. They used the method of maximum likelihood to estimate (λ, π A, and π I). In this paper we extend their model by incorporating spatial context to the test-retest reliability metric by estimating λ on a voxel-by-voxel basis rather than as a common parameter for the entire image set (all slices). We also discuss an alternative model for the likelihood that incorporates the dependence between the number of replications for which a voxel is identified as active at successive thresholds. Here, we provide a description of our modeling and estimation techniques and their performance when applied to a motor task paradigm. We also compare our technique with that of Genovese et al (7). Further, we incorporate the estimates of λ’s, πA’s and π I’s to provide a reliability map. We define reliability measures on a voxel-by- voxel basis depending on whether they have been identified as active or inactive. We define the reliability of an active voxel to be the probability that the vo xel is truly active given that it has been identified as active. The anti-reliability of an inactive voxel is the probability that the voxel is truly active given that it has been incorrectly identified as inactive. Such a map then forms the basis for comparing future studies on the person using the same paradigm. Finally, we provide a framework for the choice of thresholds to be used based on maximizing the maximum likelihood (ML) reliability efficient frontier. For a given threshold and a given voxel the reliability efficient frontier is defined as the probability that the state of the voxel, whether active or inactive is correctly identified. This approach is similar to that of Genovese et al (7); however our optimal thresholds are voxel-dependent, since the ML reliability efficient frontier is also so. We conclude with a discussion and pointers for further work. Page 5 Methods Imaging : All MR images were obtained on a GE 1.5 Tesla Signa system equipped with echo planar gradients using v5.8 software. Our structural T1 -weighted images used a standard spin-echo sequence with a TE/TR of 10ms/500ms respectively. The positioning of the slices followed the procedure described by Noll et al to minimize any inter-session differences (8). This procedure is designed to allow for accurate repositioning to facilitate longitudinal scanning, and uses all three planes to position the slices (9). Twenty-four slices parallel to the AC-PC line were acquired using a single shot spiral sequence at a TE of 35 ms and a TR of 4000 ms (10). The slice thickness was 6mm with no gap between each of the slices (11). The paradigm consisted of eight cycles of a simple finger-thumb opposition motor task. Finger-thumb opposition was performed for 32 seconds followed by a rest period of 32 seconds, thus generating 128 time-points for a single run. This paradigm was run on a single volunteer for both the left and the right hand after obtaining informed consent, and was repeated at twelve different times over a period of two months. To compare a new data set with the derived reliability maps one other set of data using the same paradigm was obtained 6 months after the first 12 datasets were obtained. Data were then transferred from the scanner to an SGI Origin 200 workstation where all reconstructions were performed. Motion correction was applied for each run using Automated Image Registration (AIR), following which time series were generated at each voxel (5). These time series were then normalized to a mean of zero to remove any linear drift in the data. To further minimize misregistration between sessions, cross-session image registration was performed between all twelve sessions using the inter-session registration algorithms provided by Bob Cox under AFNI (12). Cross-correlations were performed using sinusoidal waveforms with lags up to 8secs to create the functional maps. Similar processing was performed on the data set that was acquired 6 Page 6 months later. These functional maps were thresholded at levels of {0.2, 0.25,....., 0.85} to obtain maps of active/inactive voxels at each threshold; i.e., the voxel was identified as active if the correlation at the voxel was greater than the threshold, and inactive otherwise. Finally our statistical analysis was done using the algorithms and data analysis programs described below. All these were built on a combination of commands in MATLAB and the ‘C’ programming language. Statistical Methodology: Consider the following setup: Let λi be the probability that the i'th voxel is truly active. Let K be the number of activation threshold levels. Let Yi = (yi,1, yi,2, ...., yi,K), where yi,k is the number of replications for which the i'th voxel is active at the k'th activation threshold. Without loss of generality, assume that the threshold levels are in increasing order. Further, let pAk,k-1 be the probability of a truly active voxel being so identified at the k'th threshold level, given that it is also correctly identified as active at the (k-1)’th threshold level. Similarly, let pIk,k-1 be the corresponding probability of a truly inactive voxel being identified as active at the k'th threshold level, given that it is also incorrectly identified as active at the (k-1)’th threshold level. For notational consistency, we denote pA1,0 and pI1,0 as πA1 and πI1, respectively, which are the corresponding probabilities of a truly active voxel being identified as active at the first threshold level. Also, let yi,0 = K, the total number of replicatio ns. The likelihood function for the i’th voxel is then given by, K y K y i , k −1 y i,k p Ak , k −1 (1 − p Ak , k − 1 )y i ,k −1 − y i ,k + (1 − λi )∏ i , k −1 p Ikyi ,,kk −1 (1 − p Ik , k −1 ) yi ,k −1 − y i ,k λi ∏ k =1 yi , k k =1 yi , k [1] The corresponding likelihood for the entire set of voxels in the image set (all slices) is then the product of the above over all voxels. Page 7 The above model differs from the one described by Genovese et al in two major respects (7). In the first instance, it extends the modeling by incorporating voxel-specific probabilities of true activation. Additionally, and within the framework presented above, this model offers a more accurate representation of the likelihood function, by incorporating explicitly the dependence structure between the successive yi,k's at the i'th voxel. Note that such a derivation (on dependent likelihood) as described in the appendix provided by Genovese et al. is not applicable here because our λ’s are voxel specific. The unconditional probability of a truly active voxel being correctly classified at the k’th threshold is given by, π Ak = P{truly active voxel is identified as active at k ' th threshold } k = ∏ P{truly active voxel is identified as active at j ' th threshold | it is also so [2] j =1 K identified at ( j − 1)' th threshold } = ∏ p Ak , k −1 k =1 Similarly also, for the unconditional probability of a truly inactive voxel being incorrectly classified: K π Ik = ∏ p Ik , k −1 [3] k =1 The pAk,k-1 's and pIk,k-1 's are global parameters and are the same for the entire image. The λ's on the other hand are voxel-specific. However, since truly active and inactive voxels are more likely to occur together, we introduce a three dimensional (3-D) spatial context in the estimation of the λ's by adding the following penalty term to the log of the likelihood function: n 1 β ∑ i =1 1 + ( λi − λ j ) 2 i ~ j [4] Here, ‘i~j’ denotes that voxels 'i' and 'j' are neighboring voxels. We define the 3-D neighborhood of voxel 'i' to be the set of voxels that share an edge or a corner with it. The inclusion of the above term is Page 8 once again a generalization of Genovese et al (as noted in their discussion) (7). Specifically, it is the logarithm of the Geman-McClure prior and provides a description of our prior beliefs via a Markov Random Field (MRF) (13-17). The above prescription penalizes configurations when the λ's are far apart relative to the neighbors; however the penalty is not as severe as would have occurred if we had used, say a Gaussian MRF prior (17). Further, β is a hyper-parameter that measures the strength of the interaction between neighboring λ's: higher values of β penalize λ-configurations with far-apart neighbor values more severely than do lower values. Typically, β is not known and needs to be accounted for in order for inference to proceed. We include this as one of the parameters to be estimated, along with the pAk,k-1's, pIk,k-1 's and λ's. Note also that the prior above is specified only up to scale with a constant that is computationally intractable. Further, maximizing the posterior is no longer an easy proposition, and so stochastic methods such as simulated annealing or Markov Chain Monte Carlo (MCMC) must be used (17). These methods are however, computationally demanding. We therefore decided to use the method of Iterated Conditional Modes (ICM) first introduced in the statistical imaging literature by Besag (15). The ICM approach finds a local maximum in the vicinity of its intialization. We choose an initial estimate for the voxel-specific λ's to be the thresholded average of the correlations between the sinusoidal reference waveform and the time series of the signal for each image. The estimate is thresholded above zero so that the initial estimates for λ take values between zero and unity. Given this value for λ, we calculate pAk,k-1 's, pIk,k-1 's and β maximizing the sum of the log- likelihood and the penalty term in Eq. 4. Note that the multivariate maximization of pAk,k-1 's and pIk,k-1's can be done independently of that for β. We perform this maximization, using the downhill simplex method of Nelder and Mead on the likelihood part of the equation (since the prior does not involve these parameters) (18). The estimation of the hyper-parameter presents a challenge however, Page 9 given that the scaling constant in the prior is a function of β. We obtain instead a pseudo- likelihood estimate for β. This is done by constructing the log-pseudo likelihood function n n 1 β ∑ ∑ − ∑ ln c i ( β ) 2 i=1 j∈∂i 1 + ( λi − λ j ) i =1 [5] where 1 ci ( β ) = ∫ exp β ∑ dλ i . 2 j∈∂i 1 + ( λi − λ j ) [6] The summation in both cases is over the set ∂i – the voxels that are neighbors of the i’th voxel. With the above estimates of β, pA's and pI's, we obtain sequentially, the mode of the conditional posterior distribution of λi given the rest. This concludes one pass of the algorithm. We iterate the entire procedure and re-estimate pAk,k-1 's, pIk,k-1's, β and λ's till convergence. We thus obtain ICM estimates of the parameters, and use these estimates to obtain π A's and π I's for the different thresholds. With the above estimates, we can construct, respectively the reliability and anti-reliability maps for regions identified as activate or inactive at a given threshold in a future study. To do this, note that if the i’th voxel is identified as active at the k’th threshold, then its reliability measure is given by RAk =λiπ A,k/{λiπA,k+(1-λI)π I,k} [7] On the other hand, if the i’th voxel were identified as inactive at the k’th threshold, we have its antireliability measure defined as AIk = (1-λi)(1−πI,k) / {λi(1−πA,k)+ (1-λi)(1−πI,k)} [8] Thus, together with an activation map, one could obtain maps of reliabilities and anti-reliability for activate and inactivate voxels respectively. Page 10 We also used the estimates to obtain optimal thresholds for maximizing the true positive rate. The ML reliability efficient frontier in our method is voxel-specific and is expressed as λiπ A,k+(1-λi)π I,k for the i’th voxel and the k’th threshold. Maximizing this over the pairs of π A's and π I's for the different thresholds gives us the optimal pair at each voxel. This can be used by the investigators in identifying the appropriate threshold for deciding on activation in a future study. In particular, if the optimal thresholds vary with region, the investigator may choose a threshold based on the area where he or she is most interested in identifying activation patterns. Results Figure 1a shows activation in the left motor cortex from performing a right finger thumb opposition task on a volunteer scanned twelve different times over a period of two months. The activations are displayed here at a threshold level of 0.5. Figure 1b shows a similar set of images obtained from activation of the right motor cortex from performing left finger tapping. Note that the variability of activation is quite remarkable with as many as 189 pixels activating in the best case to as little as 18 pixels activating in the worst case in one of the slices (slice 20 of the 24 slices) for the right finger-thumb opposition task. In our statistical analysis, our ICM procedures converged rather rapidly, in six iterations for the right hand, and in nine for the left hand. For both cases, final estimates of the interaction parameter β were virtually indistinguishable up to the fourth decimal point, and were 2.696 in both cases. We display our results in the accompanying figures. The estimated λ image is shown in figures 2a and 2b for the right and the left hand tasks respectively. The most opaque voxels seen on the red transparency overlays are the ones that report estimates corresponding to the maximum value of λ. Figures 3a & 3b show a comparison of ROC curves using three different methods; (a) the method of Genovese et al., Page 11 denoted by ‘o’, (b) our alternative dependent likelihood model but with a fixed λ of 0.01776 denoted by ‘+’, and (c) the ICM estimates with varying λ for both right and the left finger opposition task respectively. Figures 4a and 4b show the reliability and anti-reliability maps for activation obtained from the same subject six months later using the same paradigm at a threshold of 0.3 for the left and right hand respectively. For both figures the yellow overlays represent the reliability of voxels identified as active with opacity directly proportional to the reliability measure of the active voxel. Additionally, in the same figure, we have the red overlays to display voxels that were identified as inactive. Once again, the extent of opacity here represents the anti-reliability measure of the inactive voxels. Thus, we get a representation of the correctness of our identification, both for active voxels as well as those that we missed. Finally, Fig. 5 shows the thresholds at which the activation in various regions of the brain can be reliably detected for the left hand. Note that the confidence for a correct identification is highest for truly active voxels (for example in the motor cortex) at a very low threshold whereas higher thresholds are required for identification to be more accurate in other regions of the brain. This can be explained by noticing that the optimal values of π A’s and π I’s for large values of λ are obtained from the ML reliability efficient frontier when the change in successive (π A, πI)’s are high. These are typically so for the smallest thresholds, as illustrated in fig. 3b. In particular, note that this means that while at low thresholds, many voxels will be identified as active, only those voxels that are in the regions with high λ values and low optimal thresholds in the map (such as in the motor cortex) will have the greatest probability of having been correctly identified. It also means that while a lot fewer voxels with low λ will be identified as active at higher thresholds, the probability that the state of these voxels is correctly identified is highest at the higher thresholds than at lower thresholds. From these optimum threshold Page 12 values, one can choose, depending on the region of interest in the context of the activation experiment, the threshold for which the chance of correctly identifying an activated or inactivated region is the highest. Discussion Genovese et al. provided a novel statistical method to obtain estimates of test-retest reliability using the maximum likelihood method (7). Their paper supposed a voxel- independent true activation rate for λ. Their computations are relatively straight- forward; however they comment on the desirability of including voxel-specific true activation rates in the model. The value of this approach lies in the fact that not all voxels have the same chance of being activated: indeed, voxels in the background have no chance of being truly active. Additionally, it is more likely that voxels that are spatially close to each other would also have similar characteristics and hence λ-values. We have incorporated this aspect of the estimation process through a prior distribution using the Geman-McClure model (13, 14). Our model includes the parameter β, which measures the strength of the interaction between the neighboring values of λ. As a further generalization, we note that the independent likelihood model of Genovese et al. ignored the dependence in incidence of activations recorded at different threshold levels. Incorporating the methodology developed here provides for a more accurate representation of the model and provides sharper confidence estimates. Indeed, our results provide a more conservative estimate of π A and a general reduction of π I for a given threshold as compared to the method of Genovese et al. As a result our estimated ROC curve has lower area than that using the method of Genovese et al.; however, our estimated curve uses a truer and more accurate representation of the model. We demonstrate the use of our methodology in fig 4 to determine our confidence in the results of a future activation experiment on the same subject and using the same experimental paradigm. Further, we also use the estimates to Page 13 suggest optimal threshold values for future experiments. Because we allow our λ’s to be voxel-specific, our optimal threshold values are found to vary with region. This information can be used by the researcher to determine the optimal threshold, based on the region that he or she feels is of most interest in the context of the activation experiment. An interesting question that can be asked here is the number of replications that are needed to get reasonable estimates. Our original study provides for estimates obtained from 12 replications. A proper analysis would involve a cost-benefit analysis, but here we just limit ourselves to an empirical study of the average gain in using more replications. To this end, we performed the following experiment: we obtained, via simple random sampling, ten sets of M=2 replications from the twelve. From each set, we obtained estimates of the π A’s, π I’s and the λ’s. We compared the root- mean-squared (RMS) error of these estimates with those obtained using all twelve replications. The ten RMS errors for replications with M=2 are displayed via the box plot in fig 6. Repeating the above exercise for ten similarly sampled sets of M=3,4,…,11 replications provided us with the remaining box plots in the figures. These figures indicate statistical consistency, with increased precision in the π A's, πI's, and λ, with increasing number of replications. The gain is more pronounced as we move from 2 to 3 replications and tapers off substantially around 5 or 6. This possibly indicates that for the particular task and subject, 5 or 6 replications were enough to obtain satisfactory estimates of reliability. It would be interesting to perform a complete set of studies on several subjects and tasks to determine whether a similar trend holds, in general. Another interesting issue to be studied is whether the true positive and the false positive rates are themselves voxel-specific. This of-course introduces a new computational burden to the estimation process. For any given study, especially those that pertain to the study of longitudinal changes, it is necessary to establish a baseline, which would then be used to estimate changes or differences during Page 14 the course of the study. It then becomes necessary to generate reliability/anti- reliability maps that would incorporate the normal variability within a given sequence, paradigm, or subject performance level. We have illustrated this application in fig. 4. Obviously, as the number of sessions increases, we will have more precise estimates of both reliability and anti-reliability. Alternatively one may wish to strike a balance between the marginal gain in precision and the marginal cost of obtaining an additional scan. Our study indicates that at least for the motor study described here along with its associated paradigm, a repeat of at least 5 scans is required as the gain in precision decreases with every successive scan significantly after that. While this subject was scanned during different sessions spanning over 2 months the methodology here is applicable to cases where multiple acquisitions are acquired during a single session. We obtained the data for this study over 12 different days with the hope of incorporating all variables that might play a role in the acquisition of the data, and to mimic the situation that would exist in real life with longitudinal stud ies. It would be interesting to see if one would find any differences if multiple scans were performed on the same day or split over a couple of days to obtain the reliability estimates. The methodology developed here could also be applied to new pilot studies that have not been explored and to studies where there might be multi- focal activation. Specifically, if the desire of the researcher is to experiment with a paradigm to understand which parts of the cerebral cortex are involved and to what extent, they could use the methodology described here to come up with the probability of activation in the various areas of the cortex. We are hopeful that such an analysis would be beneficial particularly when investigating novel paradigms. We are currently investigating such a study to understand the applicability of this methodology for group analysis. Of course in this situation the researcher is well served by converting his or her data into a common co-ordinate system such as the Talairach co-ordinate system or the Montreal brain prior to performing the analysis (19,20). Page 15 The estimates of reliability and anti-reliability that were arrived here depend totally on the statistical analysis that was chosen to prepare the activation maps. That implies that it is up to the user to provide as clean an activation map as possible prior to subjecting the data to analysis. For example, if an MR angiographic scan shows that certain voxels showing up as active encompass a region with draining veins, it is advisable to mark these voxels as inactive or to ignore them for further analysis while making reliability estimates. One could also incorporate more sophisticated analyses such as the one described by Saad et al where they provide a means for separation of voxels into vascular and parenchymal pools (21). Other major factors that could degrade the quality of the activation maps include physiological factors such as cardiac and respiratory motion. Once again, if such data can be first digitally filtered prior to preparing the activation maps, the reliability or anti-reliability maps can be greatly improved just as shown by Genovese et al., where, the reliability estimates for registered data were far better than the data that was not motion corrected (7). In conclusion, we have developed a methodology that incorporates spatial extent of activation in fMRI experiments. The methodology developed here should be applicable to studies that require a measure of reliability especially those studies that are longitudinal in design. It should also be applicable to new fMRI studies that use novel paradigms to reliably detect loci of activation across different subjects. Our future studies will involve further refining of the methodology through incorporation of physiological information. Acknowledgements Page 16 We thank Doug C. Noll at the University of Michigan for providing us the spiral sequence to perform this study. Our thanks also to Craig Mullins of the University of Maryland, Baltimore for assistance in the data collection and Rouben Rostamian of the University of Maryland, Baltimore County for providing us with the C routines for the Nelder-Mead optimization. Page 17 Appendix Derivation of the Likelihood Equation: Let Yi be the random vector associated with the observed yi. Then the likelihood function is given by P(Yi = yi ) = P(i ' th voxel is truly active) P(Yi = yi | i ' th voxel is truly active) + P(i ' th voxel is truly inactive) P(Yi = y i | ith voxel is truly inactive) [A1] = λi P(Yi = yi | i' th voxel is truly active) + (1 − λi ) P(Yi = yi | ith voxel is truly inactive) Note that if the ith voxel is active in only yi,k-1 (of K) replications at the (k-1)’th threshold, then at the kth level, it can be identified as active in at most yi,k-1 replications, and with probability pAk,k-1 (for truly active voxels) and pIk,k-1 (for truly inactive voxels) in each replication. This means that for truly active voxels, the conditional distribution of Yi,k given that Yi,k-1 =yi,k-1 (with yi,k-1 positive) is binomial with number of independent trials as yi,k-1 and probability of success in each trial as pAk,k-1 . For the truly inactive voxels the above conditional distribution is also binomial with the same number of independent trials and probability of success as pIk,k-1 . Note that this holds only if yi,k-1 is positive, for otherwise for both truly active and inactive voxels, yi,k=0 with probability 1. Also, note that for both truly active and inactive voxels, the conditional distribution of Yi,k given that Yi,k-j =yi,k-j ; j=k,k-1,…,1 is the same as that of Yi,k given that Yi,k-1 =yi,k-1 . The result follows upon noting that for both truly active and inactive voxels, P{Yi = yi | i ' th voxel is truly (in ) active} K = ∏ P{Yi, k = yi, k | Yi ,k − j = yi, k − j ; j = k , k −1,....,1, k =1 and the i 'th voxel is truly (in )active} Page 18 [A2] REFERENCES: 1. Kwong KK, Belliveau JW, Chesler DA, Goldberg IE, Weis skoff RM, Poncelet BP Kennedy DN, Hoppel BE, Cohen MS, Turner R. Dynamic magnetic resonance imaging of human brain activity during primary sensory stimulation. Proc. Natl. Acad. Sciences USA 1992;89:5675-5679. 2. Rosen BR, Buckner RL, and Dale AM. Event-related functional MRI: past, present, and future. Proc Natl Acad Sci USA 1998;95:773-780. 3. Biswal B. DeYoe EA, Hyde JS. Reduction of physiological fluctuations in FMRI using digital filters, Magn Reson Med 1996;35:107-113. 4. Hajnal JV, Myers R, Oatridge A, Schwieso JE, Young IR, Bydder GM, Artifacts due to stimulus correlated motion in functional imaging of the brain, Magn Reson Med 1994;31:283-291. 5. Wood RP, Grafton ST, Watson JDG Sicotte NL Mazziotta JC. Automated Image Registration:II. Intersubject validation of linear and nonlinear models. J. Comput. Assist. Tomogr. 1998;22:253-265. 6. Skudlarski P, Constable RT, Gore JC. ROC analysis of statistical methods used in functional MRI, 1998;9:311-329. 7. Genovese CR, Noll DC, Eddy WF. Estimating Test-Retest Reliability in Functional MR Imaging 1: Statistical Methodology. Magn Reson Med 1997;38:497-507. 8. Noll DC, Genovese CR, Nystrom LE, Vazquez AL, Forman SD, Eddy WF, Cohen JD. Estimating Test-Retest Reliability in Functional MR Imaging II: Appplication to Motor and Cognitive Activation Studies. Magn Reson Med 1997;38:508-517. 9. Gallagher HL, MacManus DG, Webb SL, Miller DH. A reproducible repositioning method for serial magnetic resonance imaging studies of the brain in treatment trails for multiple sclerosis, J Magn Reson Imaging 1997;7:439-441. 10. Noll DC, Cohen JD, Meyer CH, Schneider W. Spiral k-space MRI of cortical activation. J Magn Reson Imaging 1995;2:501-505. 11. Noll DC, Boada FE, and Eddy WF. Movement Correction in fMRI: The impact of slice profile and slice spacing. In: Proceedings of the Fifth Annual Meeting of the ISMRM, Vancouver, 1997. p 1677. 12. Cox RW, Hyde JS. Software tools for analysis and visualization of fMRI data, NMR Biomed 1997;10:171-178. Page 19 13. Geman S, McClure DE. Bayesian image analysis: Application to single photon emission computed tomography. Proc Stat Comp Sec, Amer. Stat. Assoc 1985; 12-18. 14. Geman S, McClure DE. Statistical methods for tomographic image reconstruction. Bull Int Stat Rev 1987; 52:5-21. 15. Besag JE. Towards Bayesian image analysis. J Appl Stat 1989; 16:395-407. 16. Besag JE. On the statistical analysis of dirty pictures (with discussion). J Roy Stat Soc Ser B 1986; 48:259-302. 17. Besag JE, Green PJ, Higdon D, Mengersen K. Bayesian computation and stochastic systems (with discussion). In: Stat Sci 1995; 10:3-41. 18. Nelder JA, and Mead R. A Simplex method for function minimization. In: Computer Journal 1965; 7:308-313. 19. Talairach J and Tournoux P. Co-Planar Stereotactic Atlas of the Human Brain. Theime Medical, New York 1988. 20. Evans AC, Kamber M, Collins DL, MacDonald D. An MRI based probabilistic atlas of neuroanatomy. In: Shorvon SD, Fish DR, Andermann F, Bydder GM, Stefan H. eds. Magnetic resonance scanning and epilepsy. New York: Plenum Press, 1994:263-374. 21. Saad ZS, Repella KM, Cox RW, DeYoe EA. Analysis and use of fMRI response delays, Human Brain Mapping 2001;13:74-93. Page 20 Figure Captions Figure 1. Activation maps of motor function from a single volunteer overlaid on structural T1-weighted images during (a) right hand, and (b) left hand finger-thumb opposition task on 12 different occasions over 2 months. Shown here is the variability of activation in slice 20 of 24 slices. Yellow indicates strongest positive correlation, red indicates medium positive correlation, and blue indicates low negative correlation. Figure 2. Estimated λ image for 12 of the 24 slices for the (a) right hand and (b) left hand finger-thumb opposition task in the regions of the motor cortex and the cerebellum. Opacity of red overlay is directly proportional to λ. Figure 3. Receiver operator characteristic curves at different correlation threshold values (τ) using the method of Genovese et al denoted by ‘o’; dependent likelihood but with fixed λ of 0.01776 denoted by ‘+’; and the ICM model with varying λ denoted by ‘? ’ for the (a) right and the (b) left hand fingerthumb opposition task. Three threshold values for correlation are shown on the graph. Higher threshold values are not displayed to avoid confusion. Figure 4. Reliability and anti-reliability maps for the additional scan that was obtained 6 months later. Activation map for the (a) right, and the (b) left hand finger-thumb opposition task respectively, at a threshold of 0.3. Green represents the areas that were identified active with opacity proportional to the reliability of activation; while the opacity in the red voxels represents the anti-reliability measures for voxels identified as inactive regions in this scan. Note the difference in scale for green and red. Page 21 Figure 5. Optimum threshold overlaid on contour images of the brain for slice 20 for the left hand finger-thumb opposition task respectively. The gray scale indicates optimum threshold for a given voxel for maximizing ML reliability efficient frontier. Low gray scale values indicate that at low threshold the chance of correctly identifying a voxel whether active or inactive is the highest. Figure 6. Root mean square errors of estimates obtained for M=2 thru 11 replications compared with the estimates obtained using all 12 replications for (a) π A, (b) πI, and (c) λ for the left hand finger-thumb opposition task. The ‘+’ sign indicates outliers, and the bar in the box indicates the median. Page 22 Figure 1 Page 23 Figure 2. Page 24 Figure 3. Page 25 Figure 4. Page 26 Figure 5. Page 27 Figure 6. Page 28 Page 29