SUPPLEMENTARY MATERIAL Clinical prediction from structural brain MRI scans: A large-scale empirical study Mert R. Sabuncu1,2 and Ender Konukoglu1, for the Alzheimer’s Disease Neuroimaging Initiative* 1 Athinoula A. Martinos Center for Biomedical Imaging, Harvard Medical School/Massachusetts General Hospital, Charlestown, MA 2 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 1 Supplementary Table S1 List of univariate measurements, such as thickness or volume of an anatomical ROI, that were most frequently associated with each variable of interest over the hundred 5-fold cross-validation sessions. 2 Supplementary Figure S1 Prediction performance metrics for each variable and MVPA algorithm, estimated via 5fold cross-validation. (Panels a-c) Binary variables; (Panel d) Continuous variables. (Panel a) Area under the receiver operating characteristic curve (AUC) computed using MATLAB’s perfcurve tool (MATLAB 8.2, The Mathworks Inc. Natick, MA, 2013). (Panel b) True positive rate, or sensitivity. (Panel c) True negative rate, or specificity. (Panel d) Pearson’s correlation values. The MVPA algorithms are abbreviated as follows: N for neighborhood approximation forest, S for SVMs, and R for RVMs. The number after each letter denotes the feature type (1:aseg, 2:aparc, 3:aseg+aparc, 4:thick). The shaded gray color indicates statistical significance (-log10 p-value), where the pvalue is computed via DeLong’s method and the colorbar is shown in Panel d. Associations with a p-value less than 0.01 are shown in red. (a) 3 (b) 4 (c) 5 (d) 6 Supplementary Figure S2 The difference between the performance estimates of the three algorithms over all predicted variables and feature types. (Panel a) Correct Classification Ratio (CCR) differences with respect to the RVM, which has the highest average accuracy. All three classifiers are statistically equivalent. (Panel b) Normalized RMSE differences with respect to SVM, which offers the lowest average RMSE. NAF and SVM are statistically equivalent, and offer significantly better regression accuracy than RVM. On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers (which corresponds to a 99.3% data coverage if the data are normally distributed), and outliers are plotted individually. Plots were generated with Matlab’s boxplot function, with the ‘notch’ option turned on and default settings otherwise. (a) 7 (b) 8 Supplementary Figure S3 The difference between the performance estimates of the MVPA models that employ the four image feature types, over all predicted variables and algorithms. (Panel a) Correct Classification Ratio (CCR) differences with respect to image feature 3 (aseg+aparc), which yields the highest average accuracy. Features 3 and 4 are statistically equivalent and offer slightly better accuracies than 1 and 2. (Panel b) Normalized RMSE differences with respect to Feature 4, which yields the lowest average NRMSE. Features 1 and 3 yield performances that are statistically equivalent to Feature 4. On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers (which corresponds to a 99.3% data coverage if the data are normally distributed), and outliers are plotted individually. Plots were generated with Matlab’s boxplot function, with the ‘notch’ option turned on and default settings otherwise. (a) 9 (b) 10 Supplementary Figure S4 Algorithm versus feature range values. For each variable, we computed the algorithm range as the difference between the best and worst performance metrics (CCR for classification –Panel a- and NRMSE for regression –Panel b-) across the three algorithms (SVM, RVM and NAF), while fixing the feature type. These values were then averaged over feature types. Similarly, for each variable, the feature range was defined as the difference between the best and worst performance metrics across the four feature types, while fixing the algorithm type. These values were then average over the algorithms. (a) 11 (b) 12