1 Supplementary Material METHODS We performed a post-hoc analysis to derive and test the optimal cut-point to separate between no steatosis and any steatosis (mild, moderate, or severe) for the MRI Rosetta Stone Project data set presented in this manuscript. The cut-point that provided the greatest AUROC was determined to be the optimal cut-point. Because of the potential for over-fitting that occurs when a threshold is tested in the same population in which it was derived, we also tested the cut-point along with the existing published cut-points via simulations using data from two prior pediatric studies with histology. Because the distribution of liver fat in a population is an important determinant of the biomarker’s ability to separate between normal and abnormal, we chose data sets that represented important scenarios that would be encountered in clinical research and/or patient care: 1. General population: Studies of epidemiology or genetics will often require a general population sample. Therefore in order to represent the general population, we used data from the Study of Child and Adolescent Liver Epidemiology (SCALE) that included 954 children in the County of San Diego that had an autopsy with clinical data and liver histology(1). 2. Suspected NAFLD: A second scenario in which one would want to be able to use MRI to determine the presence of absence of fatty liver is the overweight child with elevated ALT. In order to represent the clinical case of suspected NAFLD, we used data from a study that included 347 children referred from primary care to pediatric gastroenterology for suspected NAFLD(31). 2 Imputation of MRI PDFF Values Both of these data sets had liver histology with steatosis graded in the same manner as done in the MRI Rosetta Stone Project, but did not include liver MRI. Therefore the MRI PDFF values were imputed using the Markov Chain Monte Carlo method. First the means, distribution type and uncertainty were calculated for the relationship between MRI Fat Fraction and factors found to be statistically significant predictors of MRI Fat Fraction in the Rosetta cohort using W bij= (I/Pij)* (k, A, G, Bmi,); Where: Pij = Initial probability of selection; k = Adjustment for sub-sampling; A = Age; G = Gender; BMIz = BMI Z Score. The model addressed several components including: covariate effects, sex, nonlinearity associated with age, and BMI Z scores. A set of confidence intervals were created for imputing the likely range of MRI fat fraction for each individual subject. Each subject’s confidence intervals underwent 1000 bootstrap sample simulation and an MRI fat faction score was randomly selected from the simulation. Quality of Model Assessment To assess the quality of the model (and thus the robustness of our conclusions), we used 10fold cross-validation and an additional bootstrapping procedure (N = 10,000 boot strap samples). Briefly, for 10-fold cross-validation, the original sample was partitioned into 10 subsamples. Of the 10 subsamples, a single subsample was retained as the validation data for testing the model, and the remaining 9 subsamples were used as training data. The crossvalidation process was then repeated multiple times, with each of the 9 subsamples used as the validation data. 3 Sensitivity and Specificity Calculations We calculated the sensitivity and specificity for the optimal cut-point derived in MRI Rosetta Stone Project along with the four previously published MRI fat fraction threshold values used in Aim 3. These thresholds were tested in the MRI Rosetta Stone Project data set and using the simulated MRI PDFF values generated for the general population and the clinical scenario of suspected NAFLD. Area Under the Receiver Operating Curves (AUROC) where then calculated for each MRI PDFF threshold using the DeLong method. RESULTS The optimal MRI PDFF cut-point in the MRI Rosetta Stone Project to separate between no steatosis and any steatosis (mild, moderate, or severe) was 3.5%. As shown in Supplementary Table 1, this threshold yielded a sensitivity of 95% and a specificity of 83%. However, both sensitivity and specificity were reduced when the threshold of 3.5% was tested in the simulation using the general population or a clinical population with suspected NAFLD. The widely used threshold of 5.5% had a similar AUROC in both the MRI Rosetta Stone Project data set (0.88) and the general population data set (0.87); however, this threshold performed much less well in the setting of suspected NAFLD (0.64). For all of the thresholds, the data in Supplementary Table 1 show the trade-offs made that are dependent upon the threshold selected and the target population in which they are applied.