Supplement –Tables and Figures Laboratory variable AST(SGOT) ALT(SGPT) Non sickle Population Normal range male 5--40 females 5--33 male females 7--46 4--35 BUN2crea 10--20 Bilirubin 0-1.3 HBG LDH 14-20 12-16 14-18 300-600 WBC 3.8-10.8 Plates 130-400 MCV 80-100 Retic HBF.P Sys.BP infant ad-fem ad-male Sickle Population Categories 1 less than 40 between 40 2 and 100 3 above 100 1 less than 36 2 36--100 3 above 100 -1 below 10 0 10--20 1 above 20 1 below 1.3 2 1.3-3.4 3 >3.4 -1 below 8 0 8--12 1 above 12 -1 below 300 0 300-600 1 >600 0 <10.8 1 10.8-13.5 2 >13.5 0 <400 1 400-490 2 >490 -1 <80 0 80-98 1 >98 -1 <4.8 0 4.8-13 1 >13 -1 <2 0 2-9 1 >9 -1 < 80 0 80-120 1 120-140 2 > 140 -1 0 1 2 < 80 80-140 140-160 > 160 Percentiles 50% of SCA popu 2% population 25% of sca population about 50% 25% 25% 50% 25% about 25% about 50% 25% about 55% about 20% about 50% about 75% about 50% about 75% about 25% about 65% about 10% 25% 50% 25% 25% 50% 25% age < 18 >18 Table A: Categories used for continuous laboratory variables. The categories were defined using, in part, the normal range defined in the non-sickle population, and in part reference values inferred from the sickle population [1]. 1 Variable Log(Bayes Factor) Log(Bayes Factor) average Effect (OR) Referent group Effect (OR) ACS 85.76 84.79 Age 110.43 84.79 Bilirubin 345.09 295.56 47.00 39.61 LDH 196.90 157.39 MCV 561.19 474.22 Pain 245.70 205.31 1.61 (1.45;1.77) no pain 42.07 38.67 1.35 (1.09;1.67) no priapism Reticulocytes 324.12 276.23 Sepsis 447.15 367.48 67.19(57.66;78.29) Blood Transfusion Priapism Sex 1.18 (1.14;1.24) 2.67 (2.44;2.91) 1.43 (1.28;1.60) no ACS [1840] 7.61 (5.11;1.34) [1.33.4] 1.59 (1.02;2.49) 2.01 (1.78;2.27) [>40] [>3.4] 2--18 years normal no chronic BT 0.85 (0.76;0.94) 1.1 (0.99;1.24) [<300] [>600] 0.54 (0.49;0.60) 1.98 (1.71; 2.29) [<80] [>98] 0.5 (0.44;0.55) 1.51 (1.39;1.65) [<4.8] [>13] normal normal normal no sepsis 2302.11 1957.89 1.16 (1.08;1.25) females Stroke 129.49 111.53 3.81 (3.20;4.54) no stroke Sys BP 88.49 79.95 5.01 3.40 WBC 2.84 (1.67;4.82) 3.41 (2.71;4.27) [low] [high] 1.37 (1.24;1.51) 1.92 (1.44;2.55) [10.8-13.5] [>13.5] normal normal Table B: Summary of the strength of associations in the data from the Cooperative Study of Sickle Cell Disease data. Column 2 reports the Bayes factor in logarithmic scale as explained in the technical notes. The third column reports the average Bayes factor in logarithmic scale that was computed by inducing the network of associations in the sets randomly selected from the original dataset during the assessment of the error rate. Although the associations are less strong because of the smaller sample size, the close match between the Bayes factors in column 2 and the average Bayes factors provides strong evidence in favor of the associations. The fourth and fifth columns report the effect of the variable in the rows on the disease severity score measured by the odds ratio for early death and 95% Bayesian credible intervals in round brackets. Names in square brackets describe the category compared to the referent group when the variable has more than two categories: for example 2.67 in column 4, row 2 are the odds for early death of a subject aged between 18 and 40 years, compared to a subject aged below 18 years. 2 Covariate Name Reticulocyte Age category AST pain sepsis stroke Systolic Blood Pressure Covariate Levels -1 0 1 1 2 3 1 2 3 0 1 0 1 0 1 -1 0 1 2 Parameter estimate -0.56 -0.61 . -3.37 -1.19 . -0.94 -0.27 . 0.90 . -4.69 . -1.01 . 1.70 -1.48 0.41 . OR 95% CI OR 0.57 0.54 . 0.03 0.30 . 0.39 0.77 . 2.46 . 0.01 . 0.36 . 5.49 0.23 1.50 . (0.324, 1.012) (0.361, 0.818) . (0.019, 0.063) (0.184, 0.502) . (0.134, 1.129) (0.267, 2.194) . (1.463, 4.140) . (0.006, 0.014) . (0.206, 0.642) . (0.335, 90.190) (0.024, 2.192) (0.142, 15.936) . p-value OR Type III p-value 0.0548 0.0034 . <0.0001 <0.0001 . 0.0825 0.6196 . 0.0007 . <0.0001 . 0.0005 . 0.2328 0.2005 0.7345 . 0.0106 <0.0001 0.0020 0.0007 <0.0001 0.0005 <0.0001 Table C: Results of logistic regression model on death. The first column is the name of the covariate, the second column reports the levels of these categorical variables as described in Table A. If a level of a variable is the referent group, the corresponding row is filled with dots. The parameters are estimated using maximum likelihood method, and all the p-values are from Wald’s Chi-squared statistic. 3 Supplement –Statistical Analysis Network modeling. To build the BN we used a popular Bayesian approach that was developed by Cooper and Herskovitz[2] and is implemented in the program Bayesware Discoverer (www.bayesware.com). The program searches for the most probable network of dependency given the data. To find such a network, the program explores a space of different network models, scores each model by its posterior probability conditional on the available data, and returns the model with maximum posterior probability. This probability is computed by Bayes' theorem as p( M | D) p( D | M ) p( M ) , where p( D | M ) is the probability that the observed data are generated from the network model M, and p (M ) is the prior probability encoding knowledge about the model M before seeing any data. We assumed that all models were equally likely a priori, so that p (M ) is uniform and the posterior probability p( M | D) becomes proportional to p( D | M ) , a quantity known as marginal likelihood. The marginal likelihood averages the likelihood functions for different parameters values and it is calculated as p ( D | M ) p ( D | ) p ( )d where p( D | ) is the traditional likelihood function and p() is the parameter prior density. The set of marginal and conditional independences represented by a network M imply that, for categorical data in which p() follows a Dirichlet distribution and with complete data, the integral p ( D | M ) p ( D | ) p ( )d has a closed form solution[3] that is computed in product form as: p( D | M ) i p( Di | M i ) where M i is the model describing the dependency of the ith variable on its parent nodes – those node with directed arcs pointing to the ith variables-- and Di are the observed data of the ith variable[3]. Details of the calculations are in [4]. The factorization of the marginal likelihood implies that a model can be learned locally, by selecting the most probable set of parents for each variable, and then joining these local structures into a complete network, in a procedure that closely resembles standard path analysis. This modularity property allows us to assess, locally, the strength of local associations represented by rival models. This comparison is based on the Bayes factor ~ that measures the odds of a model M i versus a model M i by the ratio of their posterior ~ probabilities p(M i | Di ) / p(M i | Di ) or, equivalently, by the ratio of their marginal ~ likelihoods p( Di | M i ) / p( Di | M i ) . Given a fixed structure for all the other associations, the posterior probability p( Di | M i ) is p( Di | M i ) /(1 ) and a large 4 Bayes factor implies that the probability p( Di | M i ) is close to 1, meaning that there is very strong evidence for the associations described by the model M i versus the ~ alternative model M i . Note that, when we explore different dependency models for the ith variable, the posterior probability of each model depends on the same data Di . Even with this factorization, the search space is very large and to reduce computations, we used a bottom-up search strategy known as the K2 algorithm[2].The space of candidate models was reduced by first limiting attention to diagnostic rather than prognostic models, in which we modeled the dependency of SCD complications and laboratory variables on death. We also ordered the variables according to their variance, so that less variable nodes could only be dependent on more variable nodes. Simulations results we have carried out suggest that this heuristic leads to better networks with largest marginal likelihood. As in traditional regression models, in which the outcome (death) is dependent on the covariates, this inverted dependency structure can represent the association of independent as well as interacting covariates with the outcome of interest [3]. However, this structure is also able to capture more complex models of dependency [5] because, in this model, the marginal likelihood measuring the association of each covariate with the outcome is functionally independent of the association of other covariates with the outcome. In contrast, in regression structures, the presence of an association between a covariate and the outcome affects the marginal likelihood measuring the association between the phenotype and other covariates, reducing the set of regressors that can be detected as associated with the variable of interest. The BN induced by this search procedure was quantified by the conditional probability distribution of each node given the parents nodes. The conditional probabilities were estimated as p ( xik | ij ) ijk nijk ij nij where xik represents the state of the child node, ij represents a combination of states of the parents nodes, nijk is the sample frequency of ( xik , ij ) and n ij is the sample frequency of ij . The parameters ijk and ij with the constrain j ij k ijk encode the prior distribution for all j, as suggested in[3]. We chose 16 by sensitivity analysis[3]. 5 The network highlights the variables that are sufficient to compute the score: these are the variables that make the risk of death independent of all the other variables in the network and appear in red in Figure 1. These variables are the “Markov blanket” of the node death as defined in [3]. 1. West, M.S., et al., Laboratory profile of sickle cell disease: a cross-sectional analysis. The Cooperative Study of Sickle Cell Disease. J Clin Epidemiol, 1992. 45(8): p. 893-909. 2. Cooper, G.F. and G.F. Herskovitz, A Bayesian method for the induction of probabilistic networks from data. Mach Learn, 1992. 9: p. 309-347. 3. Cowell, R.G., et al., Probabilistic Networks and Expert Systems. 1999, New York: Springer Verlag. 4. Sebastiani, P., M. Abad, and M.F. Ramoni, Bayesian networks for genomic analysis, in Genomic Signal Processing and Statistics, E. Dougherty, et al., Editors. 2005. p. 281-320. 5. Hand, D.J., H. Mannila, and P. Smyth, Principles of Data Mining. 2001, Cambridge, MA: MIT Press. 6