Supplement –Tables and Figures

advertisement
Supplement –Tables and Figures
Laboratory variable
AST(SGOT)
ALT(SGPT)
Non sickle Population
Normal range
male
5--40
females
5--33
male
females
7--46
4--35
BUN2crea
10--20
Bilirubin
0-1.3
HBG
LDH
14-20
12-16
14-18
300-600
WBC
3.8-10.8
Plates
130-400
MCV
80-100
Retic
HBF.P
Sys.BP
infant
ad-fem
ad-male
Sickle Population
Categories
1
less than 40
between 40
2
and 100
3
above 100
1
less than 36
2
36--100
3
above 100
-1
below 10
0
10--20
1
above 20
1
below 1.3
2
1.3-3.4
3
>3.4
-1
below 8
0
8--12
1
above 12
-1
below 300
0
300-600
1
>600
0
<10.8
1
10.8-13.5
2
>13.5
0
<400
1
400-490
2
>490
-1
<80
0
80-98
1
>98
-1
<4.8
0
4.8-13
1
>13
-1
<2
0
2-9
1
>9
-1
< 80
0
80-120
1
120-140
2
> 140
-1
0
1
2
< 80
80-140
140-160
> 160
Percentiles
50% of SCA popu
2% population
25% of sca population
about 50%
25%
25%
50%
25%
about 25%
about 50%
25%
about 55%
about 20%
about 50%
about 75%
about 50%
about 75%
about 25%
about 65%
about 10%
25%
50%
25%
25%
50%
25%
age
< 18
>18
Table A: Categories used for continuous laboratory variables. The categories were
defined using, in part, the normal range defined in the non-sickle population, and in part
reference values inferred from the sickle population [1].
1
Variable
Log(Bayes
Factor)
Log(Bayes
Factor)
average
Effect (OR)
Referent
group
Effect (OR)
ACS
85.76
84.79
Age
110.43
84.79
Bilirubin
345.09
295.56
47.00
39.61
LDH
196.90
157.39
MCV
561.19
474.22
Pain
245.70
205.31
1.61 (1.45;1.77)
no pain
42.07
38.67
1.35 (1.09;1.67)
no priapism
Reticulocytes
324.12
276.23
Sepsis
447.15
367.48 67.19(57.66;78.29)
Blood
Transfusion
Priapism
Sex
1.18 (1.14;1.24)
2.67 (2.44;2.91)
1.43 (1.28;1.60)
no ACS
[1840]
7.61 (5.11;1.34)
[1.33.4]
1.59 (1.02;2.49)
2.01 (1.78;2.27)
[>40]
[>3.4]
2--18 years
normal
no chronic BT
0.85 (0.76;0.94)
1.1 (0.99;1.24)
[<300]
[>600]
0.54 (0.49;0.60)
1.98 (1.71; 2.29)
[<80]
[>98]
0.5 (0.44;0.55)
1.51 (1.39;1.65)
[<4.8]
[>13]
normal
normal
normal
no sepsis
2302.11
1957.89
1.16 (1.08;1.25)
females
Stroke
129.49
111.53
3.81 (3.20;4.54)
no stroke
Sys BP
88.49
79.95
5.01
3.40
WBC
2.84 (1.67;4.82)
3.41 (2.71;4.27)
[low]
[high]
1.37 (1.24;1.51)
1.92 (1.44;2.55)
[10.8-13.5]
[>13.5]
normal
normal
Table B: Summary of the strength of associations in the data from the Cooperative
Study of Sickle Cell Disease data. Column 2 reports the Bayes factor in logarithmic scale
as explained in the technical notes. The third column reports the average Bayes factor
in logarithmic scale that was computed by inducing the network of associations in the
sets randomly selected from the original dataset during the assessment of the error rate.
Although the associations are less strong because of the smaller sample size, the close
match between the Bayes factors in column 2 and the average Bayes factors provides
strong evidence in favor of the associations. The fourth and fifth columns report the
effect of the variable in the rows on the disease severity score measured by the odds
ratio for early death and 95% Bayesian credible intervals in round brackets. Names in
square brackets describe the category compared to the referent group when the variable
has more than two categories: for example 2.67 in column 4, row 2 are the odds for early
death of a subject aged between 18 and 40 years, compared to a subject aged below 18
years.
2
Covariate Name
Reticulocyte
Age category
AST
pain
sepsis
stroke
Systolic Blood Pressure
Covariate
Levels
-1
0
1
1
2
3
1
2
3
0
1
0
1
0
1
-1
0
1
2
Parameter
estimate
-0.56
-0.61
.
-3.37
-1.19
.
-0.94
-0.27
.
0.90
.
-4.69
.
-1.01
.
1.70
-1.48
0.41
.
OR
95% CI OR
0.57
0.54
.
0.03
0.30
.
0.39
0.77
.
2.46
.
0.01
.
0.36
.
5.49
0.23
1.50
.
(0.324, 1.012)
(0.361, 0.818)
.
(0.019, 0.063)
(0.184, 0.502)
.
(0.134, 1.129)
(0.267, 2.194)
.
(1.463, 4.140)
.
(0.006, 0.014)
.
(0.206, 0.642)
.
(0.335, 90.190)
(0.024, 2.192)
(0.142, 15.936)
.
p-value OR Type III p-value
0.0548
0.0034
.
<0.0001
<0.0001
.
0.0825
0.6196
.
0.0007
.
<0.0001
.
0.0005
.
0.2328
0.2005
0.7345
.
0.0106
<0.0001
0.0020
0.0007
<0.0001
0.0005
<0.0001
Table C: Results of logistic regression model on death. The first column is the name of the
covariate, the second column reports the levels of these categorical variables as described in
Table A. If a level of a variable is the referent group, the corresponding row is filled with dots.
The parameters are estimated using maximum likelihood method, and all the p-values are from
Wald’s Chi-squared statistic.
3
Supplement –Statistical Analysis
Network modeling. To build the BN we used a popular Bayesian approach that was
developed by Cooper and Herskovitz[2] and is implemented in the program Bayesware
Discoverer (www.bayesware.com). The program searches for the most probable
network of dependency given the data. To find such a network, the program explores a
space of different network models, scores each model by its posterior probability
conditional on the available data, and returns the model with maximum posterior
probability. This probability is computed by Bayes' theorem as
p( M | D)  p( D | M ) p( M ) ,
where p( D | M ) is the probability that the observed data are generated from the network
model M, and p (M ) is the prior probability encoding knowledge about the model M
before seeing any data. We assumed that all models were equally likely a priori, so that
p (M ) is uniform and the posterior probability p( M | D) becomes proportional
to p( D | M ) , a quantity known as marginal likelihood. The marginal likelihood averages
the likelihood functions for different parameters values and it is calculated as
p ( D | M )   p ( D |  ) p ( )d
where p( D |  ) is the traditional likelihood function and p() is the parameter prior
density. The set of marginal and conditional independences represented by a network M
imply that, for categorical data in which p() follows a Dirichlet distribution and with
complete data, the integral p ( D | M )  p ( D |  ) p ( )d has a closed form solution[3]

that is computed in product form as:
p( D | M )  i p( Di | M i )
where M i is the model describing the dependency of the ith variable on its parent nodes
– those node with directed arcs pointing to the ith variables-- and Di are the observed
data of the ith variable[3]. Details of the calculations are in [4].
The factorization of the marginal likelihood implies that a model can be learned locally,
by selecting the most probable set of parents for each variable, and then joining these
local structures into a complete network, in a procedure that closely resembles standard
path analysis. This modularity property allows us to assess, locally, the strength of local
associations represented by rival models. This comparison is based on the Bayes factor
~
that measures the odds of a model M i versus a model M i by the ratio of their posterior
~
probabilities p(M i | Di ) / p(M i | Di ) or, equivalently, by the ratio of their marginal
~
likelihoods   p( Di | M i ) / p( Di | M i ) .
Given a fixed structure for all the other
associations, the posterior probability p( Di | M i ) is p( Di | M i )   /(1   ) and a large
4
Bayes factor  implies that the probability p( Di | M i ) is close to 1, meaning that there
is very strong evidence for the associations described by the model M i versus the
~
alternative model M i . Note that, when we explore different dependency models for the
ith variable, the posterior probability of each model depends on the same data Di .
Even with this factorization, the search space is very large and to reduce computations,
we used a bottom-up search strategy known as the K2 algorithm[2].The space of
candidate models was reduced by first limiting attention to diagnostic rather than
prognostic models, in which we modeled the dependency of SCD complications and
laboratory variables on death.
We also ordered the variables according to their
variance, so that less variable nodes could only be dependent on more variable nodes.
Simulations results we have carried out suggest that this heuristic leads to better
networks with largest marginal likelihood.
As in traditional regression models, in which
the outcome (death) is dependent on the covariates, this inverted dependency structure
can represent the association of independent as well as interacting covariates with the
outcome of interest [3]. However, this structure is also able to capture more complex
models of dependency [5] because, in this model, the marginal likelihood measuring the
association of each covariate with the outcome is functionally independent of the
association of other covariates with the outcome.
In contrast, in regression structures,
the presence of an association between a covariate and the outcome affects the
marginal likelihood measuring the association between the phenotype and other
covariates, reducing the set of regressors that can be detected as associated with the
variable of interest.
The BN induced by this search procedure was quantified by the conditional probability
distribution of each node given the parents nodes. The conditional probabilities were
estimated as
p ( xik |  ij ) 
 ijk  nijk
 ij  nij
where xik represents the state of the child node,  ij represents a combination of states
of the parents nodes, nijk is the sample frequency of ( xik ,  ij ) and n ij is the sample
frequency of  ij . The parameters  ijk and  ij 
with the constrain

j
ij

k
ijk
encode the prior distribution
  for all j, as suggested in[3].
We chose   16 by
sensitivity analysis[3].
5
The network highlights the variables that are sufficient to compute the score: these are
the variables that make the risk of death independent of all the other variables in the
network and appear in red in Figure 1. These variables are the “Markov blanket” of the
node death as defined in [3].
1.
West, M.S., et al., Laboratory profile of sickle cell disease: a cross-sectional
analysis. The Cooperative Study of Sickle Cell Disease. J Clin Epidemiol, 1992.
45(8): p. 893-909.
2.
Cooper, G.F. and G.F. Herskovitz, A Bayesian method for the induction of
probabilistic networks from data. Mach Learn, 1992. 9: p. 309-347.
3.
Cowell, R.G., et al., Probabilistic Networks and Expert Systems. 1999, New York:
Springer Verlag.
4.
Sebastiani, P., M. Abad, and M.F. Ramoni, Bayesian networks for genomic
analysis, in Genomic Signal Processing and Statistics, E. Dougherty, et al.,
Editors. 2005. p. 281-320.
5.
Hand, D.J., H. Mannila, and P. Smyth, Principles of Data Mining. 2001,
Cambridge, MA: MIT Press.
6
Download