Comment on “Causal Modeling in MS”

advertisement
Comment on “The Causal Cascade to Multiple Sclerosis: A Model for MS Pathogenesis”
By Brian Healy PhD and John Healy PhD
Correspondence: Brian Healy PhD
Biostatistics Center
Massachusetts General Hospital
Boston, MA 02115
Email: bchealy@partners.org
Recently, an article was published in this journal proposing a causal framework for the
development of multiple sclerosis (MS). The author proposed a model that incorporated a
genetic component and an environmental component for MS, and attempted to estimate each
component of the model using available data. The first equation of the paper was:
P(G ) P( E | G ) P( MS | E , G )  P( MS )
(Goodin 1)
where P(G) is the probability of having a specific genotype, P(E|G) is the probability of having
an exposure given the patient’s genotype, P(MS|E,G) is the probability of having MS given the
patient’s exposure status and genotype, and P(MS) is the probability of having MS. Note that
our notation is slightly different than the author, but this notation is more standard in statistics.
P(MS) is the marginal probability of having MS, allowing the author to estimate this using the
prevalence of MS in a specific region.
This equation is not correct. The correct equation is:
P(G ) P( E | G ) P( MS | E , G )  P( MS , E , G )
(Healy 1)
where P(MS,E,G) is the joint probability of having MS, exposure E and genotype G. P(MS,E,G)
does not equal P(MS), and therefore P(MS,E,G) cannot be estimated using the prevalence of MS.
Rather, the marginal probability is the sum of the joint probabilities over the distribution of the
exposure and the genotype. To demonstrate the relationship between P(MS) and P(MS,E,G), we
focus on a simple example with two genotypes (G1 and G2) and two exposure categories (E1 and
1
E2). Under this scenario, the marginal probability is the sum over all possible exposure and
genotype joint probabilities:
P( MS )  PMS , E1 , G1   PMS , E2 , G1   PMS , E1 , G2   PMS , E2 , G2 
P( MS )  PMS | E1 , G1 PE1 | G1 PG1   PMS | E2 , G1 PE2 | G1 PG1  
PMS | E1 , G2 PE1 | G2 PG2   PMS | E2 , G2 PE2 | G2 PG2 
(Healy 2)
Since a marginal probability will be greater than or equal to a joint probability, the prevalence is
an overestimate of the joint probability.
The second main equation provided by the author is for monozygotic twins. The author
wanted to restrict the discussion to conditional probabilities in which being genetically
susceptible was assumed, which reduces Goodin equation 1 to the following equation
P( E | G) P(MS | E, G)  CRMZ
(Goodin 2)
where CRMZ is the probability that a monozygotic twin has MS given the other twin has MS.
Without making any assumptions, the conditional probabilities can be written using the
following equation:
P( E | g  G ) P( MS | E , g  G )  P( MS , E | g  G )
(Healy 3)
where G represents the set of genetic profiles for the monozygotic twins. Therefore, the author
is proposing that the concordance probability is an estimator of the joint probability of multiple
sclerosis and the exposure. Since concordant twins are very likely to have similar exposure
patterns, assuming that CRMZ=P(MS|E,G) seems potentially reasonable, but there is no
justification in the paper for assuming that CRMZ=P(MS,E|G). Therefore, Goodin equation 2
also does not hold.
The author then combines equations 1 and 2 to state that the probability of being
genetically susceptible is simply the ratio of the prevalence and the monozygotic twin
concordance:
2
P(G ) 
P( MS )
P( MS )

P( E | G ) P( MS | E , G ) CRMZ
(Goodin 3)
As described above, this equation is incorrect. The correct relationship is given by
P(G) 
P( MS , E , G)
P( E | G ) P( MS | E , G)
(Healy 4)
Unfortunately, there is no data provided in the paper to estimate each piece of the right side of
the equation, and the simplification from Goodin 3 does not estimate the desired quantity.
In the Appendix, the paper goes on to discuss genetically determined MS and purely
environmental MS. Using Healy equation 2 from above to simplify the discussion, we assume
that G1 represents genetically susceptible patients, G2 represents non-genetically susceptible
patients, E1 represents mild environmental exposure, and E2 represents sufficient environmental
exposure such that a patient would develop MS regardless of genetics. In this simple scenario,
we have 4 groups: genetically susceptible patients with mild environmental exposure (G1,E1),
genetically susceptible patients with complete environmental exposure (G1,E2), non-genetically
susceptible patients with mild environmental exposure (G2,E1), and non-genetically susceptible
patients with complete environmental exposure (G2,E2). If we assume that these four groups are
mutually exclusive and exhaustive, Healy equation 2 simplifies to:
P( MS )  PMS | E1 , G1 PE1 | G1 PG1   PMS | E 2 , G1 1  PE1 | G1 PG1  
PMS | E1 , G2 PE1 | G2 1  PG1   PMS | E 2 , G2 1  PE1 | G2 1  PG1 
No data exists to estimate the conditional probability of MS in any of these groups, but for
illustration, we assume P(MS|E2,G1)= P(MS|E2,G2)=1 (purely environmental MS) and
P(MS|E1,G2)=0 (since these patients are neither genetically susceptible nor environmentally
susceptible, there is no chance of MS). These assumptions lead to the following equation
P(MS )  P(MS | E1 , G1 ) PE1 | G1 PG1   1* 1  PE1 | G1 PG1   1* 1  PE1 | G2 1  PG1 
3
(Healy 5)
Unfortunately, as before, the components of this equation cannot be estimated using the data
from the paper. In particular, even if we assume that P(MS|E1,G1)=CRMZ=0.25 and
P(MS)=0.0025 (as in the Canadian sample), we have no estimate for P(E1|G1), P(E1|G2), or
P(G1). Since we have three unknowns and one equation, these probabilities can take on many
possible values to satisfy the equation, but the conclusions of the model would differ for each set
of possible values. If we further assume that P(E1|G1)= P(E1|G2), the constraints on the
probabilities in Healy 5 require that P(G1)<=0.01 and PE1 | G1   PE1 | G2  
0.9975
.
1  0.25 * PG1 
Therefore, we can place bounds on the probabilities of interest in this simplified scenario, but
these bounds are subject to the assumptions provided above. The upper bound on the probability
of being genetically susceptible is the same as the estimated probability of being genetically
susceptible in the paper, but the upper bound is only one possible conclusion. This illustration
demonstrates the many assumptions and simplifications required to make conclusions regarding
genetic susceptibility based on the available data. In addition, all of these results hold only if all
of required assumptions are true.
The concept of estimating the genetic and environmental components of MS is an
extremely important pursuit. Future work should focus on estimating the components of the
equations presented here to ascertain the true environmental and genetic components of MS.
Genetic components could be sampled and conditional probabilities of having MS given the
subjects have those genetic components could be directly estimated. If the samples of subjects
were chosen appropriately, estimates of having MS could be made for various subpopulations.
The equations in this letter were often simplified for clarity, and true disease modeling may
require more rigorous modeling without such simplifications.
4
Download