Comment on “The Causal Cascade to Multiple Sclerosis: A Model for MS Pathogenesis” By Brian Healy PhD and John Healy PhD Correspondence: Brian Healy PhD Biostatistics Center Massachusetts General Hospital Boston, MA 02115 Email: bchealy@partners.org Recently, an article was published in this journal proposing a causal framework for the development of multiple sclerosis (MS). The author proposed a model that incorporated a genetic component and an environmental component for MS, and attempted to estimate each component of the model using available data. The first equation of the paper was: P(G ) P( E | G ) P( MS | E , G ) P( MS ) (Goodin 1) where P(G) is the probability of having a specific genotype, P(E|G) is the probability of having an exposure given the patient’s genotype, P(MS|E,G) is the probability of having MS given the patient’s exposure status and genotype, and P(MS) is the probability of having MS. Note that our notation is slightly different than the author, but this notation is more standard in statistics. P(MS) is the marginal probability of having MS, allowing the author to estimate this using the prevalence of MS in a specific region. This equation is not correct. The correct equation is: P(G ) P( E | G ) P( MS | E , G ) P( MS , E , G ) (Healy 1) where P(MS,E,G) is the joint probability of having MS, exposure E and genotype G. P(MS,E,G) does not equal P(MS), and therefore P(MS,E,G) cannot be estimated using the prevalence of MS. Rather, the marginal probability is the sum of the joint probabilities over the distribution of the exposure and the genotype. To demonstrate the relationship between P(MS) and P(MS,E,G), we focus on a simple example with two genotypes (G1 and G2) and two exposure categories (E1 and 1 E2). Under this scenario, the marginal probability is the sum over all possible exposure and genotype joint probabilities: P( MS ) PMS , E1 , G1 PMS , E2 , G1 PMS , E1 , G2 PMS , E2 , G2 P( MS ) PMS | E1 , G1 PE1 | G1 PG1 PMS | E2 , G1 PE2 | G1 PG1 PMS | E1 , G2 PE1 | G2 PG2 PMS | E2 , G2 PE2 | G2 PG2 (Healy 2) Since a marginal probability will be greater than or equal to a joint probability, the prevalence is an overestimate of the joint probability. The second main equation provided by the author is for monozygotic twins. The author wanted to restrict the discussion to conditional probabilities in which being genetically susceptible was assumed, which reduces Goodin equation 1 to the following equation P( E | G) P(MS | E, G) CRMZ (Goodin 2) where CRMZ is the probability that a monozygotic twin has MS given the other twin has MS. Without making any assumptions, the conditional probabilities can be written using the following equation: P( E | g G ) P( MS | E , g G ) P( MS , E | g G ) (Healy 3) where G represents the set of genetic profiles for the monozygotic twins. Therefore, the author is proposing that the concordance probability is an estimator of the joint probability of multiple sclerosis and the exposure. Since concordant twins are very likely to have similar exposure patterns, assuming that CRMZ=P(MS|E,G) seems potentially reasonable, but there is no justification in the paper for assuming that CRMZ=P(MS,E|G). Therefore, Goodin equation 2 also does not hold. The author then combines equations 1 and 2 to state that the probability of being genetically susceptible is simply the ratio of the prevalence and the monozygotic twin concordance: 2 P(G ) P( MS ) P( MS ) P( E | G ) P( MS | E , G ) CRMZ (Goodin 3) As described above, this equation is incorrect. The correct relationship is given by P(G) P( MS , E , G) P( E | G ) P( MS | E , G) (Healy 4) Unfortunately, there is no data provided in the paper to estimate each piece of the right side of the equation, and the simplification from Goodin 3 does not estimate the desired quantity. In the Appendix, the paper goes on to discuss genetically determined MS and purely environmental MS. Using Healy equation 2 from above to simplify the discussion, we assume that G1 represents genetically susceptible patients, G2 represents non-genetically susceptible patients, E1 represents mild environmental exposure, and E2 represents sufficient environmental exposure such that a patient would develop MS regardless of genetics. In this simple scenario, we have 4 groups: genetically susceptible patients with mild environmental exposure (G1,E1), genetically susceptible patients with complete environmental exposure (G1,E2), non-genetically susceptible patients with mild environmental exposure (G2,E1), and non-genetically susceptible patients with complete environmental exposure (G2,E2). If we assume that these four groups are mutually exclusive and exhaustive, Healy equation 2 simplifies to: P( MS ) PMS | E1 , G1 PE1 | G1 PG1 PMS | E 2 , G1 1 PE1 | G1 PG1 PMS | E1 , G2 PE1 | G2 1 PG1 PMS | E 2 , G2 1 PE1 | G2 1 PG1 No data exists to estimate the conditional probability of MS in any of these groups, but for illustration, we assume P(MS|E2,G1)= P(MS|E2,G2)=1 (purely environmental MS) and P(MS|E1,G2)=0 (since these patients are neither genetically susceptible nor environmentally susceptible, there is no chance of MS). These assumptions lead to the following equation P(MS ) P(MS | E1 , G1 ) PE1 | G1 PG1 1* 1 PE1 | G1 PG1 1* 1 PE1 | G2 1 PG1 3 (Healy 5) Unfortunately, as before, the components of this equation cannot be estimated using the data from the paper. In particular, even if we assume that P(MS|E1,G1)=CRMZ=0.25 and P(MS)=0.0025 (as in the Canadian sample), we have no estimate for P(E1|G1), P(E1|G2), or P(G1). Since we have three unknowns and one equation, these probabilities can take on many possible values to satisfy the equation, but the conclusions of the model would differ for each set of possible values. If we further assume that P(E1|G1)= P(E1|G2), the constraints on the probabilities in Healy 5 require that P(G1)<=0.01 and PE1 | G1 PE1 | G2 0.9975 . 1 0.25 * PG1 Therefore, we can place bounds on the probabilities of interest in this simplified scenario, but these bounds are subject to the assumptions provided above. The upper bound on the probability of being genetically susceptible is the same as the estimated probability of being genetically susceptible in the paper, but the upper bound is only one possible conclusion. This illustration demonstrates the many assumptions and simplifications required to make conclusions regarding genetic susceptibility based on the available data. In addition, all of these results hold only if all of required assumptions are true. The concept of estimating the genetic and environmental components of MS is an extremely important pursuit. Future work should focus on estimating the components of the equations presented here to ascertain the true environmental and genetic components of MS. Genetic components could be sampled and conditional probabilities of having MS given the subjects have those genetic components could be directly estimated. If the samples of subjects were chosen appropriately, estimates of having MS could be made for various subpopulations. The equations in this letter were often simplified for clarity, and true disease modeling may require more rigorous modeling without such simplifications. 4