Large-scale phenome-wide scan in twins using electronic health records June 29th 2015 Scott Hebbring Marshfield Clinic Research Foundation University of Wisconsin Madison Human Genetics Association studies GWAS: Thousands of variants associated with a few hundred phenotypes a. Relatively easy to recruit unrelated individuals b. Multiple testing challenges a. Weak effects b. Difficult to interpret biology c. Clinical utility? d. Disease limited PheWAS: Dramatically increases the number of diseases that can be studied a. Can start with biologically/clinically relevant variants b. May be limited to the same challenges of GWAS Family studies Linkage, Segregation Analysis, Heritability… a. Thousands of mutations in thousands of genes causing human diseases. b. Often easier to interpret biology c. large effect sizes d. Clinically relevant e. Difficult to recruit families f. One disease at a time Classical Twins Studies 1. Gold standard for heritability studies Unique family/genetic relationships (monozygotic twins) Strong shared environmental exposures starting in utero 2. Rare (~20/1,000 births) 3. Difficult to recruit Largest twin registries include the Swedish and Danish twin registries (~200,000 twins) Others: UK Adult, Australian, Sri Lankan, and Chinese National Minnesota, Univ-Wash, MI-State, Mid-Atlantic twin registries. Sample ascertainment bias 4. Phenotypic data is often acquired by surveys and questionnaires and limited to only a few measurables. 5. Updating data is costly and labor intensive. Twin population -same last name Marshfield Clinic Personalized Medicine Research Project -same date of birth l -same billing account Marshfield Personalized Medicine Research Project Study Area (19 Zip Codes in Central WI) Marshfield Clinic Primary Service Area -same home address -key word “twin” l l Madison Milwaukee 2.6 Million patients Marshfield Clinic Twin Cohort (~16,000 patients) Genet Epidemiol. 2014 Dec;38(8):692-8. A. MCTC is one of the first cross sectional twin population ~80% accuracy B. Methods are easily translatable ~12,000 twins have been ID in Mayo’s EHR. C. Little to no zygosity data D. All patients are uniquely linked to Marshfield Clinic’s EHR. Phenotypic data is collected in real time Not disease limited Amendable to phenomewide strategies? Genet Epidemiol. 2014 Dec;38(8):692-8. Hypothesis: EHR-linked twin cohorts can be used for phenome-wide studies to identify diseases with genetic etiologies. Methods Population: MCTC and Mayo twin cohort (28,888 twins) Phenotypes were defined by collapsing ICD9 coding e.g., ICD9 100.01 100.0* 100.* For every phenotype/ICD9 codes, a p-value was estimated to determine if the disease co-occurred in twins more frequently that by chance. For every phenotype/ICD9 code, a relative risk was estimated which estimated the risk of disease if the other twin is affected relative to the population risk in the twin cohorts. 9,906 and 5,987 unique phenotypes/ICD9 codes in MCTC and Mayo-TC, respectively 5,598 shared phenotypes/ICD9 codes Diseases in MCTC were more common than in Mayo-CT Hypothesis: EHR-linked twin cohorts can be used for phenome-wide studies to identify diseases with genetic etiologies. Methods Population: MCTC and Mayo twin cohort (28,888 twins) Phenotypes were defined by collapsing ICD9 coding e.g., ICD9 100.01 100.0* 100.* For every phenotype/ICD9 codes, a p-value was estimated to determine if the disease co-occurred in twins more frequently that by chance. For every phenotype/ICD9 code, a relative risk was estimated which estimated the risk of disease if the other twin is affected relative to the population risk in the twin cohorts. Phenome-wide Scan A. 1,222 phenotypes/ICD9 codes were statistically enriched for concordance in MCTC (p<8.9E-6) 929 (76%) were replicated in Mayo-TC (p<0.05) B. 928 phenotypes/ICD9 codes were statistically enriched for concordance in Mayo-TC 739 (80%) were replicated in MCTC C. 1,406 phenotypes were statistically enriched for concordance by combined meta-analysis Phenome-wide Scan Phenome-wide Scan Phenome-wide Scan Top non V-codes and perinatal codes MCTC ICD9 382.9 382.0 465.9 465 462 520.6 783.4 520 786.2 466.1 315 367 780.6 315.3 367.1 Disease Unspecific otitis media Suppurative and unspecified otitis media Acute upper respiratory infections of unspecified site Acute upper respiratory infections of multiple or unspecified sites Acute pharyngitis Disturbances in tooth eruption Lack of expected normal physiological development in childhood Disorders of tooth development and eruption Cough Acute bronchiolitis Specific delays in development Disorders of refraction and accommodation Fever and other physiologic disturbances of temperature regulation Developmental speech or language disorder Myopia Mayo-TC Combined Affected 4,318 4,514 P-value 5.0E-203 3.4E-202 RR 1.8 1.7 Affected 1,130 1,275 P-value 4.4E-252 4.8E-231 RR 7.1 5.6 P-value 2.3E-451 1.6E-429 5,272 1.5E-138 1.4 1,223 8.2E-258 6.5 1.1E-392 5,297 1.2E-137 1.4 1,250 2.0E-253 6.2 2.1E-387 5,202 1,350 4.9E-123 8.3E-122 1.3 3.6 950 230 1.2E-224 1.1E-90 8.0 27.7 4.8E-344 4.4E-209 726 6.5E-134 8.2 416 1.6E-73 9.0 4.9E-204 1,556 4,245 575 891 3,645 7.3E-117 4.1E-80 3.9E-146 1.8E-134 8.1E-116 3.0 1.2 12.4 6.3 1.5 311 720 212 501 718 5.0E-87 1.8E-122 1.0E-45 2.9E-57 1.7E-74 16.2 6.6 15.9 5.8 4.5 1.7E-200 3.4E-199 1.7E-188 2.2E-188 6.1E-187 2,875 1.6E-90 1.6 664 1.6E-81 5.3 1.0E-168 451 2,144 5.9E-113 9.5E-101 14.0 2.1 284 215 2.7E-54 6.0E-61 12.0 20.5 6.1E-164 2.1E-158 Hypothesis: EHR-linked twin cohorts can be used for phenome-wide studies to identify diseases with genetic etiologies. Methods Population: MCTC and Mayo twin cohort (28,888 twins) Phenotypes were defined by collapsing ICD9 coding e.g., ICD9 100.01 100.0* 100.* For every phenotype/ICD9 codes, a p-value was estimated to determine if the disease co-occurred in twins more frequently that by chance. For every phenotype/ICD9 code, a relative risk was estimated which estimated the risk of disease if the other twin is affected relative to the population risk in the twin cohorts. Relative Risks RR=relative risk ADF=average disease frequency 1,455 phenotypes/ICD9 codes had at least one concordant pair in both cohorts 498 and 139 phenotypes had RRs >10 and >100 in both cohorts, respectively Genetic diseases with large estimated RRs MCTC Affected Concordant P-value RR Affected Concordant P-value RR Combine d P-value Sickle-cell disease 3 1 2.7E-04 2,747 2 1 1.6E-04 6,096 8.0E-07 282 Hereditary spherocytosis 3 1 2.7E-04 2,747 3 1 3.7E-04 2,032 1.7E-06 356.1 Peroneal muscular atrophy 3 1 2.7E-04 2,747 3 1 3.7E-04 2,032 1.7E-06 282.49 Other thalassemia 3 1 2.7E-04 2,747 9 3 6.1E-09 677 4.7E-11 334.3 Other cerebellar ataxia 3 1 2.7E-04 2,747 6 1 1.5E-03 406 6.3E-06 426.82 Long QT syndrome 4 1 4.9E-04 1,374 18 8 1.4E-17 542 3.3E-19 ICD9 Disease 282.6 Mayo-TC Same-Sex Opposite-Sex Same-Sex Opposite-Sex Potential limitations 1. Limited by the inherent challenges of ICD9 coding. 2. Parental/Familial biases 3. Lack of zygosity still limits this approach NLP or blood types may help enrich for specific twin types. Conclusions 1. Most diseases are not random events in the twins. a. 1,406/5,598 (25%) of phenotypes are statistically enriched in pairs of twins b. ~1% of phenotypes have RRs < 1.0 2. Genetics plays an important component to the diseases process for thousands of diseases. 3. Family data may be efficiently captured in in EHR and may be used to predict, prevent, and treat human disease for the advancement of “precision medicine.” Future of genomic research Precision Medicine Populations Genome Future of genomic research Phenome Precision Medicine Populations Genome Future of genomic research Families Phenome Precision Medicine Populations Genome Acknowledgements Marshfield Clinic: Murray Brilliant Peggy Peissig Steven Schrodi Zhan (Harold) Ye John Mayer many more… Mayo Clinic: Jyotishman Pathak Yijing Cheng Funding: NHGRI NLM NCATS NCRR 1U01HG006389 K22LM011938 9U54TR000021 1UL1RR025011 Marshfield Clinic Research Foundation Marshfield Clinic donors