Marshfield Clinic Twin Cohort

advertisement
Large-scale phenome-wide scan in twins
using electronic health records
June 29th 2015
Scott Hebbring
Marshfield Clinic Research Foundation
University of Wisconsin Madison
Human Genetics
Association studies
GWAS: Thousands of variants associated with a few hundred phenotypes
a. Relatively easy to recruit unrelated individuals
b. Multiple testing challenges
a. Weak effects
b. Difficult to interpret biology
c. Clinical utility?
d. Disease limited
PheWAS: Dramatically increases the number of diseases that can be studied
a. Can start with biologically/clinically relevant variants
b. May be limited to the same challenges of GWAS
Family studies
Linkage, Segregation Analysis, Heritability…
a. Thousands of mutations in thousands of genes causing
human diseases.
b. Often easier to interpret biology
c. large effect sizes
d. Clinically relevant
e. Difficult to recruit families
f. One disease at a time
Classical Twins Studies
1. Gold standard for heritability studies
Unique family/genetic relationships (monozygotic twins)
Strong shared environmental exposures starting in utero
2. Rare (~20/1,000 births)
3. Difficult to recruit
Largest twin registries include the Swedish and Danish twin registries (~200,000 twins)
Others: UK Adult, Australian, Sri Lankan, and Chinese National
Minnesota, Univ-Wash, MI-State, Mid-Atlantic twin registries.
Sample ascertainment bias
4. Phenotypic data is often acquired by surveys and questionnaires and limited to only a few
measurables.
5. Updating data is costly and labor intensive.
Twin population
-same last name
Marshfield Clinic Personalized Medicine Research Project
-same date of birth
l
-same billing account
Marshfield
Personalized Medicine
Research Project Study Area
(19 Zip Codes in Central WI)
Marshfield Clinic Primary
Service Area
-same home address
-key word “twin”
l
l
Madison
Milwaukee
2.6 Million patients
Marshfield Clinic Twin Cohort (~16,000 patients)
Genet Epidemiol. 2014 Dec;38(8):692-8.
A. MCTC is one of the first cross
sectional twin population
~80% accuracy
B. Methods are easily translatable
~12,000 twins have been ID
in Mayo’s EHR.
C. Little to no zygosity data
D. All patients are uniquely linked to
Marshfield Clinic’s EHR.
Phenotypic data is collected
in real time
Not disease limited
Amendable to phenomewide strategies?
Genet Epidemiol. 2014 Dec;38(8):692-8.
Hypothesis: EHR-linked twin cohorts can be used for
phenome-wide studies to identify diseases with genetic
etiologies.
Methods
Population: MCTC and Mayo twin cohort (28,888 twins)
Phenotypes were defined by collapsing ICD9 coding
e.g., ICD9 100.01  100.0*  100.*
For every phenotype/ICD9 codes, a p-value was estimated to determine if the
disease co-occurred in twins more frequently that by chance.
For every phenotype/ICD9 code, a relative risk was estimated which estimated
the risk of disease if the other twin is affected relative to the population risk in
the twin cohorts.
9,906 and 5,987 unique phenotypes/ICD9 codes in MCTC and Mayo-TC, respectively
5,598 shared phenotypes/ICD9 codes
Diseases in MCTC were more common than in Mayo-CT
Hypothesis: EHR-linked twin cohorts can be used for
phenome-wide studies to identify diseases with genetic
etiologies.
Methods
Population: MCTC and Mayo twin cohort (28,888 twins)
Phenotypes were defined by collapsing ICD9 coding
e.g., ICD9 100.01  100.0*  100.*
For every phenotype/ICD9 codes, a p-value was estimated to determine if the
disease co-occurred in twins more frequently that by chance.
For every phenotype/ICD9 code, a relative risk was estimated which estimated
the risk of disease if the other twin is affected relative to the population risk in
the twin cohorts.
Phenome-wide Scan
A. 1,222 phenotypes/ICD9 codes were statistically enriched
for concordance in MCTC (p<8.9E-6)
929 (76%) were replicated in Mayo-TC (p<0.05)
B. 928 phenotypes/ICD9 codes were statistically enriched
for concordance in Mayo-TC
739 (80%) were replicated in MCTC
C. 1,406 phenotypes were statistically enriched for
concordance by combined meta-analysis
Phenome-wide Scan
Phenome-wide Scan
Phenome-wide Scan
Top non V-codes and perinatal codes
MCTC
ICD9
382.9
382.0
465.9
465
462
520.6
783.4
520
786.2
466.1
315
367
780.6
315.3
367.1
Disease
Unspecific otitis media
Suppurative and unspecified otitis media
Acute upper respiratory infections of
unspecified site
Acute upper respiratory infections of multiple
or unspecified sites
Acute pharyngitis
Disturbances in tooth eruption
Lack of expected normal physiological
development in childhood
Disorders of tooth development and eruption
Cough
Acute bronchiolitis
Specific delays in development
Disorders of refraction and accommodation
Fever and other physiologic disturbances of
temperature regulation
Developmental speech or language disorder
Myopia
Mayo-TC
Combined
Affected
4,318
4,514
P-value
5.0E-203
3.4E-202
RR
1.8
1.7
Affected
1,130
1,275
P-value
4.4E-252
4.8E-231
RR
7.1
5.6
P-value
2.3E-451
1.6E-429
5,272
1.5E-138
1.4
1,223
8.2E-258
6.5
1.1E-392
5,297
1.2E-137
1.4
1,250
2.0E-253
6.2
2.1E-387
5,202
1,350
4.9E-123
8.3E-122
1.3
3.6
950
230
1.2E-224
1.1E-90
8.0
27.7
4.8E-344
4.4E-209
726
6.5E-134
8.2
416
1.6E-73
9.0
4.9E-204
1,556
4,245
575
891
3,645
7.3E-117
4.1E-80
3.9E-146
1.8E-134
8.1E-116
3.0
1.2
12.4
6.3
1.5
311
720
212
501
718
5.0E-87
1.8E-122
1.0E-45
2.9E-57
1.7E-74
16.2
6.6
15.9
5.8
4.5
1.7E-200
3.4E-199
1.7E-188
2.2E-188
6.1E-187
2,875
1.6E-90
1.6
664
1.6E-81
5.3
1.0E-168
451
2,144
5.9E-113
9.5E-101
14.0
2.1
284
215
2.7E-54
6.0E-61
12.0
20.5
6.1E-164
2.1E-158
Hypothesis: EHR-linked twin cohorts can be used for
phenome-wide studies to identify diseases with genetic
etiologies.
Methods
Population: MCTC and Mayo twin cohort (28,888 twins)
Phenotypes were defined by collapsing ICD9 coding
e.g., ICD9 100.01  100.0*  100.*
For every phenotype/ICD9 codes, a p-value was estimated to determine if the
disease co-occurred in twins more frequently that by chance.
For every phenotype/ICD9 code, a relative risk was estimated which estimated
the risk of disease if the other twin is affected relative to the population risk in
the twin cohorts.
Relative Risks
RR=relative risk
ADF=average disease frequency
1,455 phenotypes/ICD9 codes had at least one concordant pair in both cohorts
498 and 139 phenotypes had RRs >10 and >100 in both cohorts, respectively
Genetic diseases with large estimated RRs
MCTC
Affected
Concordant
P-value
RR
Affected
Concordant
P-value
RR
Combine
d
P-value
Sickle-cell disease
3
1
2.7E-04
2,747
2
1
1.6E-04
6,096
8.0E-07
282
Hereditary spherocytosis
3
1
2.7E-04
2,747
3
1
3.7E-04
2,032
1.7E-06
356.1
Peroneal muscular
atrophy
3
1
2.7E-04
2,747
3
1
3.7E-04
2,032
1.7E-06
282.49
Other thalassemia
3
1
2.7E-04
2,747
9
3
6.1E-09
677
4.7E-11
334.3
Other cerebellar ataxia
3
1
2.7E-04
2,747
6
1
1.5E-03
406
6.3E-06
426.82
Long QT syndrome
4
1
4.9E-04
1,374
18
8
1.4E-17
542
3.3E-19
ICD9
Disease
282.6
Mayo-TC
Same-Sex
Opposite-Sex
Same-Sex
Opposite-Sex
Potential limitations
1. Limited by the inherent challenges of ICD9 coding.
2. Parental/Familial biases
3. Lack of zygosity still limits this approach
NLP or blood types may help enrich for specific twin types.
Conclusions
1. Most diseases are not random events in the twins.
a. 1,406/5,598 (25%) of phenotypes are statistically enriched
in pairs of twins
b. ~1% of phenotypes have RRs < 1.0
2. Genetics plays an important component to the diseases process for
thousands of diseases.
3. Family data may be efficiently captured in in EHR and may be
used to predict, prevent, and treat human disease for the advancement
of “precision medicine.”
Future of genomic research
Precision
Medicine
Populations
Genome
Future of genomic research
Phenome
Precision
Medicine
Populations
Genome
Future of genomic research
Families
Phenome
Precision
Medicine
Populations
Genome
Acknowledgements
Marshfield Clinic:
Murray Brilliant
Peggy Peissig
Steven Schrodi
Zhan (Harold) Ye
John Mayer
many more…
Mayo Clinic:
Jyotishman Pathak
Yijing Cheng
Funding:
NHGRI
NLM
NCATS
NCRR
1U01HG006389
K22LM011938
9U54TR000021
1UL1RR025011
Marshfield Clinic Research Foundation
Marshfield Clinic donors
Download