**** ** PowerPoint - Columbia University

advertisement
The Ashkenazi Genome Project
Shai Carmi
Pe’er lab, Columbia University
and
The Ashkenazi Genome Consortium (TAGC)
Boston
September 2013
Outline
• Ashkenazi Jewish (AJ) Genetics and TAGC
• Basic Variant Statistics
• Utility in AJ Medical Genetics
• Demographic History of AJ and Europeans
• Summary
Ashkenazi Jewish (AJ) Genetics & TAGC
Recent History of Ashkenazi Jews (AJ)
• Mediterranean origin (?)
• Ca. 1000:
Small communities in
Northern France, Rhineland
• Migration east
• Expansion
• Migration to US and Israel
• ≈10M today
• Relative isolation
Ashkenazi Jewish Genetics
• Recently, AJ shown to be genetically distinct
• Close to Middle-Easterners & South-Europeans
300 Jewish individuals; SNP arrays
Jewish non-AJ
Europeans
AJ
MiddleEastern
Price et al., PLoS Genet., 2008
Olshen et al., BMC Genet., 2008
Need et al., Genome Biol., 2009
Kopelman et al., BMC Genet., 2009
Atzmon et al., AJHG, 2010
Behar et al., Nature, 2010
Bray et al., PNAS, 2010
Guha et al., Genome Biol., 2012
Recent Demography & IBD
•
Recent, strong genetic drift leads
to long identical-by-descent
haplotypes.
•
IBD sharing common in AJ
(Gusev et al., MBE, 2011 and others)
B
A
A
B
•
Inferred bottleneck of just ≈300
individuals ≈800 ya
(Palamara et al., AJHG, 2012)
A shared segment
Ashkenazi-Jewish (AJ) Genetic Risk Factors
• Multitude of Mendelian disorders
– Carrier screening: A success story
• Breast and ovarian cancer: BRCA1, BRCA2
• Parkinson’s disease: LRRK2, GBA
Tay-Sachs births
Gravel et al., 2001
AJ Genetics: Summary & Prospects






Large population (≈10M)
Narrow bottleneck (≈300)
Mostly isolated
Recruitable
Well studied
Insight on both European
and Middle-Eastern past
×
×
×
×
No genealogies
Mobile
Some recent
admixture
Significant ancient
admixture
The Ashkenazi Genome Consortium
Goal:
• 11+5 labs, mostly from the NY area
• Sequence to high coverage hundreds of healthy AJ
o Use as a reference panel for imputation and clinical interpretation
o Improve understanding of population history and
functional genetic variation in AJ
Phase I:
• 128 AJ personal genomes
• Healthy controls
• Unrelated, PCA-validated AJ
• Technology: Complete Genomics
Basic Variant Statistics
Variant Statistics &
Comparison to Europeans
• Comparison panels:
o 1000 Genomes Europeans
o 26 Flemish from Belgium, sequenced by Complete Genomics
Projection method: Gravel et al., PNAS, 2011
Allele Frequency Spectrum
Utility in AJ Medical Genetics
Screening AJ Genomes
An ancestry-matched reference panel is expected to filter more
benign variants in clinical genomes.
A Catalog of Mutations in Known AJ
Disease Genes
• Tens of genes harbor known mutations for AJ-prevalent
Mendelian disorders or risk factors for multifactorial diseases.
o Tay-Sachs disease, Gaucher disease, Familial dysautonomia, Niemann-Pick
disease, Torsion dystonia, Canavan disease, Bloom syndrome, etc.
o Breast cancer (BRCA1/2), Colon cancer (APC), Parkinson’s (LRRK2), etc.
• We mapped 73 mutations in 48 genes.
• Detected carriers of 35 known disease mutations.
• Detected 184 missense and 18 loss-of-function
novel (dbSNP135) variants.
o Catalog will be made available.
Imputing AJ Arrays
• AJ outperforms CEU even for a larger CEU panel
• Accuracy improved across all frequencies and by all measures
— Discordance rate, r2, false negatives/positives, Impute2 metrics
Imputation by IBD
• Impute by copying long IBD segments from a fully sequenced genome into
a sparsely genotyped one.
– Only 1-2 recent mutations per segment are expected
• IBD detected using Germline
with additional filtering.
>3cM
Fit to: 𝑐 = 1 − 1 − 𝑐max 1 − 𝑒 −𝑛
𝑛0
2
A Short Detour:
A Model for the Expected Coverage
Coverage by IBD: Theory
• Problem statement:
–
–
–
–
•
Reference panel (say, fully sequenced) of size nr
Study panel (say, sparsely genotyped) of size ns
Detect all IBD segments of length >m (Morgan) between study and reference panels
What is the average fraction of a study genome covered by IBD segments to the reference panel?
Assumptions:
–
–
–
–
Haploid (phased), infinite genomes
All segments can be detected
Coalescent with recombination
Recombination breaks a shared segment (B>>1)
Time
(generations)
𝑁→∞
g+1
𝑁→∞
B
g
Prob. 1-α
𝑁→∞
Present
Coverage by IBD: Theory
• Exact solution:
– Define 𝑥 ≡ 𝛼𝑛𝑟 /𝐵 and 𝐺 ≡ 𝑔𝑚
– Denote the average coverage as 𝑐
–
𝑐 =𝛼
1+𝐺 𝑒 𝐺 1−𝑒 −𝑥𝑒
−𝐺
+𝑥 2 𝑒 −𝐺 +2𝑥 −𝑥𝑒 −𝑥𝑒
𝑥+𝑒 𝐺
−𝐺
2+𝐺+𝑥𝑒
−𝐺 2+𝑥𝑒−𝐺
2
• Limits:
–
–
–
–
For 𝑥 → 0 (small reference panel, wide bottleneck), 𝑐 → 𝛼𝑒 −2𝐺 1 + 2𝐺 𝑥
For 𝑥 → ∞, 𝑐 → 𝛼𝑒 −𝐺 1 + 𝐺
For 𝐺 → 0 (short length cutoff, recent bottleneck), 𝑐 → 𝛼 1 − 𝑒 −𝑥
For 𝐺 → ∞, 𝑐 → 0
• Approximation:
–
𝑐 ≈ 𝑐max 1 − 𝑒 −𝑥
𝑥0
1+𝐺
– 𝑐max = 𝛼𝑒 −𝐺 1 + 𝐺 , 𝑥0 = 𝑒 𝐺
1+2𝐺
– Fits very well numerically
• Diploids:
–
𝑐 dip ≈ 1 − 1 − 𝑐 hap
2
Demographic History of AJ & Europeans
Recent AJ History Using IBD
• Assume a population of historical size 𝑁 𝑡 = 𝑁0 𝜆 𝑡 diploids
– Time scaled by 2N0
• Fraction of the genome in segments of length ℓ1 < ℓ < ℓ2 :
𝑡 𝑑𝑡′
∞ − 0 𝜆(𝑡 ′ )
𝑒
0
𝜆(𝑡)
𝑒 −2ℓ1𝑁0𝑡 1 + 2ℓ1 𝑁0 𝑡 − 𝑒 −2ℓ2𝑁0𝑡 1 + 2ℓ2 𝑁0 𝑡 𝑑𝑡
• Detect IBD in sample
⟹ Infer history 𝑁 𝑡
Palamara et al.,
AJHG 2012
Ancient History, One Population at a Time
• Fit the allele frequency spectrum, computed using diffusion
• (∂a∂I, Gutenkunst et al., PLoS Genetics, 2009)
A Consequence
• Number of segregating sites Sn(t)
–
Zivkovic and Stephan, Theor. Pop. Biol. 2011
• 𝑆𝑛 𝑡 = 𝜃
𝑛
𝑛/2 (4𝑘−1) 2𝑘
𝑛+2𝑘−1
𝑘=1
2𝑘
•
𝑡
exp
−∞
−
2𝑘
2
𝑡 𝑑𝑢
𝑠 𝜆(𝑢)
n: #diploid samples; θ=4N0μ; μ: mutation rate per generation
𝑑𝑠
Principal Component Analysis
Ancient History
What we know/learned so far:
• AJ are a Middle-Eastern:European mix
• Slightly higher heterozygosity (+2.4%)
– Larger ancient population size
– Admixture
– Recent explosive growth
• Many more AJ-specific variants
– +14% for 25x25 genomes
• Out-of-Africa (Henn et al., PNAS, 2012)
–
–
–
–
≈50-60 kya
Serial founder model: Africa → Middle-East → Europe
Hunter-gatherers in Europe at ≈40-45 kya (Higham et al., Nature, 2011)
Bottleneck and expansion at each step
The Joint AFS
• Allele frequencies correlated
but substructure exists.
• Experimenting with inference
using the joint AFS
— For our sample size, can infer at
most ≈10 parameters
— Hard to infer very recent history
— Hard to infer migration rates
A Proposed Model
Time
N0
Nb,OOA
Nb,EU
Tb,OOA
Tb,EU
Present
Nf,EU
Flemish
Ta fa
Nf,AJ
AJ
The Inferred Model
Time
(years ago)
6500
2300
52,000
Out-of-Africa?
Middle-East/
Levant?
1800
Early Neolithic migrants?
10,800
Jewish diaspora?
1700
58,000
Present
Flemish
55%
7500
AJ
European Origins
Farming began in Europe ≈5-8kya (“the Neolithic revolution”)
Human migration
(“demic diffusion”)
Spread of ideas
(“cultural diffusion”)
• For cultural diffusion, split from Middle-Easterners at
≈40-45 kya.
• We estimate ≈11 kya
• Earlier than ≈5-8 kya perhaps due to
• Early substructure before actual migration
• Incomplete replacement of hunter-gatherers
• Traces of recovery from the Last Glacial Maximum
Confidence Intervals
• Parametric bootstrap:
o Simulate whole genomes with the maximum likelihood parameters
o MaCS, Chen et al., Genome Res., 2009
o Infer using the simulated datasets
Parameter
𝑵𝟎
𝑵𝒃,𝑶𝑶𝑨
𝑻𝒃,𝑶𝑶𝑨
𝑵𝒇,𝑨𝑱
𝑵𝒃,𝑬𝑼
𝑻𝒃,𝑬𝑼
𝑵𝒇,𝑬𝑼
𝑻𝒂
𝒇𝒂
Maximum
likelihood
6543
2256
53,050
7632
Bias-corrected
mean±SD
6523±25
2314±47
52,007±1561
7494±193
95% confidence
interval
[6475 , 6572]
[2223 , 2406]
[48,947 , 55,067]
[7116 , 7872]
1556
10,600
56,519
1802±28
10,835±188
57,977±2912
[1748 , 1857]
[10,467 , 11202]
[52,270 , 63,685]
1940
55%
1686±98
55%±1%
[1495 , 1878]
[53% , 57%]
Hmmm…
Model
specification
Mutation
rate
Mutation Rate
• We used 𝜇 = 2.35 ∙ 10−8 per bp per generation: the “phylogenetic rate”.
• The “de-novo rate” is ≈ 1 − 1.5 ∙ 10−8 , and would double all population
sizes and times.
• We preferred the phylogenetic rate for a few (weak) reasons
– False negatives may exist in some de-novo studies
– The de-novo rate does not account for selection
– With the de-novo rate, the Out-of-Africa time would be >100 kya
• A decrease of 50% in the mutation rate will bring the split time to ≈16 kya
– Support the LGM recovery hypothesis
– Identify the Middle-East as the source of the recovery
•
(Haber et al, PLoS Genetics, 2013; Pala et al., AJHG 2012)
– Still suggests genetic discontinuity from first hunter-gatherers who colonized Europe
• Debate is still open
Model Specification
We tried several alternative models
• All models support >50% European ancestry in AJ and European-Middle-Eastern
split 10-15 kya.
• For example, a two-wave model for the population of Europe supports LGM
recovery + Neolithic replacement:
Summary & Outlook
• We sequenced 128 healthy AJ genomes to high coverage.
• Our reference panel will improve:
– Screening of AJ clinical genomes or known disease genes
– Imputation of AJ SNP arrays
• IBD sharing indicates a very recent bottleneck and expansion.
• The AJ-European joint allele frequency spectrum suggests:
– Over 50% European ancestry in AJ
– Europeans diverged from Middle-Easterners only ≈10-15 kya
– Made possible by sequencing population with partly Middle-Eastern ancestry
• In the future:
– Sequence ≈200 more genomes to cover entire bottleneck
– Use genomes from more populations to fine-tune demographic models
Thank you!
TAGC consortium members:
Columbia University Computer Science:
Itsik Pe’er
Fillan Grady, Ethan Kochav, James Xue
Shlomo Hershkop
Long-Island Jewish Medical Center:
Todd Lencz, Semanti Mukherjee, Saurav Guha
Columbia University Medical Center:
Lorraine Clark, Xinmin Liu
Albert Einstein College of Medicine:
Gil Atzmon, Harry Ostrer, Nir Barzilai,
Kinnari Upadhyay, Danny Ben-Avraham
Mount Sinai School of Medicine:
Inga Peter, Laurie Ozelius
Memorial Sloan Kettering Cancer Center:
Ken Offit, Joseph Vijai
Yale School of Medicine:
Judy Cho, Ken Hui, Monica Bowen
The Hebrew University of Jerusalem:
Ariel Darvasi
VIB, Gent, Belgium
Herwig Van Marck, Stephane Plaisance
Complete Genomics
Omicia
Funding:
Human Frontiers Science program
AJ Genetics
t
2,300
Years ago
270
800
% Additional Information Potential
45,000
Power of imputation by IBD
WTCCC
AJ_SCZ
UK
AJ
100%
80%
60%
40%
20%
0%
0
50
100
150
200
250
300
350
400
# of Sequenced Individuals
4,300,000
Present
Palamara et al., AJHG 2012
N
Effective size
450
500
Complete Genomics WGS
Quality Control
• 128 samples from two labs were sequenced in 3 batches
• Minimal batch effects
• Some results are for the first batch of 57 genomes
Genome (exome)
Coverage
≈56x
Fraction called
96.7±0.3% (98.1%)
Fraction with coverage > 20x
92.7±1.6% (94.9%)
Concordance with SNP array
99.67±0.25%
Ti/Tv ratio
2.14±0.004 (3.05)
Ti/Tv
Property
Quality Control
• False positive rate assessment
— Counting (the few) hets inside long runs of homozygosity
— A duplicate sample
hets
roh
• Genome wide extrapolation:
– SNVs: ≈10-40k FP per genome (FDR: 0.3-1.3%)
– Indels: ≈10-30k FP per genome (FDR: 2-6%)
• QC:
– Remove indels and poly-allelic variants
– Remove HWE violations, low call rate
• FP after QC: ≈5k per genome.
Concordance with Arrays
Asymptotic discordance
0.05%
Processing and Cleaning Pipeline
AJ
58 Complete Genomics masterVar (hg19)
26 Complete Genomics masterVar (hg18)
CGA tools mkvcf
VCF file
Local cleaning
AJ complete project
Flemish
CGA tools
CGA tools
Ti/Tv statistics
VCF file
testvariants file
Local cleaning
Custom script; Plink/Seq
Remove low-quality, half-called, or non-SNVs
Remove variants not fully called in at least one individual
Remove inbred individual
Custom script
Liftover hg18 => hg19
Remove low-quality, half-called, or non-SNVs
Remove variants not fully called in at least one individual
Cohort-based cleaning
Cohort-based cleaning
seqphase
Phase using
molecular
phasing
information
Plink file
Plink file
Initial filtering
Keep
Variant in one
cleaned file and in
the VCF of the
other?
Discard
Local cleaning
Custom script
Summary stats, array
concordance, and
duplicates analyses
Remove low-quality, half-called, or non-SNVs
Remove variants not fully called in at least one individual
Cohort-based cleaning
Plink file
SHAPEIT
Remove coordinates with reference mapping problem
Remove variants with AJ-Flemish incompatible alleles
Variant in both
cleaned files?
testvariants file
Remove poly-alleleic variants
Remove variants with high no-call rate or that are
not in Hardy-Weinberg equilibrium
Remove poly-alleleic variants
Remove variants with high no-call rate or that are
not in Hardy-Weinberg equilibrium
Monomorphic
non-ref and
runs-ofhomozygosity
analyses
128 Complete Genomics masterVar (hg19)
Phase and impute sporadically missing values
Variant in one
cleaned file and not
at all in other?
Keep and set
other as
hom-ref
Merge AJ-Flemish genotypes
Remove variants incompatible with 1000 Genomes
SHAPEIT; using 1000 Genomes panel
Phase and impute sporadically missing genotypes
Validate AJ ancestry
Validate no cryptic relatedness
Mobile Element Insertions (MEIs) &
Copy Number Variants (CNVs)
Initial validation efforts suggested high false
discovery rate, at least for novel events.
1000 Genomes MEIs
Novel MEIs:
• 3/11 validated
• Strong batch effect
Variant Statistics
Statistic
Per genome (exome)
Total SNPs
3.4M (22k)
Novel SNPs
3.8% (4.1%)
Het/hom ratio
1.65 (1.67)
Insertions count
220k (242)
Deletions count
235k (223)
Substitutions count
83k (374)
Synonymous SNPs
10,536
Non-synonymous SNPs
9706
Nonsense SNPs
72
Other disrupting
255
CNV count
302
SV count
1480
MEI count
4090
Imputing AJ Arrays
Compare imputation accuracy of AJ SNP arrays when using
either AJ or European reference panels.
1000 Genomes CEU (87)
Phased AJ Sequences (57)
AJ Arrays (1000)
Reduce to
unphased
arrays
1000
AJ arrays
(1007)
Phase
(ShapeIT)
7
87
50
Reference
Panel 1 (50)
50
Reference
Panel 2 (87)
Study Panel
(1007)
Impute
(Impute2)
Imputed Study
Panel 1
87
Reference
Panel 3 (137)
Imputed Study
Panel 3
Imputed Study
Panel 2
Mutation Burden in AJ
• Theoretically, a narrow bottleneck should increase the load of
deleterious variants (e.g., Lohmuller, Nature, 2008)
o Or not? (Simons et al., arXiv, 2013)
o
Expect higher load in AJ.
• Define deleterious:
o
o
o
o
Derived? Minor? Non-reference? Rare?
How to weight each variant?
Account for demography, sequencing errors?
Define significance?
• Compare 26 AJ and 26 Flemish.
• AJ have between 1-10% more deleterious variants than expected
(using Flemish as baseline). P-values between 0.2 and 10-60.
Mutation Burden in Disease Categories
• Many diseases have been
suggested to be more prevalent
in AJ (Goodman 1979)
o
o
o
o
o
Several Mendelian disorders
Some cancers
Inflammatory bowel diseases
Diabetes, obesity
Some psychiatric diseases, myopia
• Annotate genes according to
disease category (Omicia Inc).
• Compare non-synonymous
variant load between AJ and
Flemish.
Disease category
#genes
AJ/FL ratio
Aging
Infectious
Neonatal
Gastrointestinal
Dental
Immunological
Hemic
Cardiovascular
Endocrinological
Oncological
Women’s
Drug
Neurological
Nutrition
Respiratory
Kidney
Psychiatric
106
70
956
254
86
474
202
502
750
471
39
82
980
29
187
285
21
1.07
1.03
1.02
1.02
1.01
1.01
1.01
1.01
1.01
1.01
1.00
1.00
1.00
0.99
0.99
0.96
0.93
• No category comes out
significant in Gene Set
Enrichment Analysis.
Het/Hom Ratio
Years
ago
t
AJ
EU
IBD observed
Present
Download