talk_AJSequencing_CSHL

advertisement
The Ashkenazi Genome Project
Shai Carmi
Pe’er lab, Columbia University
and
The Ashkenazi Genome Consortium (TAGC)
Personal Genomes & Medical Genomics
Cold Spring Harbor, NY
November 2012
Recent History of Ashkenazi Jews
• Mediterranean origin (?)
• Ca. 1000: Small communities
in N. France, Rhineland
• Migration east
• Expansion
• ~10M today, mostly
in US and Israel
• Relative isolation
Ashkenazi Jewish Genetics
• Recently, AJ shown to be a genetically distinct group
• Close to Middle-Eastern & South-European populations
300 Jewish individuals; SNP arrays
Jewish non-AJ
Europeans
AJ
MiddleEastern
Price et al., PLoS Genetics 2008.
Olshen et al., BMC Genetics 2008.
Need et al., Genome Biology 2009.
Kopelman et al., BMC Genetics, 2009.
Atzmon et al., AJHG 2010
Behar et al., Nature 2010.
Bray et al., PNAS 2010.
Guha et al., Genome Biology 2012.
Recent Demography & IBD
In small populations, common ancestors are likely recent.
A
B
Recent Demography & IBD
In small populations, common ancestors are likely recent.
B
A
• IBD is highly
informative on
recent history!
• IBD common in AJ.
(Gusev et al., MBE 2011)
A
B
A shared segment
⟹ Many long haplotypes identical-by-descent
AJ Genetic History
t
2,300
High potential for
genetic studies!
45,000
of imputation by IBD
Power
% Additional Information Potential
Years ago
270
800
Present
4,300,000
N
Palamara et al., AJHG 2012
Effective size
Expansion rate ≈34% per generation
WTCCC
AJ_SCZ
UK
AJ
100%
80%
60%
40%
20%
0%
0
50
100
150
200
250
300
350
400
# of Sequenced Individuals
450
500
The Ashkenazi Genome Consortium
Goal:
• Sequence to high coverage hundreds of healthy AJ
o Use as a reference panel for association studies, imputation,
and clinical interpretation
o Understand population history and
functional genetic variation in AJ
Phase I:
• 58 AJ personal genomes (86 under way)
• ~60yo, healthy controls
• Unrelated, PCA-validated AJ
• Technology: Complete Genomics
Quality Control
Ti/Tv
Property
Genome (exome)
Coverage
~55x
Fraction called
96.5±0.003% (98%)
Fraction with coverage > 20x 92.4±0.018% (94.9%)
Concordance with SNP array
99.87±0.1%
Ti/Tv ratio
2.14±0.003 (3.05)
Variant Statistics &
Comparison to Europeans
3.5
(M)
1.6
TAGC
1.4
3
All SNPs
Het/hom
200
(k)
0
Insertions Deletions
MNPs
14 Flemish
genomes
(Belgium)
Similar results in
13 CG European
public genomes.
Comparison to Europeans
• Allele frequency spectrum:
– No excess singletons.
– Slight excess of doubletons.
singletons
• More novel SNPs in AJ
(3.8% vs. 3.1%).
doubletons
Quality Control (2)
False positive rate assessment by runs of homozygosity:
• Assume hets in high confidence roh are FP.
hets
Paternal
Maternal
• Genome wide extrapolation: ~20,000 per genome.
• QC:
– Discard putatively low-quality variants
– Discard HWE violations, low call rate
⇒FP after QC: ~5,000 per genome.
Applicability to Clinical Genomics
• Variants of unknown significance
Novel variants per sample
– Technical false positives
– True variants without health impact
Total
Non-synonymous
140000
120000
100000
80000
60000
40000
20000
0
600
500
400
300
200
100
0
All
After QC
Not
Not in
in
TAGC
panel
All
After QC
Not
Not in
in
TAGC
panel
Demographic Inference
• Use allele frequency spectrum and coalescent simulations.
• Assume the demographic model previously mentioned.
%sites
100
10
1
0.1
• Parameters qualitatively similar to those inferred from IBD
• Bottleneck 35gbp of size 500; Pre-bottleneck size 90,000
Summary
• IBD reveals AJ population bottleneck and expansion and
potential for genetics studies.
• High quality genomes sequenced by TAGC indicate
utility in clinical setting.
• Confirm demography and demonstrate subtle
differences from Europeans.
• Ongoing analysis:
–
–
–
–
Imputation power using TAGC vs. 1kG as ref panels
Local ancestry inference
Functional variants; AJ disease genes
Mobile element insertions
Thank you!
TAGC consortium members:
Columbia University Computer Science:
Itsik Pe’er, Pier Francesco Palamara
Undergrads: Fillan Grady, Ethan Kochav, James Xue
IT: Shlomo Hershkop
Long-Island Jewish Medical Center:
Todd Lencz, Semanti Mukherjee, Saurav Guha
Columbia University Medical Center:
Lorraine Clark, Xinmin Liu
Albert Einstein College of Medicine:
Gil Atzmon, Harry Ostrer
Mount Sinai School of Medicine:
Inga Peter, Laurie Ozelius
Memorial Sloan Kettering Cancer Center:
Ken Offit, Vijai Joseph
Yale School of Medicine:
Judy Cho, Ken Hui, Monica Bowen
The Hebrew University of Jerusalem:
Ariel Darvasi
VIB, Gent, Belgium
Herwig Van Marck, Stephane Plaisance
Complete Genomics
Jason Laramie
Funding:
Human Frontiers Science program.
Formal Inference Using IBD
• Assume a population of historical size 𝑁 𝑡 = 𝑁0 𝜆 𝑡 .
• Total shared segments of length ℓ1 < ℓ < ℓ2 :
𝑡 𝑑𝑡′
∞ − 0 𝜆(𝑡 ′ )
𝑒
0
𝜆(𝑡)
𝑒 −2ℓ1𝑁0𝑡 1 + 2ℓ1 𝑁0 𝑡 − 𝑒 −2ℓ2𝑁0𝑡 1 + 2ℓ2 𝑁0 𝑡 𝑑𝑡
• Detect IBD in sample
⟹ Infer history 𝑁 𝑡 .
Palamara et al., AJHG 2012
B
A
A
B
A shared segment
Data processing
•
•
•
•
CGA tools VCF generator: called sites only.
Correct multi-nucleotide substitution bug.
Compress, index, and distribute.
Generate high-quality genotypes set for population genetic analyses.
–
–
–
–
–
–
–
–
–
Remove indels and multi-nucleotide substitutions.
Remove low-quality SNPs.
Remove multi-alleic SNPs.
Remove half-calls.
Remove SNPs with high no-call rate.
Remove SNPs not in Hardy-Weinberg equilibrium.
Remove monomorphic reference SNPs.
Remove an inbred individual.
Format as Plink file.
Variant statistics
Statistic
Per genome (exome)
Total SNPs
3.4M (22k)
Novel SNPs
3.7% (4%)
Het/hom ratio
1.64 (1.67)
Insertions count
223k (246)
Deletions count
237k (218)
Substitutions count
83k (374)
Synonymous SNPs
10525
Non-synonymous SNPs
9695
Nonsense SNPs
71
Other disrupting
241
CNV count
336
SV count
1486
MEI count
3475
Download