The Ashkenazi Genome Project Shai Carmi Pe’er lab, Columbia University and The Ashkenazi Genome Consortium (TAGC) Boston September 2013 Outline • Ashkenazi Jewish (AJ) Genetics and TAGC • Basic Variant Statistics • Utility in AJ Medical Genetics • Demographic History of AJ and Europeans • Summary Ashkenazi Jewish (AJ) Genetics & TAGC Recent History of Ashkenazi Jews (AJ) • Mediterranean origin (?) • Ca. 1000: Small communities in Northern France, Rhineland • Migration east • Expansion • Migration to US and Israel • ≈10M today • Relative isolation Ashkenazi Jewish Genetics • Recently, AJ shown to be genetically distinct • Close to Middle-Easterners & South-Europeans 300 Jewish individuals; SNP arrays Jewish non-AJ Europeans AJ MiddleEastern Price et al., PLoS Genet., 2008 Olshen et al., BMC Genet., 2008 Need et al., Genome Biol., 2009 Kopelman et al., BMC Genet., 2009 Atzmon et al., AJHG, 2010 Behar et al., Nature, 2010 Bray et al., PNAS, 2010 Guha et al., Genome Biol., 2012 Recent Demography & IBD • Recent, strong genetic drift leads to long identical-by-descent haplotypes. • IBD sharing common in AJ (Gusev et al., MBE, 2011 and others) B A A B • Inferred bottleneck of just ≈300 individuals ≈800 ya (Palamara et al., AJHG, 2012) A shared segment Ashkenazi-Jewish (AJ) Genetic Risk Factors • Multitude of Mendelian disorders – Carrier screening: A success story • Breast and ovarian cancer: BRCA1, BRCA2 • Parkinson’s disease: LRRK2, GBA Tay-Sachs births Gravel et al., 2001 AJ Genetics: Summary & Prospects Large population (≈10M) Narrow bottleneck (≈300) Mostly isolated Recruitable Well studied Insight on both European and Middle-Eastern past × × × × No genealogies Mobile Some recent admixture Significant ancient admixture The Ashkenazi Genome Consortium Goal: • 11+5 labs, mostly from the NY area • Sequence to high coverage hundreds of healthy AJ o Use as a reference panel for imputation and clinical interpretation o Improve understanding of population history and functional genetic variation in AJ Phase I: • 128 AJ personal genomes • Healthy controls • Unrelated, PCA-validated AJ • Technology: Complete Genomics Basic Variant Statistics Variant Statistics & Comparison to Europeans • Comparison panels: o 1000 Genomes Europeans o 26 Flemish from Belgium, sequenced by Complete Genomics Projection method: Gravel et al., PNAS, 2011 Allele Frequency Spectrum Utility in AJ Medical Genetics Screening AJ Genomes An ancestry-matched reference panel is expected to filter more benign variants in clinical genomes. A Catalog of Mutations in Known AJ Disease Genes • Tens of genes harbor known mutations for AJ-prevalent Mendelian disorders or risk factors for multifactorial diseases. o Tay-Sachs disease, Gaucher disease, Familial dysautonomia, Niemann-Pick disease, Torsion dystonia, Canavan disease, Bloom syndrome, etc. o Breast cancer (BRCA1/2), Colon cancer (APC), Parkinson’s (LRRK2), etc. • We mapped 73 mutations in 48 genes. • Detected carriers of 35 known disease mutations. • Detected 184 missense and 18 loss-of-function novel (dbSNP135) variants. o Catalog will be made available. Imputing AJ Arrays • AJ outperforms CEU even for a larger CEU panel • Accuracy improved across all frequencies and by all measures — Discordance rate, r2, false negatives/positives, Impute2 metrics Imputation by IBD • Impute by copying long IBD segments from a fully sequenced genome into a sparsely genotyped one. – Only 1-2 recent mutations per segment are expected • IBD detected using Germline with additional filtering. >3cM Fit to: 𝑐 = 1 − 1 − 𝑐max 1 − 𝑒 −𝑛 𝑛0 2 A Short Detour: A Model for the Expected Coverage Coverage by IBD: Theory • Problem statement: – – – – • Reference panel (say, fully sequenced) of size nr Study panel (say, sparsely genotyped) of size ns Detect all IBD segments of length >m (Morgan) between study and reference panels What is the average fraction of a study genome covered by IBD segments to the reference panel? Assumptions: – – – – Haploid (phased), infinite genomes All segments can be detected Coalescent with recombination Recombination breaks a shared segment (B>>1) Time (generations) 𝑁→∞ g+1 𝑁→∞ B g Prob. 1-α 𝑁→∞ Present Coverage by IBD: Theory • Exact solution: – Define 𝑥 ≡ 𝛼𝑛𝑟 /𝐵 and 𝐺 ≡ 𝑔𝑚 – Denote the average coverage as 𝑐 – 𝑐 =𝛼 1+𝐺 𝑒 𝐺 1−𝑒 −𝑥𝑒 −𝐺 +𝑥 2 𝑒 −𝐺 +2𝑥 −𝑥𝑒 −𝑥𝑒 𝑥+𝑒 𝐺 −𝐺 2+𝐺+𝑥𝑒 −𝐺 2+𝑥𝑒−𝐺 2 • Limits: – – – – For 𝑥 → 0 (small reference panel, wide bottleneck), 𝑐 → 𝛼𝑒 −2𝐺 1 + 2𝐺 𝑥 For 𝑥 → ∞, 𝑐 → 𝛼𝑒 −𝐺 1 + 𝐺 For 𝐺 → 0 (short length cutoff, recent bottleneck), 𝑐 → 𝛼 1 − 𝑒 −𝑥 For 𝐺 → ∞, 𝑐 → 0 • Approximation: – 𝑐 ≈ 𝑐max 1 − 𝑒 −𝑥 𝑥0 1+𝐺 – 𝑐max = 𝛼𝑒 −𝐺 1 + 𝐺 , 𝑥0 = 𝑒 𝐺 1+2𝐺 – Fits very well numerically • Diploids: – 𝑐 dip ≈ 1 − 1 − 𝑐 hap 2 Demographic History of AJ & Europeans Recent AJ History Using IBD • Assume a population of historical size 𝑁 𝑡 = 𝑁0 𝜆 𝑡 diploids – Time scaled by 2N0 • Fraction of the genome in segments of length ℓ1 < ℓ < ℓ2 : 𝑡 𝑑𝑡′ ∞ − 0 𝜆(𝑡 ′ ) 𝑒 0 𝜆(𝑡) 𝑒 −2ℓ1𝑁0𝑡 1 + 2ℓ1 𝑁0 𝑡 − 𝑒 −2ℓ2𝑁0𝑡 1 + 2ℓ2 𝑁0 𝑡 𝑑𝑡 • Detect IBD in sample ⟹ Infer history 𝑁 𝑡 Palamara et al., AJHG 2012 Ancient History, One Population at a Time • Fit the allele frequency spectrum, computed using diffusion • (∂a∂I, Gutenkunst et al., PLoS Genetics, 2009) A Consequence • Number of segregating sites Sn(t) – Zivkovic and Stephan, Theor. Pop. Biol. 2011 • 𝑆𝑛 𝑡 = 𝜃 𝑛 𝑛/2 (4𝑘−1) 2𝑘 𝑛+2𝑘−1 𝑘=1 2𝑘 • 𝑡 exp −∞ − 2𝑘 2 𝑡 𝑑𝑢 𝑠 𝜆(𝑢) n: #diploid samples; θ=4N0μ; μ: mutation rate per generation 𝑑𝑠 Principal Component Analysis Ancient History What we know/learned so far: • AJ are a Middle-Eastern:European mix • Slightly higher heterozygosity (+2.4%) – Larger ancient population size – Admixture – Recent explosive growth • Many more AJ-specific variants – +14% for 25x25 genomes • Out-of-Africa (Henn et al., PNAS, 2012) – – – – ≈50-60 kya Serial founder model: Africa → Middle-East → Europe Hunter-gatherers in Europe at ≈40-45 kya (Higham et al., Nature, 2011) Bottleneck and expansion at each step The Joint AFS • Allele frequencies correlated but substructure exists. • Experimenting with inference using the joint AFS — For our sample size, can infer at most ≈10 parameters — Hard to infer very recent history — Hard to infer migration rates A Proposed Model Time N0 Nb,OOA Nb,EU Tb,OOA Tb,EU Present Nf,EU Flemish Ta fa Nf,AJ AJ The Inferred Model Time (years ago) 6500 2300 52,000 Out-of-Africa? Middle-East/ Levant? 1800 Early Neolithic migrants? 10,800 Jewish diaspora? 1700 58,000 Present Flemish 55% 7500 AJ European Origins Farming began in Europe ≈5-8kya (“the Neolithic revolution”) Human migration (“demic diffusion”) Spread of ideas (“cultural diffusion”) • For cultural diffusion, split from Middle-Easterners at ≈40-45 kya. • We estimate ≈11 kya • Earlier than ≈5-8 kya perhaps due to • Early substructure before actual migration • Incomplete replacement of hunter-gatherers • Traces of recovery from the Last Glacial Maximum Confidence Intervals • Parametric bootstrap: o Simulate whole genomes with the maximum likelihood parameters o MaCS, Chen et al., Genome Res., 2009 o Infer using the simulated datasets Parameter 𝑵𝟎 𝑵𝒃,𝑶𝑶𝑨 𝑻𝒃,𝑶𝑶𝑨 𝑵𝒇,𝑨𝑱 𝑵𝒃,𝑬𝑼 𝑻𝒃,𝑬𝑼 𝑵𝒇,𝑬𝑼 𝑻𝒂 𝒇𝒂 Maximum likelihood 6543 2256 53,050 7632 Bias-corrected mean±SD 6523±25 2314±47 52,007±1561 7494±193 95% confidence interval [6475 , 6572] [2223 , 2406] [48,947 , 55,067] [7116 , 7872] 1556 10,600 56,519 1802±28 10,835±188 57,977±2912 [1748 , 1857] [10,467 , 11202] [52,270 , 63,685] 1940 55% 1686±98 55%±1% [1495 , 1878] [53% , 57%] Hmmm… Model specification Mutation rate Mutation Rate • We used 𝜇 = 2.35 ∙ 10−8 per bp per generation: the “phylogenetic rate”. • The “de-novo rate” is ≈ 1 − 1.5 ∙ 10−8 , and would double all population sizes and times. • We preferred the phylogenetic rate for a few (weak) reasons – False negatives may exist in some de-novo studies – The de-novo rate does not account for selection – With the de-novo rate, the Out-of-Africa time would be >100 kya • A decrease of 50% in the mutation rate will bring the split time to ≈16 kya – Support the LGM recovery hypothesis – Identify the Middle-East as the source of the recovery • (Haber et al, PLoS Genetics, 2013; Pala et al., AJHG 2012) – Still suggests genetic discontinuity from first hunter-gatherers who colonized Europe • Debate is still open Model Specification We tried several alternative models • All models support >50% European ancestry in AJ and European-Middle-Eastern split 10-15 kya. • For example, a two-wave model for the population of Europe supports LGM recovery + Neolithic replacement: Summary & Outlook • We sequenced 128 healthy AJ genomes to high coverage. • Our reference panel will improve: – Screening of AJ clinical genomes or known disease genes – Imputation of AJ SNP arrays • IBD sharing indicates a very recent bottleneck and expansion. • The AJ-European joint allele frequency spectrum suggests: – Over 50% European ancestry in AJ – Europeans diverged from Middle-Easterners only ≈10-15 kya – Made possible by sequencing population with partly Middle-Eastern ancestry • In the future: – Sequence ≈200 more genomes to cover entire bottleneck – Use genomes from more populations to fine-tune demographic models Thank you! TAGC consortium members: Columbia University Computer Science: Itsik Pe’er Fillan Grady, Ethan Kochav, James Xue Shlomo Hershkop Long-Island Jewish Medical Center: Todd Lencz, Semanti Mukherjee, Saurav Guha Columbia University Medical Center: Lorraine Clark, Xinmin Liu Albert Einstein College of Medicine: Gil Atzmon, Harry Ostrer, Nir Barzilai, Kinnari Upadhyay, Danny Ben-Avraham Mount Sinai School of Medicine: Inga Peter, Laurie Ozelius Memorial Sloan Kettering Cancer Center: Ken Offit, Joseph Vijai Yale School of Medicine: Judy Cho, Ken Hui, Monica Bowen The Hebrew University of Jerusalem: Ariel Darvasi VIB, Gent, Belgium Herwig Van Marck, Stephane Plaisance Complete Genomics Omicia Funding: Human Frontiers Science program AJ Genetics t 2,300 Years ago 270 800 % Additional Information Potential 45,000 Power of imputation by IBD WTCCC AJ_SCZ UK AJ 100% 80% 60% 40% 20% 0% 0 50 100 150 200 250 300 350 400 # of Sequenced Individuals 4,300,000 Present Palamara et al., AJHG 2012 N Effective size 450 500 Complete Genomics WGS Quality Control • 128 samples from two labs were sequenced in 3 batches • Minimal batch effects • Some results are for the first batch of 57 genomes Genome (exome) Coverage ≈56x Fraction called 96.7±0.3% (98.1%) Fraction with coverage > 20x 92.7±1.6% (94.9%) Concordance with SNP array 99.67±0.25% Ti/Tv ratio 2.14±0.004 (3.05) Ti/Tv Property Quality Control • False positive rate assessment — Counting (the few) hets inside long runs of homozygosity — A duplicate sample hets roh • Genome wide extrapolation: – SNVs: ≈10-40k FP per genome (FDR: 0.3-1.3%) – Indels: ≈10-30k FP per genome (FDR: 2-6%) • QC: – Remove indels and poly-allelic variants – Remove HWE violations, low call rate • FP after QC: ≈5k per genome. Concordance with Arrays Asymptotic discordance 0.05% Processing and Cleaning Pipeline AJ 58 Complete Genomics masterVar (hg19) 26 Complete Genomics masterVar (hg18) CGA tools mkvcf VCF file Local cleaning AJ complete project Flemish CGA tools CGA tools Ti/Tv statistics VCF file testvariants file Local cleaning Custom script; Plink/Seq Remove low-quality, half-called, or non-SNVs Remove variants not fully called in at least one individual Remove inbred individual Custom script Liftover hg18 => hg19 Remove low-quality, half-called, or non-SNVs Remove variants not fully called in at least one individual Cohort-based cleaning Cohort-based cleaning seqphase Phase using molecular phasing information Plink file Plink file Initial filtering Keep Variant in one cleaned file and in the VCF of the other? Discard Local cleaning Custom script Summary stats, array concordance, and duplicates analyses Remove low-quality, half-called, or non-SNVs Remove variants not fully called in at least one individual Cohort-based cleaning Plink file SHAPEIT Remove coordinates with reference mapping problem Remove variants with AJ-Flemish incompatible alleles Variant in both cleaned files? testvariants file Remove poly-alleleic variants Remove variants with high no-call rate or that are not in Hardy-Weinberg equilibrium Remove poly-alleleic variants Remove variants with high no-call rate or that are not in Hardy-Weinberg equilibrium Monomorphic non-ref and runs-ofhomozygosity analyses 128 Complete Genomics masterVar (hg19) Phase and impute sporadically missing values Variant in one cleaned file and not at all in other? Keep and set other as hom-ref Merge AJ-Flemish genotypes Remove variants incompatible with 1000 Genomes SHAPEIT; using 1000 Genomes panel Phase and impute sporadically missing genotypes Validate AJ ancestry Validate no cryptic relatedness Mobile Element Insertions (MEIs) & Copy Number Variants (CNVs) Initial validation efforts suggested high false discovery rate, at least for novel events. 1000 Genomes MEIs Novel MEIs: • 3/11 validated • Strong batch effect Variant Statistics Statistic Per genome (exome) Total SNPs 3.4M (22k) Novel SNPs 3.8% (4.1%) Het/hom ratio 1.65 (1.67) Insertions count 220k (242) Deletions count 235k (223) Substitutions count 83k (374) Synonymous SNPs 10,536 Non-synonymous SNPs 9706 Nonsense SNPs 72 Other disrupting 255 CNV count 302 SV count 1480 MEI count 4090 Imputing AJ Arrays Compare imputation accuracy of AJ SNP arrays when using either AJ or European reference panels. 1000 Genomes CEU (87) Phased AJ Sequences (57) AJ Arrays (1000) Reduce to unphased arrays 1000 AJ arrays (1007) Phase (ShapeIT) 7 87 50 Reference Panel 1 (50) 50 Reference Panel 2 (87) Study Panel (1007) Impute (Impute2) Imputed Study Panel 1 87 Reference Panel 3 (137) Imputed Study Panel 3 Imputed Study Panel 2 Mutation Burden in AJ • Theoretically, a narrow bottleneck should increase the load of deleterious variants (e.g., Lohmuller, Nature, 2008) o Or not? (Simons et al., arXiv, 2013) o Expect higher load in AJ. • Define deleterious: o o o o Derived? Minor? Non-reference? Rare? How to weight each variant? Account for demography, sequencing errors? Define significance? • Compare 26 AJ and 26 Flemish. • AJ have between 1-10% more deleterious variants than expected (using Flemish as baseline). P-values between 0.2 and 10-60. Mutation Burden in Disease Categories • Many diseases have been suggested to be more prevalent in AJ (Goodman 1979) o o o o o Several Mendelian disorders Some cancers Inflammatory bowel diseases Diabetes, obesity Some psychiatric diseases, myopia • Annotate genes according to disease category (Omicia Inc). • Compare non-synonymous variant load between AJ and Flemish. Disease category #genes AJ/FL ratio Aging Infectious Neonatal Gastrointestinal Dental Immunological Hemic Cardiovascular Endocrinological Oncological Women’s Drug Neurological Nutrition Respiratory Kidney Psychiatric 106 70 956 254 86 474 202 502 750 471 39 82 980 29 187 285 21 1.07 1.03 1.02 1.02 1.01 1.01 1.01 1.01 1.01 1.01 1.00 1.00 1.00 0.99 0.99 0.96 0.93 • No category comes out significant in Gene Set Enrichment Analysis. Het/Hom Ratio Years ago t AJ EU IBD observed Present