Sequencing 128 Ashkenazi Genomes: Implications for Medical Genetics and History Shai Carmi Department of Computer Science Columbia University Itsik Pe’er’s lab UCLA October 2014 Outline • Ashkenazi Jewish Genetics: Background • The Ashkenazi Genome Sequencing Project • Segment Sharing and Population History • Opportunities and Future Directions Outline • Ashkenazi Jewish Genetics: Background • The Ashkenazi Genome Sequencing Project • Segment Sharing and Population History • Opportunities and Future Directions Ashkenazi Jewish (AJ) Genetics: Significance Medical genetics • Large founder population • Mendelian disorders • Complex diseases o Breast cancer, Parkinson’s, Crohn’s Population genetics • Debated origins • Genetics of a founder event mtDNA: Behar et al., 2004; Behar et al., 2006 Y chr: Behar et al., 2003; Behar et al., 2004 Disease genes: Risch et al., 2003; Slatkin, 2004 SNP arrays: Gusev et al., 2012; Palamara et al., 2012 Review: Ostrer and Skorecki, 2013 Founder Populations: Opportunities Time Non-founder population Founder population Recent successes • Greece o Tachmazidou et al., 2013; HDL • Finland o Kurki et al. 2014; aneurysm • Iceland Bottleneck o Many papers; most recently Steinthorsdottir et al., 2014; T2D • Ashkenazi Jews o Hui et al., in preparation; Crohn’s See also: • Hatzikotoulas et al., 2014 • Zuk et al., 2014 Present Population size Disease alleles Problem: Common genotyping platforms do not include alleles rare outside the founder population Opportunities: Reduced Haplotype Diversity Chromosomes in the sample Observed data Inferred sequence Imputation Full sequence Partial sequence (SNP array, low-coverage sequence) Nearly-complete inferred sequence Problem: The Ashkenazi population is missing a reference panel of complete sequences Opportunities: Personal Genomics in AJ Personal clinical genomics is here But genomes are hard to interpret Problem: The Ashkenazi population is missing a reference panel of complete sequences The Documented Ashkenazi History • Origin? • Founder event? • Ca. 1000: Small communities in • European gene flow: Northern France, Rhineland o Where? • Migration east o When? • Expansion o How much? • Migration to US and Israel • Relation to other Jews? Whole-genomes? Outline • Ashkenazi Jewish Genetics: Background • The Ashkenazi Genome Sequencing Project • Segment Sharing and Population History • Opportunities and Future Directions The Ashkenazi Genome Consortium NY area labs interested in specific diseases Impute Large cohorts of AJ cases Phase I: 128 whole genomes (Completed*) Phase II: ≈500 whole genomes (NYGC; under way) * Carmi et al., Nat Commun, 2014 Quantify utility in medical genetics Learn about population history Technical Details • Ashkenazi ancestry verified • Some phenotypes exist • Sequencing by Complete Genomics in three batches Uniform QC measures o Property Genome (exome) Coverage ≈56x Fraction called 96.7±0.3% (98.1%) Concordance with arrays 99.67±0.25% Ti/Tv ratio 2.14±0.004 (3.05) hets roh • Error rate estimates o o o Using runs-of-homozygosity and a duplicate SNVs: ≈10-40k errors per genome (FDR: 0.3-1.3%) Indels: ≈10-30k errors per genome (FDR: 2-6%) • QC: Remove indels, poly-allelic variants, Hardy-Weinberg violations, low call rate • Errors after QC: ≈5k per genome Comparison to Europeans Comparison panels: • 26 Flemish from Belgium (platform-matched) • 87 North-West Europeans [CEU (1000 Genomes)] Fraction novel (%) Population-specific variants (dbSNP135) (25x25 genomes) AJ Clinical Genomics An Ashkenazi reference panel filters more benign variants than a European panel. AJ Medical Genetics: Imputation An Ashkenazi reference panel improves imputation accuracy of AJ SNP arrays compared to the standard European panel. Correlation between imputed and real data Using Impute2 Rare variants (≤1%) accuracy: 87% vs 65% AJ Medical Genetics: Applications • Our consortium: o o o o An expanded carrier screening panel Pharmacogenetically-important alleles Low-frequency deletions in tumors Association studies: schizophrenia, Parkinson’s, Crohn’s, longevity, cancer • Others: o o Frequency lookups (clinical/pedigrees) Association studies: Epilepsy, Autism, … Principal Component Analysis (PCA) Middle-East Ashkenazi Jews Europe Druze French Tuscans Palestinians Flemish Italians Bedouins Sephardi Jews (Italy, Turkey) Sardinians Basque Price et al., 2008; Olshen et al., 2008; Need et al., 2009; Kopelman et al., 2009; Atzmon et al., 2010; Behar et al., 2010; Bray et al., 2010; Guha et al., 2012; Behar et al., 2014 The Documented Ashkenazi History • Origin? • Founder event? • European gene flow: o Where? o When? o How much? • Relation to other Jews? Variant Discovery Rate Heterozygosity paradox? Number of variants Predicted number of new variants A Model for Ancient History Out-of-Africa MiddleEast European gene flow into AJ 25x25 genomes The Documented Ashkenazi History • Origin? • Founder event? • European gene flow: o Where? o When? o How much? • Relation to other Jews? Outline • Ashkenazi Jewish Genetics: Background • The Ashkenazi Genome Sequencing Project • Segment Sharing and Population History • Opportunities and Future Directions Identical-by-Descent (IBD) Shared Segment Formal definition: A contiguous segment inherited from a single, recent common ancestor. g What’s “recent”? IBD segment After Browning & Browning, 2012 Identical-by-Descent (IBD) Shared Segment Formal definition: A contiguous segment inherited from a single, recent common ancestor. Practical definition: A contiguous segment nearly identical over a sequence length longer than a cutoff. g • Requires strong genetic drift • Segments are rare but long o o Probability of a site to be shared ~2−2𝑔 Segment length ~𝑔−1 • Current methods can detect segments ≳1cM IBD segment Applications • A segment indicates recent co-ancestry: o Disease mapping o Pedigree reconstruction o Detecting natural selection o Demographic (historical) inference o Estimating mutation rates • Identical sequence across individuals: o Resolving haplotypes (phasing) o Imputation o Estimating heritability o Estimating genotyping error rate Eskin’s lab g IBD segment IBD Sharing Theory • Model: o o o A population with a constant effective size N Two chromosomes of length L (Morgans) A minimal segment length m (Morgans) • The number of shared segments nm? • The fraction of the chromosome in shared segments fm? ℓ1 ℓ2 ℓ3 L 𝑛𝑚 = 3; 𝑓𝑚 = (ℓ1 + ℓ2 + ℓ3 ) 𝐿 m Results overview • Under the Sequentially Markov Coalescent (SMC): • The number of shared segments: 𝒏𝒎 = 𝟐𝑵𝑳 𝟏+𝟐𝒎𝑵 𝟐 ; Var[𝒏𝒎 ] ≈ 𝟐𝒎𝑳𝟐𝑵 • The fraction of the chromosome in shared segments: 𝒇𝒎 = 𝟏+𝟒𝒎𝑵 𝟏+𝟐𝒎𝑵 𝟐 𝑳 𝒎] ; Var[𝒇𝒎 ] ≈ 𝐥𝐨𝐠[𝑵𝑳 • Results for a more realistic coalescent model (SMC’) • Implicit expressions for the distributions • All results generalizable to variable population size Palamara et al., 2012; Carmi et al., Genetics, 2013; Carmi et al., Theor Popul Biol, 2014 Demographic Inference: Maximum Likelihood Use the distribution of the number of shared segments Carmi et al., Theor Popul Biol, 2014 Demographic Inference: A Practical Approach • Historical size N(t)=N0 ν(t). • Mean fraction of the genome in segments of length ℓ1<ℓ<ℓ2: − 𝑡 𝑑𝑡′ (1) 𝑃 ℓ1 , ℓ2 = ∞ 𝑒 0 𝜈 𝑡′ −2ℓ 𝑁 𝑡 1 0 1+2ℓ1 𝑁0 𝑡 −𝑒 −2ℓ2 𝑁0 𝑡 1+2ℓ2 𝑁0 𝑡 𝑑𝑡 𝑒 𝜈(𝑡) 0 Hypothetical example Method: • Record IBD segments in each length bin • Using Eq. (1), find the history N(t) that fits best Palamara et al., 2012 IBD Sharing in Ashkenazi Jews Atzmon et al., 2010 Gusev et al., 2012 AJ EU Bray et al., 2010 A pair of AJ individuals shares ≈50cM in ≈15 long segments (>3cM) Inferring the Bottleneck Size and Time Carmi et al., Nat. Commun., 2014 Palamara et al., 2012 Inferring the Bottleneck Size and Time Carmi et al., Nat. Commun., 2014 Palamara et al., 2012 Inferring the Bottleneck Size and Time Time (years) Carmi et al., Nat. Commun., 2014 Palamara et al., 2012 Caveats • Phasing and sequencing errors; IBD detection errors • Reasonable power only for 10-50 generations ago • Model specification (e.g. prolonged bottleneck, admixture) Parameter 95% confidence interval Ancestral size 3654-5856 Bottleneck size 249-419 Growth rate (per generation) 16-53% Bottleneck time (years) 625-800 • A bottleneck 700ya confirmed by an independent method: lengths of haplotypes around rare variants o Mathieson and McVean, 2014 The Documented Ashkenazi History • Origin? • Founder event? • European gene flow: o Where? o When? o How much? • Relation to other Jews? Outline • Ashkenazi Jewish Genetics: Background • The Ashkenazi Genome Sequencing Project • Segment Sharing and Population History • Opportunities and Future Directions Coverage by Shared Segments A sequenced reference panel What fraction of the genome can we cover with shared segments? Partly sequenced genome Impute Full sequence Partial sequence Nearly-complete inferred sequence The Era of Near-Complete Coverage Mine public data? Other studies? Now Phase II Opportunities: • Interpret personal genomes o Time-stamp rare mutations • Cost-effective large-scale association studies o o o Resolve haplotypes Impute SNP arrays or low-coverage sequences Mapping rare variants/haplotypes See Carmi et al., Genetics, 2013 for a theoretical analysis The Era of Near-Complete Coverage Time-stamp rare mutations g New algorithms needed! IBD segment Mine public data? Other studies? Phase II Now Ashkenazi History • Origin? • Founder event? • European gene flow: o Where? o When? o How much? • Relation to other Jews? The Place of European Gene Flow “Most of these theories … are myths or speculation … based on some vague or misunderstood references. … It will probably be impossible to say definitely where the hundreds or thousands of Jews in Poland in the 13th to 14th centuries came from.” B. Weinryb, The Jews of Poland, 1972 Approach ME EU An Ashkenazi genome x xxxxx EU PC2 xxxx xxxx x oooooo xxxxxx ME xxxxxx AJ PC1 ME EU PC2 x xxxxx xxxxxx ooo xxxoxo o xxxxxx xxxx x PC1 Johnson et al., 2011; Moreno-Estrada et al., 2013 x xxxxx PC2 oo xxxxxx oo oo xxxx xxxxxx xxxx x PC1 Preliminary Results • Origin in the Levant • Gene flow mostly from West-Europe, about 30 generations ago • Sex-imbalanced history? Summary • It is important to study Ashkenazi genetics • We sequenced 128 whole-genomes • Useful for personal clinical genomics and imputation • Segment sharing reveals a founder event and suggests opportunities My research statement Acknowledgements Itsik Pe’er’s lab: James Xue, Ethan Kochav, Shuo Yang, Pier Palamara, Vladimir Vacic Harvard University: Peter Wilton, John Wakeley Sheba Medical Center: Eitan Friedman TAGC consortium members: Todd Lencz, Semanti Mukherjee (LIJMC) Lorraine Clark, Xinmin Liu (CUMC) Gil Atzmon, Harry Ostrer, Danny Ben-Avraham (AECOM) Inga Peter, Judy Cho (ISMMS) Ariel Darvasi (HUJI) Joseph Vijai (MSKCC) Ken Hui (Yale) VIB Ghent, Belgium Funding: Human Frontier Science program Thank you for your attention!