C. elegans population diversity Justin Gerke Joshua Shapiro Erik Andersen Marie-Anne Felix Leonid Kruglyak Princeton University Institut Jacques-Monod C. elegans • 1 mm long as adults • Found worldwide in rotting vegetation/compost • Hermaphroditic with rare males •low recombination rate • 100 Mb genome, 6 Chromosomes C. elegans • Biomedical Research • Evolutionary Biology C. elegans population diversity • Studies of Small # of Loci –MtDNA, PCR amplicon sequencing, SSRs, AFLPs • One Genome-Wide Study of Biased SNPs –SNPs identified from a single strain pair • Goal: Genome-wide, unbiased sample 203 Wild Isolates RAD-Seq Library Construction Our Oligos PLoS One 2008: Barcode Tailed Amplifica0on Faster Protocol More Mappable Reads Lower Cost for Barcoded Oligos RAD seq data management goals • Create a central repository for reads and associated metadata • Easily share data among lab members • • • split multiplexed runs combine multiple runs select mapped data by genomic location • • scripts to link database to external software ability to pull data directly into R, python, perl, etc. • Make it easy to operate on subsets of data • Integrate with analysis tools RAD seq data workflow MySQL Database Raw reads & run info Normalize quality scores Check barcodes Check restriction site SELECT RAW READS by barcode, restriction site, run, quality, length CREATE SUMMARY TABLES by restriction site location, strain, run, mapping parameters Remove barcodes Reads for mapping SELECT MAPPED READS by location, strain Remove restriction sites bwa samtools Pileup files Reads per strain 15 reads 10 Number of Strains 5 0 15 mapped 10 5 0 0 2 4 6 Read count (millions) 8 10 Mapping rate by strain 30 25 Number of Strains 20 15 10 C. briggsae 5 0 0.0 0.2 0.4 0.6 Fraction of reads mapped 0.8 1.0 Genomic distribution of mapped restriction sites 60 40 20 0 II 60 40 20 0 III 60 40 20 0 IV 60 40 20 0 V Number of mapped restriction sites I 60 40 20 0 X 60 40 20 0 0 5 10 Location (Mb) 15 20 Restriction Site Coverage 38,628 (38,541) 2,097 (69) Observed restriction sites 985 (1,072) Reference sequence restriction sites (reference strain only) 40 30 20 10 0 40 30 20 10 0 40 30 20 10 0 40 30 20 10 0 40 30 20 10 0 40 30 20 10 0 I II III IV V Number of 'missing' restriction sites Genomic locations of ‘missing’ restriction sites X 0 5 10 Location (Mb) 15 20 Restriction site proximity results in unsampled segments Missing restriction sites tend to be near another site 3000 2500 Present 2000 Number of restriction sites 1500 1000 500 0 150 Missing 100 50 0 0 1000 2000 3000 Distance to nearest restriction site 4000 5000 High variability between read counts at the same restriction site r2 = 0.15 Distribution of bias in read depth 3000 Number of restriction sites 2500 2000 1500 1000 500 0 0 1 2 3 4 Log2 Ratio of read counts 5 6 7 Base calling procedures • Pull sequences from the database that start at defined locations • • • seen in at least 2 strains average reads per strain > 5 do not require restriction site in reference • Compile reads in pileup format • Only call SNPs in the 91 bp after the restriction site SNPs outside expected segments are error-prone 91bp AATTCCATGGTTAGTG...GACCGGGTAGCTTCTACACATGACTA AATTCCATGGTTAGTG...GACCGGGTAGCTTC AATTCCATGGTTAGTG...GACCGGGTAACTTC AATTCCATGGTTAGTG...GTCCGGGTAGCTTC AATTCCATGGTTAGTG...GACCGG-TAGCTTCT AATTCCATGGTTAGTG...GACCGG-TAGCTTCA AATTCCATGGTTAGTG...GACCGG-TTGCTTCA Base calling continued • 7.85 Mb from 86,211 segments • Mean coverage depth is 29 reads/site, median 18 • • homozygosity of individuals makes this much easier only 51 SNPs in reference strain (one of the lowest coverage strains) with quality score≥60 • Calls generated by samtools are quite good • Tested by comparison of duplicate libraries • • prepared separately, run separately conservative estimator, as coverage per run tends to be lower than in full data set Errors per run by quality cutoff ● ● 800 Errors per run ● 600 ● 400 ● ● ● ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 Quality score cutoff ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 150 ● ● ● ● ● 200 Called sites by quality cutoff ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Number of called sites ● ● ● ● ● ● ● ● ● ● ● ● ● 30000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 ● ● ● ● ● ● ● ● 60 80 100 120 Quality score cutoff 140 160 180 200 Error rate per called site 0.007 ● 0.006 ● Errors per called site 0.005 ● 0.004 ● ● ● ● ● 0.003 ● ● ● ● ● 0.002 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.001 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.000 60 ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 120 Quality score cutoff ● ● ● ● ● ● ● ● ● 140 ● ● ● ● ● ● ● ● ● ● ● ● 160 ● ● ● ● ● ● ● 180 ● ● ● ● ● ● ● 200 Error rate per called SNP (FDR) ● 0.020 ● Errors per called SNP 0.015 ● ● 0.010 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.005 ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● 0.000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 120 Quality score cutoff ● ● ● ● ● ● ● 140 ● ● ● ● ● ● ● ● ● ● ● ● 160 ● ● ● ● ● ● ● ● ● 180 ● ● ● ● ● ● ● 200 Parameters for SNP calling • Optimized for analysis of structure & association mapping • SNPs identified as sites with at least one sample of each allele having quality score ≥ 120 • remaining bases called where quality ≥ 60 • ~40,000 with frequency ≥ 2, with good calls in at least 95% of strains • ~61,000 total SNPs N2 LSJ1 JU1616 JU1615 JU1200 JU1586 JU1566 JU1568 ED3024 ED3028 ED3010 ED3015 ED3014 ED3012 ED3023 JU693 JU694 JU361 JU367 EG4346 EG4347 JU394 JU399 DR1344 CB4857 CB4851 JU406 PX178 PX179 JU1563 CB4852 JU1516 JU440 JU1395 EG4945 AB1 EG4957 EG4680 EG4948 EG4689 EG4951 JU318 EG4946 CB4932 JU316 JU313 JU321 JU311 JU317 JU314 PB303 JU561 JU563 JU1440 JT11362 JT11398 JT11399 JU393 JU1207 JU1214 JU1212 PX174 RC301 RW7000 ED3048 ED3042 CX11259 CX11292 JU1088 JU262 AB4 JU315 CB4854 JU1401 JU1896 JU1530 JU1026 CX11278 JU1410 JU1411 CX11319 CB3198 JU1409 JU1037 CX11317 AB2 DR1350 CB4855 CX11305 JU1172 CB4858 JU299 ED3075 JU1040 JU1204 JU438 ED3017 JU1206 JU1039 JU310 CB4853 CX11271 ED3020 ED3019 JU263 ED3073 CX11258 JU848 JU774 ED3043 JU342 LKC34 JU347 CX11254 JU322 ED3040 JU323 ED3052 CX11316 CX11276 JU642 JU362 JU363 JU398 JU397 JU401 ED3077 JU801 JU368 JU360 ED3011 JU847 CX11315 DL226 JU1652 JU345 JU1230 JU346 MY1 JU830 JU829 ED3005 JU1491 JU1484 CX11268 JU1218 ED3021 JU1243 JU1242 KR314 ED3049 CX11307 CX11314 JU1246 DL200 ED3046 JU778 PB306 MY6 MY18 JU531 QX1218 JU533 QX1233 JU751 JU1213 JU1482 MY7 EG4724 JU622 MY10 JU792 JU396 JU395 JU1400 CX11262 CX11294 JU1511 JU1581 JU1582 JU1522 CX11264 CX11285 CX11321 EG4348 EG4349 MY16 EG4725 PS2025 JU782 JU799 JU1580 DL238 CB4856 JU258 JU775 JU1171 MY2 MY14 MY23 MY15 QX1216 QX1211 Number of SNPs (vs. reference) Number of SNPs per strain (vs. reference) 15000 10000 5000 0 Strain Overall SNP Frequency Spectrum 85555 /.4#&<#= #>?#*3#= ,(-.#&$/0$123#4 76555 75555 6555 5 597 598 !""#"#$%&#'(#)*+ 59: 59; 596 I 4 3 2 1 Polymorphism is highest on chromosome arms 4 3 2 1 III 4 3 2 1 IV Π or Θw (per kb) II 4 3 2 1 V 4 3 2 1 X 4 3 2 1 5 10 Location (Mb) 15 20 Tajima’s D is usually negative sometimes for extended regions 2 I 0 −2 −4 2 II 0 −2 −4 2 −2 −4 2 0 IV Tajima's D III 0 −2 −4 2 V 0 −2 −4 2 X 0 −2 −4 5 10 Location (Mb) 15 20 CB4856 DL238 MY16 EG4724 JU1484 JU1491 QX1211 QX1216 JU1171 MY15 MY14 MY23 MY2 JU258 EG4725 JU778 JU1401 JU1409 JU1411 JU1410 CB4852 JU1516 JU1563 JU1395 JU775 JU799 JU782 ED3040 LKC34 ED3043 ED3052 JU774 JU1400 MY10 MY7 ED3046 ED3049 AB1 CB4851 CB4857 DR1344 JU394 JU399 JU406 JU1566 JU1568 JU311 JU313 JU317 JU314 JU321 JU316 JU318 JU1586 JU1615 JU1616 LSJ1 N2 JU440 JU361 JU367 JU693 JU694 ED3010 ED3012 ED3014 ED3015 ED3024 ED3028 ED3023 JU1200 EG4346 EG4347 PX178 PX179 JU1652 JU561 JU563 CB4932 JU1896 JU1218 JU1242 JU1243 JU1230 JU395 JU396 JU622 JU792 JU397 JU398 JU401 JU393 JU1440 DL200 ED3017 ED3019 ED3020 JU1026 JU1037 JU1039 JU1204 JU1206 JU310 JU438 JU1040 JU262 JU315 JU299 JU263 JU1530 JU322 JU323 JU342 JU347 JU847 JU848 CB4854 ED3042 ED3048 JU1511 JU1522 JU1582 JU1581 JU345 JU346 JT11362 JT11398 JT11399 JU1088 ED3073 ED3075 ED3077 JU829 JU830 MY1 AB2 CB4855 CB4858 CX11278 CX11317 CX11258 CX11305 AB4 JU1172 CX11259 CX11292 CB3198 CB4853 DR1350 EG4680 EG4689 EG4946 EG4951 EG4948 EG4957 EG4945 PB306 CX11262 CX11294 CX11264 CX11314 CX11268 CX11315 QX1218 QX1233 DL226 ED3005 ED3021 MY18 MY6 ED3011 CX11271 CX11319 PB303 CX11285 CX11321 CX11254 CX11276 CX11316 PX174 RC301 RW7000 KR314 JU1207 JU1212 JU1214 JU1213 JU1482 JU533 JU531 JU1246 JU360 JU368 JU801 JU362 JU363 JU751 JU642 30 EG4348 EG4349 CX11307 PS2025 JU1580 Many Clonal Isolates 203 Strains R > 0.98, Hand-curation 43 Unique Strains 50 clonal sets • Most Sets are Location-Specific • Some Found Across Continents – Germany/Chile – Australia/USA – USA/France SNP Pruning 1. Imputation (NPUTE), 99.7% accuracy 2. Pruning (PLINK) 40 SNP windows, r2 > 0.5 40,000 SNPs 8,000 SNPs Analysis by STRUCTURE •k=1 PCA Separates France and USA C. elegans population sample • Mild population structure detected by PCA –no distinct populations • 203 strains = 93 unique haplotypes • Segment Sharing? –GERMLINE Most Strains are Related Most Strains Share 1/3 of Genome IBD Segments are Large ED3010 - DL226 Three Unrelated Strains • 93 Genome-wide Haplotypes • All but 255 comparisons (~94%) Share One IBD Segment • 250/255 Due to three strains: • CB4856, DL238, QX1211/1216 Multiple Highly Shared IBD Segments Chr V Segment is Widespread Chr V Segment in France What could cause this pattern? • Recent Migration • Selection –Background Selection –Positive Sweep 10000 Ascertained N2 LSJ1 JU1616 JU1615 JU1200 JU1586 JU1566 JU1568 ED3024 ED3028 ED3010 ED3015 ED3014 ED3012 ED3023 JU693 JU694 JU361 JU367 EG4346 EG4347 JU394 JU399 DR1344 CB4857 CB4851 JU406 PX178 PX179 JU1563 CB4852 JU1516 JU440 JU1395 EG4945 AB1 EG4957 EG4680 EG4948 EG4689 EG4951 JU318 EG4946 CB4932 JU316 JU313 JU321 JU311 JU317 JU314 PB303 JU561 JU563 JU1440 JT11362 JT11398 JT11399 JU393 JU1207 JU1214 JU1212 PX174 RC301 RW7000 ED3048 ED3042 CX11259 CX11292 JU1088 JU262 AB4 JU315 CB4854 JU1401 JU1896 JU1530 JU1026 CX11278 JU1410 JU1411 CX11319 CB3198 JU1409 JU1037 CX11317 AB2 DR1350 CB4855 CX11305 JU1172 CB4858 JU299 ED3075 JU1040 JU1204 JU438 ED3017 JU1206 JU1039 JU310 CB4853 CX11271 ED3020 ED3019 JU263 ED3073 CX11258 JU848 JU774 ED3043 JU342 LKC34 JU347 CX11254 JU322 ED3040 JU323 ED3052 CX11316 CX11276 JU642 JU362 JU363 JU398 JU397 JU401 ED3077 JU801 JU368 JU360 ED3011 JU847 CX11315 DL226 JU1652 JU345 JU1230 JU346 MY1 JU830 JU829 ED3005 JU1491 JU1484 CX11268 JU1218 ED3021 JU1243 JU1242 KR314 ED3049 CX11307 CX11314 JU1246 DL200 ED3046 JU778 PB306 MY6 MY18 JU531 QX1218 JU533 QX1233 JU751 JU1213 JU1482 MY7 EG4724 JU622 MY10 JU792 JU396 JU395 JU1400 CX11262 CX11294 JU1511 JU1581 JU1582 JU1522 CX11264 CX11285 CX11321 EG4348 EG4349 MY16 EG4725 PS2025 JU782 JU799 JU1580 DL238 CB4856 JU258 JU775 JU1171 MY2 MY14 MY23 MY15 QX1216 QX1211 All • Number of SNPs (vs. reference) Main Conclusions • Technique extremely successful • Direct Sequencing is crucial to accurately depict strain relationships –rare variants –diverged populations with private alleles 15000 10000 5000 0 15000 5000 0 Strain Main Conclusions • Many Worldwide Isolates are Closely Related • Large IBD Segments Common Worldwide –Extensive Migration + Selection • Some population structure masked by segment sharing • Three strains on Pacific Rim are the Most Diverse (~12% of SNPs) Acknowledgements Princeton Microarray facility Waksman Genomics Core facility (Rutgers) Jonathan Crissman Dee Denver, University of Oregon Matt Rockman, NYU Michael Ailion, Utah Suzanne Estes, University of Oregon Patrick McGrath, Cori Bargmann, Rockefeller University Asher Cutter, University of Toronto Caenorhabidits Genetics Center