C. elegans

advertisement
C. elegans population diversity
Justin Gerke
Joshua Shapiro
Erik Andersen
Marie-Anne Felix
Leonid Kruglyak
Princeton University
Institut Jacques-Monod
C. elegans
• 1 mm long as adults
• Found worldwide in rotting vegetation/compost
• Hermaphroditic with rare males
•low recombination rate
• 100 Mb genome, 6 Chromosomes
C. elegans
• Biomedical Research
• Evolutionary Biology
C. elegans population diversity
• Studies of Small # of Loci
–MtDNA, PCR amplicon sequencing, SSRs,
AFLPs
• One Genome-Wide Study of Biased SNPs
–SNPs identified from a single strain pair
• Goal: Genome-wide, unbiased sample
203 Wild Isolates
RAD-Seq Library Construction
Our Oligos
PLoS One 2008:
Barcode
Tailed Amplifica0on
Faster Protocol
More Mappable Reads
Lower Cost for Barcoded Oligos
RAD seq data management goals
• Create a central repository for reads and
associated metadata
•
Easily share data among lab members
•
•
•
split multiplexed runs
combine multiple runs
select mapped data by genomic location
•
•
scripts to link database to external software
ability to pull data directly into R, python, perl, etc.
• Make it easy to operate on subsets of data
• Integrate with analysis tools
RAD seq data workflow
MySQL Database
Raw reads
& run info
Normalize quality scores
Check barcodes
Check restriction site
SELECT RAW
READS by
barcode,
restriction site,
run, quality,
length
CREATE SUMMARY
TABLES by
restriction site
location, strain, run,
mapping parameters
Remove barcodes
Reads for
mapping
SELECT
MAPPED
READS by
location, strain
Remove restriction sites
bwa
samtools
Pileup files
Reads per strain
15
reads
10
Number of Strains
5
0
15
mapped
10
5
0
0
2
4
6
Read count (millions)
8
10
Mapping rate by strain
30
25
Number of Strains
20
15
10
C. briggsae
5
0
0.0
0.2
0.4
0.6
Fraction of reads mapped
0.8
1.0
Genomic distribution of mapped
restriction sites
60
40
20
0
II
60
40
20
0
III
60
40
20
0
IV
60
40
20
0
V
Number of mapped restriction sites
I
60
40
20
0
X
60
40
20
0
0
5
10
Location (Mb)
15
20
Restriction Site Coverage
38,628
(38,541)
2,097
(69)
Observed
restriction sites
985
(1,072)
Reference sequence
restriction sites
(reference strain only)
40
30
20
10
0
40
30
20
10
0
40
30
20
10
0
40
30
20
10
0
40
30
20
10
0
40
30
20
10
0
I
II
III
IV
V
Number of 'missing' restriction sites
Genomic locations of ‘missing’
restriction sites
X
0
5
10
Location (Mb)
15
20
Restriction site proximity
results in unsampled segments
Missing restriction sites
tend to be near another site
3000
2500
Present
2000
Number of restriction sites
1500
1000
500
0
150
Missing
100
50
0
0
1000
2000
3000
Distance to nearest restriction site
4000
5000
High variability between read counts
at the same restriction site
r2 = 0.15
Distribution of bias in read depth
3000
Number of restriction sites
2500
2000
1500
1000
500
0
0
1
2
3
4
Log2 Ratio of read counts
5
6
7
Base calling procedures
• Pull sequences from the database that start
at defined locations
•
•
•
seen in at least 2 strains
average reads per strain > 5
do not require restriction site in reference
• Compile reads in pileup format
• Only call SNPs in the 91 bp after the
restriction site
SNPs outside expected segments are
error-prone
91bp
AATTCCATGGTTAGTG...GACCGGGTAGCTTCTACACATGACTA
AATTCCATGGTTAGTG...GACCGGGTAGCTTC
AATTCCATGGTTAGTG...GACCGGGTAACTTC
AATTCCATGGTTAGTG...GTCCGGGTAGCTTC
AATTCCATGGTTAGTG...GACCGG-TAGCTTCT
AATTCCATGGTTAGTG...GACCGG-TAGCTTCA
AATTCCATGGTTAGTG...GACCGG-TTGCTTCA
Base calling continued
• 7.85 Mb from 86,211 segments
•
Mean coverage depth is 29 reads/site, median 18
•
•
homozygosity of individuals makes this much
easier
only 51 SNPs in reference strain (one of the
lowest coverage strains) with quality score≥60
• Calls generated by samtools are quite good
• Tested by comparison of duplicate libraries
•
•
prepared separately, run separately
conservative estimator, as coverage per run tends
to be lower than in full data set
Errors per run by quality cutoff
●
●
800
Errors per run
●
600
●
400
●
●
●
●
●
●
●
●
●
●
●
200
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
0
●
●
●
●
●
●
●
●
●
●
50
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
100
Quality score cutoff
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
150
●
●
●
●
●
200
Called sites by quality cutoff
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
40000
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Number of called sites
●
●
●
●
●
●
●
●
●
●
●
●
●
30000
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
20000
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10000
●
●
●
●
●
●
●
●
60
80
100
120
Quality score cutoff
140
160
180
200
Error rate per called site
0.007
●
0.006
●
Errors per called site
0.005
●
0.004
●
●
●
●
●
0.003
●
●
●
●
●
0.002
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.001
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.000
60
●
●
●
80
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
120
Quality score cutoff
●
●
●
●
●
●
●
●
●
140
●
●
●
●
●
●
●
●
●
●
●
●
160
●
●
●
●
●
●
●
180
●
●
●
●
●
●
●
200
Error rate per called SNP (FDR)
●
0.020
●
Errors per called SNP
0.015
●
●
0.010
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.005
●
●
●
●
●
●
●
●
●
●
●
●
●
60
●
●
●
●
●
●
●
●
●
●
●
0.000
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
120
Quality score cutoff
●
●
●
●
●
●
●
140
●
●
●
●
●
●
●
●
●
●
●
●
160
●
●
●
●
●
●
●
●
●
180
●
●
●
●
●
●
●
200
Parameters for SNP calling
• Optimized for analysis of structure &
association mapping
• SNPs identified as sites with at least one
sample of each allele having quality score
≥ 120
•
remaining bases called where quality ≥ 60
•
~40,000 with frequency ≥ 2, with good calls in
at least 95% of strains
• ~61,000 total SNPs
N2
LSJ1
JU1616
JU1615
JU1200
JU1586
JU1566
JU1568
ED3024
ED3028
ED3010
ED3015
ED3014
ED3012
ED3023
JU693
JU694
JU361
JU367
EG4346
EG4347
JU394
JU399
DR1344
CB4857
CB4851
JU406
PX178
PX179
JU1563
CB4852
JU1516
JU440
JU1395
EG4945
AB1
EG4957
EG4680
EG4948
EG4689
EG4951
JU318
EG4946
CB4932
JU316
JU313
JU321
JU311
JU317
JU314
PB303
JU561
JU563
JU1440
JT11362
JT11398
JT11399
JU393
JU1207
JU1214
JU1212
PX174
RC301
RW7000
ED3048
ED3042
CX11259
CX11292
JU1088
JU262
AB4
JU315
CB4854
JU1401
JU1896
JU1530
JU1026
CX11278
JU1410
JU1411
CX11319
CB3198
JU1409
JU1037
CX11317
AB2
DR1350
CB4855
CX11305
JU1172
CB4858
JU299
ED3075
JU1040
JU1204
JU438
ED3017
JU1206
JU1039
JU310
CB4853
CX11271
ED3020
ED3019
JU263
ED3073
CX11258
JU848
JU774
ED3043
JU342
LKC34
JU347
CX11254
JU322
ED3040
JU323
ED3052
CX11316
CX11276
JU642
JU362
JU363
JU398
JU397
JU401
ED3077
JU801
JU368
JU360
ED3011
JU847
CX11315
DL226
JU1652
JU345
JU1230
JU346
MY1
JU830
JU829
ED3005
JU1491
JU1484
CX11268
JU1218
ED3021
JU1243
JU1242
KR314
ED3049
CX11307
CX11314
JU1246
DL200
ED3046
JU778
PB306
MY6
MY18
JU531
QX1218
JU533
QX1233
JU751
JU1213
JU1482
MY7
EG4724
JU622
MY10
JU792
JU396
JU395
JU1400
CX11262
CX11294
JU1511
JU1581
JU1582
JU1522
CX11264
CX11285
CX11321
EG4348
EG4349
MY16
EG4725
PS2025
JU782
JU799
JU1580
DL238
CB4856
JU258
JU775
JU1171
MY2
MY14
MY23
MY15
QX1216
QX1211
Number of SNPs (vs. reference)
Number of SNPs per strain
(vs. reference)
15000
10000
5000
0
Strain
Overall SNP Frequency Spectrum
85555
/.4#&<#=
#>?#*3#=
,(-.#&$/0$123#4
76555
75555
6555
5
597
598
!""#"#$%&#'(#)*+
59:
59;
596
I
4
3
2
1
Polymorphism is highest
on chromosome arms
4
3
2
1
III
4
3
2
1
IV
Π or Θw (per kb)
II
4
3
2
1
V
4
3
2
1
X
4
3
2
1
5
10
Location (Mb)
15
20
Tajima’s D is usually negative
sometimes for extended regions
2
I
0
−2
−4
2
II
0
−2
−4
2
−2
−4
2
0
IV
Tajima's D
III
0
−2
−4
2
V
0
−2
−4
2
X
0
−2
−4
5
10
Location (Mb)
15
20
CB4856
DL238
MY16
EG4724
JU1484
JU1491
QX1211
QX1216
JU1171
MY15
MY14
MY23
MY2
JU258
EG4725
JU778
JU1401
JU1409
JU1411
JU1410
CB4852
JU1516
JU1563
JU1395
JU775
JU799
JU782
ED3040
LKC34
ED3043
ED3052
JU774
JU1400
MY10
MY7
ED3046
ED3049
AB1
CB4851
CB4857
DR1344
JU394
JU399
JU406
JU1566
JU1568
JU311
JU313
JU317
JU314
JU321
JU316
JU318
JU1586
JU1615
JU1616
LSJ1
N2
JU440
JU361
JU367
JU693
JU694
ED3010
ED3012
ED3014
ED3015
ED3024
ED3028
ED3023
JU1200
EG4346
EG4347
PX178
PX179
JU1652
JU561
JU563
CB4932
JU1896
JU1218
JU1242
JU1243
JU1230
JU395
JU396
JU622
JU792
JU397
JU398
JU401
JU393
JU1440
DL200
ED3017
ED3019
ED3020
JU1026
JU1037
JU1039
JU1204
JU1206
JU310
JU438
JU1040
JU262
JU315
JU299
JU263
JU1530
JU322
JU323
JU342
JU347
JU847
JU848
CB4854
ED3042
ED3048
JU1511
JU1522
JU1582
JU1581
JU345
JU346
JT11362
JT11398
JT11399
JU1088
ED3073
ED3075
ED3077
JU829
JU830
MY1
AB2
CB4855
CB4858
CX11278
CX11317
CX11258
CX11305
AB4
JU1172
CX11259
CX11292
CB3198
CB4853
DR1350
EG4680
EG4689
EG4946
EG4951
EG4948
EG4957
EG4945
PB306
CX11262
CX11294
CX11264
CX11314
CX11268
CX11315
QX1218
QX1233
DL226
ED3005
ED3021
MY18
MY6
ED3011
CX11271
CX11319
PB303
CX11285
CX11321
CX11254
CX11276
CX11316
PX174
RC301
RW7000
KR314
JU1207
JU1212
JU1214
JU1213
JU1482
JU533
JU531
JU1246
JU360
JU368
JU801
JU362
JU363
JU751
JU642
30
EG4348
EG4349
CX11307
PS2025
JU1580
Many Clonal Isolates
203 Strains
R > 0.98, Hand-curation
43 Unique Strains
50 clonal sets
• Most Sets are Location-Specific
• Some Found Across Continents
– Germany/Chile
– Australia/USA
– USA/France
SNP Pruning
1. Imputation (NPUTE), 99.7% accuracy
2. Pruning (PLINK) 40 SNP windows, r2 > 0.5
40,000 SNPs
8,000 SNPs
Analysis by STRUCTURE
•k=1
PCA Separates France and USA
C. elegans population sample
• Mild population structure detected by PCA
–no distinct populations
• 203 strains = 93 unique haplotypes
• Segment Sharing?
–GERMLINE
Most Strains are Related
Most Strains Share 1/3 of Genome
IBD Segments are Large
ED3010 - DL226
Three Unrelated Strains
•
93 Genome-wide Haplotypes
•
All but 255 comparisons (~94%) Share One IBD Segment
•
250/255 Due to three strains:
•
CB4856,
DL238,
QX1211/1216
Multiple Highly Shared IBD Segments
Chr V Segment is Widespread
Chr V Segment in France
What could cause this
pattern?
• Recent Migration
• Selection
–Background Selection
–Positive Sweep
10000
Ascertained
N2
LSJ1
JU1616
JU1615
JU1200
JU1586
JU1566
JU1568
ED3024
ED3028
ED3010
ED3015
ED3014
ED3012
ED3023
JU693
JU694
JU361
JU367
EG4346
EG4347
JU394
JU399
DR1344
CB4857
CB4851
JU406
PX178
PX179
JU1563
CB4852
JU1516
JU440
JU1395
EG4945
AB1
EG4957
EG4680
EG4948
EG4689
EG4951
JU318
EG4946
CB4932
JU316
JU313
JU321
JU311
JU317
JU314
PB303
JU561
JU563
JU1440
JT11362
JT11398
JT11399
JU393
JU1207
JU1214
JU1212
PX174
RC301
RW7000
ED3048
ED3042
CX11259
CX11292
JU1088
JU262
AB4
JU315
CB4854
JU1401
JU1896
JU1530
JU1026
CX11278
JU1410
JU1411
CX11319
CB3198
JU1409
JU1037
CX11317
AB2
DR1350
CB4855
CX11305
JU1172
CB4858
JU299
ED3075
JU1040
JU1204
JU438
ED3017
JU1206
JU1039
JU310
CB4853
CX11271
ED3020
ED3019
JU263
ED3073
CX11258
JU848
JU774
ED3043
JU342
LKC34
JU347
CX11254
JU322
ED3040
JU323
ED3052
CX11316
CX11276
JU642
JU362
JU363
JU398
JU397
JU401
ED3077
JU801
JU368
JU360
ED3011
JU847
CX11315
DL226
JU1652
JU345
JU1230
JU346
MY1
JU830
JU829
ED3005
JU1491
JU1484
CX11268
JU1218
ED3021
JU1243
JU1242
KR314
ED3049
CX11307
CX11314
JU1246
DL200
ED3046
JU778
PB306
MY6
MY18
JU531
QX1218
JU533
QX1233
JU751
JU1213
JU1482
MY7
EG4724
JU622
MY10
JU792
JU396
JU395
JU1400
CX11262
CX11294
JU1511
JU1581
JU1582
JU1522
CX11264
CX11285
CX11321
EG4348
EG4349
MY16
EG4725
PS2025
JU782
JU799
JU1580
DL238
CB4856
JU258
JU775
JU1171
MY2
MY14
MY23
MY15
QX1216
QX1211
All
•
Number of SNPs (vs. reference)
Main Conclusions
• Technique extremely successful
• Direct Sequencing is crucial to accurately
depict strain relationships
–rare variants
–diverged populations with private alleles
15000
10000
5000
0
15000
5000
0
Strain
Main Conclusions
• Many Worldwide Isolates are Closely Related
• Large IBD Segments Common Worldwide
–Extensive Migration + Selection
• Some population structure masked by
segment sharing
• Three strains on Pacific Rim are the Most
Diverse (~12% of SNPs)
Acknowledgements
Princeton Microarray facility
Waksman Genomics Core facility (Rutgers)
Jonathan Crissman
Dee Denver, University of Oregon
Matt Rockman, NYU
Michael Ailion, Utah
Suzanne Estes, University of Oregon
Patrick McGrath, Cori Bargmann, Rockefeller University
Asher Cutter, University of Toronto
Caenorhabidits Genetics Center
Download