etc3186-sup-0001-SuppData-S1

advertisement
Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas) Genome
Burns FR†, Cogburn LA†, Ankley, GT‡, Villeneuve DL‡, Waits E§, Chang YJǁ, Llaca, V#, Deschamps, S#,
Jackson, R††, Hoke, RA*†
*†
E.I. du Pont de Nemours, Haskell Global Centers for Health and Environmental Sciences, Newark, DE,
USA; ‡U.S. Environmental Protection Agency, Mid-Continent Ecology Division, Duluth, MN, USA ,
U.S. Environmental Protection Agency, Cincinnati, Ohio, USA , ǁHigh-Performance Biological
§
Computing, University of Illinois at Urbana-Champaign, Urbana, IL USA, # E.I. du Pont de Nemours ,
Agricultural Biotechnology, Wilmington, DE,
††
E.I. du Pont de Nemours, Central Research and
Development Biotechnology, Wilmington, DE.
Supplementary Information
Supplementary Methods:
Library Construction and sequence generation on HiSeq 2000
Four next-generation DNA sequencing (NGS) libraries with varying average DNA fragment sizes were
made: 1) a paired-end NGS library with an average fragment size of 180 bases; 2) two “mate-pair” NGS
libraries with an average fragment size of 3 Kb and 6 Kb; and 3) a “fosmid” library with an average
fragment size of 40 Kb.
The paired-end 180 bps and mate-pair 3 Kb and 6 Kb NGS libraries were made following standard
Illumina protocol recommendations. Approximately 2 µg genomic DNA were sheared on a Covaris
system (Woburn, MA) to an average of 180 bases. End repair, A base addition and ligation of adapters
were performed according to the protocol developed by Illumina for preparing DNA samples for pairedend sequencing. Size-selection of adapter-ligated DNA fragments was performed on a LabChip XT
fractionation system (Caliper). Two consecutive size selection steps were performed with a resulting
DNA fragment size ranging between approximately 140 bases and 200 bases, and peaking at 170 bases.
For PCR amplification, 1µL of size-selected DNA was incubated in 50µL of 1X Phusion HF master mix
(Finnzymes) containing 10 µM PCR primers PE 1.0 and PE 2.0 (Illumina). After PCR amplification (30s
at 98°C, followed by 14 rounds of 40s at 98°C, 30s at 65°C, 30s at 72°C, and a final extension step of
5min at 72°C), DNA was purified with a MinElute PCR purification kit (Qiagen) and resuspended in 10
µl EB buffer (Qiagen). The mate-pair libraries were made after shearing high-molecular weight DNA
with a HydroShear system (Genomic Solutions). The resulting fragments were subjected to end repair,
biotin label and gel-based size-selection, according to a protocol developed by Illumina for preparing
mate pair libraries. Buffer and enzymatic reagents were obtained directly from Illumina’s 2-5 Kb mate
pair library preparation kit. The size-selected DNA was circularized and sheared on a Covaris system
(Woburn, MA). Following selection, the purified biotinylated DNA fragments were subjected to end
repair, A base addition and ligation of adapters and a final PCR amplification step was performed using
conditions similar to the ones used to create the 180 bases library. After amplification, DNA was
resuspended in 10 µl EB buffer (Qiagen).
To create the 40 Kb library, approximately 30 µg high-molecular-weight DNA were initially subjected to
hydrodynamic shearing on a HydroShear system (Genomic Solutions) using a large shearing assembly
set. DNA was sheared to an average of approximately 40 Kb, as confirmed by CHEF gel electrophoresis.
Fragments were treated with the DNA Terminator End Repair kit (Lucigen) to create phosphorylated
blunt ends, and then size-selected using a 1.2% Low-Melting-Point agarose in CHEF Gel electrophoresis
using a BioRad CHEF Mapper system. Fragments with lengths of 30 to 100 Kb were cut off from and
recovered by digesting the gel slice with Gelase (Epicentre), following the manufacturer’s conditions.
Recovered DNA was ligated to linearized, de-phosphorylated pFOS NGS vector using the CopyRight®
pNGS Fosmid Cloning Kit (Lucigen). Ligated DNA was then lambda-packaged using the Gigapack III
Gold (Agilent) and used to transfect FOS replicator E. coli cells. Transfected cells were frozen at 80C in
LB+25% Glycerol. The library was tittered and amplified at high density (~50,000 cfu/in2) in 22x22cm
bioassay trays with YT-agar+Cloramphenicol. Bioassay trays were incubated for 20 hrs and placed at 4C
for 3 hours before scraping the clones and combined into a single 250-ml LB+Chloramphenicol + 1X
Arabinose solution and incubated at 37C in for 1 hour. DNA from the pooled clones was purified as a
single sample using a large-insert DNA extraction kit (Qiagen). Two 5-ug aliquots were completely
digested separately with BfaI and CVQI. The vector fragments from each sample were purified from a
1.2% Sybr Safe gel (Invitrogen) and self-ligated at low concentration (0.8 ng/ul) using CloneSmart®
DNA Ligase (Lucigen). The self-ligated samples were PCR-amplified using Phusion HF master mix and
the following primers: 1) lllum-A: AAG CAG AAG ACG GCA TAC GAG, 2) Illum-B: AAT GAT ACG
GCG ACC ACC GAG). Samples were purified twice using Ampure XT magnetic beads and quantified in
a Bioanalyzer (Agilent).
Library construction and Sequence generation on MiSeq.
DNA was fragmented using a Covaris S220 Ultrasonicator (Covaris, Woburn, MA 18081). The settings
used for fragmentation are in the table below:
Parameter
Setting
Duty Factor
5%
Peak Incident Power
(W)
175
Cycles per Burst
200
Time (Seconds)
25
The fragmented DNA was characterized for size distribution using a BioAnalyzer 2100 (Santa Clara, CA,
95051). Fragment sizes were determined to be 600 – 1000bp. The fragmented DNA was then purified
and adapters for sequencing were ligated as per the instruction manual for the TruSeq DNA PCR-Free
Sample Prep kit (Part #FC-121-3001 Illumina, San Diego, CA, 92122). The resultant sample library was
sequenced in two runs on the Illumina MiSeq using 2 x 250 base, paired end settings.
SOAP denovo assembly parameters:
SOAPdenovo-63mer pregraph -s config.file -K 25 -p 24 -o dupont_25 -R 1> pregraph.log 2> pregraph.err
SOAPdenovo-63mer contig -s config.file -g dupont_25 -m 43 -p 24 -R 1> contig.log 2> contig.err
SOAPdenovo-63mer map -s config.file -g dupont_25 -k 33 -p 12 -R 1> map.log 2> map.err
SOAPdenovo-63mer scaff -p 12 -F -L 100 -g dupont_25 1> scaff.log 2> scaff.err
Two Paired-end reads were used for gap closing.
Scaffolds below 1kb were excluded from the final assembly file.
Config.file: (The following is the content of the config file)
max_rd_len=240
[LIB]
avg_ins=180
reverse_seq=0
asm_flags=3
rank=1
q1=A6730001_paired_r1.fastq.cor.pair_1.fq
q2=A6730001_paired_r2.fastq.cor.pair_2.fq
q1=A6730005_paired_r1.fastq.cor.pair_1.fq
q2=A6730005_paired_r2.fastq.cor.pair_2.fq
[LIB]
avg_ins=600
reverse_seq=0
asm_flags=3
rank=2
q1=FatHeadMinnow_S1_L001_R1_001-A_pairedOut.trimR.fastq.cor.pair_1.fq
q2=FatHeadMinnow_S1_L001_R2_001-A_pairedOut.trimR.fastq.cor.pair_2.fq
q1=FatHeadMinnow_S1_L001_R1_001_pairedOut.trimR.fastq.cor.pair_1.fq
q2=FatHeadMinnow_S1_L001_R2_001_pairedOut.trimR.fastq.cor.pair_2.fq
[LIB]
avg_ins=3000
reverse_seq=1
asm_flags=2
rank=3
q1=A6730002_paired_r1.fastq
q2=A6730002_paired_r2.fastq
[LIB]
avg_ins=6000
reverse_seq=1
asm_flags=2
rank=4
pair_num_cutoff=5
q1=A6730003_paired_r1.fastq
q2=A6730003_paired_r2.fastq
[LIB]
avg_ins=40000
reverse_seq=1
asm_flags=2
rank=5
q1=A6730004_paired_r1.fastq
q2=A6730004_paired_r2.fastq
SGA assembly parameters:
sga preprocess --pe-mode 1 -o FHM_180.fastq 180_r1.fastq 180_r2.fastq
sga index -a ropebwt --no-reverse -t 16 FHM_180.fastq
sga correct -k 31 --discard --learn -t 16 -o reads.ec.k31.fastq FHM_180.fastq
sga index -a ropebwt -t 16 reads.ec.k31.fastq
sga filter -x 2 -t 16 --homopolymer-check --low-complexity-check reads.ec.k31.fastq
sga fm-merge -m 55 -t 16 -o merged.k31.fa reads.ec.k31.filter.pass.fa
sga index -d 1000000 -t 16 merged.k31.fa
sga rmdup -t 16 merged.k31.fa
sga overlap -m 55 -t 16 merged.k31.rmdup.fa
sga assemble -m 75 -g 0 -r 10 -o assemble.m75 merged.k31.rmdup.asqg.gz
bwa index assemble.m75.contigs.fa
bwa aln -t 16 assemble.m75.contigs.fa 180_r1.fastq > 180_r1.fastq.sai
bwa aln -t 16 assemble.m75.contigs.fa 180_r2.fastq > 180_r2.fastq.sai
bwa sampe assemble.m75.contigs.fa 180_r1.fastq.sai 180_r2.fastq.sai 180_r1.fastq 180_r2.fastq |
samtools view -Sb - > 180bp.bam
*repeat bwa alignment steps above for each library used for scaffolding
sga-bam2de.pl --prefix lib.fragment.180bp -n 5 -m 200 lib.fragment.180bp.bam
sga-bam2de.pl --prefix lib.fragment.250bp -n 5 -m 200 lib.fragment.250bp.bam
sga-bam2de.pl --prefix lib.matepair.3kbp -n 5 -m 200 lib.matepair.3kbp.bam
sga-bam2de.pl --prefix lib.matepair.6kbp -n 5 -m 200 lib.matepair.6kbp.bam
sga-bam2de.pl --prefix lib.matepair.40kbp -n 5 -m 200 lib.matepair.40kbp.bam
samtools sort lib.fragment.180bp.bam lib.fragment.180bp.refsort
sga-astat.py -m 200 lib.fragment.180bp.refsort.bam > contigs.astat
sga scaffold -m 200 -a contigs.astat --pe lib.fragment.180bp.de –pe lib.fragment.50bp.de --mate
lib.matepair.3kb.de --mate lib.matepair.6kb.de --mate lib.matepair.40kb.de -o multiple.libs.scaf
contigs.fa
sga scaffold2fasta --write-unplaced -m 200 -o FHM_scaffolds.fa --use-overlap -a assemble.m75graph.asqg.gz multiple.libs.scaf
Statistic
SGA
SOAP
Number of scaffolds
810921
73057
Total size of scaffolds
957809772
1219326373
Longest scaffold
811454
580801
Shortest scaffold
200
1000
Number of scaffolds > 500 nt
177416 21.9%
73057 100.0%
Number of scaffolds > 1K nt
66983 8.3%
73011 99.9%
Number of scaffolds > 10K nt
16868 2.1%
19302 26.4%
Number of scaffolds > 100K nt
746 0.1%
2265 3.1%
Number of scaffolds > 1M nt
0 0.0%
0 0.0%
Mean scaffold size
1181
16690
Median scaffold size
307
3513
N50 scaffold length
15414
60380
L50 scaffold count
11491
5505
NG50 scaffold length
147007
240449
LG50 scaffold count
263
178
N50 scaffold - NG50 scaffold length difference
131593
180069
scaffold %A
26.29
20.67
scaffold %C
16.19
12.62
scaffold %G
16.15
12.59
scaffold %T
26.31
20.63
scaffold %N
15.07
33.48
scaffold %non-ACGTN
0.00
0.00
Number of scaffold non-ACGTN nt
0
0
Percentage of assembly in scaffolded contigs
67.90%
94.9%
Percentage of assembly in unscaffolded contigs
32.10%
5.1%
Average number of contigs per scaffold
1.2
2.9
CEGMA analysis of 248 ultra-conserved genes
SGA Assembly
Column1
# prots
% Completeness
#
Average
% Ortho
Total
Complete
126
50.81
156
1.24
18.25
Group 1
26
39.39
32
1.23
11.54
Group 2
32
57.14
41
1.28
21.88
Group 3
30
49.18
39
1.3
26.67
Group 4
38
58.46
44
1.16
13.16
Partial
166
66.94
237
1.43
31.33
Group 1
41
62.12
54
1.32
21.95
Group 2
40
71.43
56
1.4
30
Group 3
40
65.57
63
1.57
40
Group 4
45
69.23
64
1.42
33.33
% Completeness
#
Average
% Ortho
SOAPdenovo Assembly
Column1
# prots
Total
Complete
183
73.79
268
1.46
28.42
Group 1
43
65.15
60
1.4
32.56
Group 2
44
78.57
59
1.34
20.45
Group 3
48
7869
76
1.58
29.17
Group 4
48
73.85
73
1.52
31.25
Partial
226
91.13
413
1.83
48.23
Group 1
61
92.42
105
1.72
49.18
Group 2
52
92.86
94
1.81
46.15
Group 3
55
90.16
111
2.02
49.09
Group 4
58
89.23
103
1.78
48.28
These results are based on the set of genes selected by Genis Parra
Prots = number of 248 ultra-conserved CEGs present in genome
% Completeness = percentage of 248 ultra-conserved CEGs present
Total = total number of CEGs present including putative orthologs
Average = average number of orthologs per CEG
% Ortho = percentage of detected CEGS that have more than 1 ortholog
Figure 1. Reduction in genetic diversity (Hz, MNA) following six generations of inbreeding
(Error bars represent ±1 S.D.).
Download