Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas) Genome Burns FR†, Cogburn LA†, Ankley, GT‡, Villeneuve DL‡, Waits E§, Chang YJǁ, Llaca, V#, Deschamps, S#, Jackson, R††, Hoke, RA*† *† E.I. du Pont de Nemours, Haskell Global Centers for Health and Environmental Sciences, Newark, DE, USA; ‡U.S. Environmental Protection Agency, Mid-Continent Ecology Division, Duluth, MN, USA , U.S. Environmental Protection Agency, Cincinnati, Ohio, USA , ǁHigh-Performance Biological § Computing, University of Illinois at Urbana-Champaign, Urbana, IL USA, # E.I. du Pont de Nemours , Agricultural Biotechnology, Wilmington, DE, †† E.I. du Pont de Nemours, Central Research and Development Biotechnology, Wilmington, DE. Supplementary Information Supplementary Methods: Library Construction and sequence generation on HiSeq 2000 Four next-generation DNA sequencing (NGS) libraries with varying average DNA fragment sizes were made: 1) a paired-end NGS library with an average fragment size of 180 bases; 2) two “mate-pair” NGS libraries with an average fragment size of 3 Kb and 6 Kb; and 3) a “fosmid” library with an average fragment size of 40 Kb. The paired-end 180 bps and mate-pair 3 Kb and 6 Kb NGS libraries were made following standard Illumina protocol recommendations. Approximately 2 µg genomic DNA were sheared on a Covaris system (Woburn, MA) to an average of 180 bases. End repair, A base addition and ligation of adapters were performed according to the protocol developed by Illumina for preparing DNA samples for pairedend sequencing. Size-selection of adapter-ligated DNA fragments was performed on a LabChip XT fractionation system (Caliper). Two consecutive size selection steps were performed with a resulting DNA fragment size ranging between approximately 140 bases and 200 bases, and peaking at 170 bases. For PCR amplification, 1µL of size-selected DNA was incubated in 50µL of 1X Phusion HF master mix (Finnzymes) containing 10 µM PCR primers PE 1.0 and PE 2.0 (Illumina). After PCR amplification (30s at 98°C, followed by 14 rounds of 40s at 98°C, 30s at 65°C, 30s at 72°C, and a final extension step of 5min at 72°C), DNA was purified with a MinElute PCR purification kit (Qiagen) and resuspended in 10 µl EB buffer (Qiagen). The mate-pair libraries were made after shearing high-molecular weight DNA with a HydroShear system (Genomic Solutions). The resulting fragments were subjected to end repair, biotin label and gel-based size-selection, according to a protocol developed by Illumina for preparing mate pair libraries. Buffer and enzymatic reagents were obtained directly from Illumina’s 2-5 Kb mate pair library preparation kit. The size-selected DNA was circularized and sheared on a Covaris system (Woburn, MA). Following selection, the purified biotinylated DNA fragments were subjected to end repair, A base addition and ligation of adapters and a final PCR amplification step was performed using conditions similar to the ones used to create the 180 bases library. After amplification, DNA was resuspended in 10 µl EB buffer (Qiagen). To create the 40 Kb library, approximately 30 µg high-molecular-weight DNA were initially subjected to hydrodynamic shearing on a HydroShear system (Genomic Solutions) using a large shearing assembly set. DNA was sheared to an average of approximately 40 Kb, as confirmed by CHEF gel electrophoresis. Fragments were treated with the DNA Terminator End Repair kit (Lucigen) to create phosphorylated blunt ends, and then size-selected using a 1.2% Low-Melting-Point agarose in CHEF Gel electrophoresis using a BioRad CHEF Mapper system. Fragments with lengths of 30 to 100 Kb were cut off from and recovered by digesting the gel slice with Gelase (Epicentre), following the manufacturer’s conditions. Recovered DNA was ligated to linearized, de-phosphorylated pFOS NGS vector using the CopyRight® pNGS Fosmid Cloning Kit (Lucigen). Ligated DNA was then lambda-packaged using the Gigapack III Gold (Agilent) and used to transfect FOS replicator E. coli cells. Transfected cells were frozen at 80C in LB+25% Glycerol. The library was tittered and amplified at high density (~50,000 cfu/in2) in 22x22cm bioassay trays with YT-agar+Cloramphenicol. Bioassay trays were incubated for 20 hrs and placed at 4C for 3 hours before scraping the clones and combined into a single 250-ml LB+Chloramphenicol + 1X Arabinose solution and incubated at 37C in for 1 hour. DNA from the pooled clones was purified as a single sample using a large-insert DNA extraction kit (Qiagen). Two 5-ug aliquots were completely digested separately with BfaI and CVQI. The vector fragments from each sample were purified from a 1.2% Sybr Safe gel (Invitrogen) and self-ligated at low concentration (0.8 ng/ul) using CloneSmart® DNA Ligase (Lucigen). The self-ligated samples were PCR-amplified using Phusion HF master mix and the following primers: 1) lllum-A: AAG CAG AAG ACG GCA TAC GAG, 2) Illum-B: AAT GAT ACG GCG ACC ACC GAG). Samples were purified twice using Ampure XT magnetic beads and quantified in a Bioanalyzer (Agilent). Library construction and Sequence generation on MiSeq. DNA was fragmented using a Covaris S220 Ultrasonicator (Covaris, Woburn, MA 18081). The settings used for fragmentation are in the table below: Parameter Setting Duty Factor 5% Peak Incident Power (W) 175 Cycles per Burst 200 Time (Seconds) 25 The fragmented DNA was characterized for size distribution using a BioAnalyzer 2100 (Santa Clara, CA, 95051). Fragment sizes were determined to be 600 – 1000bp. The fragmented DNA was then purified and adapters for sequencing were ligated as per the instruction manual for the TruSeq DNA PCR-Free Sample Prep kit (Part #FC-121-3001 Illumina, San Diego, CA, 92122). The resultant sample library was sequenced in two runs on the Illumina MiSeq using 2 x 250 base, paired end settings. SOAP denovo assembly parameters: SOAPdenovo-63mer pregraph -s config.file -K 25 -p 24 -o dupont_25 -R 1> pregraph.log 2> pregraph.err SOAPdenovo-63mer contig -s config.file -g dupont_25 -m 43 -p 24 -R 1> contig.log 2> contig.err SOAPdenovo-63mer map -s config.file -g dupont_25 -k 33 -p 12 -R 1> map.log 2> map.err SOAPdenovo-63mer scaff -p 12 -F -L 100 -g dupont_25 1> scaff.log 2> scaff.err Two Paired-end reads were used for gap closing. Scaffolds below 1kb were excluded from the final assembly file. Config.file: (The following is the content of the config file) max_rd_len=240 [LIB] avg_ins=180 reverse_seq=0 asm_flags=3 rank=1 q1=A6730001_paired_r1.fastq.cor.pair_1.fq q2=A6730001_paired_r2.fastq.cor.pair_2.fq q1=A6730005_paired_r1.fastq.cor.pair_1.fq q2=A6730005_paired_r2.fastq.cor.pair_2.fq [LIB] avg_ins=600 reverse_seq=0 asm_flags=3 rank=2 q1=FatHeadMinnow_S1_L001_R1_001-A_pairedOut.trimR.fastq.cor.pair_1.fq q2=FatHeadMinnow_S1_L001_R2_001-A_pairedOut.trimR.fastq.cor.pair_2.fq q1=FatHeadMinnow_S1_L001_R1_001_pairedOut.trimR.fastq.cor.pair_1.fq q2=FatHeadMinnow_S1_L001_R2_001_pairedOut.trimR.fastq.cor.pair_2.fq [LIB] avg_ins=3000 reverse_seq=1 asm_flags=2 rank=3 q1=A6730002_paired_r1.fastq q2=A6730002_paired_r2.fastq [LIB] avg_ins=6000 reverse_seq=1 asm_flags=2 rank=4 pair_num_cutoff=5 q1=A6730003_paired_r1.fastq q2=A6730003_paired_r2.fastq [LIB] avg_ins=40000 reverse_seq=1 asm_flags=2 rank=5 q1=A6730004_paired_r1.fastq q2=A6730004_paired_r2.fastq SGA assembly parameters: sga preprocess --pe-mode 1 -o FHM_180.fastq 180_r1.fastq 180_r2.fastq sga index -a ropebwt --no-reverse -t 16 FHM_180.fastq sga correct -k 31 --discard --learn -t 16 -o reads.ec.k31.fastq FHM_180.fastq sga index -a ropebwt -t 16 reads.ec.k31.fastq sga filter -x 2 -t 16 --homopolymer-check --low-complexity-check reads.ec.k31.fastq sga fm-merge -m 55 -t 16 -o merged.k31.fa reads.ec.k31.filter.pass.fa sga index -d 1000000 -t 16 merged.k31.fa sga rmdup -t 16 merged.k31.fa sga overlap -m 55 -t 16 merged.k31.rmdup.fa sga assemble -m 75 -g 0 -r 10 -o assemble.m75 merged.k31.rmdup.asqg.gz bwa index assemble.m75.contigs.fa bwa aln -t 16 assemble.m75.contigs.fa 180_r1.fastq > 180_r1.fastq.sai bwa aln -t 16 assemble.m75.contigs.fa 180_r2.fastq > 180_r2.fastq.sai bwa sampe assemble.m75.contigs.fa 180_r1.fastq.sai 180_r2.fastq.sai 180_r1.fastq 180_r2.fastq | samtools view -Sb - > 180bp.bam *repeat bwa alignment steps above for each library used for scaffolding sga-bam2de.pl --prefix lib.fragment.180bp -n 5 -m 200 lib.fragment.180bp.bam sga-bam2de.pl --prefix lib.fragment.250bp -n 5 -m 200 lib.fragment.250bp.bam sga-bam2de.pl --prefix lib.matepair.3kbp -n 5 -m 200 lib.matepair.3kbp.bam sga-bam2de.pl --prefix lib.matepair.6kbp -n 5 -m 200 lib.matepair.6kbp.bam sga-bam2de.pl --prefix lib.matepair.40kbp -n 5 -m 200 lib.matepair.40kbp.bam samtools sort lib.fragment.180bp.bam lib.fragment.180bp.refsort sga-astat.py -m 200 lib.fragment.180bp.refsort.bam > contigs.astat sga scaffold -m 200 -a contigs.astat --pe lib.fragment.180bp.de –pe lib.fragment.50bp.de --mate lib.matepair.3kb.de --mate lib.matepair.6kb.de --mate lib.matepair.40kb.de -o multiple.libs.scaf contigs.fa sga scaffold2fasta --write-unplaced -m 200 -o FHM_scaffolds.fa --use-overlap -a assemble.m75graph.asqg.gz multiple.libs.scaf Statistic SGA SOAP Number of scaffolds 810921 73057 Total size of scaffolds 957809772 1219326373 Longest scaffold 811454 580801 Shortest scaffold 200 1000 Number of scaffolds > 500 nt 177416 21.9% 73057 100.0% Number of scaffolds > 1K nt 66983 8.3% 73011 99.9% Number of scaffolds > 10K nt 16868 2.1% 19302 26.4% Number of scaffolds > 100K nt 746 0.1% 2265 3.1% Number of scaffolds > 1M nt 0 0.0% 0 0.0% Mean scaffold size 1181 16690 Median scaffold size 307 3513 N50 scaffold length 15414 60380 L50 scaffold count 11491 5505 NG50 scaffold length 147007 240449 LG50 scaffold count 263 178 N50 scaffold - NG50 scaffold length difference 131593 180069 scaffold %A 26.29 20.67 scaffold %C 16.19 12.62 scaffold %G 16.15 12.59 scaffold %T 26.31 20.63 scaffold %N 15.07 33.48 scaffold %non-ACGTN 0.00 0.00 Number of scaffold non-ACGTN nt 0 0 Percentage of assembly in scaffolded contigs 67.90% 94.9% Percentage of assembly in unscaffolded contigs 32.10% 5.1% Average number of contigs per scaffold 1.2 2.9 CEGMA analysis of 248 ultra-conserved genes SGA Assembly Column1 # prots % Completeness # Average % Ortho Total Complete 126 50.81 156 1.24 18.25 Group 1 26 39.39 32 1.23 11.54 Group 2 32 57.14 41 1.28 21.88 Group 3 30 49.18 39 1.3 26.67 Group 4 38 58.46 44 1.16 13.16 Partial 166 66.94 237 1.43 31.33 Group 1 41 62.12 54 1.32 21.95 Group 2 40 71.43 56 1.4 30 Group 3 40 65.57 63 1.57 40 Group 4 45 69.23 64 1.42 33.33 % Completeness # Average % Ortho SOAPdenovo Assembly Column1 # prots Total Complete 183 73.79 268 1.46 28.42 Group 1 43 65.15 60 1.4 32.56 Group 2 44 78.57 59 1.34 20.45 Group 3 48 7869 76 1.58 29.17 Group 4 48 73.85 73 1.52 31.25 Partial 226 91.13 413 1.83 48.23 Group 1 61 92.42 105 1.72 49.18 Group 2 52 92.86 94 1.81 46.15 Group 3 55 90.16 111 2.02 49.09 Group 4 58 89.23 103 1.78 48.28 These results are based on the set of genes selected by Genis Parra Prots = number of 248 ultra-conserved CEGs present in genome % Completeness = percentage of 248 ultra-conserved CEGs present Total = total number of CEGs present including putative orthologs Average = average number of orthologs per CEG % Ortho = percentage of detected CEGS that have more than 1 ortholog Figure 1. Reduction in genetic diversity (Hz, MNA) following six generations of inbreeding (Error bars represent ±1 S.D.).