Automating genome assembly: from contigs to annotated

Supplementary Methods. PAGIT: two worked examples Here we give a synopsis of the work-flow used for the two examples discussed in the “Anticipated Results” section. E. coli data and initial assembly. The reads for Escherichia coli K-12 strain MG1655 were obtained from the Short Read Archive at the NCBI (run ERR008613 under study ERP000092). They were prepared by Illumina as a test data set for development of de novo sequence assembly algorithms and related tools. They consisted of 101 bp paired-end Illumina reads. The NCBI was searched for “Escherichia coli” in order to find the identifier of a suitable reference genome: “U00096” was found for strain K-12 substr. MG1655, and the sequence file U00096.fna was downloaded. The EBI was then searched with “U00096” and the annotation in EMBL format was downloaded. Before applying the PAGIT protocol, we performed a de novo assembly. Velvet version 1.1.06 compiled to support k-mers of up to 99 bp was used to generate an initial assembly. The reads came in two files (“s_6_1.fastq” and “s_6_2.fastq”) which we first “shuffled” together using the appropriate Velvet perl script. We then ran “velveth”, followed by “velvetg” using a number of different k-mers. The best results were obtained with a k-mer of 81. A synopsis of the commands we used is as follows (for more detail please see the Velvet manual): /path/to/velvet/shuffleSequences_fastq.pl s_6_1.fastq s_6_2.fastq shuffled.fastq /path/to/velvet/velveth K81 81 -shortPaired -fastq shuffled.fastq /path/to/velvet/velvetg K81 -exp_cov auto The result of this assembly was 182 scaffolds with a combined sum of 4.58 Mb, a scaffold N50 of 70.3 Kb and an average size of 25.2 Kb. The largest scaffold was 209.8 Kb. Applying PAGIT to the E. coli example. ABACAS was executed with the following command: abacas.pl -b -m -r U00096.fna -q contigs.fa -p nucmer -o U96mapped After joining the results (ordered contigs and bin), IMAGE was initially executed with the following command: image.pl -scaffolds ABACAS.fasta -prefix s_6 -iteration 1 all_iteration 1 -dir_prefix ite -kmer 81 The initial IMAGE run took about 4 hours to complete. The following commands were then executed one after another (they were pasted into a file, the file was made executable and then executed). For these runs, each iteration took about 20 minutes: restartIMAGE.pl restartIMAGE.pl restartIMAGE.pl restartIMAGE.pl restartIMAGE.pl ite3 71 3 partitioned ite6 61 3 partitioned ite9 51 3 partitioned ite12 41 3 partitioned ite15 31 3 partitioned The following command was used to execute ICORN: icorn.start.sanger.sh scaffolds18.fa 1 6 s_6_1.fastq s_6_2.fastq 400,800 600 The following command was used to execute RATT: $RATT_HOME/start.ratt.sh ./EMBL scaffolds18.fa.7 ratted Strain.Global To generate Figure 6 the following commands were used to first create a BLAST comparison file, and then to load the files into ACT: formatdb -p F -i U00096.fna blastall -p blastn -m 8 -e 1e-40 -d U00096.fna -i Sequences/ ordered_gi.48994873 -o comp.blast act EMBL/ U00096.embl comp.blast ratted.ordered_gi.48994873.gb.U00096.2..final.embl Initially ACT displayed the reference genome “U00096.embl” on top with the newly annotated genome below, and syntenic blocks between them indicated in red. Subsequently, the original entry “U00096.embl” was disabled and the file with non-transferred annotations, “ratt.U00096.NOTTransfered.embl”, was loaded in its place (this was performed via the following ACT menus: Entries -> U00096.embl -> disable U00096.embl). C. Trachomatis data The data used here (EMBL files, 454 assembly and the Illumina reads) can be found on the PAGIT website: ftp://ftp.sanger.ac.uk/pub4/resources/software/pagit/PAGIT.ChlamydiaExample.tgz We took 15% of the reads to get coverage of around 100x from the library ERS001402 (available at the NCBI in the Short Read Archive), which is a 54 bp Illumina paired-end library with around 40% of mouse contamination. The reference genome (AM884177) and its plasmid (AM886279) were downloaded from the EBI. We saved the two EMBL files into two different directories “dir1” and “dir2”. Next we extracted the FASTA sequence for each EMBL file with: perl $PAGIT_HOME/RATT/main.ratt.pl Embl2Fasta dir1 AM884177.fasta perl $PAGIT_HOME/RATT/main.ratt.pl Embl2Fasta dir2 AM886279.fasta Applying PAGIT to the C. Trachomatis example We generated a working directory for ABACAS and linked the FASTA files into it. As we have two reference sequences, we first mapped the contigs against the nuclear reference genome, and then ordered the contigs in the bin against the plasmid genome: ln -s ../*.fasta . ln -s ../454LargeContigs.fna perl $PAGIT_HOME/ABACAS/abacas.pl -r AM884177.fasta -q 454LargeContigs.fna -p nucmer -c -m -b -o Res.Nuc perl $PAGIT_HOME/ABACAS/abacas.pl -r AM886279.fasta -q Res.Nuc.contigsInbin.fas -p nucmer -c -m -b -o Res.Plasmid Most prokaryotic genomes consist of a single circular chromosome. Circular genomes are represented linearly in sequence files with the origin of replication at the start: this can cause problems when using them in mapping or alignment tasks due to the artificially created ends of the linear sequence. When initially assembling a circular genome it is usually the case that the origin of replication is located in the middle of a contig: hence there may be problems when aligning that contig to the reference genome. For this use-case, based on the assumption that the origin of replication is annotated correctly in the reference genome, we used ACT to identify the origin of replication in our query genome. On examining the synteny between the origin of replication in the reference to our new sequence, we found the following: “ATGACAAGGCTTCCATTACT”. We then loaded the ABACAS output file “Res.Nuc.fasta” into an editor and searched for those bases. On finding the pattern, we inserted a new line before it with a FASTA header (for example “>manual Cut”). In this way we manually divided the sequence into two according to where we suspect the origin of replication to be. Next we reran the first ABACAS command, changing the input to use the new split sequence of the ABACAS pseudomolecule (just 2 sequences to map). This new mapping will have gained an additional gap, but the advantage is that the gaps can be more accurately closed by IMAGE. The seven unmapped contigs from the first ABACAS execution were mapped against the plasmid sequence, giving a plasmid pseudomolecule. We joined the two pseudo contigs (from the genome and the plasmid) and the plasmid bin contigs. Now we have a pseudo sequence for the nucleus genome and the plasmid, rather than a joined sequence. Note that we could use the plasmid bin contigs for further analysis. Here is a synopsis of the commands used: formatdb -p F -i AM884177.fasta blastall -p blastn -m 8 -e 1e-40 -d AM884177.fasta -i Res.Nuc.fasta -o comp.Nuc.crunch act AM884177.fasta comp.Nuc.crunch Res.Nuc.fasta perl $PAGIT_HOME/ABACAS/abacas.pl -r AM884177.fasta -q Res.Nuc.fasta -p nucmer -c -m -b -o Res.NucOrigin cat Res.NucOrigin.fasta Res.Plasmid.fasta Res.Plasmid.contigsInbin.fas > Res.abacas.fasta IMAGE was executed with two different k-mers. A synopsis of the commands used is as follows: perl $PAGIT_HOME/IMAGE/image.pl -scaffolds Res.abacas.fasta -prefix 3807_partial -iteration 1 -all_iteration 3 -dir_prefix ite -kmer 49 -vel_ins_len 300 -smalt_minScore 45 > output.txt perl $PAGIT_HOME/IMAGE/restartIMAGE.pl ite3 31 12 partitioned image_run_summary.pl ite perl $PAGIT_HOME/IMAGE/contigs2scaffolds.pl ite9/new.fa ite9/new.read.placed 300 10 Res.image After IMAGE, ICORN was executed for 4 iterations. A synopsis of the commands used is as follows: icorn.start.sh Res.image.fa 1 4 3807_partial_1.fastq 3807_partial_2.fastq 80,600 300 icorn.CollectResults.sh Res.image.fa 3807 We ran RATT using the sequences output from the 4th (and final) iteration of ICORN using the following command: start.ratt.sh embl/ Res.image.fa.4 TransferPAGIT Strain > ratt.output.TransferPAGIT.txt For transferring onto the original 454 assembly we ran the following command: start.ratt.sh embl/ 454LargeContigs.fna Transfer454 Strain > ratt.output.Transfer454.txt To gather the general transfer statistics we used the following command: grep -iB1 -A1 "Gene model" ratt.out*txt The command used for summarizing statistics on the incorrect models was the following. It prints all the lines from the report files of the 454 transfer. The grep -v Gene_ID will delete the header line of each file. The wc -l counts the amount of lines, therefore the amount of gene models that need correction after the transfer. cat Transfer454*.Report.txt | grep -v Gene_ID | wc -l And similarly for the PAGIT sequence: cat TransferPAGIT*.Report.txt | grep -v Gene_ID | wc -l To gather statistics on the transferred models with frameshifts on the 454 contigs we used the following command. The difference to the previous command is just to count where the fifth column has more than two counts. The fifth column is the number of stop codons in the codon regions: therefore it is likely to be a frameshift if the number of stop codons is higher than 1. cat Transfer454*.Report.txt | awk '$5>2' | grep -v Gene_ID | wc And similarly for the PAGIT sequence: cat TransferPAGIT*.Report.txt | awk '$5>2' | grep -v Gene_ID | wc

Automating genome assembly: from contigs to annotated

Related documents

Products

Support

Automating genome assembly: from contigs to annotated

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib