Automating genome assembly: from contigs to annotated

advertisement
Supplementary Methods.
PAGIT: two worked examples
Here we give a synopsis of the work-flow used for the two examples discussed in the “Anticipated
Results” section.
E. coli data and initial assembly.
The reads for Escherichia coli K-12 strain MG1655 were obtained from the Short Read Archive at the
NCBI (run ERR008613 under study ERP000092). They were prepared by Illumina as a test data set for
development of de novo sequence assembly algorithms and related tools. They consisted of 101 bp
paired-end Illumina reads.
The NCBI was searched for “Escherichia coli” in order to find the identifier of a suitable reference
genome: “U00096” was found for strain K-12 substr. MG1655, and the sequence file U00096.fna was
downloaded. The EBI was then searched with “U00096” and the annotation in EMBL format was
downloaded.
Before applying the PAGIT protocol, we performed a de novo assembly. Velvet version 1.1.06
compiled to support k-mers of up to 99 bp was used to generate an initial assembly. The reads came
in two files (“s_6_1.fastq” and “s_6_2.fastq”) which we first “shuffled” together using the
appropriate Velvet perl script. We then ran “velveth”, followed by “velvetg” using a number of
different k-mers. The best results were obtained with a k-mer of 81. A synopsis of the commands we
used is as follows (for more detail please see the Velvet manual):
/path/to/velvet/shuffleSequences_fastq.pl s_6_1.fastq s_6_2.fastq
shuffled.fastq
/path/to/velvet/velveth K81 81 -shortPaired -fastq shuffled.fastq
/path/to/velvet/velvetg K81 -exp_cov auto
The result of this assembly was 182 scaffolds with a combined sum of 4.58 Mb, a scaffold N50 of
70.3 Kb and an average size of 25.2 Kb. The largest scaffold was 209.8 Kb.
Applying PAGIT to the E. coli example.
ABACAS was executed with the following command:
abacas.pl -b -m -r U00096.fna -q contigs.fa -p nucmer -o U96mapped
After joining the results (ordered contigs and bin), IMAGE was initially executed with the following
command:
image.pl -scaffolds ABACAS.fasta -prefix s_6 -iteration 1 all_iteration 1 -dir_prefix ite -kmer 81
The initial IMAGE run took about 4 hours to complete. The following commands were then executed
one after another (they were pasted into a file, the file was made executable and then executed).
For these runs, each iteration took about 20 minutes:
restartIMAGE.pl
restartIMAGE.pl
restartIMAGE.pl
restartIMAGE.pl
restartIMAGE.pl
ite3 71 3 partitioned
ite6 61 3 partitioned
ite9 51 3 partitioned
ite12 41 3 partitioned
ite15 31 3 partitioned
The following command was used to execute ICORN:
icorn.start.sanger.sh scaffolds18.fa 1 6 s_6_1.fastq s_6_2.fastq
400,800 600
The following command was used to execute RATT:
$RATT_HOME/start.ratt.sh ./EMBL scaffolds18.fa.7 ratted
Strain.Global
To generate Figure 6 the following commands were used to first create a BLAST comparison file, and
then to load the files into ACT:
formatdb -p F -i U00096.fna
blastall -p blastn -m 8 -e 1e-40 -d U00096.fna -i Sequences/
ordered_gi.48994873 -o comp.blast
act EMBL/ U00096.embl comp.blast
ratted.ordered_gi.48994873.gb.U00096.2..final.embl
Initially ACT displayed the reference genome “U00096.embl” on top with the newly annotated
genome below, and syntenic blocks between them indicated in red. Subsequently, the original entry
“U00096.embl” was disabled and the file with non-transferred annotations,
“ratt.U00096.NOTTransfered.embl”, was loaded in its place (this was performed via the following
ACT menus: Entries -> U00096.embl -> disable U00096.embl).
C. Trachomatis data
The data used here (EMBL files, 454 assembly and the Illumina reads) can be found on the PAGIT
website: ftp://ftp.sanger.ac.uk/pub4/resources/software/pagit/PAGIT.ChlamydiaExample.tgz
We took 15% of the reads to get coverage of around 100x from the library ERS001402 (available at
the NCBI in the Short Read Archive), which is a 54 bp Illumina paired-end library with around 40% of
mouse contamination.
The reference genome (AM884177) and its plasmid (AM886279) were downloaded from the EBI. We
saved the two EMBL files into two different directories “dir1” and “dir2”. Next we extracted the
FASTA sequence for each EMBL file with:
perl $PAGIT_HOME/RATT/main.ratt.pl Embl2Fasta dir1 AM884177.fasta
perl $PAGIT_HOME/RATT/main.ratt.pl Embl2Fasta dir2 AM886279.fasta
Applying PAGIT to the C. Trachomatis example
We generated a working directory for ABACAS and linked the FASTA files into it. As we have two
reference sequences, we first mapped the contigs against the nuclear reference genome, and then
ordered the contigs in the bin against the plasmid genome:
ln -s ../*.fasta .
ln -s ../454LargeContigs.fna
perl $PAGIT_HOME/ABACAS/abacas.pl -r AM884177.fasta -q
454LargeContigs.fna -p nucmer -c -m -b -o Res.Nuc
perl $PAGIT_HOME/ABACAS/abacas.pl -r AM886279.fasta -q
Res.Nuc.contigsInbin.fas -p nucmer -c -m -b -o Res.Plasmid
Most prokaryotic genomes consist of a single circular chromosome. Circular genomes are
represented linearly in sequence files with the origin of replication at the start: this can cause
problems when using them in mapping or alignment tasks due to the artificially created ends of the
linear sequence. When initially assembling a circular genome it is usually the case that the origin of
replication is located in the middle of a contig: hence there may be problems when aligning that
contig to the reference genome.
For this use-case, based on the assumption that the origin of replication is annotated correctly in the
reference genome, we used ACT to identify the origin of replication in our query genome. On
examining the synteny between the origin of replication in the reference to our new sequence, we
found the following: “ATGACAAGGCTTCCATTACT”. We then loaded the ABACAS output file
“Res.Nuc.fasta” into an editor and searched for those bases. On finding the pattern, we inserted a
new line before it with a FASTA header (for example “>manual Cut”). In this way we manually
divided the sequence into two according to where we suspect the origin of replication to be.
Next we reran the first ABACAS command, changing the input to use the new split sequence of the
ABACAS pseudomolecule (just 2 sequences to map). This new mapping will have gained an
additional gap, but the advantage is that the gaps can be more accurately closed by IMAGE.
The seven unmapped contigs from the first ABACAS execution were mapped against the plasmid
sequence, giving a plasmid pseudomolecule. We joined the two pseudo contigs (from the genome
and the plasmid) and the plasmid bin contigs. Now we have a pseudo sequence for the nucleus
genome and the plasmid, rather than a joined sequence. Note that we could use the plasmid bin
contigs for further analysis.
Here is a synopsis of the commands used:
formatdb -p F -i AM884177.fasta
blastall -p blastn -m 8 -e 1e-40 -d AM884177.fasta -i Res.Nuc.fasta
-o comp.Nuc.crunch
act AM884177.fasta comp.Nuc.crunch Res.Nuc.fasta
perl $PAGIT_HOME/ABACAS/abacas.pl -r AM884177.fasta -q Res.Nuc.fasta
-p nucmer -c -m -b -o Res.NucOrigin
cat Res.NucOrigin.fasta Res.Plasmid.fasta
Res.Plasmid.contigsInbin.fas > Res.abacas.fasta
IMAGE was executed with two different k-mers. A synopsis of the commands used is as follows:
perl $PAGIT_HOME/IMAGE/image.pl -scaffolds Res.abacas.fasta -prefix
3807_partial -iteration 1 -all_iteration 3 -dir_prefix ite -kmer 49
-vel_ins_len 300 -smalt_minScore 45 > output.txt
perl $PAGIT_HOME/IMAGE/restartIMAGE.pl ite3 31 12 partitioned
image_run_summary.pl ite
perl $PAGIT_HOME/IMAGE/contigs2scaffolds.pl ite9/new.fa
ite9/new.read.placed 300 10 Res.image
After IMAGE, ICORN was executed for 4 iterations. A synopsis of the commands used is as follows:
icorn.start.sh Res.image.fa 1 4 3807_partial_1.fastq
3807_partial_2.fastq 80,600 300
icorn.CollectResults.sh Res.image.fa 3807
We ran RATT using the sequences output from the 4th (and final) iteration of ICORN using the
following command:
start.ratt.sh embl/ Res.image.fa.4 TransferPAGIT Strain >
ratt.output.TransferPAGIT.txt
For transferring onto the original 454 assembly we ran the following command:
start.ratt.sh embl/ 454LargeContigs.fna Transfer454 Strain >
ratt.output.Transfer454.txt
To gather the general transfer statistics we used the following command:
grep -iB1 -A1 "Gene model" ratt.out*txt
The command used for summarizing statistics on the incorrect models was the following. It prints all
the lines from the report files of the 454 transfer. The grep -v Gene_ID will delete the header
line of each file. The wc -l counts the amount of lines, therefore the amount of gene models that
need correction after the transfer.
cat Transfer454*.Report.txt | grep -v Gene_ID | wc -l
And similarly for the PAGIT sequence:
cat TransferPAGIT*.Report.txt | grep -v Gene_ID | wc -l
To gather statistics on the transferred models with frameshifts on the 454 contigs we used the
following command. The difference to the previous command is just to count where the fifth column
has more than two counts. The fifth column is the number of stop codons in the codon regions:
therefore it is likely to be a frameshift if the number of stop codons is higher than 1.
cat Transfer454*.Report.txt | awk '$5>2' | grep -v Gene_ID | wc
And similarly for the PAGIT sequence:
cat TransferPAGIT*.Report.txt | awk '$5>2' | grep -v Gene_ID | wc
Download