Lab Exercise: de novo assembly and genome alignment

advertisement
Lab Exercise: de novo assembly and genome alignment
Software used:
AbySS
perl script (fac.pl)
Mauve
Overview: You will assemble a single bacterial genome several times using ABySS. The genome
sequence data is from Enterobacter cloacae subsp. cloacae strain ATCC 13047 (Ecl13047), and was
collected as part of a Perna lab NSF Tree of Life project using a Roche 454 GS FLX instrument and
Titanium chemistry. We have single end random shotgun data (1/8 plate) and paired end data (1/8
plate). You will try assembling the data with and without taking advantage of the paired end
information. ABySS is a de Bruijn graph-based assembler, and you will explore several values of the key
k-mer parameter. To evaluate the assembly results, you will generate summary statistics using a perl
script. The genome of this exact same strain and species has been sequenced to completion by another
group and fully assembled. You will compare your assemblies to this complete genome (chromosome
and 2 plasmids), and to another version generated using the Roche 454 assembler, Newbler, using
genome alignment software, Mauve, to optimize co linearity of the assemblies by reordering the
contigs.
In addition to this document, the lab exercise folder contains the following files:
all the data, with paired end treated as unpaired: 454_All_Reads.fas
paired end data: 454_PE_Reads_1.fas and 454_PE_Reads_2.fas
single end data: 454_WGS_Reads.fas
the contigs from a Newbler assembly of the same data: Ecl13047_Newbler.fas
a concatenated GenBank file of the completed Ecl13047 genome: Ecl13047.gbk
a Mauve alignment of the Newbler contigs to the GenBank entries: Newbler-vs-GenBank
the genome alignment software, Mauve
We will assume you have downloaded the Genetics875_Assembly folder to your computer’s Desktop for
these exercises. AbySS and the perl script are command line programs, so you will be working in the
Terminal application (Applications:Utilities:Terminal). Launch Terminal, and at the prompt you will type
commands and hit enter (return) to execute them. All commands are case-sensitive; you can copy-andpaste from a text or Word document (such as this one) to the Terminal window if desired.
At the prompt type the following to navigate to the folder:
cd Desktop/Genetics875_Assembly
Part 1 – Assemble all data without paired-end information, using a k-mer length of 30.
Type the following command to start the assembly:
ABYSS -k30 454_All_Reads.fas -o Ecl13047.SE.30.fa > Ecl13047.SE.30.out
Initially nothing will seem to be happening, while the program reads in all the data. Then information
will be displayed on the screen as the program runs. When the assembly is complete, the command
prompt will return. We will use a perl script to quickly extract some information from the assembly, by
typing the following command:
fac.pl Ecl13047.SE.30.fa
The script will report the following information to the screen, which you will want to copy to your notes
(copy-and-paste works):
n = the total number of contigs
n:100 = the number of contigs at least 100 bp long
n:N50 = the number of contigs at least as long as the N50 value
min, median, mean, N50, and max = various contig length information
sum = the sum of the lengths of the contigs of at least 100 bp
note: except for n, all of the statistics are reported based only on the contigs of at least 100 bp
Q: How many contigs are in the assembly?
Q: What is the mean contig size? N50 contig size?
Part 2 – Re-assemble the data using the paired-end information, still using a k-mer length of 30
Type the following command to start the assembly; it should be a single line, despite the wrapping in
Word! Note that the name of the particular AbySS routine is different, as well as the syntax.
abyss-pe k=30 n=5 name=Ecl13047.PE.30 in='454_PE_Reads_1.fas
454_PE_Reads_2.fas' se='454_WGS_reads.fas' > Ecl13047.PE.30.out
This assembly will write more information to the screen as the program runs, and will take longer to
finish. Again, when the assembly is complete the command prompt will return. Run the perl script on
the results of the new assembly:
fac.pl Ecl13047.PE.30-contigs.fa
Q: How many contigs are in the assembly?
Q: What is the mean contig size? N50 contig size?
Q: What was the effect of utilizing paired end information?
Part 3 – Comparing assemblies using Mauve (can start while Part 2 is running)
Launch Mauve, and under the File menu, select Open alignment… navigate to the Desktop, and then to
Genetics875_Assembly, and then select Newbler-vs-GenBank
Mauve will load the sequences and the alignment, which we will go over in class. Note that you can also
use the perl script to calculate some statistics on the Newbler assembly, back in the Terminal window:
fac.pl Ecl13047_Newbler.fas
Once your ABySS assemblies are done you will align them to the GenBank entry as well. Under the File
menu in Mauve, you can select Align with progressiveMauve… and load the GenBank and ABySS
sequences. However, given the larger number of contigs, we will use Mauve’s ability to re-order the
contigs. Under the Tools menu, select Move Contigs. You will be prompted to select a place to save files;
you can navigate to the Genetics875_Assembly folder again, and then in the File Name: space, append a
new folder name, e.g., /alignments, to the path for the output files. When you add the sequences, add
the GenBank file first, and then the ABySS contigs file. Mauve will run multiple iterations, attempting to
find the best ordering of the contigs relative to the complete genome.
Q: What can you say about each of the assemblies compared to the completed genome? What about
to each other?
Part 4 – Try to improve the ABySS assembly by changing the k-mer parameter.
Run ABySS on the complete dataset (paired end information included) using two different k-mer values.
You can type in the command line, or edit the following, changing the x to whatever value of k you wish
to use, and then copy-and-paste. ABySS only requires you to specify the k-mer value, but we are also
editing the output file name so different assemblies don’t overwrite each other’s results.
abyss-pe k=x n=5 name=Ecl13047.PE.x in='454_PE_Reads_1.fas
454_PE_Reads_2.fas' se='454_WGS_reads.fas' > Ecl13047.PE.x.out
To use the perl script, be sure to specify which file you want to use:
fac.pl Ecl13047.PE.x-contigs.fa
Team up with your neighbors to explore a larger number of k-mer values. Discuss this in advance and
predict what you expect to happen:
Q: What criteria will you use to decide which assembly is better?
Q: Will increasing or decreasing the k-mer length improve the assembly? Why?
Q: Will the best ABySS assembly be better or worse than the Newbler assembly? Why?
Run the assemblies.
Q: Which k-mer value was best?
Q: Were your predictions correct?
Download