Lab Exercise: de novo assembly and genome alignment Software used: AbySS perl script (fac.pl) Mauve Overview: You will assemble a single bacterial genome several times using ABySS. The genome sequence data is from Enterobacter cloacae subsp. cloacae strain ATCC 13047 (Ecl13047), and was collected as part of a Perna lab NSF Tree of Life project using a Roche 454 GS FLX instrument and Titanium chemistry. We have single end random shotgun data (1/8 plate) and paired end data (1/8 plate). You will try assembling the data with and without taking advantage of the paired end information. ABySS is a de Bruijn graph-based assembler, and you will explore several values of the key k-mer parameter. To evaluate the assembly results, you will generate summary statistics using a perl script. The genome of this exact same strain and species has been sequenced to completion by another group and fully assembled. You will compare your assemblies to this complete genome (chromosome and 2 plasmids), and to another version generated using the Roche 454 assembler, Newbler, using genome alignment software, Mauve, to optimize co linearity of the assemblies by reordering the contigs. In addition to this document, the lab exercise folder contains the following files: all the data, with paired end treated as unpaired: 454_All_Reads.fas paired end data: 454_PE_Reads_1.fas and 454_PE_Reads_2.fas single end data: 454_WGS_Reads.fas the contigs from a Newbler assembly of the same data: Ecl13047_Newbler.fas a concatenated GenBank file of the completed Ecl13047 genome: Ecl13047.gbk a Mauve alignment of the Newbler contigs to the GenBank entries: Newbler-vs-GenBank the genome alignment software, Mauve We will assume you have downloaded the Genetics875_Assembly folder to your computer’s Desktop for these exercises. AbySS and the perl script are command line programs, so you will be working in the Terminal application (Applications:Utilities:Terminal). Launch Terminal, and at the prompt you will type commands and hit enter (return) to execute them. All commands are case-sensitive; you can copy-andpaste from a text or Word document (such as this one) to the Terminal window if desired. At the prompt type the following to navigate to the folder: cd Desktop/Genetics875_Assembly Part 1 – Assemble all data without paired-end information, using a k-mer length of 30. Type the following command to start the assembly: ABYSS -k30 454_All_Reads.fas -o Ecl13047.SE.30.fa > Ecl13047.SE.30.out Initially nothing will seem to be happening, while the program reads in all the data. Then information will be displayed on the screen as the program runs. When the assembly is complete, the command prompt will return. We will use a perl script to quickly extract some information from the assembly, by typing the following command: fac.pl Ecl13047.SE.30.fa The script will report the following information to the screen, which you will want to copy to your notes (copy-and-paste works): n = the total number of contigs n:100 = the number of contigs at least 100 bp long n:N50 = the number of contigs at least as long as the N50 value min, median, mean, N50, and max = various contig length information sum = the sum of the lengths of the contigs of at least 100 bp note: except for n, all of the statistics are reported based only on the contigs of at least 100 bp Q: How many contigs are in the assembly? Q: What is the mean contig size? N50 contig size? Part 2 – Re-assemble the data using the paired-end information, still using a k-mer length of 30 Type the following command to start the assembly; it should be a single line, despite the wrapping in Word! Note that the name of the particular AbySS routine is different, as well as the syntax. abyss-pe k=30 n=5 name=Ecl13047.PE.30 in='454_PE_Reads_1.fas 454_PE_Reads_2.fas' se='454_WGS_reads.fas' > Ecl13047.PE.30.out This assembly will write more information to the screen as the program runs, and will take longer to finish. Again, when the assembly is complete the command prompt will return. Run the perl script on the results of the new assembly: fac.pl Ecl13047.PE.30-contigs.fa Q: How many contigs are in the assembly? Q: What is the mean contig size? N50 contig size? Q: What was the effect of utilizing paired end information? Part 3 – Comparing assemblies using Mauve (can start while Part 2 is running) Launch Mauve, and under the File menu, select Open alignment… navigate to the Desktop, and then to Genetics875_Assembly, and then select Newbler-vs-GenBank Mauve will load the sequences and the alignment, which we will go over in class. Note that you can also use the perl script to calculate some statistics on the Newbler assembly, back in the Terminal window: fac.pl Ecl13047_Newbler.fas Once your ABySS assemblies are done you will align them to the GenBank entry as well. Under the File menu in Mauve, you can select Align with progressiveMauve… and load the GenBank and ABySS sequences. However, given the larger number of contigs, we will use Mauve’s ability to re-order the contigs. Under the Tools menu, select Move Contigs. You will be prompted to select a place to save files; you can navigate to the Genetics875_Assembly folder again, and then in the File Name: space, append a new folder name, e.g., /alignments, to the path for the output files. When you add the sequences, add the GenBank file first, and then the ABySS contigs file. Mauve will run multiple iterations, attempting to find the best ordering of the contigs relative to the complete genome. Q: What can you say about each of the assemblies compared to the completed genome? What about to each other? Part 4 – Try to improve the ABySS assembly by changing the k-mer parameter. Run ABySS on the complete dataset (paired end information included) using two different k-mer values. You can type in the command line, or edit the following, changing the x to whatever value of k you wish to use, and then copy-and-paste. ABySS only requires you to specify the k-mer value, but we are also editing the output file name so different assemblies don’t overwrite each other’s results. abyss-pe k=x n=5 name=Ecl13047.PE.x in='454_PE_Reads_1.fas 454_PE_Reads_2.fas' se='454_WGS_reads.fas' > Ecl13047.PE.x.out To use the perl script, be sure to specify which file you want to use: fac.pl Ecl13047.PE.x-contigs.fa Team up with your neighbors to explore a larger number of k-mer values. Discuss this in advance and predict what you expect to happen: Q: What criteria will you use to decide which assembly is better? Q: Will increasing or decreasing the k-mer length improve the assembly? Why? Q: Will the best ABySS assembly be better or worse than the Newbler assembly? Why? Run the assemblies. Q: Which k-mer value was best? Q: Were your predictions correct?