E S S E N T I A L S O F N E X T G E N E R A T I O N S E Q U E N C I N G W O R K S H O P 2 0 1 5 U N I V E R S I T Y O F K E N T U C K Y A G T C 3 Class Genome Assembly Most assembly programs are run in a similar manner to one another. We will use the Newbler and velvet assemblers for this exercise. Newbler uses an overlap layout consensus strategy and was designed for assembling the longer NGS reads achievable with the Roche 454 sequencing machines. In our experience, Newbler assemblies for bacterial/fungal genomes are typically far superior to those produced by most “short read” aligners. However, Newbler struggles with large datasets (>4 M paired end reads) and runs can take days/weeks to complete, if they ever complete at all. This is where the de Bruijn graph-based aligners come into their own. Velvet is a short read aligner and easily handles the assembly of large datasets comprising reads ranging from 50 to 300 bp in length. Velvet 1.2.10 Zerbino DR, Birney E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18:821–829. https://github.com/dzerbino/velvet/tree/master The Velvet package contains two main programs that perform separate functions: Velveth creates a table of k-mers (essentially a list of all the different “words” spelled out by k-mers within the sequence reads; and the number of times each “word” occurs); and Velvetg builds and traverses the de Bruijn graph. However, before we start running either program, we must get the sequence reads into a suitable format. 3.1 Prepare the sequence reads Change to the assembly directory List the directory contents. You will see that it contains two sequence files: Bcereus_S1_L001_R1_001.fastq.gz and Bcereus_S1_L001_R2_001.fastq.gz. The filenames contains useful information about the sequences: “S1” means that this was the first barcode on the flow cell; “L001” indicates the data came from flow cell lane 1; “R1” means forward read (“R2” means Essentials of Next Generation Sequencing 2015 Page 1 of 10 reverse), the “001” means this is dataset 1 from this lane of the flow cell; “fastq” indicates the sequence format; and “gz” tells us that the data were compressed with gzip. Let’s begin by decompressing the data (note the * wild card which tells gzip to work on all files that contain the preceding string : gunzip Bcereus_S1_L001_R* Before we start the assembly, let’s examine the data quality. To simplify this step, the data are duplicated in the Class_3_ASSEMBLY directory on your USB thumb drive. If necessary, use Trimmomatic to trim/filter the data to remove poor quality sequence. If you decide to trim, you can use SCP/WinSCP to transfer the trimmed files from the Mac/PC to your virtual machine. Only perform the steps that follow to the end of this page if your trimmed the data: On a Mac: In the Terminal program, type the command: scp -p <port_number> Bcereus_S1_L001_R*paired.fastq \ <username>@csurs11.csr.uky.edu:assembly/ Enter your password when prompted. On a PC: Open the WinSCP program. For hostname, type csurs11.csr.uky.edy; for port type your port number; enter your username and password, then press “Login.” In the left pane, browse to your USB drive. In the right pane, browse to your assembly directory. Drag the Bcereus_S1_L001_R1_paired.fastq and Bcereus_S1_L001_R2_paired.fastq files from the left pane to the right pane to transfer them to the remote computer. Make sure you change the input file names when you run the following script. Essentials of Next Generation Sequencing 2015 Page 2 of 10 3.2 Create an interleaved, paired-end dataset Velvet requires paired-end reads to be in a single file in which the paired reads are interleaved (i.e. each forward read is immediately followed by its corresponding reverse read). To accomplish this, we will use a python script named interleave-fastq.py: Make sure you are still in the top-level assembly directory Run the python script: python interleave-fastq.py Bcereus_S1_L001_R1_001.fastq \ Bcereus_S1_L001_R2_001.fastq Woah! As you can see the script writes the output to the screen. Let’s re-run it but, this time, redirect the output to a file named Bcereus_interleaved.fastq. 3.3 Identify a suitable k-mer length As you may remember, velvet breaks the sequencing reads down into subsequences known as k-mers. The choice of k-mer length is very important and will affect the quality of the assembly. A k-mer that is too short will yield a large number of very short contigs, while a lengthy k-mer will produce an incomplete assembly. To assist us in the choice of a suitable k-mer length, we will use an online tool called Velvet Advisor: http://dna.med.monash.edu.au/~torsten/velvet_advisor/ Fill in the relevant fields with information on the sequencing reads and estimated genome size (5.5 Mb). Use a target k-mer coverage of 20 (note; assemblies typically improve as k-mer coverage increases to 20 and then reach a plateau beyond this value). Inspect the highlighted “Answer” field to find the recommended k-mer length. Note, this is only a guideline and the best length is determined empirically by running velvet iteratively, using a range of k-mer values around the one suggested by Velvet Advisor. 3.4 Run velvet using a range of k-mer values To obtain an optimal assembly, we could run velvet multiple times using different k-mer lengths surrounding our suggested value. Fortunately, the nice folks at the Victorian Bioinformatics Institute have written a script that automates this process. All we have to do is provide it a range of k-mer values to test and the step size between each value: Usage: velvetoptimiser [options] –f ‘velveth input line’ Essentials of Next Generation Sequencing 2015 Page 3 of 10 Run velvetoptimiser using the k-mer range 121 to 201: velvetoptimiser –s 121 –e 201 –x 10 –d Bcereus_velvet \ -f ' –shortPaired –fastq Bcereus_interleaved.fastq' Options: -s The starting (lower) k-mer value (default '19') -e The end (higher) k-mer value (default '201') -x The step in k-mer search, min 2, no odd numbers (default '2') -d The name of the directory to put the final output into -f The file section of the velveth command line (default '0'). -shortPaired Type of reads in assembly -fastq Sequence read format When the run has completed, examine the output directory. There should be seven output files: Log describes the run parameters for each iteration and summarizes the assembly metrics Stats.txt summarizes information about each contig Sequences contains the input reads Roadmaps contains k-mer information required to build the graphs PreGraph graph file, not intended to be read by end user Graph graph file, not intended to be read by end user Graph2 graph file, not intended to be read by end user Interrogate the Log file to determine the following metrics for the BEST assembly: Genome size: Number of scaffolds (nodes): Largest scaffold (node): N50: Essentials of Next Generation Sequencing 2015 Page 4 of 10 Newbler 2.9 Roche provides the Newbler at no cost to the academic community. The preferred data format is .sff (standard flowgram format - a proprietary Roche format), although it accepts data as .fasta (with quality); and, since version 2.6, .fastq files. Note also, that even though we use a command line version, a GUI-based copy of the software is also available. http://www.454.com/products/analysis-software/ see also: http://contig.wordpress.com 3.5 Prepare the sequence reads The Newbler assembler requires the sequence header lines to be in a specific format before it can recognize the existence of paired-end reads. The required format is generated using the Paired_end_headers.sh shell script located in the assembly directory. Make sure you are still in the assembly directory Create new sequence files that contain the appropriate sequence header lines (Note: if you trimmed the reads for the Velvet assembly and you decide to use the trimmed data, make sure you change the input file names accordingly): Paired_end_headers.sh Bcereus_S1_L001_R1_001.fastq Bcereus_R1_PE.fastq Repeat for the “R2” reads. 3.6 Start a Newbler genome assembly project To start our Newbler assembly project, we will first create a project directory using the newAssembly tool: Usage: newAssembly [options] <project_name> Note: the newAssembly command creates a genome assembly project. For a transcriptome assembly you need to use option –cdna (newAssembly -cdna <project_name>). Or, if we want to assemble a genome/transcriptome using a reference genome as a guide, we would use the newMapping tool Usage: newMapping -cdna <project_name> Let’s tell Newbler that we’d like to start a new assembly project (you can change the name if desired): newAssembly My_Genome_Assembly Essentials of Next Generation Sequencing 2015 Page 5 of 10 This created an assembly directory named My_Genome_Assembly. Change to the newly-created directory and list its contents: You will find that it contains 2 subdirectories, assembly and sff. The assembly (or mapping) directory contains the 454AssemblyProject.xml project configuration file and will house the assembly results. The sff subdirectory contains symbolic links to any .sff files that may be included in the project. Let’s examine the list of parameters that govern the assembly process. This can be found in the 454AssemblyProject.xml file (see APPENDIX below for interpretation of the parameter settings). Note: There is a 454Project.xml file inside of the new project directory. It is not the same as the 454AssemblyProject.xml file inside of the My_Genome_Assembly/assembly directory. 3.7 Define the input files Now we need to tell the assembler where to find the input files. To do this, we will use the addRun tool: Usage: addRun [options] <file_path> Change to the project directory that was just created with the newAssembly command the following step needs to be completed in this directory. Use the addRun tool to specify the path to the .fastq files. Remember, these are in the original assembly directory (the one you cd’d to when you started this exercise; not the one created by newAssembly) so you will need to use “..” to specify the path to the parent directory. addRun –p ../Bcereus_R1_PE.fastq ../Bcereus_R2_PE.fastq Options: -p reads are paired ends 3.8 Start the assembly Finally you can start the assembly from within the current directory: runProject Or you can cd up a level and run it from the parent assembly directory. In this case you must specify the path to the run directory: runProject My_Genome_Assembly Assembling the reads should take approximately 20 minutes. Essentials of Next Generation Sequencing 2015 Page 6 of 10 3.9 Read the output files Examine the output files in the My_Genome_Project/assembly directory. Newbler output files: 454AssemblyProject.xml reports assembly parameters 454NewblerProgress.txt this is the run log 454AlignmentInfo.tsv contains position-by-position summary information about the consensus sequence for the contigs generated by the assembler application, listed one nucleotide per line .fna and .qual files: 454AllContigs, 454LargeContigs, 454Scaffolds 454ReadStatus.txt contains the status identifiers for all the reads used in the assembly computation, plus the 3’ and 5’ positions for each assembled read’s alignment within the contig results. Reads are listed one per line 454TrimStatus.txt information on read trimming 454PairStatus.txt provides information on pair status for each read 454NewblerMetrics.txt reports the key input, algorithmic and output metrics for the data analysis software. Contains input information, operation metrics, consensus results, information about the scaffolds, large contigs and all contigs 454Contigs.ace “ace” file for visualization of assembly in a program like Consed 454ContigGraph.txt description of the graphical layout of contigs Interrogate the 454NewblerMetrics.txt file to find out the total genome size (not estimated), how many scaffolds/contigs were produced, the largest scaffold/contig size and the N50 values: Genome size: Number of scaffolds: Largest scaffold: Number of contigs: Largest contig: N50 scaffolds: N50 contigs: Essentials of Next Generation Sequencing 2015 Page 7 of 10 3.10 Optional Exercise If you are already comfortable with the command line, there is a good chance you have already completed your assemblies with plenty of time to spare. In this case, you may want to see if you can improve the velvet assembly by further optimizing the k-mer length. Try running velvetoptimiser again but this time use a step size of 2 between each k-mer length tested (remember to narrow your start/end range to match the best range identified in the first iteration series). Essentials of Next Generation Sequencing 2015 Page 8 of 10 APPENDIX Newbler configuration file The Newbler configuration file starts with the header, <FourFiveFourProject> and ends with the footer, </FourFiveFourProject>. Part 1 contains project information – project directory, type (assembly or mapping), date of creation, software version (2.9) Part 2 contains configuration specifications – these are various parameters related to the assembly process: xml tag brief description default option in runProject <minimumReadLength> all reads shorter are ignored 50 -minlen # <overlapSeedStep> assembly algorithm parameter 12 -ss # <overlapSeedLength> assembly algorithm parameter 16 -sl # <overlapMinSeedCount> assembly algorithm parameter 1 -sc # <overlapMinMatchLength> min length to be considered a sequence overlap 40 -ml # <overlapMinMatchIdentity> percent of identity 90 -mi # <overlapMatchIdentScore> match score 2 -ais # <overlapMatchDiffScore> mismatch score (penalty) -3 -ads # <allContigThresh> shortest contig length 100 -a # <largeContigThresh> min length of a large contig 500 -l # <expectedDepth> expected depth of coverage, 0 means any 0 -e # <cDNAMode> if transcriptome assembly No -cdna <largeGenome> if large or complicated genome No -large <ripMode> if read can belong to one contig only No -rip Essentials of Next Generation Sequencing 2015 Page 9 of 10 xml tag brief description default option in runProject <numCPU> number of cores used; 0 for all available 1 -cpu # <finishMode> gaps to be filled with repeats and short reads No -finish <autoTrimming> trim low quality bases Yes -trim/notrim <UseSerialIO> preserve order of input files No -sio <SerialIOMinimizeDisk> preserve order of input while minimizing disk space No -sid <SerialIOMemLimit> preserve order of input limit memory use No -sim # Part 3. Contains information on reference files, vector sequences, primer sequences, adaptor sequences and input files. Each reference is placed between the <File> </File> tags: <File> <Path> full path <Type> fastq or fasta <FastqScoreType> Standard <FastqAccnoType> Illumina/general <IsPairedEnd> true/false <Rescore> true/false <removeFlag> true if a sequence file has been removed from the project. </File> Essentials of Next Generation Sequencing 2015 Page 10 of 10