Class_3_Assembly_Velvet1st_final

advertisement
E S S E N T I A L S O F N E X T G E N E R A T I O N
S E Q U E N C I N G W O R K S H O P 2 0 1 5
U N I V E R S I T Y O F K E N T U C K Y A G T C
3
Class
Genome Assembly
Most assembly programs are run in a similar manner to one another. We will use the Newbler and velvet
assemblers for this exercise. Newbler uses an overlap layout consensus strategy and was designed for
assembling the longer NGS reads achievable with the Roche 454 sequencing machines. In our experience,
Newbler assemblies for bacterial/fungal genomes are typically far superior to those produced by most
“short read” aligners. However, Newbler struggles with large datasets (>4 M paired end reads) and runs
can take days/weeks to complete, if they ever complete at all. This is where the de Bruijn graph-based
aligners come into their own. Velvet is a short read aligner and easily handles the assembly of large
datasets comprising reads ranging from 50 to 300 bp in length.
Velvet 1.2.10
Zerbino DR, Birney E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
Genome Res. 18:821–829.
https://github.com/dzerbino/velvet/tree/master
The Velvet package contains two main programs that perform separate functions: Velveth creates a table
of k-mers (essentially a list of all the different “words” spelled out by k-mers within the sequence reads;
and the number of times each “word” occurs); and Velvetg builds and traverses the de Bruijn graph.
However, before we start running either program, we must get the sequence reads into a suitable format.
3.1 Prepare the sequence reads
 Change to the assembly directory
 List the directory contents. You will see that it contains two sequence files:
Bcereus_S1_L001_R1_001.fastq.gz and Bcereus_S1_L001_R2_001.fastq.gz. The filenames contains
useful information about the sequences: “S1” means that this was the first barcode on the flow
cell; “L001” indicates the data came from flow cell lane 1; “R1” means forward read (“R2” means
Essentials of Next Generation Sequencing 2015
Page 1 of 10
reverse), the “001” means this is dataset 1 from this lane of the flow cell; “fastq” indicates the
sequence format; and “gz” tells us that the data were compressed with gzip.
 Let’s begin by decompressing the data (note the * wild card which tells gzip to work on all files
that contain the preceding string :

gunzip Bcereus_S1_L001_R*
 Before we start the assembly, let’s examine the data quality. To simplify this step, the data are
duplicated in the Class_3_ASSEMBLY directory on your USB thumb drive.
If necessary, use Trimmomatic to trim/filter the data to remove poor quality sequence. If you decide to
trim, you can use SCP/WinSCP to transfer the trimmed files from the Mac/PC to your virtual machine.
 Only perform the steps that follow to the end of this page if your trimmed the data:
On a Mac:
 In the Terminal program, type the command:

scp -p <port_number> Bcereus_S1_L001_R*paired.fastq \
<username>@csurs11.csr.uky.edu:assembly/
 Enter your password when prompted.
On a PC:
 Open the WinSCP program.
 For hostname, type csurs11.csr.uky.edy; for port type your port number; enter your username and
password, then press “Login.”
 In the left pane, browse to your USB drive.
 In the right pane, browse to your assembly directory.
 Drag the Bcereus_S1_L001_R1_paired.fastq and Bcereus_S1_L001_R2_paired.fastq files from the left
pane to the right pane to transfer them to the remote computer.
 Make sure you change the input file names when you run the following script.
Essentials of Next Generation Sequencing 2015
Page 2 of 10
3.2 Create an interleaved, paired-end dataset
Velvet requires paired-end reads to be in a single file in which the paired reads are interleaved (i.e. each
forward read is immediately followed by its corresponding reverse read). To accomplish this, we will use a
python script named interleave-fastq.py:
 Make sure you are still in the top-level assembly directory
 Run the python script:

python interleave-fastq.py Bcereus_S1_L001_R1_001.fastq \
Bcereus_S1_L001_R2_001.fastq
 Woah! As you can see the script writes the output to the screen. Let’s re-run it but, this time,
redirect the output to a file named Bcereus_interleaved.fastq.
3.3 Identify a suitable k-mer length
As you may remember, velvet breaks the sequencing reads down into subsequences known as k-mers.
The choice of k-mer length is very important and will affect the quality of the assembly. A k-mer that is
too short will yield a large number of very short contigs, while a lengthy k-mer will produce an incomplete
assembly. To assist us in the choice of a suitable k-mer length, we will use an online tool called Velvet
Advisor:
http://dna.med.monash.edu.au/~torsten/velvet_advisor/
Fill in the relevant fields with information on the sequencing reads and estimated genome size (5.5 Mb).
Use a target k-mer coverage of 20 (note; assemblies typically improve as k-mer coverage increases to 20
and then reach a plateau beyond this value). Inspect the highlighted “Answer” field to find the
recommended k-mer length. Note, this is only a guideline and the best length is determined empirically by
running velvet iteratively, using a range of k-mer values around the one suggested by Velvet Advisor.
3.4 Run velvet using a range of k-mer values
To obtain an optimal assembly, we could run velvet multiple times using different k-mer lengths
surrounding our suggested value. Fortunately, the nice folks at the Victorian Bioinformatics Institute have
written a script that automates this process. All we have to do is provide it a range of k-mer values to test
and the step size between each value:
Usage: velvetoptimiser [options] –f ‘velveth input line’
Essentials of Next Generation Sequencing 2015
Page 3 of 10
 Run velvetoptimiser using the k-mer range 121 to 201:

velvetoptimiser –s 121 –e 201 –x 10 –d Bcereus_velvet \
-f ' –shortPaired –fastq Bcereus_interleaved.fastq'
Options:
-s
The starting (lower) k-mer value (default '19')
-e
The end (higher) k-mer value (default '201')
-x
The step in k-mer search, min 2, no odd numbers (default '2')
-d
The name of the directory to put the final output into
-f
The file section of the velveth command line (default '0').
-shortPaired
Type of reads in assembly
-fastq
Sequence read format
 When the run has completed, examine the output directory. There should be seven output files:
Log
describes the run parameters for each iteration and summarizes the assembly
metrics
Stats.txt
summarizes information about each contig
Sequences
contains the input reads
Roadmaps
contains k-mer information required to build the graphs
PreGraph
graph file, not intended to be read by end user
Graph
graph file, not intended to be read by end user
Graph2
graph file, not intended to be read by end user
 Interrogate the Log file to determine the following metrics for the BEST assembly:
Genome size:
Number of scaffolds (nodes):
Largest scaffold (node):
N50:
Essentials of Next Generation Sequencing 2015
Page 4 of 10
Newbler 2.9
Roche provides the Newbler at no cost to the academic community. The preferred data format is .sff
(standard flowgram format - a proprietary Roche format), although it accepts data as .fasta (with quality);
and, since version 2.6, .fastq files. Note also, that even though we use a command line version, a
GUI-based copy of the software is also available.
http://www.454.com/products/analysis-software/
see also: http://contig.wordpress.com
3.5 Prepare the sequence reads
The Newbler assembler requires the sequence header lines to be in a specific format before it can
recognize the existence of paired-end reads. The required format is generated using the
Paired_end_headers.sh shell script located in the assembly directory.
 Make sure you are still in the assembly directory
 Create new sequence files that contain the appropriate sequence header lines (Note: if you
trimmed the reads for the Velvet assembly and you decide to use the trimmed data, make sure
you change the input file names accordingly):

Paired_end_headers.sh Bcereus_S1_L001_R1_001.fastq Bcereus_R1_PE.fastq
 Repeat for the “R2” reads.
3.6 Start a Newbler genome assembly project
 To start our Newbler assembly project, we will first create a project directory using the
newAssembly tool:
Usage: newAssembly [options] <project_name>
Note: the newAssembly command creates a genome assembly project. For a transcriptome assembly you
need to use option –cdna (newAssembly -cdna <project_name>). Or, if we want to assemble a
genome/transcriptome using a reference genome as a guide, we would use the newMapping tool
Usage: newMapping -cdna <project_name>
 Let’s tell Newbler that we’d like to start a new assembly project (you can change the name if
desired):

newAssembly My_Genome_Assembly
Essentials of Next Generation Sequencing 2015
Page 5 of 10
 This created an assembly directory named My_Genome_Assembly.
 Change to the newly-created directory and list its contents: You will find that it contains 2
subdirectories, assembly and sff. The assembly (or mapping) directory contains the
454AssemblyProject.xml project configuration file and will house the assembly results. The sff
subdirectory contains symbolic links to any .sff files that may be included in the project.
 Let’s examine the list of parameters that govern the assembly process. This can be found in the
454AssemblyProject.xml file (see APPENDIX below for interpretation of the parameter
settings). Note: There is a 454Project.xml file inside of the new project directory. It is not the
same as the 454AssemblyProject.xml file inside of the My_Genome_Assembly/assembly
directory.
3.7 Define the input files
Now we need to tell the assembler where to find the input files. To do this, we will use the addRun tool:
Usage: addRun [options] <file_path>
 Change to the project directory that was just created with the newAssembly command the following step needs to be completed in this directory.
 Use the addRun tool to specify the path to the .fastq files. Remember, these are in the original
assembly directory (the one you cd’d to when you started this exercise; not the one created by
newAssembly) so you will need to use “..” to specify the path to the parent directory.

addRun –p ../Bcereus_R1_PE.fastq ../Bcereus_R2_PE.fastq
Options:
-p
reads are paired ends
3.8 Start the assembly
 Finally you can start the assembly from within the current directory:

runProject
Or you can cd up a level and run it from the parent assembly directory. In this case you must specify
the path to the run directory:

runProject My_Genome_Assembly
Assembling the reads should take approximately 20 minutes.
Essentials of Next Generation Sequencing 2015
Page 6 of 10
3.9 Read the output files
 Examine the output files in the My_Genome_Project/assembly directory.
Newbler output files:
454AssemblyProject.xml
reports assembly parameters
454NewblerProgress.txt
this is the run log
454AlignmentInfo.tsv
contains position-by-position summary information about the
consensus sequence for the contigs generated by the assembler
application, listed one nucleotide per line
.fna and .qual files:
454AllContigs, 454LargeContigs, 454Scaffolds
454ReadStatus.txt
contains the status identifiers for all the reads used in the assembly
computation, plus the 3’ and 5’ positions for each assembled read’s
alignment within the contig results. Reads are listed one per line
454TrimStatus.txt
information on read trimming
454PairStatus.txt
provides information on pair status for each read
454NewblerMetrics.txt
reports the key input, algorithmic and output metrics for the data
analysis software. Contains input information, operation metrics,
consensus results, information about the scaffolds, large contigs and
all contigs
454Contigs.ace
“ace” file for visualization of assembly in a program like Consed
454ContigGraph.txt
description of the graphical layout of contigs
 Interrogate the 454NewblerMetrics.txt file to find out the total genome size (not estimated), how
many scaffolds/contigs were produced, the largest scaffold/contig size and the N50 values:
Genome size:
Number of scaffolds:
Largest scaffold:
Number of contigs:
Largest contig:
N50 scaffolds:
N50 contigs:
Essentials of Next Generation Sequencing 2015
Page 7 of 10
3.10 Optional Exercise
If you are already comfortable with the command line, there is a good chance you have already completed
your assemblies with plenty of time to spare. In this case, you may want to see if you can improve the
velvet assembly by further optimizing the k-mer length. Try running velvetoptimiser again but this time
use a step size of 2 between each k-mer length tested (remember to narrow your start/end range to match
the best range identified in the first iteration series).
Essentials of Next Generation Sequencing 2015
Page 8 of 10
APPENDIX
Newbler configuration file
The Newbler configuration file starts with the header, <FourFiveFourProject> and ends with the footer,
</FourFiveFourProject>.
Part 1 contains project information – project directory, type (assembly or mapping), date of creation,
software version (2.9)
Part 2 contains configuration specifications – these are various parameters related to the assembly process:
xml tag
brief description
default
option in
runProject
<minimumReadLength>
all reads shorter are ignored
50
-minlen #
<overlapSeedStep>
assembly algorithm parameter
12
-ss #
<overlapSeedLength>
assembly algorithm parameter
16
-sl #
<overlapMinSeedCount>
assembly algorithm parameter
1
-sc #
<overlapMinMatchLength>
min length to be considered a
sequence overlap
40
-ml #
<overlapMinMatchIdentity>
percent of identity
90
-mi #
<overlapMatchIdentScore>
match score
2
-ais #
<overlapMatchDiffScore>
mismatch score (penalty)
-3
-ads #
<allContigThresh>
shortest contig length
100
-a #
<largeContigThresh>
min length of a large contig
500
-l #
<expectedDepth>
expected depth of coverage, 0
means any
0
-e #
<cDNAMode>
if transcriptome assembly
No
-cdna
<largeGenome>
if large or complicated genome
No
-large
<ripMode>
if read can belong to one contig
only
No
-rip
Essentials of Next Generation Sequencing 2015
Page 9 of 10
xml tag
brief description
default
option in
runProject
<numCPU>
number of cores used; 0 for all
available
1
-cpu #
<finishMode>
gaps to be filled with repeats and
short reads
No
-finish
<autoTrimming>
trim low quality bases
Yes
-trim/notrim
<UseSerialIO>
preserve order of input files
No
-sio
<SerialIOMinimizeDisk>
preserve order of input while
minimizing disk space
No
-sid
<SerialIOMemLimit>
preserve order of input limit
memory use
No
-sim #
Part 3. Contains information on reference files, vector sequences, primer sequences, adaptor sequences
and input files. Each reference is placed between the <File> </File> tags:
<File>
<Path>
full path
<Type>
fastq or fasta
<FastqScoreType>
Standard
<FastqAccnoType>
Illumina/general
<IsPairedEnd>
true/false
<Rescore>
true/false
<removeFlag>
true if a sequence file has been removed from the project.
</File>
Essentials of Next Generation Sequencing 2015
Page 10 of 10
Download