A Quick Guide to Velvet and Oases

advertisement
A Quick Guide to Velvet and Oases
VELVET is a simple yet efficient genomic denovo assembler based on
de Bruijn graphs, taking advantage of short and longer reads as well. This
widely used package in Next Generation Sequencing applications can
even tackle transcriptome assemblies in combination with Oases
software. Due to the nature of its de Bruijn graph algorithm, probably the
most important parameter in Velvet is the Kmer length, directly
influencing specificity and sensitivity of the assembly. This quick guide
assumes knowledge of UNIX operating systems and does not substitutes
to the complete Velvet (v1.2.10) manual. Command line features are
underlined.
INSTALLATION
Download from https://github.com/dzerbino/velvet/archive/master.zip,
unzip the velvet-master.zip archive and compile with make
followed by make install for basic installation. To benefit from
multithreading and longer Kmer lengths (default max. Kmer = 31),
compile using 'OPENMP=1' and 'MAXKMERLENGTH=X', where X
has to be an odd integral, and the number of CPU’s used by Velvet will
be read from the unit-incremented value of OMP_NUM_THREADS
environment variable.
GENERAL WORKFLOW
velveth first establishes hash-tables consisting of all possible Kmeric
sub-sequences found in the sequencing read dataset. This is were read
type and Kmer length(s) are defined.
velvetg then builds de Bruijn graphs based on velveth step, removes
errors and solves repeats to eventually yields contig sequences.
oases may be finally used for processing contigs into locus and their
associated transcripts, when dealing with RNA-Seq experiments.
VELVETH PARAMETERS
Kmer length:
- Unique Kmer length:
velveth <out_dir> K <input_files> [opts]
- Range k ≤ Kmers < K, with a step of s:
velveth <out_dir> k,K,s <input_files> [opts]
Input format:
-fasta or -fasta.gz Classical FASTA format or its compressed
version.
-fastq or -fastq.gz FASTA with embedded calling quality or its
compressed version.
-sam or -bam
Sequence Alignment/Mapping or its bin
version.
Read categories:
Depending on the sequencing platform, read category may be -short
(default), -shortPaired, -long or -longPaired. Use the suffix
2, and so forth, for distinct libraries of the same read category.
For example: -short lib1reads.fq –short2 lib2reads.fq
velvetg dir/ -read_trkg yes -ins_length 100 -exp_cov 10
Velveth options:
-strand_specific
-noHash
For strand specific sequencing reads.
Only prepare sequences for hashing.
-reuse_Sequences
Use the preprocessed sequences by -noHash.
-create_binary
-interleaved
Binary output of velveth to speed up velvetg.
Paired-end reads are interleaved in one file
(default).
Paired-end reads are in two different files.
-separate
VELVETG PARAMETERS (default in bold)
Velvetg must be run separately on each velveth directory having a different
K-mer length. A simple bash loop for i in <dir>*; do velvtg
<dir> <options>; done is a way to sequentially accomplish this.
Long single-end & short paired-end, multiple Kmers:
velveth dir 23,55,8 -fastq -shortPaired -separate
\ shortR1.fq shortR2.fq -long 454reads.fq
for i in dir*; do velvetg dir_$i -read_trkg yes
\ -ins_length 200; done
Assembly of assemblies, short reads with multiple Kmers:
velveth dir 37,63,4 -fastq -short reads.fastq
for i in dir*; do velvetg dir_$i -read_trkg yes; done
velveth MergeAssembly 27 -long dir*/contigs.fa
-cov_cutoff
Exclude low coverage nodes (float | auto).
-max_coverage
Exclude too highly covered contigs from
the assembly (float).
-exp_cov
Expected coverage, for standard genomic
data only (integer | auto). Expected coverage
value greatly improves assembly. It is then
highly advised to provide one.
-min_contig_lgth
Minimum contig length (int | Kmer-len.*2).
OASES software is the continuation of Velvet assembler to process
-ins_length
Paired-end total insert size in base pairs, i.e.
insert and both reads included (integer).
-ins_length_sd
Insert size standard deviation (integer).
-scaffolding
Scaffolding with N's of contigs that do not
overlap (yes | no)
denovo transcriptome assemblies (RNA-Seq). A Velvet installation is
therefore required as well as a final output directory from a Velvet run.
This Quick Guide is intended for mainstream applications of Oases
v0.2.08 along with most of its parameters, without substituting to the
complete manual.
-read_trkg
Produces a more detailed assembly (yes | no),
usually required.
-unused_reads
Outputs unused reads in the assembly
(yes | no).
Keep contigs with long reads in them
(yes | no)
-conserveLong
velvetg MergeAssembly/ -read_trkg yes -conserveLong yes
NB: A Kmer length of 27 works well for most organisms when
assembling serveral assemblies. Also note that contigs from initial
assemblies are provided as -long for the merging assembly.
INSTALLATION
Get the code either by downloading the oases_x.x.x.tgz or via
git
(git
clone
git://github.com/dzerbino/
oases.git). Prior to make install, the make compilation step
must be indictated with the same parameters as for Velvet compilation
(Cfr. Velvet Installation).
OASES PARAMETERS
Note that Velvet expects all coverage values to be given in K-mer
coverage. The relation between common nucleotide coverage (C) and Kmer coverage (Ck) is:
-ins_length
Specify paired-end insert size in base-pairs.
Ck = C * ( L – k + 1 ) / L, where k is the hash length and L the read length.
-cov_cutoff
Minimum coverage for a transcript (int | 3).
-min_pair_count
Minimum number of bridging reads to
confirm distance between two long contigs
(int | 4).
TYPICAL COMMANDS
Short reads, single-end, Kmer=31:
velveth dir 31 -fastq -short reads.fq
velvetg dir/ -read_trkg yes
Short paired-end reads, Kmer=53:
velveth dir 53 -fastq -separate -shortPaired F.fq R.fq
-min_trans_lgth
Minimum length for a transcript to be
outputted (int | 100).
-merge
To merge and process a Velvet assemblyof-assemblies.
TYPICAL COMMANDS
oases dir/
Given dir/ as the output folder of a velveth followed by a velvetg
denovo assembly. The outcome of Oases will create a
transcripts.fa file containing all the loci and their related
transcripts. According to the official manual, it is advised to run Oases
on a array of single-K assemblies (Cfr. Velvet with k≤ Kmers< K),
which would have been merged.
Example with Paired-End reads:
python oases_pipeline.py
\ -m 21 -M 35 -s 2 -o PE_assembly
\ -d '-fastq -shortPaired -separate
\ reads1.fq reads2.fq'
Oases example with short reads:
\ -p '-ins_length 200 -min_trans_lgth 100'
1 - Array of single-Kmer lengths:
velveth dir 23,35,2 -fastq -short reads.fastq
COPYRIGHTS
for i in dir*; do velvetg $i -read_trkg yes; done
2 - Merging of the single-K assemblies :
velveth Merged 27 -long dir*/contigs.fa
velvetg Merged/ -read_trkg yes -conserverLong yes
3 – Assembly of transcripts with Oases :
oases Merged/ -merge -min_trans_lgth 200
OASES PYTHON PIPELINE
For convenience, Oases comes with a Python script designed to
accomplish all tasks from the single-K assemblies to the final merged
transcriptome assembly.
OASES PYTHON OPTIONS
VELVET: was developed and is maintained by Daniel R. Zerbino
(zerbino@ebi.ac.uk) and Ewan Birney (birney@ebi.ac.uk). Please visit
www.ebi.ac.uk/~zerbino/velvet website for further details on Velvet.
Mailing list at: http://listserver.ebi.ac.uk/mailman/listinfo/velvet-users.
OASES: was developed and is maintained by Daniel R. Zerbino
(zerbino@ebi.ac.uk) and Marcel Schulz (marcel.schulz@mogen.mpg.de).
Please visit www.ebi.ac.uk/~zerbino/oases website for further details on
Oases. Mailing list at: http://listserver.ebi.ac.uk/mailman/listinfo/oasesusers.
CITATIONS:
- D. R. Zerbino and E. Birney. 2008. Velvet: algorithms for de novo short
read assembly using de Bruijn graphs. Genome Research, 18:821-829
- M.H. Schulz, D.R. Zerbino, M. Vingron and Ewan Birney. Oases: Robust
de novo RNA-seq assembly across the dynamic range of expression levels.
Bioinformatics, 2012. DOI:10.1093/bioinformatics/ bts094
-d 'FILES'
Velveth file descriptors (string).
-p 'OPTIONS'
-m KMIN
Oases options passed to the command
line (string).
Minimum K-mer length (odd int).
-M KMAX
Maximum K-mer length (odd int).
-s KSTEP
Steps in K-mer length (even int).
EMBnet - European Molecular Biology Network - is a worldwide
bioinformatics support network. Most countries have a national node
which can provide training courses and other forms of help for users of
bioinformatics software.
-g KMERGE
K-mer length for the merging of
assemblies.
You can find information about your national node from EMBnet site:
-o NAME
Output directory prefix (string).
-r
Only do the merging.
-c
Clean temporary files after all steps.
THIS DOCUMENT was written and designed by Axel Thieffry and is
being distributed by the P&PR Publications Committee of EMBnet.
http://www.embnet.org/
A Quick Guide To Velvet v1.2.10
First edition © 2014 Axel Thieffry
LICENSE: CC-BY-NC 3.0 http://creativecommons.org/licenses/by-nc/
THANKS to Jose Valerde for the help, Daniel Zerbino for review and
validation, and the EMBnet community for helpful advice and review.
Velvet & Oases
Download