ppt

advertisement
NGS Bioinformatics Workshop
2.3 Tutorial – Transcriptome Assembly
May 16th, 2012
IRMACS 10900
Facilitator: Richard Bruskiewich
Adjunct Professor, MBB
Workflow for Today
Erratum about last week
Questions from last week
Galaxy @ Westgrid now available
Transcriptome assembly
Mapped-based assembly:
Bowtie, TopHat,Cufflinks
de novo assembly:
Velvet + Oases
Trans-ABySS
Erratum: Running Velvet with
two paired end data files
Run velveth:
velveth outputdir k_mer –fastq 
-shortPaired paired_data_file_1 
-shortPaired2 paired_data_file_2
Run velvetg:
velvetg outputdir 
-ins_length 200 -exp_cov 20
1st Question from last week…
What are the limits in read lengths (e.g. Sanger ~1000 - 1500)
to NGS assemblers?
From P.4 of ALLPATHS-LG manual:
Capabilities and limitations
ALLPATHS-LG is a short-read assembler. It has been designed
to use reads produced by new sequencing technology
machines such as the Illumina Genome Analyser. the version
described here has been optimized for, but not necessarily
limited to, reads of length 100 bases.
ALLPATHS is not designed to assemble Sanger or 454 FLX
reads, or a mix of these with short reads.
1st Question from last week…
On p5 of the Velvet manual:
Read lengths are stored on signed 16bit integers,
meaning that if you are assembling contigs longer than
32kb long, then more memory is required to store the
coordinates. To do so, simply add the following option to
the make command:
make 'LONGSEQUENCES=1‘
(Note the single quotes and absence of spacing.)
This will cost more memory overhead.
2nd Question from last week…
What are the limits to insert sizes of libraries?
From P.10 of ALLPATHS-LG manual:
Supported library constructions
…any input dataset should include as least one fragment library and one
jumping library... A jumping library has a longer separation, typically in the
3kbp-10kbp range...
…Additionally, ALLPATHS also supports long jumping libraries. A jumping
library is considered to be long if the insert size is larger than 20 kbp.
In Velvet Manual, P.10
Shows a command line switch example of –ins_length_long=40000
Now available: Galaxy @ WestGrid
https://joffre.westgrid.ca/galaxy/
Accessing the Westgrid Galaxy instance
Use your Westgrid ID (email name without @part)
to log into Joffre, e.g. if your email is
‘rbruskie@sfu.ca’, your server access id is
‘rbruskie’, and use your WestGrid password
Logging into the Galaxy instance
Once into Galaxy, you need to register (initially) or
log in (if already registered) using your username
(your full email, e.g. ‘rbruskie@sfu.ca’) and
(important!) use your WestGrid password as the
Galaxy password
Transcriptome Assembly - Overview
 As in whole genome, one can have a reference
based (‘map based’) assembly, based on read
alignment, and a ‘de novo’ assembly, based on De
Bruijn graph construction.
 In some respects, transcriptome assembly can be
more challenging due to splice isoforms and
overlapping transcripts, and other issues.
 For a detailed review of the issues and available
software, see
Martin JA and Wang Z. 2011.
Next-generation transcriptome assembly.
Nature Reviews Genetics 12:671-682
Assembly by Mapping:
Bowtie/TopHat/Cufflinks Suite
 Bowtie2: Ultrafast short read alignment
http://bowtie-bio.sourceforge.net/bowtie2
 TopHat: is a fast splice junction mapper for RNA-Seq reads.
It aligns RNA-Seq reads to large genomes using the ultra
high-throughput short read aligner Bowtie, and then
analyzes the mapping results to identify splice junctions
between exons.
http://tophat.cbcb.umd.edu
 Cufflinks: Isoform assembly and quantitation for RNA-Seq.
http://cufflinks.cbcb.umd.edu/
 It is non-trivial to install this software suite…
 Fortunately, the software is installed under Galaxy and
some useful tutorials are available (see
https://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seqanalysis-exercise)
de novo Assembly: Velvet (last week) + Oases
Obtain version of oases compatible with velvet
http://www.ebi.ac.uk/~zerbino/oases/
wget …oases_latest.tgz
tar –zxvf oases_latest.tgz
make VELVET_DIR=/path/to/velvet
Put on your $PATH
Velvet + Oases with (BAM) paired end read data
 Running velveth:
velveth outputdir k_mer –bam 
-shortPaired read_data.bam
 Running velvetg:
velvetg outputdir 
-ins_length 250
-exp_cov auto
 Run oases:
oases outputdir -scaffolding yes 
-min_trans_lgth 100 -ins_length2 250
-unused_reads yes
Sort, Filter and Cluster your Transcripts
 Sorting and clustering transcripts. Can use the
‘usearch’ tool (http://www.drive5.com/usearch/)
usearch --sort transcripts.fa 
--output transcripts.sorted.fa 
--minlen min# --maxlen max# --log sorted.log
usearch --cluster transcripts.sorted.fa 
--id 0.95 
--seedsout $@ --uc results.uc 
--minlen min# --maxlen max# 
--log clustered.log
trans-Abyss
 Obtain software:
 Download
http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss
 tar –zxvf …/trans-ABySS-v1.3.2.tar.gz
 Need to look under the release web page for the manual link.
http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss/ 
releases/1.3.2
 Consult this file for full details about how to set up and run
trans-ABySS (non-trivial to set up, many dependencies)
 To execute, first need to run ABySS (abyss-pe) over a series of kmer values, then run the pipeline.
 Unfortunately, NOT installed (yet) under Galaxy…
Download