NGS Bioinformatics Workshop 2.3 Tutorial – Transcriptome Assembly May 16th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB Workflow for Today Erratum about last week Questions from last week Galaxy @ Westgrid now available Transcriptome assembly Mapped-based assembly: Bowtie, TopHat,Cufflinks de novo assembly: Velvet + Oases Trans-ABySS Erratum: Running Velvet with two paired end data files Run velveth: velveth outputdir k_mer –fastq -shortPaired paired_data_file_1 -shortPaired2 paired_data_file_2 Run velvetg: velvetg outputdir -ins_length 200 -exp_cov 20 1st Question from last week… What are the limits in read lengths (e.g. Sanger ~1000 - 1500) to NGS assemblers? From P.4 of ALLPATHS-LG manual: Capabilities and limitations ALLPATHS-LG is a short-read assembler. It has been designed to use reads produced by new sequencing technology machines such as the Illumina Genome Analyser. the version described here has been optimized for, but not necessarily limited to, reads of length 100 bases. ALLPATHS is not designed to assemble Sanger or 454 FLX reads, or a mix of these with short reads. 1st Question from last week… On p5 of the Velvet manual: Read lengths are stored on signed 16bit integers, meaning that if you are assembling contigs longer than 32kb long, then more memory is required to store the coordinates. To do so, simply add the following option to the make command: make 'LONGSEQUENCES=1‘ (Note the single quotes and absence of spacing.) This will cost more memory overhead. 2nd Question from last week… What are the limits to insert sizes of libraries? From P.10 of ALLPATHS-LG manual: Supported library constructions …any input dataset should include as least one fragment library and one jumping library... A jumping library has a longer separation, typically in the 3kbp-10kbp range... …Additionally, ALLPATHS also supports long jumping libraries. A jumping library is considered to be long if the insert size is larger than 20 kbp. In Velvet Manual, P.10 Shows a command line switch example of –ins_length_long=40000 Now available: Galaxy @ WestGrid https://joffre.westgrid.ca/galaxy/ Accessing the Westgrid Galaxy instance Use your Westgrid ID (email name without @part) to log into Joffre, e.g. if your email is ‘rbruskie@sfu.ca’, your server access id is ‘rbruskie’, and use your WestGrid password Logging into the Galaxy instance Once into Galaxy, you need to register (initially) or log in (if already registered) using your username (your full email, e.g. ‘rbruskie@sfu.ca’) and (important!) use your WestGrid password as the Galaxy password Transcriptome Assembly - Overview As in whole genome, one can have a reference based (‘map based’) assembly, based on read alignment, and a ‘de novo’ assembly, based on De Bruijn graph construction. In some respects, transcriptome assembly can be more challenging due to splice isoforms and overlapping transcripts, and other issues. For a detailed review of the issues and available software, see Martin JA and Wang Z. 2011. Next-generation transcriptome assembly. Nature Reviews Genetics 12:671-682 Assembly by Mapping: Bowtie/TopHat/Cufflinks Suite Bowtie2: Ultrafast short read alignment http://bowtie-bio.sourceforge.net/bowtie2 TopHat: is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to large genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. http://tophat.cbcb.umd.edu Cufflinks: Isoform assembly and quantitation for RNA-Seq. http://cufflinks.cbcb.umd.edu/ It is non-trivial to install this software suite… Fortunately, the software is installed under Galaxy and some useful tutorials are available (see https://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seqanalysis-exercise) de novo Assembly: Velvet (last week) + Oases Obtain version of oases compatible with velvet http://www.ebi.ac.uk/~zerbino/oases/ wget …oases_latest.tgz tar –zxvf oases_latest.tgz make VELVET_DIR=/path/to/velvet Put on your $PATH Velvet + Oases with (BAM) paired end read data Running velveth: velveth outputdir k_mer –bam -shortPaired read_data.bam Running velvetg: velvetg outputdir -ins_length 250 -exp_cov auto Run oases: oases outputdir -scaffolding yes -min_trans_lgth 100 -ins_length2 250 -unused_reads yes Sort, Filter and Cluster your Transcripts Sorting and clustering transcripts. Can use the ‘usearch’ tool (http://www.drive5.com/usearch/) usearch --sort transcripts.fa --output transcripts.sorted.fa --minlen min# --maxlen max# --log sorted.log usearch --cluster transcripts.sorted.fa --id 0.95 --seedsout $@ --uc results.uc --minlen min# --maxlen max# --log clustered.log trans-Abyss Obtain software: Download http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss tar –zxvf …/trans-ABySS-v1.3.2.tar.gz Need to look under the release web page for the manual link. http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss/ releases/1.3.2 Consult this file for full details about how to set up and run trans-ABySS (non-trivial to set up, many dependencies) To execute, first need to run ABySS (abyss-pe) over a series of kmer values, then run the pipeline. Unfortunately, NOT installed (yet) under Galaxy…