Transcriptome analysis

BIT 815: Analysis of Deep Sequencing Data
Transcriptome analysis
• With a reference
Challenging due to size and complexity of datasets
Many tools available, driven by biomedical research
GATK and R/Bioconductor offer many options
Start by mapping reads to reference genome with a
mapping/alignment tool – deal with exon-intron junctions
Reconstruct transcripts from mapped reads – deal with
alternate splicing products
Calculate relative abundance of different transcripts
Estimate biological significance based on annotation
Example tools: Bowtie/TopHat, Cufflinks, Myrna
Workflow summary from a
review “From RNA-seq
reads to differential
expression results”, by
Oshlack et al, Genome Biol
11:220, 2010.
Note emphasis on statistical
analysis methods; an equal
emphasis should be placed
on experimental design.
BIT 815: Analysis of Deep Sequencing Data
The ‘Tuxedo’ suite of programs:
Bowtie, TopHat, Cufflinks and
See Trapnell et al, Nature
Protocols 7:562 – 578, 2012 for
•TopHat maps reads
•Cufflinks assembles transcripts
•Cuffmerge merges transcript
data detected in different
•Cuffdiff evaluates differential
•CummeRbund provides
visualization tools
BIT 815: Analysis of Deep Sequencing Data
Why merge data across treatments?
BIT 815: Analysis of Deep Sequencing Data
Differential transcript abundance mechanisms
BIT 815: Analysis of Deep Sequencing Data
Transcriptome analysis
• Without a reference
– First step is assembly
– Transcriptome assembly pipelines
Velvet/Oases – Oases is a post-assembly processor for Velvet
Trans-ABySS (BCGSC) – based on ABySS parallel assembler
Rnnotator – based on Velvet
Trinity (Broad Institute) – a set of three programs
– Common strategy: Assembly at multiple k-values, then
merging of resulting contigs, followed by refinement
– Once an assembly is available, continue with analysis as
BIT 815: Analysis of Deep Sequencing Data
After Transcriptome Assembly…
• Some amount of analysis of differential splicing versus
differential promoter activity is possible, but conclusions
may be less robust in the absence of a reference
• The fraction of the total number of genes that can be
discovered by RNA-seq depends on the diversity of tissue
types and developmental stages analyzed, as well as the
depth of sequencing
330 million SOLiD reads
from a human cell line
detect only about 67% of
all annotated transcripts
in the human genome.
Characterization and
improvement of RNA-Seq
precision in quantitative
transcript expression
Labaj et al, Bioinformatics
27:i383-91, 2011
BIT 815: Deep Sequencing
Transcriptome analysis with RSEM
RNA-Seq with Expectation Maximization
Li & Dewey, BMC Bioinformatics 12:323, 2011
(a). Allows estimation of transcript abundance without a reference
genome, based on alignments to assembled transcripts, although the
transcripts can be taken from a reference genome sequence if it
is available
(b). Uses the Bowtie aligner by default, but considers reads that map to
multiple locations in the reference transcript collection
(c). For each sample, files of estimated transcript and isoform
abundance are produced, along with SAM files of alignments.
(d). The files of transcript and isoform abundance can be used to
evaluate differential expression using tools from R and Bioconductor