pptx

advertisement
RNA-seq: Quantifying the
Transcriptome
Alisha Holloway, PhD
Gladstone Bioinformatics Core Director
What is RNA-seq?
Use of high-throughput sequencing technologies
to assess the RNA content of a sample.
Why do an RNA-seq experiment?
• Detect differential expression
• Assess allele-specific expression
• Quantify alternative transcript
usage
• Discover novel genes/transcripts,
gene fusions
• Profile transcriptome
• Ribosome profiling to measure
translation
Why do an RNA-seq experiment?
• Detect differential expression
• Assess allele-specific expression
• Quantify alternative transcript
usage
• Discover novel genes/transcripts,
gene fusions
• Profile transcriptome
• Ribosome profiling to measure
translation
Skelly et al. 2011
Why do an RNA-seq experiment?
• Detect differential expression
• Assess allele-specific expression
• Quantify alternative transcript
usage
• Discover novel genes/transcripts,
gene fusions
• Profile transcriptome
• Ribosome profiling to measure
translation
Why do an RNA-seq experiment?
• Detect differential expression
• Assess allele-specific expression
• Quantify alternative transcript
usage
• Discover novel genes/transcripts,
gene fusions
• Profile transcriptome
• Ribosome profiling to measure
translation
Why do an RNA-seq experiment?
• Detect differential expression
• Assess allele-specific expression
• Quantify alternative transcript
usage
• Discover novel genes/transcripts,
gene fusions
• Profile transcriptome
• Ribosome profiling to measure
translation
Pluripotent
Stem Cell
Cardiogenic Cardiac
Cardiomyocytes
Mesoderm
Precursors
Why do an RNA-seq experiment?
• Detect differential expression
• Assess allele-specific expression
• Quantify alternative transcript
usage
• Discover novel genes/transcripts,
gene fusions
• Profile transcriptome
• Ribosome profiling to measure
translation
More tomorrow!
Ingolia et al. 2009, Weissman Lab
RNA-seq
Microarray
ID novel genes, transcripts, &
exons
Well vetted QC and analysis
methods
Greater dynamic range
Less bias due to genetic
variation
Repeatable
Well characterized biases
Quick turnaround from
established core facilities
Currently less expensive
No species-specific
primer/probe design
More accurate relative to qPCR
Many more applications
RNA-seq vs. Affy
Marioni et al. 2008
RNA-seq vs. Taqman
© 2010 NuGen
Illumina
Pac-Bio
Read length
100 bp paired end
2500 bp avg
Throughput
200 million read
pairs/lane
1 million reads/
SMRT cell
Error rate
<1%
Cost
$600/sample
15% total, most are
indels, 4% SNP
$7-8k/sample
Accessibility
USCF, UC-Davis, BGI
Uses
DE, ASE, quant alt.
transc. usage
No commercially
available protocols
Characterize
transcriptome
When to use Pac-Bio
Plan it well.
• Experimental design
– Biological replicates
– Reference genome?
– Good gene annotation?
•
•
•
•
Read depth
Barcoding
Read length
Paired vs. single-end
Biological
variation
Technical
variation
Plan it well.
• Experimental design
– Biological replicates
– Reference genome?
– Good gene annotation?
•
•
•
•
Read depth
Barcoding
Read length
Paired vs. single-end
Plan it well.
• Experimental design
– Biological replicates
– Reference genome?
– Good gene annotation?
•
•
•
•
Read depth
Barcoding
Read length
Paired vs. single-end
How much data do we need?
• ~15-20K genes expressed in a tissue | cell line.
• Genes are on average 3KB
• For 1x coverage using 100 bp reads, would
need 600K sequence reads
• In reality, we need MUCH higher coverage to
accurately estimate gene expression levels.
• 50 million reads
Plan it well.
• Experimental design
– Biological replicates
– Reference genome?
– Good gene annotation?
•
•
•
•
Read depth
Barcoding
Read length
Paired vs. single-end
200 million reads / lane
Run 4 samples / lane
Plan it well.
• Experimental design
– Biological replicates
– Reference genome?
– Good gene annotation?
•
•
•
•
Read depth
Barcoding
Read length
Paired vs. single-end
Uniq seq = 4read length
Read length
Unique seq
25
1.1x1015
50
1.3x1030
100
1.6x1060
~60 million coding bases
in vertebrate genome
Plan it well.
• Experimental design
– Biological replicates
– Reference genome?
– Good gene annotation?
•
•
•
•
Read depth
Barcoding
Read length
Paired vs. single-end
Paired-end!
• Effectively doubles read length –
huge impact on read mapping
• Increases number of splice junction
spanning reads
• Critical for estimating transcriptlevel abundance
The wet lab side…briefly
How do you make sense of this pile of
data?
• QC
• Alignment
• Expt: Compare two groups
– Transcript Assignment & Abundance
– Differential Expression
• Expt: Allele-specific expression
QC
• FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• Proportion of reads that mapped uniquely
– Remove duplicates; likely due to PCR amp.
• Assess ribosomal RNA content
• Assess content of possible contaminants –
human RNA (if not human samples),
Mycoplasma (if cell lines)
Then what?
• Align reads to the genome
– Easy(ish) for genomic sequence
– Difficult for transcripts with splice junctions
Alignment Algorithms
• Burrows-Wheeler Transform
– Bowtie (Langmead et al 2009)
– BWA (Li and Durbin 2009)
– SOAP2 (Li et al. 2009)
• Smith-Waterman
– BFAST (Homer at al. 2009, based on BLAT) – multiple indexes, finds
candidate alignment locations using seed and extend, followed by a
gapped Smith-Waterman local alignment for each candidate
http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
Alignment tools for splice junction
mapping
•
•
•
•
Tophat
MapSplice
SpliceMap
HMMsplicer
Tophat
• Map reads to transcriptome using Bowtie
• Map to genome to discover novel exons
– or start here if no annotation available
• Split reads to smaller segments; map to
genome to discover novel splice junctions
• Report best alignment for each read
Trapnell et al. Bioinformatics 2009; Trapnell et al. Nature Protocols 2012
MapSplice & SpliceMap
• Tag alignment (user chooses aligner)
– Break reads into segments
– Map reads
– Unmapped segments considered for splice
junction mapping based on location of partner
segment
– Merge segments from read for final alignment
• Assess splice junction quality
Wang et al. NAR 2010, Au et al. NAR 2010
HMMsplicer
• Remove reads that map contiguously
• Hidden markov model to detect exon
boundary of remaining reads
• Compute intensive
• Reference annotation not used
• Best for compact genomes
• User sets threshold for accepting splice
junction.
Dimon et al. PLoS One 2010
HMMsplicer
Transcript Assignment/Abundance
Martin & Wang, Nature Reviews Genetics 2011
Transcript Assignment &|Abundance
Tools
• For DE:
– Cufflinks
– MISO
– Scripture – not maintained
• De novo assembly
– Cufflinks
– Trans-ABySS
– Trinity
– Maker
Cufflinks
• Constructs the parsimonious set of transcripts
that explain the reads observed. Basically,
finds a minimum path cover on the DAG.
• Derives a likelihood for the abundances of a
set of transcripts given a set of fragments.
• FPKM – fragments per kb of exon per million
fragments mapped.
Trapnell, Pachter
MISO
• Mixture of Isoforms
• Bayesian – treats expression level of set of
isoforms as random variable and estimates a
distribution over the values of this variable.
• Gives confidence intervals for expression
estimates and measures of DE as Bayes factors
Burge Lab @ MIT
Bias Correction and Normalization
• Random hexamer bias
(Hansen et al. 2010)
– From PCR or RT primers
– Reestimate FPKM or read
counts based on bias
• Upper quartile normalization
(Bullard et al. 2010)
– excellent resource for
comparison to qPCR and
microarray as well as
methods of normalization of
RNA-seq data
Differential Expression
• Goal: determine whether observed difference
in read counts is greater than would be
expected due to random variation.
• If reads independently sampled from
population, reads would follow multinomial
distribution appx by Poisson
Pr(X = k) =
le
k -l
k!
Differential Expression
• BUT! We know that the count data show more
variance than expected
• Overdipersion problem mitigated by using the
negative binomial distribution, which is
determined by mean and variance
Kij @ NB(mij , s )
2
ij
Sample j, gene i
Differential Expression
• Binomial test
– Old Cuffdiff
• Negative binomial
– DESeq – estimate variance using all genes with
similar expression levels
– Cuffdiff – sim to DESeq, but incorp fragment
assignment uncertainty simultaneously
– EdgeR - moderate variance over all genes
– T-test
Differential Expression
Old cuffdiff
Some biology, finally?
• How have gene expression patterns have
changed during the course of differentiation?
• Which genes are specific to certain cell types?
• What can we learn about what those coexpressed genes do?
Clusters of co-expressed genes
• Use unsupervised
clustering to group genes
by expression pattern
• Use gene ontology
information to determine
which kinds of genes are in
each group
• Reveal novel associations
and gene types
Clusters of co-expressed genes
Pluripotency/stem cell: Nanog, Oct4
Mesoderm/cell fate commitment: Mesp1, Eomes
Cardiac precursors: Isl1, Mef2c, Wnt2
Cardiac structure/function: Actc1, Ryr2, Tnni3
Thanks for listening!
Alisha Holloway
Gladstone Institutes
Bioinformatics Core
alisha.holloway@gladstone.ucsf.edu
Download