CMSC702_Cufflinks_Final - Center for Bioinformatics and

advertisement
Cufflinks
Matt Paisner, Hua He, Steve Smith and
Brian Lovett
The Vision
• RNAseq can be used for transcript discovery and
abundance estimation
• What’s missing: algorithms which aren’t restricted by
prior gene annotations (which are often
incomplete) and account for alternative
transcription and splicing.
• Hence, Cufflinks.
The Need
• Evidence of ambiguous assignment of isoforms. TSS
site/promoter changes and splice site changes
were found previously by the authors
• Longer reads and pair end reads do not do enough
The Biology
•
•
•
•
•
General assumption of randomization of reads
Central Dogma
Transcription Start Site (TSS)
Splice site
Isoform
Central Dogma and
Regulation
Splicing
• Thisblahblahblahblahblahblahisblahblahimportant
• Thisblahblahblahblahblahblahisblahblahimportant
• “This” “is” “important” - Exons
• “blah” - Introns (Intrusions)
Major Change 1
Major Change 2
Why it matters: Isoforms
• Not only different sizes, but different shapes
• Shape determines function
• Isoforms would map to the same section of the
genome: undetected without Cufflinks
• Separating transcripts into isoforms elucidates a
more realistic representation of what is happening
TopHat
Mapping short reads
Trapnell et. al, Bioinformatics, 2009
TopHat
• No genome reference annotations are needed
• The output of TopHat is the input of Cufflinks.
• Input: Reads and genome
• Output:
Read mappings
• Short reads present computational challenges
o BOWTIE
How does
TopHat Work?!
Big Idea: “Exon Inference”!!
Step 1: Initial Mapping
via Bowtie
Reference
Mapped Reads
• Group 1: Mapped Reads (Segments)
• Group 2: Initially Unmapped (IUM) Reads
o possibly intron-spanning read
• Based on Group 1, we want to get intron-spanning
reads from Group 2
Step 2: Generate Putative
Exons
Step 3: Look for Potential
Splice Signals
Putative Exons
Step 4: Seed-and-Extend
Cufflinks
Isoform/Transcript Detection and
Quantification
Trapnell et al, Nature Biotech, 2010
Step 5: Identify
Compatible Reads
Two reads are compatible if their overlap contains the
exact same implied introns (or none). If two reads are
not compatible they are incompatible.
Step 6:
Less BIOLOGY, and
NOW it is the time for
some GRAPH
THEORIES…….
“We emphasize that the definition of a transcription locus
is not biological……”
Authors
Step 6: Create Overlap
Graph
•
Connect compatible reads in order
Create a DAG
A path in this graph corresponds to a
transcript isoform
Theory
1. Solving minimum path cover (isoforms) in the
overlap graph implies the fewest transcripts
necessary to explain the reads.
2. Solve minimum path cover by finding largest set of
individual reads such that no two are compatible.
3. According to Dilworth Thereom, find a maximum
matching in a bipartite graph
Step 8: Convert a DAG
into a Bipartite Graph
Step 9: Looking for Maximum Matching inside a
bipartite graph via Bipartite Matching Algorithm
BIPARTITE-MATCHING Algorithm: Add augmenting path via BFS, repeatedly adding the
paths into the matching until none can be added.
A path in this graph corresponds to a
transcript isoform
Projective normalization
underestimates expression
isoform a
isoform b
project all isoforms into
genome coordinates
R reads total, r reads for the gene:
- ra for isoform a
- rb for isoform b
but
so
28
How should expression
levels be estimated?
(a)
(b)
• A-B are distinguished by the presence of splice
junction (a) or (b).
• A-C are distinguished by the presence of splice
junction (a) and change in UTR
• B-C are distinguished by the presence of splice
junction (b) and change in UTR
29
How should expression
levels be estimated?
(a)
(b)
• Longer transcripts contain more reads.
• Reads that could have originated from multiple
transcripts are informative.
• Relative abundance estimation requires
“discriminatory reads”.
30
A model for RNA-Seq
• r = Transcript proportions for
assignment of reads to transcripts
• L = Likelihood of this assignment
• R = all reads
31
A model for RNA-Seq
• T = All transcripts
32
A model for RNA-Seq
Define:
• Expected possible positions for an
arbitrary fragment in Transcript t
• F(i) = pr(random fragment has length
i)
33
A model for RNA-Seq
34
A model for RNA-Seq
• It (r) = Implied length of r’s fragment if
r is assigned to transcript t
• Recall: F(i) = pr(fragment length = i)
35
Projective normalization
underestimates expression
isoform a
isoform b
project all isoforms into
genome coordinates
R reads total, r reads for the gene:
- ra for isoform a
- rb for isoform b
but
so
36
A model for RNA-Seq
• Now we have a maximum likelihood
function in terms of r, the distribution of
reads among transcripts.
• Non-negative linear model
37
Inference with the
sequencing model
•
•
•
•
•
Maximum likelihood function is concave - optimization
using the EM algorithm.
Asymptotic MLE theory leads to a covariance matrix for
the estimator in the form of the inverse of the observed
Fisher information matrix
Importance sampling from the posterior distribution used
for estimating the abundances from the posterior
expectation, and 95% confidence intervals for the
estimates.
This approach extends the log linear model of H. Jiang
and W. Wong, Bioinformatics 2009 to a linear model for
paired end reads.
For more background see Li et al., Bioinformatics, 2010
and Bullard et al., BMC Bioinformatics, 2010.
38
Utility of Cufflinks
• mRNA as proxy for gene expression & action
• Control points
o transcriptional vs
o post transcriptional
• Does isoform-level discovery & quantification
matter?
o Apparently, yes
o Putatively discovered about 12K new isoforms while recovering about 13K
known
o Plus other stuff…
The skeletal myogenesis transcriptome
RNA-Seq (2x75bp GAIIx) along time course of mouse C2C12 differentiation
myotube
myoctyte
•84,369,078
•140,384,062
• 82,138,212
•123,575,666
reads
reads
reads
reads
•66,541,668
•103,681,081
•47,431,271
•89,162,512
alignments
alignments
alignments
alignments
•10,754,363
•19,194,697
•9,015,806
•17,449,848
to junctions
to junctions
to junctions
to junctions
•58,008
•69,716
•55,241
•63,664
transfrags
transfrags
transfrags
transfrags
-24
hours
120
hours
60
hours
differentiation
(starting at 0 hours)
168
hours
fusion
Illustration based on: Ohtake et al, J. Cell Sci., 2006; 119:3822-3832
40
Slide courtesy of Hector Corrada Brav
Validation
Validation
Projective normalization
underestimates expression
isoform a
isoform b
project all isoforms into
genome coordinates
R reads total, r reads for the gene:
- ra for isoform a
- rb for isoform b
but
so
43
Slide courtesy of Hector Corrada Brav
Discovery is necessary for accurate abundance estimates
44
Slide courtesy of Hector Corrada Brav
Some Questions…
• Do isoforms of a given gene have interesting
temporal patterns?
o Increasing, decreasing, more complex…
• What does this mean biologically?
• What about transcriptional versus post
transcriptional regulation?
o Differential transcription
o Differential splicing
Dynamics of Myc expression
46
Slide courtesy of Hector Corrada Brav
Overloading Metric using
Jensen-Shannon Divergence
Entropy of Average
Average Entropy
Metric:
One-sided t-test under the null hypothesis that there is no difference in
abundance;
Type I errors controlled with Benjamini-Hotchberg correction (FDR)
Regulatory Overloading
# Genes (FDR < 0.05)
Differential splicing
231
Fibronectin
Tropomyosin 1
Mef2d
…
17
101
Differential TSS preference
Fhl3
Fhl1
Myl1
…
Dynamics of Myc
expression
d(
,
)
49
Slide courtesy of Hector Corrada Brav
New TSS = New Points of
Regulation
What would a “collapsed” RNA-seq alignment look like? Microarray?
TSS=Transcription Start Site
Questions?
I am the DNA, and I
want a protein!
Transcription
Protein
mRNA
Translation
The DNA wants a protein.
Download