ppt - University of Connecticut

advertisement
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes
Serghei Mangul*, Adrian Caciula*, Ion Mandoiu** and Alexander Zelikovsky*
*Georgia State University, **University of Connecticut
1
3
Introduction
Genes, Exons, Introns, and Splicing
INITIALIZATION: Uniform transcript frequencies f(j) ‘s
E STEP:
Compute the expected number n(j) of reads sampled from transcript
j (assuming current transcript frequencys f(j) )
M STEP:
For each transcript j, set of f(j) = portion of reads emitted by
transcript j among all reads in the sample
Gene - a segment of DNA or RNA
that carries genetic information.
Exon - a region of a gene which is
translated into protein
Intron - a region of a gene which is not
translated into protein
Splicing – a process in which the
introns are removed and exons are joined
to be translated into a single protein
Alternative Splicing
the process in which exons can
be spliced out in different
combinations named transcripts
to generate the mature RNA.
5
Expectation Maximization (EM)
Discovery and Reconstruction of Unannotated Transcripts
DRUT (Discovery and Reconstruction of Unannotated Transcripts):
GIVEN: A set of transcripts and frequencies for the reads.
FIND : Transcripts missing from the set.
DRUT
Quality of ML Model
Fig. 1. Chromosome with its DNA
Alternative splicing is a common
mode of gene regulation within
cells, being used by 90–95% of
human genes.
It can drastically alter the
Fig. 2. Alternative Splicing Process
function of a gene in different
tissue types or environmental
conditions, or even inactivate the gene completely.
The possible gaps in the ML model include:
erroneous reads caused by genotyping errors
missing and/or chimerical candidate transcripts
an inaccurate read to transcript match (caused by genotyping errors)
non-uniform emitting of reads by transcripts

j:hi , j 0
Unspliced
reads
Annotated
transcript
a) Map reads to annotated
transcripts (using Bowtie)
Measure the quality of ML model by deviation D of observed reads from
expected reads (ej)
| oj  ej |

j
|R| is the number of reads
D
|R|
Expected read frequencies (ej) are calculated based on
weighted match between reads and strings
ML
f
maximum likelihood frequencies estimations of transcripts ( j )
Fig4 shows the relation between transcripts, exons and
hTj ,i ML
reads
ej 
fj
Spliced
reads
b) VTEM: Identify “overexpressed”
exons (possibly from unannotated
transcripts)
Overexpressed exons
c) Assemble Transcripts (e.g., Cufflinks)
using reads from “overexpressed”
Novel
exons and unmapped reads
transcript
h
Tj ,l
l
d) Output: annotated transcripts + novel
transcripts
Alternative splicing is implicated in many diseases.
Fig. 4. Transcripts – Exons –Reads Relation.
4
Virtual Transcript Expectation Maximization (VTEM)
-> Observed frequencies
- EDGES: weights ~ probability of the read to be emitted
by the transcript
Fig. 9(a) shows that in genes with more transcripts is more difficult to
correctly reconstruct all transcripts. As a result Cufflinks performs better on
genes with few transcripts since annotations are not used in it standard settings.
DRUT has higher sensitivity on genes with 2 and 3 transcripts, but RABT is
better on genes with 4 transcripts.
For genes with more than 4 transcripts performance of annotation-guided
methods is equal to ”existing annotations ratio”, which mean what these
methods are unable to reconstruct unannotated transcript..
ML Problem:
GIVEN: Annotations (transcripts) and
frequencies of the reads.
FIND: ML estimate of transcript frequencies
Fig 3. Panel: Bipartite Graph - consisting
of transcripts with unknown frequencies
and reads with observed frequency (oj)
SUBPROBLEMS:
Decide if the panel is likely to be incomplete
Estimate total frequency of missing transcripts
Identify read spectrum emitted by missing transcripts
Assemble missing transcripts from read spectrum emitted by
missing transcripts
Input data of EM is a panel: a bipartite graph
a set of candidate transcripts that are believed to emit the set of reads
weighted match based on mapping of the read i to the transcripts j (hTj, i)
ML Estimates of Transcripts Frequencies
Probability that a read is sampled from transcript j is proportional with f(j)
f(j) transcript (unknown) frequency
ML estimates for f(j) is given by n(j)/(n(1) + . . . + n(N))
n(j) denotes the number of reads sampled from transcript j
Experimental Results
Simulation Setup: human genome data (UCSC hg18)
UCSC database - 66, 803 isoforms
19, 372 genes, Single error-free reads: 60M of length 100bp
for partially annotated genome -> remove from every gene exactly one isoform
- LEFT: transcripts -> unknown frequencies
- RIGHT: reads
6
Fig. 7. VTEM
1.0
1.0
0.9
0.8
0.8
Partial Annotations
(T3 missing)
Complete Annotations
exons
ML
transcripts
.25
T1
.5
T2
.25
E1
E2
E3
O
.25
.25
.25
E
.25
.25
.25
1st
.25
2nd
ML ML ML
.34
.66
.33
.65
.02
1st | 2nd | 3rd
Run | Run | Run
3rd
|
|
Run | Run | Run
0
T3
O<E
Decrease VT
weights (to 0)
.20
.6
.2
.25
exons
transcripts
T1
VT
VT frequency stays 0
No false positives
E – Expected exon frequencies
VT – Virtual Transcripts with hTi, j = 0
ML – Estimatied transcript frequency
E1
E2
VT frequency increases!
Deviation of expected
from observed decreases!
Fig 8. An example of VTEM estimation
E2
E3
.25
.32
.3
.25
.25
.32
.3
.25
2
3
4
5
0.6
0.5
DRUT
RABT
Cufflinks
Existing annotations
0.0
0.7
E3
VT
.25
.16
.15
.25
.25
.16
.15
.25
O>E
Increase VT
weights
Overexpressed exons
VT frequency (.2) ≈ T3 frequency (.25)
VT’s exons (E3,E4)= T3’s exons (E3,E4)
DRUT
RABT
Cufflinks
0.4
6
7
8
>8
a) Number of transcripts per gene
T2
E4
Observed = Expected
Nothing to update
E1
0.4
0.2
E4
0
O
0.6
PPV
Maximum Likelihood (ML) Model
Sensitivity
2
2
3
4
5
6
7
8
>8
b) Number of transcripts per gene
Fig. 9. a) Sensitivity and PPV of the methods grouped by the number of transcripts per gene. Here, 60M
single reads of length 100bp are simulated
* Cufflinks is a well known tool for transcriptome reconstruction [2].
References
1. S. Mangul, I. Astrovskaya, M. Nicolae, B. Tork, I. Mandoiu, and A. Zelikovsky, “Maximum likelihood estimation of
incomplete genomic spectrum from hts data,” in Proc. 11th Workshop on Algorithms in Bioinformatics, 2011.
2. C. Trapnell, B. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. van Baren, S. Salzberg, B. Wold, and L. Pachter,
“Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell
differentiation.” Nature biotechnology, vol. 28, no. 5, pp. 511–515, 2010.
Download