Estimation of alternative splicing isoform frequencies from RNA

advertisement
Marius Nicolae
Computer Science and Engineering Department
University of Connecticut
Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky




Introduction
EM Algorithm
Results
Conclusions and future work
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
A
Gene Expression (GE)
B
C
D
Isoform Discovery (ID)
A
B
A
C
D
E
C
E
Isoform Expression (IE)

Read ambiguity (multireads)
A

B
C
What is the gene length?
D
E


Ignore multireads
[Mortazavi et al. 08]
◦ Fractionally allocate multireads based on unique
read estimates

[Pasaniuc et al. 10]
◦ EM algorithm for solving ambiguities

Gene length: sum of lengths of exons that
appear in at least one isoform
 Underestimate expression levels for genes with 2
or more isoforms [Trapnell et al. 10]
A
A
B
C
C
D
E

[Jiang&Wong 09]
◦ Poisson model, single reads only

[Li et al.10]
◦ EM Algorithm, single reads only

[Feng et al. 10]
◦ Convex quadratic program, pairs used only for ID

[Trapnell et al. 10]
◦ Extends Jiang’s model to paired reads
◦ Fragment length distribution

EM Algorithm for IE
◦
◦
◦
◦

Single and paired reads
Fragment length distribution
Strand information
Base quality scores
Solving GE by adding isoform levels

Introduction

EM Algorithm


Results
Conclusions and future work


Paired reads
A
B
C
A
C
Single reads
A
B
A
C
C
E-step
M-step

Introduction
EM Algorithm

Results

Conclusions and future work

Human genome UCSC known isoforms
25000
100000
20000
10000
Number of genes
Number of isoforms

15000
10000
5000
0
100
10
1
10
100
1000
10000 100000
Isoform length

1000
0
5
10 15 20 25 30 35 40 45 50 55
Number of isoforms
GNFAtlas2 gene expression levels
◦ Uniform/geometric expression of gene isoforms

Normally distributed fragment lengths
◦ Mean 250, std. dev. 25

Error Fraction (EF)
◦ Percentage of isoforms (or genes) with relative error
larger than given threshold t

Median Percent Error (MPE)
◦ Threshold t for which EF is 50%

r2
◦ Coefficient of determination

30M single reads of length 25
100
Uniq
Rescue
RSEM
IsoEM
UniqLN
% of isoforms over threshold
90
80
70
60
50
40
30
20
10
0
0
0.2
0.4
0.6
0.8
1
Relative error threshold

Main difference b/w IsoEM and RSEM is fragment length
modeling
30M single reads of length 25
100
90
Uniq
Rescue
GeneEM
RSEM
80
% of genes over threshold

70
60
IsoEM
50
40
30
20
10
0
0
0.2
0.4
0.6
Relative error threshold
0.8
1

Fixed sequencing throughput (750Mb)
0.978
25
0.976
r2
0.972
0.97
0.968
0.966
Paired reads
0.964
Single reads
0.962
Median Percent Error
20
0.974
15
10
5
Paired reads
Single reads
0
25
35
45
55
65
75
85
95
25
35
Read length

50bp reads better than 100bp!
45
55
65
Read length
75
85
95

1-60M 75bp reads
0.985
0.98
0.975
r2
0.97
RandomStrand-Pairs-PerfectMapping
0.965
RandomStrand-Pairs
0.96
CodingStrand-pairs
0.955
RandomStrand-Single
0.95
CodingStrand-single
0.945
0
10000000 20000000 30000000 40000000 50000000 60000000
# reads


Pairs help, strand info doesn’t
[Trapnell et al. 10] r2=.95 for 13M PE reads

Introduction
EM Algorithm
Results

Conclusions and future work



Presented EM algorithm for isoform frequency
estimation that exploits fragment length
distribution for both single and paired reads
◦ Significant accuracy improvement over existing
methods
◦ Code and datasets to be released publicly soon

Ongoing extensions
◦
◦
◦
◦
Confidence intervals
Allelic specific isoform expression
Testing for novel isoforms
Integration with isoform discovery
Download