3 out of 10

advertisement
An Integer Programming Approach to Novel
Transcript Reconstruction from Paired-End
RNA-Seq Reads
Serghei Mangul
Department of Computer Science
Georgia State University
Joint work with Adrian Caciula, Sahar Al Seesi, Dumitru Brinza,
Abdul Rouf Banday, Rahul N. Kanadia, Ion Mandoiu and Alex Zelikovsky
Advances in Next Generation Sequencing
Illumina HiSeq 2000
Up to 6 billion PE reads/run
35-100bp read length
Roche/454 FLX Titanium
400-600 million reads/run
400bp avg. length
http://www.economist.com/node/16349358
Ion Proton Sequencer
SOLiD 4/5500
1.4-2.4 billion PE reads/run
35-50bp read length
2
RNA-Seq
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
A
Gene Expression
B
C
D
Transcriptome Reconstruction
A
B
A
C
D
E
E
Isoform Expression
C
3
Transcriptome Reconstruction
• Given partial or incomplete information about
something, use that information to make an
informed guess about the missing or unknown
data.
4
Transcriptome Reconstruction Types
• Genome-independent reconstruction (de novo)
– de Brujin k-mer graph
• Genome-guided reconstruction (ab initio)
– Spliced read mapping
– Exon identification
– Splice graph
• Annotation-guided reconstruction
– Use existing annotation (known transcripts)
– Focus on discovering novel transcripts
5
Previous approaches
• Genome-independent reconstruction
– Trinity(2011), Velvet(2008), TransABySS(2008)
• Genome-guided reconstruction
– Scripture(2010)
• Reports “all” transcripts
– Cufflinks(2010), IsoLasso(2011), SLIDE(2012)
• Minimizes set of transcripts explaining reads
• Annotation-guided reconstruction
– RABT(2011), DRUT(2011)
6
Challenges and Solutions
• Read length is currently much shorter
then transcripts length
• Paired-end reads
• Fragment length distribution
7
1
2
3
4
5
6
7
t1 :
1
2
3
4
5
6
7
t2 :
1
3
4
5
6
7
t3 :
1
3
4
5
7
t4 :
1
3
4
5
7
2
Exon 2 and 6 are “distant” exons : how to phase them?
TRIP
Transciptome Reconstruction using Integer Programming
• Map the RNA-Seq reads to
genome
• Construct Splice Graph - G(V,E)
Genome
– V : exons
– E: splicing events
• Candidate transcripts
– depth-first-search (DFS)
• Filter candidate transcripts
– fragment length distribution (FLD)
– integer programming
9
Gene representation
Tr1:
e1
Tr2:
e1
Tr3:
Pseudoexons:
e5
e3
e2
pse1
Spse1
e4
pse2
Epse1
Spse2
e5
pse3
Epse2
Spse3
pse4
Epse3
Spse4
e6
pse5
Epse4
Spse5
pse6
Epse5
Spse6
pse7
Epse6
Spse7
Epse7
• Pseudo-exons - regions of a gene between
consecutive transcriptional or splicing events
• Gene - set of non-overlapping pseudo-exons
10
pseudo-exons
Splice Graph
TSS
TES
Genome
1
2
3
4
5
6
7
8
9
11
How to select?
• Select the smallest set of candidate transcripts
that yields a good statistical fit between
– the fragment length distribution empirically determined
during library preparation
– fragment lengths implied by mapped read pairs
500
1
2
3
200
200
200
Mean : 500; Std. dev. 50
300
Mean : 500; Std. dev. 50
1
3
200
200
12
Simplified IP Formulation
• Objective
 y (t )
min
t T
• Constraints
(1)

y ( t )  x ( p ),  p
t T ( p )
( 2 )  x ( p )  N reads
for each pe read at
least one transcript
is selected
p
T(p) - set of candidate transcripts on which paired-end read p can be mapped
y(t) - 1 if a candidate transcript t is selected, 0 otherwise
x(p) - 1 if the pe read p is selected to be mapped
13
IP Formulation
• Objective
min
• Constraints
 y (t )
t T
for each pe read from every
category of std.dev. at least one
transcript is selected
restricts the number of pe reads
mapped within different std.
dev.
each pe read is mapped no more
then with one category of std. dev.
every splice junction to be covered
14
Comparison on Simulated Data
Sens 
PPV 
TP
TP
TP  FP
TP  FN
FScore  2 
100x coverage, 2x100bp pe reads, 500 mean fragment length, 10% sd
PPV  Sens
PPV  Sens
15
Influence of Sequencing Parameters
Sens 
TP
PPV 
TP
TP  FP
TP  FN
FScore  2 
PPV  Sens
PPV  Sens
TRIP-L : individual fragment lengths estimates
100x coverage, 2x100bp pe reads, 500 mean fragment length, 10%-100% sd
16
Results on Real RNA-Seq Data
• CD1 mouse retina RNA samples
• specific gene that has 33 annotated
transcripts in Ensembl
• TRIP : 5 out of 10 transcripts, confirmed using
qPCR.
• Cufflinks : 3 out of 10 transcripts
6906 alignments for 22346 read pairs with read length of 68
17
Conclusions
• We introduced a novel method for
transcriptome reconstruction from paired-end
RNA-Seq reads.
• Our method :
– exploits distribution of fragment lengths
– additional experimental data
• TSS/TES (TRIP with TSS/TES)
• individual fragment lengths estimates (TRIP-L)
18
19
Download