transcripts

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes

Adrian Caciula

Department of Computer Science

Georgia State University

Joint work with

Serghei Mangul (GSU)

Ion Mandoiu (UCONN)

Alex Zelikovsky (GSU)

CAME 2011, Atlanta, GA

Genome-Guided RNA-Seq Protocol

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

A

A

D

A

Isoform Discovery (ID)

C B

C

E

B C D E

Isoform Expression (IE) Gene Expression (GE)


Using Partial Annotation

• Existing tools for genome-guided RNA-seq protocol

— Cufflinks [Trapnell et al. 2010] reports minimal set of transcripts

— Scripture [Guttman et al. 2010] reports all possible transcripts

• Besides genomes, partial annotation exists for many species

— Almost all human genes/exons and majority of transcripts are already known (genomes libraries - UCSC)

• How can one incorporate it into genome-guided RNA-seq protocol?

— Cufflinks – RABT [Roberts et al. 2011] – added fake reads uniformly distributed among transcripts

• This talk:

— Subtract reads (exon counts) explained by known transcripts

— Use Virtual String EM (VSEM) [Mangul et al. 2011]

— Reconstruct transcripts containing unexplained exons


Outline

• EM for Isoform Expression Estimation

• V irtual T ranscript EM Algorithm

• DRUT: D etection and R econstruction of U nannotated

T ranscripts

• Experimental Results

• Conclusions


Max Likelihood Model

— LEFT : transcripts

> unknown frequencies

— RIGHT : reads

> observed frequencies

— EDGES : weight ~ probability of the read transcripts

T1 to be emitted by the transcript

> weights are calculated based on the mapping of the reads to the transcripts

T2

T3 reads

R1

R2

R3

R4

• Given: annotations (transcripts) and frequencies of the reads

• Find: ML estimate of transcript frequencies

• Input data of EM  a panel: bipartite graph .

Generic EM algorithm

• Initialization: uniform transcript frequencies f(j) ’s

• E-step : Compute the expected number n(j) of reads sampled from transcript j

— assuming current transcript frequencies f(j)

• M-step : For each transcript j , set f(j) = portion of reads emitted by transcript j among all reads in the sample

ML estimates for f(j) = n(j)/ ( n(1) + . . . + n(N) )


ML Model Quality

• How well ML model explain the reads.

— deviation between expected (e j

) and observed read frequency (o j

):

D





| o

 j j e j

|

| R |

|R| - number of reads

— expected read frequency: h si,j

– weighted match based on mapping read r j e j

 i : h i ,

  j



0 l h ti , h ti , j l f j

ML f i

ML

- maximum-likelihood frequency of transcript t i

• If the deviation is high we expect that some transcripts are missing from the panel.


Outline


• Virtual Transcript EM Algorithm

• DRUT: Detection and Reconstruction of

Unannotated Transcripts


• Conclusions


V irtual T ranscript E xpectation M aximization ( VTEM)

• VTEM is based on a modification of Virtual String

Expectation Maximization (VSEM) Algorithm [Mangul et al.

2011]).

— the difference is that we consider in the panel exons instead of reads

— Calculate observed exon counts based on read mapping transcripts

T1

> each read contribute to count of either one exon or two exons

(depending if it is a unspliced spliced read or spliced read ) reads

R1 exon counts reads

R1 exons

1

E1

R2 R2 transcripts

T1 E2

3

R3 R3

T2 T2 E3

3

R4 R4

Input: Complete vs Partial Annotations

Complete Annotations Partial Annotations

O O exons

E1 exons

E1 transcripts

T1

0.25

transcripts

T1

0.25

E2 0.25

E2

0.25

T2

T2

0.25

E3 0.25

E3

T3 0.25

E4

0.25

E4

Observed exon frequencies

Transcript T3 is missing from annotations


Complete Annotations

Adding Virtual Transcript

Partial Annotations

O exons

E1 exons

E1 transcripts

T1

0.25

transcripts

T1

E2 0.25

E2

T2

T2

0.25

E3

E3

VT

T3 0.25

E4

E4

VT

O

0.25

0.25

0.25

0.25

Virtual transcript, with weighted probability of VT to emit exon j equals 0 (i.e., dashed edges)


11


ML

.25

transcripts

T1 exons

E1

E2

.5

T2

E3

.25

T3

E4

0

VT

After 1

st

EM Run

Partial Annotations

O E exons

E1

.25

.25

ML

.34

transcripts

T1

E2

.25

.25

.66

T2

.25

.25

E3

0

VT

.25

.25

E4

O E

.25

.32

.25

.32

.25

.16

.25

.16

ML-estimated transcript frequencies

Expected exon frequencies


Updating Weights From Virtual Transcript

Complete Annotations Partial Annotations

O E O

ML

.25

transcripts

T1 exons

E1

E2

.5

T2

E3

.25

T3

E4

0

VT

.25

.25

.25

.25

.25

.25

.25

.25

ML

.34

transcripts

T1

.66

0

T2

VT

O > E



Increase VT weights exons

E1

E2

E3

E4

E

.25

.32

.25

.32

.25

.16

.25

.16

Observed = Expected

Nothing to update

O < E



Decrease VT weights (to 0)


After 2

nd

EM Run

Partial Annotations Complete Annotations

ML

.25

transcripts

T1 exons

E1

E2

.5

.25

0

O E

.25

.25

.25

.25

T2

E3

.25

.25

T3 .25

.25

E4

VT

VT frequency stays 0

ML

.32

transcripts

T1

.65

.03

T2

VT exons

E1

E2

E3

E4

O E

.25

.3

.25

.3

.25

.15

.25

.15

VT frequency increases!

Deviation of expected from observed decreases!


After the Last EM Run


ML

.25

.50

.25

0

Partial Annotations reads

O E reads O E transcripts

T1

E1 .25

.25

ML

.20

transcripts

T1

E1

.25

.25

E2 .25

.25

E2

.25

.25

.60

T2

T2

E3

.25

.25

E3 .25

.25

.20

T3 .25

.25

E4

VT

VT frequency is 0

No false positives

VT

E4

.25

Overexpressed exons

VT frequency (.2) ≈ T3 frequency (.25)

VT’s exons (E3,E4)= T3’s exons (E3,E4)


.25

V irtual T ranscript E xpectation M aximization ( VTEM)

(Partially) Annotated

Genome

+ Virtual Transcript with 0-weights in virtual transcript

EM

ML estimates of transcript frequencies

EM

Update weights of reads in virtual transcript

Output overexpressed exons

(expressed by virtual transcripts)

NO Virtual

Transcript frequency change> ε?

YES

Compute expected exons frequencies

• Overexpressed exons belongs to unknown transcripts


Outline






• Conclusions


D etection and R econstruction of U nannotated T ranscripts

DRUT a) Map reads to annotated transcripts (using Bowtie)

Spliced reads

Unspliced reads

Annotated transcript b) VTEM: Identify overexpressed exons (possibly from unannotated transcripts)

Overexpressed exons c) Assemble Transcripts (e.g., Cufflinks) using reads from “overexpressed” exons and unmapped reads

Novel transcript d) Output: annotated transcripts + novel transcripts


Outline






• Conclusions


Simulation Setup

• human genome data (UCSC hg18)

— UCSC database - 66, 803 isoforms

— 19, 372 genes.

— Single error-free reads: 60M of length 100bp

— for partially annotated genome -> remove from every gene exactly one isoform


Distribution of isoforms length and gene cluster sizes in UCSC dataset


Comparison Between Methods


Conclusions

• We proposed DRUT a novel annotation-guided method for transcriptome discovery and reconstruction in partially annotated genomes.

DRUT overperforms existing genome-guided transcriptome assemblers (i.e., Cufflinks)

DRUT shows similar or better performance with existing annotation-guided assemblers (Cufflinks - RABT).


Thanks!


transcripts

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes

Genome-Guided RNA-Seq Protocol

Using Partial Annotation

Outline

Max Likelihood Model

Generic EM algorithm

ML Model Quality

Outline

Input: Complete vs Partial Annotations

Adding Virtual Transcript

After 1

EM Run

Updating Weights From Virtual Transcript

After 2

EM Run

After the Last EM Run

Outline

Outline

Simulation Setup

Distribution of isoforms length and gene cluster sizes in UCSC dataset

Conclusions

Thanks!

Related documents

Products

Support

transcripts

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes

Genome-Guided RNA-Seq Protocol

Using Partial Annotation

Outline

Max Likelihood Model

Generic EM algorithm

ML Model Quality

Outline

Input: Complete vs Partial Annotations

Adding Virtual Transcript

After 1

EM Run

Updating Weights From Virtual Transcript

After 2

EM Run

After the Last EM Run

Outline

Outline

Simulation Setup

Distribution of isoforms length and gene cluster sizes in UCSC dataset

Conclusions

Thanks!

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib