ppt

advertisement
RNA Sequence Assembly
WEI Xueliang
Overview
•
•
•
•
•
Sequence Assembly
Current Method
My Method
RNA Assembly
To Do
Sequence Assembly
• Goal : get the DNA/RNA sequence.
• Machine cannot read whole genomes in one
go, but rather small pieces between 20 and
1000 bases.
• Define: Read = Tag = Fragment
De novo sequence assembly
Overview
•
•
•
•
•
Sequence Assembly
Current Method
My Method
RNA Assembly
To Do
De novo sequence assembly

Calculating the overlap need huge amount of time.
DE BRUIJN GRAPH
K-Mer : Length k
substring of the Tag.
 Each nodes only have 4
out degrees at most.
 Hashing the node.







“CTG”=>(132)4=(30)10
“CTG”=>”TGG”
(132=)4 shift left.
(1320)4 module (1000)4
(320)4 + (3)4 ‘G’
(323)4
DE BRUIJN GRAPH (CONT’)
If there are repeats,
like ”GACT”
 3-Mer De Bruijn can not
know which way is the
correct way. 6-Mer can get
the correct sequence.
 Larger K, better result.

De novo sequence assembly

Suppose use K = Length of Tag. (20-Mer)
TGACGTAGCTATGTATTTTG
 GACGTAGCTATGTATTTTGT (no 20-Mer)


Coverage is not enough to support large K.
Overview
•
•
•
•
•
Sequence Assembly
Current Method
My Method
RNA Assembly
To Do
MY METHOD.
Tag length=6, K=3
 When we have



AAGACT?
Try all the way:
AAGACTC
 AAGACTT
 AAGACTG


Check Tag :


AGACTC
The correct way should be
AAGACTC
Overview
•
•
•
•
•
Sequence Assembly
Current Method
My Method
RNA Assembly
To Do
RNA ASSEMBLY
ALTERNATIVE SPLICING

The graph

All cDNA sequences.
RNA ASSEMBLY’S PROBLEM
Merge?
 Index the sequence.

RNA ASSEMBLY’S PROBLEM(CONT’)

Solution?
RNA ASSEMBLY’S PROBLEM(CONT’)

Index Tags
RNA ASSEMBLY’S PROBLEM(CONT’)

Solution?

Speed?
SINGLE TAG’S LIMITATION
|Yellow Sequence| >= Length of Tag
 Length of Tag 25-100bp.
 Single Tag is not enough!

DATASET - PAIRED END TAGS
Fragment length usually > 1k
 Some RNA sequence is shorter than 1k.

TO DO
Handle large data-sets. (10G)
 Improve accuracy.
 Using PETs data.

Thanks!!
Download