RNA Sequence Assembly
WEI Xueliang
Overview
•
•
•
•
•
Sequence Assembly
Current Method
My Method
RNA Assembly
To Do
Sequence Assembly
• Goal : get the DNA/RNA sequence.
• Machine cannot read whole genomes in one
go, but rather small pieces between 20 and
1000 bases.
• Define: Read = Tag = Fragment
De novo sequence assembly
Overview
•
•
•
•
•
Sequence Assembly
Current Method
My Method
RNA Assembly
To Do
De novo sequence assembly
Calculating the overlap need huge amount of time.
DE BRUIJN GRAPH
K-Mer : Length k
substring of the Tag.
Each nodes only have 4
out degrees at most.
Hashing the node.
“CTG”=>(132)4=(30)10
“CTG”=>”TGG”
(132=)4 shift left.
(1320)4 module (1000)4
(320)4 + (3)4 ‘G’
(323)4
DE BRUIJN GRAPH (CONT’)
If there are repeats,
like ”GACT”
3-Mer De Bruijn can not
know which way is the
correct way. 6-Mer can get
the correct sequence.
Larger K, better result.
De novo sequence assembly
Suppose use K = Length of Tag. (20-Mer)
TGACGTAGCTATGTATTTTG
GACGTAGCTATGTATTTTGT (no 20-Mer)
Coverage is not enough to support large K.
Overview
•
•
•
•
•
Sequence Assembly
Current Method
My Method
RNA Assembly
To Do
MY METHOD.
Tag length=6, K=3
When we have
AAGACT?
Try all the way:
AAGACTC
AAGACTT
AAGACTG
Check Tag :
AGACTC
The correct way should be
AAGACTC
Overview
•
•
•
•
•
Sequence Assembly
Current Method
My Method
RNA Assembly
To Do
RNA ASSEMBLY
ALTERNATIVE SPLICING
The graph
All cDNA sequences.
RNA ASSEMBLY’S PROBLEM
Merge?
Index the sequence.
RNA ASSEMBLY’S PROBLEM(CONT’)
Solution?
RNA ASSEMBLY’S PROBLEM(CONT’)
Index Tags
RNA ASSEMBLY’S PROBLEM(CONT’)
Solution?
Speed?
SINGLE TAG’S LIMITATION
|Yellow Sequence| >= Length of Tag
Length of Tag 25-100bp.
Single Tag is not enough!
DATASET - PAIRED END TAGS
Fragment length usually > 1k
Some RNA sequence is shorter than 1k.
TO DO
Handle large data-sets. (10G)
Improve accuracy.
Using PETs data.
Thanks!!