Document 13354444

advertisement
AG
AG
G
CC
ACGCG
G
A
A
A
G
GT
GA
T
C
A
AATAA
T
C
A
G
A
T
GC TCAA
AA A
T
T
T
AAA
A
C
A
TT
G
C GA G
C
T
A
T
AA
A
A
G
TGA
A
T
AC AATG
TT
AA
G
T
A
A
G
AG
GG
AAAATAG
GTAAAA
TAGCCG
TCAG
GCG
AC
CGAGAAACGC
G
T
AA
C
AGT
GCGT
AG
ATT
A
G
CG
CC
ATA
AAAAAATAATTGCG
C
T
C
G
GA
GT
AA
ATT
G
AA ATAA
A
GC
AAAAATAAATT
AA
A
CA
AAAA
CA
C
A
G
AA
TTC
A
TT
C
G
T
A
T
G
A
CG
T
A
A
C
A
A
C
AAA AAA
GCA
AA
AT
A
A
A
G
AA TAA
TC
A
AA
TTGAAAATCG
A
G
G
A
G
C
AA
TA
AA
AT
TGAA
A
AA
A
A
AA
A
AA
ACTA
TG
AA
AA
G
TT
CA
TGAAAA
C
AAA
AA
A
AA
AA
TA AAAATCAATA
A
TGA
G
TG
AA
GTAGTTT
AA
TTG
TA
AGT
GA
GTAGTTT
G
AA
TTG
GG
AA
AG
T
TA
TG
GTAGTTT
A
G
TTG
AA
A
AG
AT
GTAGTT
T
TTT
A
GA
GA
A
GT
A
GTAGTT
TTT
GA
GA
GTAGT
GT
TT
TT
GA
GTAGT
TT
TT
GTAG
T
GA
TG
GTAGTTTT
AA
T
AA
G
TC
AA
GA
GA
GACA AAAT
CA
C
AA
AA
TAAAA
A
G
AC
the
gene
pool
GTTCAAAAAAAAATA
A
AATC
A
TTC
AAA
A
GC
AA
GCAAAAATCGTT
GA
A
TCGA
C
A
A
A
TC
AAA
AT
AA
GA
AA
A
A
TGAAAATCGAG
T
T
C
A
A
G
A
A
A
AA
GT
TT
A
A
ATG
A
AT
AT
TG
AA
TG
AATGG
C
T
A
T
G
G
AA GAA
TGAAA
AA
AAA
TT
CTA
A
AA TAT
C
T
AA
AA
AAACTATGAAA
CG
AA
A
ACA
A
A
A
AC
T
GC
AC
GG
AA
T
AATGAAAAC
A
AA ATG
A
TG A
A
AAA
A
AC
TA
T
ATAAAAA
AA
AT
AA
ATCA
A
G
AATGAAAACA
ATAAAA
AA A
A
C
CTA
AAT
A
TG
A
A
A
T
A
A
A
A
A
G
C
ATGA
T
AA
A
A
A
A
C
AA A
A
AA
G
A
A
CA
G
C
AT
C
T
A
A
A
A
T
A
A
A
A
AA
AAAA
G
AGA
A
GG
C
A
A
A
A
A
A
G
TG
C
A
A
A
T
T
AA
A
A
AT
AGAAAATCA
AA
C
A
A
G
A
A
A
T
A
ATA
GA
AC
TA
A
AAA
A
GG GG
T
A
A
G
C
A
A
A
GA A A
AA
AA
GG
TAAA
ATC
G
A
A
GA
G
AA TG
AG AGA
TAAAAAGA
G
A
T
G
TA
A
T
CA
A
G
A
A
A
GA
T
AA
AA
AA
AGAGGTAAA
T
A
A
A
A
A
A
AG A
G
GA
A
T
G
GAAAATAGAGG
T
G
TA
G A G AG T G A A A A
TA
TT
T
G
T
A
T
G
TT
Delivering the genomics revolution
GenePool & NGBug
Transcriptome
Assembly Workshop
NeSC November 2010
RNA QC
cDNA Synthesis
Normalisation
Roche 454 Titanium
Sequencing
Sequence QC/QA
Statistics for read lengths:
Min read length: 40
Max read length: 933
Mean read length: 311.33
Standard deviation of read length: 139.31
Median read length: 344
N50 read length: 395
Statistics for numbers of reads:
Number of reads: 35585
Number of reads >=1kb: 0
Number of reads in N50: 12195
Assembly
Statistics for bases in the reads:
Number of bases in all reads: 11078599
Number of bases in reads >=1kb: 0
GC Content of reads: 42.38 %
Statistics for reads >= 150 bp in length:
Number of reads >= 150 bp: 29060
Number of bases in reads >= 150 bp: 10508272
Mean read length for reads over 150 bp: 361.61
Annotation
Optimising Transcriptome Assembly
the truth is out there....
Optimising transcriptomics
ribominus versus poly(A) selection
oligod(T) versus random 6- | 9-mers
cap capture vs 5’ end capture vs none
gene expression is very skewed
number of mRNAs
10,000
1
10e-2
expression level (mRNA per cell)
10e5
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
bbb
bbb
bbb
ccc
aaa
normalisation
aaa
bbb
ccc
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
aaa
bbb
bbb
bbb
ccc
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
aaa
normalisation
aaa
bbb
ccc
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
subtraction with driver
aba
aba
aba
apa aba
aba
aba aba aba
aba
aba
aba
apa
normalisation
aba
aba
aba
aba
apa
aba aba aba
aba
aba
aba
aba
apa
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
aba
ds nuclease
aba
[p]
aba
aba
Optimising Transcriptome Assembly
if the genome is known,
the transcriptome can be estimated by
mapping transcriptome reads to this
reference
if the genome is well annotated,
the transcriptome is ‘known’
Assembler ‘philosophy’: reads
every read is sacred
versus
reads can be fragmented
Assembler ‘philosophy’: graphs
overlap - layout - consensus (OLC)
versus
De Bruijn graph
Assembler ‘philosophy’: preclustering
precluster before assembly
versus
all-against-all assembly
Assembler ‘philosophy’
programme
reads
graphs preclusters
TiCL
sacred
OLC
yes
CLOBB
sacred
OLC
yes
MIRA
sacred
OLC
no
CAP3
sacred
OLC
no
Newbler
fragment
OLC
no
Velvet
fragment de Bruijn
no
ABySS
fragment de Bruijn
no
SOAPdenovo fragment de Bruijn
no
CLC NGCell fragment de Bruijn
no
Optimising de novo Transcriptome Assembly
or How do we know when we have arrived?
Assembly
Assembly Optimality Criteria
Read use:
Number of contigs:
Span of contigs:
# unique bases:
N50 of contigs:
# contigs >1 kb:
use most or all reads
should approach transcriptome size (15k-35k)
should approach transcriptome size (30-50 Mb)
more may be better
should approach transcriptome N50 (~1.3 kb)
more may be better
comparison to reference datasets:
previously sequenced ESTs
related reference transcriptome/proteome
KOGs/CEGMA data more is better
comparison to other assemblies:
should have the most unique bases
2.0e+07
1.5e+07
1.0e+07
5.0e+06
CAP3
CLC
MIRA
Newbler 2.3
Newbler 2.5
SeqMan
0.0e+00
Cumulative contig length
Assembly
0
5000
10000
15000
20000
25000
Contigs ranked by size
30000
35000
Assembly
Improving the credibility of assemblies
by merging pairs of initial assemblies
CcCc
cCcC
cc
cc
CC
CC
reads NOT used by either assembler
reads used by
both assemblers
BUT contigs not co-assembled
reads used
only by
assembler A
reads used by either assembler
contributing to coassembled contigs
reads used
only by
assembler B
Assembly
Improving the credibility of assemblies
by merging pairs of initial assemblies
Assembly 1
Assembly 2
Number of Contigs
Assembly1 Assembly 2
“Credible”
C+C contigs
Summed length
of C+ C contigs
MIRA
SeqMan
35827
29969
18068
16293192
MIRA
Newbler (Version 2.5)
35827
21734
15951
15866051
CLC
Newbler (Version 2.5)
22746
21734
15778
15825663
Newbler (Version 2.5)
SeqMan
21734
29969
15783
15701053
CAP3
Newbler (Version 2.5)
24727
21734
14275
14830304
CAP3
SeqMan
24727
29969
15387
14824287
CLC
SeqMan
22746
29969
15504
14679975
CLC
MIRA
22746
35827
15334
14357031
CAP3
MIRA
24727
35827
15688
14243534
CAP3
CLC
24727
22746
14149
13753398
12019
21734
9733
13252303
Newbler (Version 2.3) Newbler (Version 2.5)
CLC
Newbler (Version 2.3)
22746
12019
8884
12318589
MIRA
Newbler (Version 2.3)
35827
12019
9380
11731374
Newbler (Version 2.3)
SeqMan
12019
29969
8274
11452990
CAP3
Newbler (Version 2.3)
24727
12019
8484
11426423
Download