file

advertisement
Additional tables
Table S1. Raw sequencing statistics from the Illumina platform.
Miseq
Miseq
Hiseq2500
Hiseq2500
Hiseq2500
Hiseq2500
Hiseq2500
Hiseq2500
Insert size Reads length
(bp)
(bp)
400
PE300
550
PE300
350
PE100
350
PE100
550
PE100
900
PE100
5,000
PE90
10,000
PE90
Total Summary
No. of reads
36,779,442
52,336,798
411,335,450
453,018,138
456,429,074
367,837,280
337,573,672
324,020,242
1,878,528,688
Raw reads
(Gb)
11.00
15.70
41.13
45.30
45.64
36.78
30.38
29.16
Clean reads
(Gb)
10.24
14.60
32.97
34.29
31.76
14.27
4.01
5.06
147.20
255.09
 Total raw reads represents approximately 395 × coverage of the danshen genome.
Table S2. Evaluation of the completeness of the danshen genome based on 248
core eukaryotic genes.
Number of Completeness
CEGs
(%)
Complete
Group 1
Group 2
Group 3
Group 4
Partial
Group 1
Group 2
Group 3
Group 4
221
57
49
53
62
238
62
52
60
64
89.11
86.36
87.50
86.89
95.38
95.97
93.94
92.86
98.36
98.46
Number of
CEGs and
orthologs
443
97
86
111
149
531
118
102
143
168
Orthologs
per CEG
2.00
1.70
1.76
2.09
2.40
2.23
1.90
1.96
2.38
2.62
% CEGS
with  1
ortholog
55.66
43.86
40.82
60.38
74.19
62.61
48.39
50.00
71.67
78.12
Table S3. Transposable element annotation statistics for the danshen genome
Methods
Tandem Repeat Finder
RepeatMasker
RepeatProteinMasker
De novo
Merged data
Repeat size (bp)
33,102,154
409,776
83,864,539
335,698,178
353,513,348
Percent of genome (%)
5.02
0.06
12.71
50.88
53.58
Table S4. Gene annotation statistics for the danshen genome.
Methods
RNA-seq
EST
De novo
AUGUSTUS
GenScan
Homolog *
Arabidopsis thaliana
Eucalyptus grandis
Sesamum indicum
Solanum lycopersicum
Vitis vinifera
Oryza sativa
Populus trichocarpa
Solanum tuberosum
Ricinus communis
33 other plants
EVidenceModeler
Number of
transcript
Average
Average
transcript CDS length
length (bp)
(bp)
Average
Average
exon per exon length
gene
(bp)
Average
intron
length (bp)
40,700
3,974
2,606
1,596
1,163
467
4
2
288
188
474
759
27,753
32,305
4,316
2,791
1,181
551
6
3
207
157
665
896
15,915
17,187
28,395
26,846
17,565
13,423
20,423
29,158
19,109
20,945
34,598
2,520
2,712
2,115
1,966
2,604
2,891
2,332
1,603
2,266
2,183
4,166
1,247
1,290
1,123
1,056
1,213
1,384
1,185
976
1,132
1,103
1,078
5
6
4
4
6
6
5
4
5
4
5
227
225
252
245
213
250
232
275
224
273
200
338
354
348
339
345
392
337
326
334
425
597
* All 39 species in the Ensembl Plants database (release 29) were used. E. grandis, S. indicum, and R. communis were obtained from Phytozome.
Table S5. Statistics for gene family clustering analysis.
Species
Arabidopsis thaliana
Salvia miltiorrhiza
Eucalyptus grandis
Oryza sativa
Populus trichocarpa
Ricinus communis
Sesamum indicum
Solanum lycopersicum
Solanum tuberosum
Vitis vinifera
Total gene
number
No. of genes
in families
Unclustered
genes
No. of gene
families
No. of
unique gene
families
Average
gene per
family
35,395
34,598
36,368
42,132
45,787
31,221
27,161
34,730
35,119
29,936
31,704
27,989
28,929
29,472
37,739
20,783
23,663
26,421
28,885
22,535
3,691
6,609
7,439
12,660
8,048
10,438
3,498
8,309
6,234
7,401
13,517
13,176
13,717
13,553
15,334
14,595
13,027
16,487
15,540
13,992
1,184
1,644
815
2,474
1,150
781
401
561
628
716
2.35
2.12
2.11
2.17
2.46
1.42
1.82
1.60
1.86
1.61
100
0 ~ 2kb
2 ~ 4kb
4 ~ 6kb
6 ~ 8kb
8 ~ 10kb
10 ~ 12kb
12 ~ 14kb
14 ~ 16kb
16 ~ 18kb
18 ~ 20kb
20 ~ 22kb
22 ~ 24kb
24 ~ 26kb
26 ~ 28kb
28 ~ 30kb
30 ~ 32kb
32 ~ 34kb
34 ~ 36kb
36 ~ 38kb
> 38kb
Frequency Count
Additional Figures
Figure S1. Frequency counts of all PacBio reads per read length.
107
106
105
104
103
102
101
Read Lengths
Figure S2. Frequency distribution of the 23-mer graph.
(X)
Figure S3. Assembly pipeline for the danshen genome combining Illumina data
and PacBio data.
Figure S4. Ortholog clustering analysis of the protein-coding genes among
Arabidopsis thaliana, Salvia miltiorrhiza, Eucalyptus grandis, Oryza sativa,
Populus trichocarpa, Ricinus communis, Sesamum indicum, Solanum
lycopersicum, Solanum tuberosum, Vitis vinifera.
Single-copy orthologs
Multiple-copy orthologs
Unique paralogs
Other orthologs
Unclustered genes
54000
36000
27000
18000
9000
ifera
V. v
in
u be
rosu
m
um
S. t
ersi
c
c op
S . ly
dicu
m
S. in
arpa
unis
omm
R. c
h oc
P. tr
ic
O. s
ativ
a
is
E. g
r an d
iltio
rrhiz
S. m
hali
a
na
a
0
A. t
Number of genes
45000
Download