Additional tables Table S1. Raw sequencing statistics from the Illumina platform. Miseq Miseq Hiseq2500 Hiseq2500 Hiseq2500 Hiseq2500 Hiseq2500 Hiseq2500 Insert size Reads length (bp) (bp) 400 PE300 550 PE300 350 PE100 350 PE100 550 PE100 900 PE100 5,000 PE90 10,000 PE90 Total Summary No. of reads 36,779,442 52,336,798 411,335,450 453,018,138 456,429,074 367,837,280 337,573,672 324,020,242 1,878,528,688 Raw reads (Gb) 11.00 15.70 41.13 45.30 45.64 36.78 30.38 29.16 Clean reads (Gb) 10.24 14.60 32.97 34.29 31.76 14.27 4.01 5.06 147.20 255.09 Total raw reads represents approximately 395 × coverage of the danshen genome. Table S2. Evaluation of the completeness of the danshen genome based on 248 core eukaryotic genes. Number of Completeness CEGs (%) Complete Group 1 Group 2 Group 3 Group 4 Partial Group 1 Group 2 Group 3 Group 4 221 57 49 53 62 238 62 52 60 64 89.11 86.36 87.50 86.89 95.38 95.97 93.94 92.86 98.36 98.46 Number of CEGs and orthologs 443 97 86 111 149 531 118 102 143 168 Orthologs per CEG 2.00 1.70 1.76 2.09 2.40 2.23 1.90 1.96 2.38 2.62 % CEGS with 1 ortholog 55.66 43.86 40.82 60.38 74.19 62.61 48.39 50.00 71.67 78.12 Table S3. Transposable element annotation statistics for the danshen genome Methods Tandem Repeat Finder RepeatMasker RepeatProteinMasker De novo Merged data Repeat size (bp) 33,102,154 409,776 83,864,539 335,698,178 353,513,348 Percent of genome (%) 5.02 0.06 12.71 50.88 53.58 Table S4. Gene annotation statistics for the danshen genome. Methods RNA-seq EST De novo AUGUSTUS GenScan Homolog * Arabidopsis thaliana Eucalyptus grandis Sesamum indicum Solanum lycopersicum Vitis vinifera Oryza sativa Populus trichocarpa Solanum tuberosum Ricinus communis 33 other plants EVidenceModeler Number of transcript Average Average transcript CDS length length (bp) (bp) Average Average exon per exon length gene (bp) Average intron length (bp) 40,700 3,974 2,606 1,596 1,163 467 4 2 288 188 474 759 27,753 32,305 4,316 2,791 1,181 551 6 3 207 157 665 896 15,915 17,187 28,395 26,846 17,565 13,423 20,423 29,158 19,109 20,945 34,598 2,520 2,712 2,115 1,966 2,604 2,891 2,332 1,603 2,266 2,183 4,166 1,247 1,290 1,123 1,056 1,213 1,384 1,185 976 1,132 1,103 1,078 5 6 4 4 6 6 5 4 5 4 5 227 225 252 245 213 250 232 275 224 273 200 338 354 348 339 345 392 337 326 334 425 597 * All 39 species in the Ensembl Plants database (release 29) were used. E. grandis, S. indicum, and R. communis were obtained from Phytozome. Table S5. Statistics for gene family clustering analysis. Species Arabidopsis thaliana Salvia miltiorrhiza Eucalyptus grandis Oryza sativa Populus trichocarpa Ricinus communis Sesamum indicum Solanum lycopersicum Solanum tuberosum Vitis vinifera Total gene number No. of genes in families Unclustered genes No. of gene families No. of unique gene families Average gene per family 35,395 34,598 36,368 42,132 45,787 31,221 27,161 34,730 35,119 29,936 31,704 27,989 28,929 29,472 37,739 20,783 23,663 26,421 28,885 22,535 3,691 6,609 7,439 12,660 8,048 10,438 3,498 8,309 6,234 7,401 13,517 13,176 13,717 13,553 15,334 14,595 13,027 16,487 15,540 13,992 1,184 1,644 815 2,474 1,150 781 401 561 628 716 2.35 2.12 2.11 2.17 2.46 1.42 1.82 1.60 1.86 1.61 100 0 ~ 2kb 2 ~ 4kb 4 ~ 6kb 6 ~ 8kb 8 ~ 10kb 10 ~ 12kb 12 ~ 14kb 14 ~ 16kb 16 ~ 18kb 18 ~ 20kb 20 ~ 22kb 22 ~ 24kb 24 ~ 26kb 26 ~ 28kb 28 ~ 30kb 30 ~ 32kb 32 ~ 34kb 34 ~ 36kb 36 ~ 38kb > 38kb Frequency Count Additional Figures Figure S1. Frequency counts of all PacBio reads per read length. 107 106 105 104 103 102 101 Read Lengths Figure S2. Frequency distribution of the 23-mer graph. (X) Figure S3. Assembly pipeline for the danshen genome combining Illumina data and PacBio data. Figure S4. Ortholog clustering analysis of the protein-coding genes among Arabidopsis thaliana, Salvia miltiorrhiza, Eucalyptus grandis, Oryza sativa, Populus trichocarpa, Ricinus communis, Sesamum indicum, Solanum lycopersicum, Solanum tuberosum, Vitis vinifera. Single-copy orthologs Multiple-copy orthologs Unique paralogs Other orthologs Unclustered genes 54000 36000 27000 18000 9000 ifera V. v in u be rosu m um S. t ersi c c op S . ly dicu m S. in arpa unis omm R. c h oc P. tr ic O. s ativ a is E. g r an d iltio rrhiz S. m hali a na a 0 A. t Number of genes 45000