tpj12886-sup-0025-Legends

advertisement
Supporting Information Legends
Supporting Figure S1. Evidence supporting MAKER-P gene annotations in PG29.
The MAKER-derived annotation edit distance (AED) is a measure of the annotation to
supporting evidence goodness of fit and was obtained after running the MAKER-P gene
predictor on the PG29 V3 draft assembly (Campbell, et al. 2014). AED values range
from 0 (complete concordance) to 1 (lack of supporting evidence). Over three quarters of
the high-confidence PG29 V3 genes have an AED <0.2, consistent with the selection of
this gene subset based transcript presence in the PG29 transcriptome assembly.
Supporting Figure S2. Intron lengths derived from MAKER-P transcript models
and experimental transcripts. We used the intron size information from MAKER-P
(green line) and intron lengths derived from sequence alignments of 42,440 white spruce
cDNA clones (blue line; (Rigault, et al. 2011 and GCAT v3.3,
https://web.gydle.com/smartforests/gcat)) onto the draft PG29 V3 genome to plot the
cumulative distribution function (CDF). Intron lengths varied between MAKER-P (n =
124,951 introns; Mean length = 646.70 ± 1,218 bp; Median length = 278 bp; Maximum
length = 44,115 bp) and the measurements derived from cDNAs (n=60,265 introns; Mean
length = 4,934 ± 14,311 bp; Median length = 202 bp; Maximum length = 377,332 bp).
Supporting Figure S3. NMF consensus clustering of high-confidence genes. a) From
the NMF rank survey (Gaujoux and Seoighe 2010), both cophenetic correlation
coefficient and average silhouette width suggest a four-cluster solution. b) The consensus
membership heatmap indicates that the 8 tissue samples were unambiguously grouped
into 4 categories. c) 65 high-confidence P450s were subject to hierarchical clustering,
fitting into the NMF consensus groups and expression RPKM plotted with the pheatmap
R package. The range of RNA-seq measured expression is wide and varies across tissue
types. d) A subset of only ten P450s are within the top 5% of genes whose expression
were discriminatory for at least one sample group.
Supporting Figure S4. Phylogeny of gymnosperm terpene synthases. Phylogeny of
gymnosperm terpene synthase proteins, with Physcomitrella patens ent-kaurene synthase
as the root. Phylogeny created with FastTree 2 after protein alignment with MAFFT,
visualized with FigTree. Picea glauca proteins are shown in red, and those names ending
with P are putative pseudogenes. Bootstrap values are indicated at nodes.
Supporting Figure S5. Phylogeny of the white spruce cytochrome P450 family. A
phylogeny of 307 white spruce P450s is shown with CYP51G used as the root. CYP IDs
ending in P are putative pseudogenes. The phylogeny was created with FastTree 2 after
protein alignment with MAFFT, and visualized with FigTree. Labels indicate the ten plant
P450 clans. Areas of conifer- or gymnosperm-specific expansion are colored in: CYP76AAs,
blue; CYP736s, red; CYP750s, orange; CYP720Bs, green; CYP86s, olive; and CYP716Bs,
purple.
Supporting Figure S6. Distribution of white spruce P450s across the eleven plant
P450 clans. Plant P450s have been annotated into eleven major clans based on common
ancestry, each containing one or more families (Nelson and Werck-Reichhart 2011). The
number of white spruce members found in each family is shown, with the portion of
putative pseudogenes shown in white. No white spruce P450s were identified for clan
CYP746, which to date has only been identified in Chlamydomonas reinhardtii (Nelson
and Werck-Reichhart 2011).
Supporting Figure S7. Gymnosperm- and conifer-specific cytochrome P450
subfamilies. Phylogeny of white spruce P450s with orthologues available on NCBI.
Phylogeny created with FastTree 2 after protein alignment with MAFFT, and visualized with
FigTree. CYP IDs ending in P are putative pseudogenes. Bootstrap values are indicated at
nodes.
Supporting Figure S8. Phylogeny of HMG-R proteins in plants.
Phylogeny created with FastTree 2 after protein alignment of the Picea glauca and
PlantCyc proteins with MAFFT, visualized with CLC Main Workbench. Picea glauca
protein names ending with P are putative pseudogenes. Bootstrap values are indicated at
nodes. Labels for Acrogymnospermae are shown in red, Magnoliophyta are shown in
black, and other Streptophyta are shown in blue.
Supporting Figure S9. Phylogeny of isoprenyl diphosphate synthase proteins in
plants. Phylogeny created with FastTree 2 after protein alignment of the Picea glauca
and PlantCyc proteins with MAFFT, visualized with CLC Main Workbench. Phylogeny
was rooted between the FPPS and the GPPS/GGPPS clades. Picea glauca protein names
ending with P are putative pseudogenes. Bootstrap values are indicated at nodes. Labels
for Acrogymnospermae are shown in red, Magnoliophyta are shown in black, other
Streptophyta are shown in blue, and Chlorophyta are shown in green.
Supporting Figure S10. N20 and N50 length statistics for WS77111 pre-unitig
assemblies across a range of k values. In separate experiments, white spruce genome
WS77111 sequencing data was assembled up to the k-mer graph stage (pre-unitig, k-1
overlapping k-mers) with ABySS v1.3.7 using a range of k-values (k=96, 104, 112, 116,
120 and 128). We assessed that the optimal parameter k for assembly (k=116) as the
value of k giving the most contiguous pre-unitig assembly as measured by the N50 length
contiguity statistic (blue line, N50=1,426 bp) and the N20 length (red line, N20=3,819
bp).
Supporting Figure S11. Derivation of a high-confidence gene set based on detected
gene expression in eight white spruce tissue and organ samples. A reference P. glauca
transcritpome was assembled from RNA-seq reads sequenced from 8 PG29 tissues. Only
putative transcripts sequences, those containing complete open reading frames were used
to align (BLAST expect value < e-20) against MAKER-P gene predictions. Choosing the
best hit for each gene model (shown above with a thick black line), we classified 16.4K
gene models as high-confidence. Expression profiles were further obtained by aligning
RNA-seq reads from each independent tissue library against those models.
Supporting Table S1. White spruce WS77111 sequence data.
Supporting Table S2. ABySS resources required at various assembly stages of building
the 20 Gbp White Spruce WS77111 V1 draft genome.
Supporting Table S3. Statistics for WS77111 V1 pre-unitig assemblies across a range of
k values.
Supporting Table S4. Execution, runtime and resources required at different stages of
the DIDA read alignment framework.
Supporting Table S5. White spruce WS77111 V1 ABySS v1.3.7 assembly statistics.
Supporting Table S6. N50 length (kbp) comparisons between white spruce individuals
PG29 V3 and WS77111 V1 at various stages of the ABySS assembly.
Supporting Table S7. White spruce PG29 RNA-Seq transcriptome sequence reads.
Supporting Table S8. White spruce assembly completeness, as measured by CEGMA
analyses.
Supporting Table S9. Assembly quality control (QC) using cDNA and sequence capture
resources.
Supporting Table S10. Exact repeat content in the V3 PG29 genome assembly.
Supporting Table S11. ABySS-Bloom sequence identity calculations between various
draft genome assemblies.
Supporting Table S12. Structural variation (S.V.) in WS77111 V1 relative to PG29 V3
by PAVfinder (Chiu, et al. in preparation) analysis of whole-scaffold alignments.
Supporting Table S13. White spruce genome re-scaffolding – scaffold assembly
statistics.
Download