tpj12679-sup-0027-Legends

Figure S1. Coverage of TAIR10 annotations in assembled transcriptome. Transcripts are assembled using Cufflinks and Cuffmerge, most of the coding transcripts and annotated lncRNAs are recovered, 955 predicted lncRNA transcripts are obtained after four strict filter steps (Figure 1), which are not overlapped with coding genes or known ncRNAs. Figure S2. Structural and epigenetic features of annotated ncRNAs and coding genes. Structural features and epigenetic features can help to distinguish ncRNAs with coding genes. (a) The average stabilities and conservations of RNA secondary structure for genomic elements. (b) Different epigenetic signatures of coding genes and known ncRNAs are shown. Averaged signal intensities in 1.5 kb region around TSS are shown. Random selected unexpressed intergenic regions are used as control (see Supplemental Methods). TSS, transcription start site; CpG, DNA methylation on CpG island; CHH and CHG, CHH and CHG sites where H is A, T, or C. Figure S3. incRNA model performance. (a) incRNA performance for classifying known ncRNAs (canonical ncRNAs and annotated lncRNAs). For the ROC curve, we use the bins being lncRNA or canonical ncRNA as positives, and union of other classes as negatives. As we did previously in C.elegans (Lu et al., 2011), we test the model on four genomic elements: canonical ncRNAs, CDSs, UTRs and unexpressed intergenic regions. And the model performs (green line, AUC = 0.984) very well in Arabidopsis as well. However, in this paper, we focus on long ncRNAs (lncRNAs). When we add lncRNAs into the canonical ncRNAs, the model’s performance decreases (blue line, AUC = 0.861). The major reason is that some lncRNAs cannot be thoroughly separated from UTRs. First of all, the definition of our training set may cause these errors. Many annotated lncRNAs are newly assembled transcripts that are not well validated by solid experiments. They could be mis-annotated UTRs or expression noises. Second, many known ncRNAs were also found to be derived from UTRs (Nie et al., 2012). Third, the lncRNAs are very different from canonical ncRNAs. They may be polyadenylated and have no structures, making them similar to untranslated regions of mRNAs (UTRs). Probably due to such concerns, previous methods did not use UTRs as controls for the lncRNA identification (Lv et al., 2013, Sun et al., 2013). Therefore, in this study, we prefer not to use UTRs in the negative set when training the lncRNA model. The AUC is 0.901 (red line) to classify lncRNAs (5,593 bins in total) and canonical ncRNAs (1,629 bins in total) from CDSs (303,098 bins in total) and unexpressed intergenic regions (79,488 bins in total). Moreover, we use distance criteria to filter the predictions, which is > 500 bp away from known genes. For level 3 lncRNAs, we removed these ambiguous ones. For level 1 and 2 lncRNAs, we still kept them and annotated them as cis lncRNAs, because of the potential interests on them. They will need further experimental validations. (b) The model is used to predict whole genome bins, assign averaged bin score on overlapped annotated lncRNAs, check the performance by ppv and sensitivity. (c-f) Model performance for lncRNAs from different genomic locations by using different features sets. Figure S4. Saturation plots of lncRNAs and coding genes. Twenty one datasets including poly(A)+ RNA-seq and poly(A)- RNA-seq are used to calculate the expression ratios of level 3 lncRNA loci, coding genes and annotated lncRNAs. Aligned reads number >= 5 is set as expression cutoff, each point at a given number of considered datasets corresponds to a different combination of datasets. Figure S5. Comparison of feature scores among different levels of predicted lncRNAs. The normalized feature scores are shown for different levels of lncRNAs,CDSs and controls (unexpressed intergenic region). All the 25 features are plotted. Figure S6. Comparison of feature scores of predicted lncRNAs from different genomic locations. (a) Genomic locations of leve1 and level 2 predicted lncRNAs and annotated lncRNAs (see definitions in Supplemental methods in Additional file 1). (b) The normalized feature scores are shown for lncRNAs from different genomic locations,CDS and control (unexpressed intergenic region). All the 25 features are plotted. Figure S7. Exon numbers in poly(A)+ and poly(A)- lncRNAs. The classification are same with Figure 3, but only level 1 predicted lncRNAs are shown. Figure S8. Classification of level 1 poly(A)+ and poly(A)- lncRNAs. The classification are same with Figure 3, but only level 1 predicted lncRNAs are shown. Figure S9. Comparison of feature scores between poly(A)+ and poly(A)lncRNAs. The normalized feature scores are shown for poly(A)+ and poly(A)lncRNAs,CDSs and controls (unexpressed intergenic region). All the 25 features are plotted.(a) All predicted lncRNAs are plotted. (b) Only level 1 predicted lncRNAs are plotted. Figure S10. DNA conservation of lncRNAs in multiple plant species. LncRNAs show moderate DNA conservation across multiple plant species. Overall, close species tend to have higher DNA conservation scores. Distances between species are shown in Mya on the phylogenetic tree. We have calculated the DNA conservation scores for each genomic bin (100 nt, Supplemental Methods). In brief, we used BLASTn to search targets in each of the 31 plant species. We only kept the best match and recorded the Bit score. If there is no match for a bin, we assign 0 to the Bit score. Here, we selected 16 close species for the evolutionary conservation comparison of lncRNAs. Each predicted lncRNA could overlap with multiple bins (100nt). Thus, we used the maximum Bit score of the overlapped bins as the lncRNA’s conservation score. Then, we averaged the lncRNA scores, and plotted them according to the evolutionary branch. We employed Mann-Whitney test to compare the conservation scores between lncRNAs and the controls. Figure S11. Expression patterns of lncRNAs in different stress conditions. We analyzed the differential expression patterns of lncRNAs under different stress conditions. The significance of differential expression is measured by fold-changes. Then we clustered the fold-change scores (k-means, k=7). Poly(A)+ lncRNAs are assessed by poly(A)+ RNA-seq data (left panels). Poly(A)- lncRNAs are assessed by poly(A)- RNA-seq data (right panel). Figure S12. Comparison of stress specific scores between mRNAs and lncRNAs within a fixed expression level. (a) Maximum expressions of mRNAs and lncRNAs in either poly(A)+ RNAseq or poly(A)- RNA-seq data are calculated, and their densities are plotted. (b) Expressions (log(rpkm)) between -1 and 3 are used to compare stress specificity expressions which are measured by stress specificity scores (see Methods). Figure S13. qRT-PCR validations for differentially expressed lncRNAs. (a) and (b) qRT-PCR results of predicted differentially expressed lncRNAs in drought stress and salt stress. The expression values of lncRNAs are normalized to the expression values of the house keeping gene ACT2. Numbers in boxes denote expression fold changes, which are measured either by RNA-seq or qRT-PCR in drought and salt stress. The expression changes measured by qRT-PCR (either time point) should be in the same direction with RNA-seq and the fold-change should above 1.5. Most of the qRT-PCR results are consistent with RNA-seq data (✔symbol), several are not consistent (symbol ✖). lncRNAs from level1: lnc-20, lnc-23, lnc-27, lnc104, lnc-168, lnc-225, lnc-278, lnc-316, and lnc-324; lncRNAs from level2: lnc68, lnc-127, lnc-164, lnc-287, lnc-311, lnc-334, lnc-369, lnc-410, lnc-525, and lnc-498. (b) and (c) Validation of differentially expressed lncRNA candidates in cold and heat stress. The method is the same with (a) and (b). lncRNAs from level1: lnc-61, lnc-67, lnc-93, lnc-100, lnc-141, lnc-156, lnc-162, lnc-190, lnc220, lnc-225, lnc-234, lnc-259, lnc-278, lnc-283, lnc-289, lnc-334; lncRNAs from level2:, lnc-102, lnc-168, lnc-211, lnc-293, lnc-326, lnc-420, lnc-508, lnc606. Figure S14. Validation of pif mutants and the high light responsiveness of PIF genes. The pif4 and pifq mutants are validated as null alleles, and the PIF genes are significantly down-regulated by high light, and the mutated genes cannot be detected in mutants. (a) PIF4 amplification in Col-0 and pif4 mutant under normal and highlight condition. (b) PIF1, PIF3, PIF5 and PIF4 amplification in Col-0 and pifq mutant under normal and highlight condition. +RT, the normal reverse transcription procedure; -RT, without reverse transcriptase. ACT2 is used as positive control. Figure S15. Stress-related conserved sequence motifs. The conserved motifs are searched by MEME in the 25 groups of lncRNAs.16 groups have at least one conserved motif enriched (E-value 0.05). Figure S16. Verification of enriched motifs. (a) We use MAST to search sequence motifs (detected by MEME) in lncRNAs, mRNA regions and simulated background sequences and (see Supplemental Methods), used the default cutoff (p-value<0.0001) for motif match. We calculated the enrichments by comparing the number of matched motifs in lncRNAs to that in the background sequences and performed Fisher’s exact test to measure if the enrichments are significant. (b) For structural motif, we use RNApromo program to search the structural motifs in the same way. We summed the log-likelihood scores of matched motifs (loglikelihood score >0) and calculated the enrichment,and we performed MannWhitney test to measure if the enrichments are significant. Fig S17. Characterization of poly(A)- mRNAs. (a) The same clustering analysis is performed for poly(A)- mRNAs as in Figure 4a. (b) Maximum expressions of mRNAs and lncRNAs in either poly(A)+ RNA-seq or poly(A)- RNA-seq data are calculated, and their densities are plotted. (c) Expressions (log(rpkm)) between -1 and 3 are used to compare stress specificity expressions which are measured by stress specificity scores (see Methods). We used the same method as in Figure S12. (d) GO enrichment analysis is performed using all the differentially expressed poly(A)- mRNAs, several GO terms under the parental annotation macromolecule metabolic process (GO:0043170, FDR = 7.0E-86) are shown. Table S1. Poly(A)- and RiboMinus RNA-seq data statistical summary. Table S2. Features and public data used in this study. Table S3. Overlaps of predicted lincRNAs by this study and Liu et al.,2012. Table S4. Statistical test of differentially expressed coding genes or lncRNAs under stress. Table S5. RT-PCR and qRT-PCR primer sequences used in this study. Supporting experimental procedures Supplemental methods Appendix S1. Full matrix of lncRNAs in Arabidopsis, including annotated lncRNAs and three levels of predicted lncRNAs. Appendix S2. List of differentially expressed lncRNAs under stress conditions, including annotated lncRNAs and level 1 and level 2 predicted lncRNAs. Appendix S3. Functional annotation of co-expressed lncRNAs, including annotated lncRNAs and level 1 and level 2 predicted lncRNAs.

tpj12679-sup-0027-Legends

Related documents

Products

Support

tpj12679-sup-0027-Legends

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib