tpj12679-sup-0027-Legends

advertisement
Figure S1. Coverage of TAIR10 annotations in assembled transcriptome.
Transcripts are assembled using Cufflinks and Cuffmerge, most of the coding
transcripts and annotated lncRNAs are recovered, 955 predicted lncRNA
transcripts are obtained after four strict filter steps (Figure 1), which are not
overlapped with coding genes or known ncRNAs.
Figure S2. Structural and epigenetic features of annotated ncRNAs and
coding genes.
Structural features and epigenetic features can help to distinguish ncRNAs
with coding genes. (a) The average stabilities and conservations of RNA
secondary structure for genomic elements. (b) Different epigenetic signatures
of coding genes and known ncRNAs are shown. Averaged signal intensities in
1.5 kb region around TSS are shown. Random selected unexpressed
intergenic regions are used as control (see Supplemental Methods). TSS,
transcription start site; CpG, DNA methylation on CpG island; CHH and CHG,
CHH and CHG sites where H is A, T, or C.
Figure S3. incRNA model performance.
(a) incRNA performance for classifying known ncRNAs (canonical ncRNAs
and annotated lncRNAs). For the ROC curve, we use the bins being lncRNA
or canonical ncRNA as positives, and union of other classes as negatives. As
we did previously in C.elegans (Lu et al., 2011), we test the model on four
genomic elements: canonical ncRNAs, CDSs, UTRs and unexpressed
intergenic regions. And the model performs (green line, AUC = 0.984) very
well in Arabidopsis as well. However, in this paper, we focus on long ncRNAs
(lncRNAs). When we add lncRNAs into the canonical ncRNAs, the model’s
performance decreases (blue line, AUC = 0.861). The major reason is that
some lncRNAs cannot be thoroughly separated from UTRs. First of all, the
definition of our training set may cause these errors. Many annotated
lncRNAs are newly assembled transcripts that are not well validated by solid
experiments. They could be mis-annotated UTRs or expression noises.
Second, many known ncRNAs were also found to be derived from UTRs (Nie
et al., 2012). Third, the lncRNAs are very different from canonical ncRNAs.
They may be polyadenylated and have no structures, making them similar to
untranslated regions of mRNAs (UTRs). Probably due to such concerns,
previous methods did not use UTRs as controls for the lncRNA identification
(Lv et al., 2013, Sun et al., 2013). Therefore, in this study, we prefer not to
use UTRs in the negative set when training the lncRNA model. The AUC is
0.901 (red line) to classify lncRNAs (5,593 bins in total) and canonical
ncRNAs (1,629 bins in total) from CDSs (303,098 bins in total) and
unexpressed intergenic regions (79,488 bins in total). Moreover, we use
distance criteria to filter the predictions, which is > 500 bp away from known
genes. For level 3 lncRNAs, we removed these ambiguous ones. For level 1
and 2 lncRNAs, we still kept them and annotated them as cis lncRNAs,
because of the potential interests on them. They will need further
experimental validations. (b) The model is used to predict whole genome bins,
assign averaged bin score on overlapped annotated lncRNAs, check the
performance by ppv and sensitivity. (c-f) Model performance for lncRNAs from
different genomic locations by using different features sets.
Figure S4. Saturation plots of lncRNAs and coding genes.
Twenty one datasets including poly(A)+ RNA-seq and poly(A)- RNA-seq are
used to calculate the expression ratios of level 3 lncRNA loci, coding genes
and annotated lncRNAs. Aligned reads number >= 5 is set as expression
cutoff, each point at a given number of considered datasets
corresponds to a different combination of datasets.
Figure S5. Comparison of feature scores among different levels of
predicted lncRNAs.
The normalized feature scores are shown for different levels of
lncRNAs,CDSs and controls (unexpressed intergenic region). All the 25
features are plotted.
Figure S6. Comparison of feature scores of predicted lncRNAs from
different genomic locations.
(a) Genomic locations of leve1 and level 2 predicted lncRNAs and annotated
lncRNAs (see definitions in Supplemental methods in Additional file 1). (b)
The normalized feature scores are shown for lncRNAs from different genomic
locations,CDS and control (unexpressed intergenic region). All the 25 features
are plotted.
Figure S7. Exon numbers in poly(A)+ and poly(A)- lncRNAs.
The classification are same with Figure 3, but only level 1 predicted lncRNAs
are shown.
Figure S8. Classification of level 1 poly(A)+ and poly(A)- lncRNAs.
The classification are same with Figure 3, but only level 1 predicted lncRNAs
are shown.
Figure S9. Comparison of feature scores between poly(A)+ and poly(A)lncRNAs.
The normalized feature scores are shown for poly(A)+ and poly(A)lncRNAs,CDSs and controls (unexpressed intergenic region). All the 25
features are plotted.(a) All predicted lncRNAs are plotted. (b) Only level 1
predicted lncRNAs are plotted.
Figure S10. DNA conservation of lncRNAs in multiple plant species.
LncRNAs show moderate DNA conservation across multiple plant species.
Overall, close species tend to have higher DNA conservation scores.
Distances between species are shown in Mya on the phylogenetic tree. We
have calculated the DNA conservation scores for each genomic bin (100 nt,
Supplemental Methods). In brief, we used BLASTn to search targets in each
of the 31 plant species. We only kept the best match and recorded the Bit
score. If there is no match for a bin, we assign 0 to the Bit score. Here, we
selected 16 close species for the evolutionary conservation comparison of
lncRNAs. Each predicted lncRNA could overlap with multiple bins (100nt).
Thus, we used the maximum Bit score of the overlapped bins as the lncRNA’s
conservation score. Then, we averaged the lncRNA scores, and plotted them
according to the evolutionary branch. We employed Mann-Whitney test to
compare the conservation scores between lncRNAs and the controls.
Figure S11. Expression patterns of lncRNAs in different stress
conditions.
We analyzed the differential expression patterns of lncRNAs under different
stress conditions. The significance of differential expression is measured by
fold-changes. Then we clustered the fold-change scores (k-means, k=7).
Poly(A)+ lncRNAs are assessed by poly(A)+ RNA-seq data (left panels).
Poly(A)- lncRNAs are assessed by poly(A)- RNA-seq data (right panel).
Figure S12. Comparison of stress specific scores between mRNAs and
lncRNAs within a fixed expression level.
(a) Maximum expressions of mRNAs and lncRNAs in either poly(A)+ RNAseq or poly(A)- RNA-seq data are calculated, and their densities are plotted.
(b) Expressions (log(rpkm)) between -1 and 3 are used to compare stress
specificity expressions which are measured by stress specificity scores (see
Methods).
Figure S13. qRT-PCR validations for differentially expressed lncRNAs.
(a) and (b) qRT-PCR results of predicted differentially expressed lncRNAs in
drought stress and salt stress. The expression values of lncRNAs are
normalized to the expression values of the house keeping gene ACT2.
Numbers in boxes denote expression fold changes, which are measured
either by RNA-seq or qRT-PCR in drought and salt stress. The expression
changes measured by qRT-PCR (either time point) should be in the same
direction with RNA-seq and the fold-change should above 1.5. Most of the
qRT-PCR results are consistent with RNA-seq data (✔symbol), several are
not consistent (symbol ✖). lncRNAs from level1: lnc-20, lnc-23, lnc-27, lnc104, lnc-168, lnc-225, lnc-278, lnc-316, and lnc-324; lncRNAs from level2: lnc68, lnc-127, lnc-164, lnc-287, lnc-311, lnc-334, lnc-369, lnc-410, lnc-525, and
lnc-498. (b) and (c) Validation of differentially expressed lncRNA candidates in
cold and heat stress. The method is the same with (a) and (b). lncRNAs from
level1: lnc-61, lnc-67, lnc-93, lnc-100, lnc-141, lnc-156, lnc-162, lnc-190, lnc220, lnc-225, lnc-234, lnc-259, lnc-278, lnc-283, lnc-289, lnc-334; lncRNAs
from level2:, lnc-102, lnc-168, lnc-211, lnc-293, lnc-326, lnc-420, lnc-508, lnc606.
Figure S14. Validation of pif mutants and the high light responsiveness
of PIF genes.
The pif4 and pifq mutants are validated as null alleles, and the PIF genes are
significantly down-regulated by high light, and the mutated genes cannot be
detected in mutants. (a) PIF4 amplification in Col-0 and pif4 mutant under
normal and highlight condition. (b) PIF1, PIF3, PIF5 and PIF4 amplification in
Col-0 and pifq mutant under normal and highlight condition. +RT, the normal
reverse transcription procedure; -RT, without reverse transcriptase. ACT2 is
used as positive control.
Figure S15. Stress-related conserved sequence motifs.
The conserved motifs are searched by MEME in the 25 groups of lncRNAs.16
groups have at least one conserved motif enriched (E-value 0.05).
Figure S16. Verification of enriched motifs.
(a) We use MAST to search sequence motifs (detected by MEME) in
lncRNAs, mRNA regions and simulated background sequences and (see
Supplemental Methods), used the default cutoff (p-value<0.0001) for motif
match. We calculated the enrichments by comparing the number of matched
motifs in lncRNAs to that in the background sequences and performed
Fisher’s exact test to measure if the enrichments are significant. (b) For
structural motif, we use RNApromo program to search the structural motifs in
the same way. We summed the log-likelihood scores of matched motifs (loglikelihood score >0) and calculated the enrichment,and we performed MannWhitney test to measure if the enrichments are significant.
Fig S17. Characterization of poly(A)- mRNAs.
(a) The same clustering analysis is performed for poly(A)- mRNAs as in
Figure 4a. (b) Maximum expressions of mRNAs and lncRNAs in either
poly(A)+ RNA-seq or poly(A)- RNA-seq data are calculated, and their
densities are plotted. (c) Expressions (log(rpkm)) between -1 and 3 are used
to compare stress specificity expressions which are measured by stress
specificity scores (see Methods). We used the same method as in Figure S12.
(d) GO enrichment analysis is performed using all the differentially expressed
poly(A)- mRNAs, several GO terms under the parental annotation
macromolecule metabolic process (GO:0043170, FDR = 7.0E-86) are shown.
Table S1. Poly(A)- and RiboMinus RNA-seq data statistical summary.
Table S2. Features and public data used in this study.
Table S3. Overlaps of predicted lincRNAs by this study and Liu et al.,2012.
Table S4. Statistical test of differentially expressed coding genes or lncRNAs
under stress.
Table S5. RT-PCR and qRT-PCR primer sequences used in this study.
Supporting experimental procedures
Supplemental methods
Appendix S1. Full matrix of lncRNAs in Arabidopsis, including annotated
lncRNAs and three levels of predicted lncRNAs.
Appendix S2. List of differentially expressed lncRNAs under stress
conditions, including annotated lncRNAs and level 1 and level 2 predicted
lncRNAs.
Appendix S3. Functional annotation of co-expressed lncRNAs, including
annotated lncRNAs and level 1 and level 2 predicted lncRNAs.
Download