Reverse transcriptases are the most abundant genes in

advertisement
1
SUPPLEMENTARY INFORMATION
2
Supplementary Table S1. Tara Oceans assembly data used in this study.
3
Supplementary Table S2. Tara Oceans ORF data used in this study.
4
Supplementary Table S3. Set of 56 CDD profiles related to retrotransposons or retroviruses.
5
Supplementary Table S4. Abiotic variables with the sample codes in PANGAEA.
6
Supplementary Table S5. Biotic composition of samples derived from the 18S-V9 rDNA
7
barcodes described in (de Vargas et al. 2015).
8
Supplementary Table S6. CDD profiles of most highly represented sequences (top 100) in
9
the metagenomic ORF set.
10
Supplementary Table S7. Output of the metatranscriptomic RT gene abundance DCA and
11
the envfit analysis.
12
Supplementary Table S8. Output of the metagenomic RT gene abundance DCA and the
13
envfit analysis.
14
Supplementary Figure S1. Distribution of pair-wise amino acid sequence identities of
15
known RT domains. When two sequences have the same taxonomic annotation at the second
16
level of the hierarchy of the NCBI taxonomic classification, their comparison was classified
17
as an “inside taxon” comparison. Other comparisons were classified as “outside taxon”. (A)
18
Taxonomically characterized Copia RTs. (B) Taxonomically characterized Gypsy RTs.
19
Supplementary Figure S2. Highly abundant retrotransposon/retrovirus-related ORFs in
20
the metagenomic data from large size fractions. ORFs predicted from the metagenomic
21
assemblies were pooled together in this analysis if they originated from the same size fraction,
1
22
irrespective of sampling depth (i.e., SUR or DCM). Five hundred CDD profiles that acquired
23
the highest number of ORF assignments are shown on the left panels, while the forty best
24
CDD profiles are shown on the right panels (with their CDD profile accession numbers). Red
25
bars correspond to profiles related to retrotransposons/retroviruses, while black bars represent
26
other CDD profiles. Sequence data from two whole genome amplified (WGA) samples (St
27
TARA_007/SUR or DCM/5-20 μm) are not included in this analysis.
28
Supplementary Figure S3. ML-trees of known and environmental RT sequences. (A)
29
ML-tree of known RT sequences with 124 RT-like sequences identified in the metagenomic
30
data. (B) ML-tree of known RT sequences with 100 Copia RT-like sequences identified in the
31
metagenomic data. (C) ML-tree of known RT sequences with 100 longest BEL RT-like
32
sequences identified in the metagenomic data.
33
Supplementary Figure S4. Number of identified RT-like metagenomic sequences and
34
their taxonomic classification. The stringent taxonomy annotation method allowed us to
35
classify 656 metagenomic sequences.
36
Supplementary Figure S5. Number of RT-like metagenomic sequences and their
37
taxonomic classification and RT classification.
38
Supplementary Figure S6. The longest ORF represents a group of Dualen non-LTR
39
retrotransposons distributed among cryptophytes. APE: apurinic-like endonuclease; RT:
40
reverse transcriptase; RNH: ribonuclease H; RLE: restriction-like endonuclease; SET: SET
41
methyltransferase; C48: C48/Ulp1 peptidase; JCP: Josephin-like cysteine protease.
42
Supplementary Figure S7. Number of identified RT-like metatranscriptomic sequences and
43
their taxonomic classification. The stringent taxonomy annotation method allowed us to
44
classify 2050 metatranscriptomic sequences.
2
45
Supplementary Figure S8. Detrended Correspondence Analysis (DCA) ordinations of 21
46
samples for metagenomic RT gene abundance. DCA ordinations are shown with significant
47
(p ≤ 0.05) environmental vectors fitted using envfit. Arrows indicate the direction of the
48
(increasing) environmental gradient, and their lengths are proportional to their correlations
49
with the ordination. X stands for size fraction samples of 180-2000 µm, L for 20-180 µm, M
50
for 5-20 µm and S for 0.8-5 µm. Samples from St TARA_007 is coloured in pink, St
51
TARA_023 in blue, and St TARA_030 in green.
3
Download