1 SUPPLEMENTARY INFORMATION 2 Supplementary Table S1. Tara Oceans assembly data used in this study. 3 Supplementary Table S2. Tara Oceans ORF data used in this study. 4 Supplementary Table S3. Set of 56 CDD profiles related to retrotransposons or retroviruses. 5 Supplementary Table S4. Abiotic variables with the sample codes in PANGAEA. 6 Supplementary Table S5. Biotic composition of samples derived from the 18S-V9 rDNA 7 barcodes described in (de Vargas et al. 2015). 8 Supplementary Table S6. CDD profiles of most highly represented sequences (top 100) in 9 the metagenomic ORF set. 10 Supplementary Table S7. Output of the metatranscriptomic RT gene abundance DCA and 11 the envfit analysis. 12 Supplementary Table S8. Output of the metagenomic RT gene abundance DCA and the 13 envfit analysis. 14 Supplementary Figure S1. Distribution of pair-wise amino acid sequence identities of 15 known RT domains. When two sequences have the same taxonomic annotation at the second 16 level of the hierarchy of the NCBI taxonomic classification, their comparison was classified 17 as an “inside taxon” comparison. Other comparisons were classified as “outside taxon”. (A) 18 Taxonomically characterized Copia RTs. (B) Taxonomically characterized Gypsy RTs. 19 Supplementary Figure S2. Highly abundant retrotransposon/retrovirus-related ORFs in 20 the metagenomic data from large size fractions. ORFs predicted from the metagenomic 21 assemblies were pooled together in this analysis if they originated from the same size fraction, 1 22 irrespective of sampling depth (i.e., SUR or DCM). Five hundred CDD profiles that acquired 23 the highest number of ORF assignments are shown on the left panels, while the forty best 24 CDD profiles are shown on the right panels (with their CDD profile accession numbers). Red 25 bars correspond to profiles related to retrotransposons/retroviruses, while black bars represent 26 other CDD profiles. Sequence data from two whole genome amplified (WGA) samples (St 27 TARA_007/SUR or DCM/5-20 μm) are not included in this analysis. 28 Supplementary Figure S3. ML-trees of known and environmental RT sequences. (A) 29 ML-tree of known RT sequences with 124 RT-like sequences identified in the metagenomic 30 data. (B) ML-tree of known RT sequences with 100 Copia RT-like sequences identified in the 31 metagenomic data. (C) ML-tree of known RT sequences with 100 longest BEL RT-like 32 sequences identified in the metagenomic data. 33 Supplementary Figure S4. Number of identified RT-like metagenomic sequences and 34 their taxonomic classification. The stringent taxonomy annotation method allowed us to 35 classify 656 metagenomic sequences. 36 Supplementary Figure S5. Number of RT-like metagenomic sequences and their 37 taxonomic classification and RT classification. 38 Supplementary Figure S6. The longest ORF represents a group of Dualen non-LTR 39 retrotransposons distributed among cryptophytes. APE: apurinic-like endonuclease; RT: 40 reverse transcriptase; RNH: ribonuclease H; RLE: restriction-like endonuclease; SET: SET 41 methyltransferase; C48: C48/Ulp1 peptidase; JCP: Josephin-like cysteine protease. 42 Supplementary Figure S7. Number of identified RT-like metatranscriptomic sequences and 43 their taxonomic classification. The stringent taxonomy annotation method allowed us to 44 classify 2050 metatranscriptomic sequences. 2 45 Supplementary Figure S8. Detrended Correspondence Analysis (DCA) ordinations of 21 46 samples for metagenomic RT gene abundance. DCA ordinations are shown with significant 47 (p ≤ 0.05) environmental vectors fitted using envfit. Arrows indicate the direction of the 48 (increasing) environmental gradient, and their lengths are proportional to their correlations 49 with the ordination. X stands for size fraction samples of 180-2000 µm, L for 20-180 µm, M 50 for 5-20 µm and S for 0.8-5 µm. Samples from St TARA_007 is coloured in pink, St 51 TARA_023 in blue, and St TARA_030 in green. 3