Supplementary Methods (doc 62K)

advertisement
Supplementary Methods
Marker Genes for Phage Taxa
Phage Orthologous Groups (POGs) were constructed using the proteins
contained in over 1000 phage genomes, including ss- and ds-DNA, ss- and dsRNA phages, and archaeal viruses (Kristensen et al. 2013). Briefly, proteins were
clustered, using the standard COG construction algorithm, into orthologous
groups using triangles of 3-way reciprocal best matches(Kristensen, Kannan, et
al. 2010). Subsequently, for each POG a Viral Quotient (VQ) was calculated
indicating its specificity to phage genomes, relative to non-prophage regions of
bacterial chromosomes. Then, taxon-specific marker genes were identified that
are never found in other viral taxa (i.e. 100% precision), and not found in nonprophage regions of bacterial chromosomes (i.e. VQ greater than 85%). These
are therefore suitable for detecting phage taxa in mixed samples of prokaryotic
and viral cells. Note that some marker genes have recall<100% (ie. <100% of the
genomes in that taxa contain the marker gene), thus the absence of a detectable
marker gene does not reliably indicate the absence of that type of phage in the
sample, and thus for the abundance calculations, markers with recall <85% were
not used. The presence of a phage in a given sample is determined by the
detection of one of these marker genes, and the abundance of that phage was
calculated using the average number of matches among the set of all marker
genes that are specific to that taxa. Those markers with recall>=85% and
present in at most a single copy per virus genome (such that one match to the
gene is approximately equal to one match to the virus), were used for abundance
calculations are labelled as "Quantitative" in the list of marker genes in
Supplementary Table S1.
To obtain the relative rank information Figure 1(a), abundance was calculated
similarly, but for all markers (rather than just those that are Quantitative), and
then the precise values were discarded, keeping only the qualitative information
about the rank of each phage's abundance level relative to the other phages
present in the same sample.
Note:
The abundance calculated using the Picovirinae subfamily marker gene is much
higher than the abundance of the only Picovirinae genus with an available
taxonomic marker gene (Phi29-like viruses); for most samples, the Phi29-like
abundance is less than 1% of the abundance of the Picovirinae. Most likely, the
abundance of the Picovirinae taxon in excess of that of Phi29-like viruses
represents other Picovirinae such as Ahjd-like viruses and unclassified or
unidentified members.
Calculation of Bacterial Taxon Abundances.
The MOCAT pipeline was used to calculate the abundance of each bacterial taxon.
Briefly, metagenomic reads were mapped to a set of 10 universal marker genes
from 3,496 reference prokaryotic genomes using 97 % nucleotide identity.
Initially, only reads that were uniquely mapped were retained. Then for the
reads that mapped to multiple marker genes, their abundance was distributed to
their respective genomes according to the proportions determined using the
unique-mappers. The abundance was calculated as the length normalized
coverage per base pair.
Identification of Prophage Host Interaction Network
In order to avoid detecting false relationships, criteria for classifying the
bacterial host, were rather strict. Firstly, the nucleotide sequences of the scaftigs
up- and down-stream of the prophage regions were extracted and classified
using Phylopithia (McHardy et al. 2007). Thus, the non-prophage bacterial
chromosome was assigned to a taxon at the level at which the up and
downstream classifications agreed with each other. Secondly, blastN was
performed using the up- and down-stream sequences against a database of
reference genome bacteria genomes, requiring a bit score greater than 60 and
nucleotide identity greater than 85%. The last common ancestor of all hits was
determined, and then the last common ancestor of the up- and down-stream
sequences was finally used as the identity of the bacterial host. If a scaffold was
classified using both methods, the most specific classification was used. Of the
scaffolds that were classified using both methods, none of these were conflicting.
Of the 463 refG- prophages that occur in the gut, 47 (10%) contained at least
one taxon-specific marker gene. From these we constructed a network of phage
taxa that infect specific bacterial genomes (Table S3). To simplify this network,
all the associations are summarized at the level of genus for the bacteria. For the
construction of the network diagram, phage-bacterial associations derived from
the reference genomes were summarized at the level of genus for the bacteria.
For example, the 2 associations of P2-like phages with Escherichia coli HS and of
P2-like phages with Escherichia coli 53638 are summarized as 1 connection
between P2-like phages and the genus Escherichia.
Of the 2518 unique scaftig-prophages, 683 contain at least one marker POG
(27%). However, only 363 scaffolds had reliable taxonomic classification of the
surrounding host genomic region (14%), resulting in an intersection of 79 for
which both the prophage region and the host chromosome had a taxonomic
classification.
Download