Supplementary Methods (doc 42K)

advertisement
1. Construction of gene- and species-trees
Protein sequences were aligned using MUSCLE (Edgar, 2004), and all gaps were removed,
including one flanking residue on either side. Nucleotide alignments were also performed using
MUSCLE, ensuring that codons were consistent with the protein alignment. Gene tree topologies
and branch lengths were estimated using PhyML 2.4.5 (Guindon & Gascuel, 2003) with a JTT
substitution model and 4 γ-distributed rate categories. As we previously noted, use of the WAG
substitution model instead of JTT did not significantly influence the branch lengths or topologies
of these proteins (Shapiro & Alm, 2008). The consensus species tree topology was estimated as
previously described, using a concatenation of 80 COGs present in single copy in each of the 30
genomes (Shapiro & Alm, 2008). Divergence times in the species tree were estimated using a
relaxed molecular clock model implemented in multidivtime (Thorne et al, 1998) with
parameters rttm = 1.75 By, rttmsd = 1, rtrate = 3.9, rtratesd = 3.9, brownmean = 0.6, brownsd =
0.6, bigtime = 5.0, and the constraint that E. coli and Salmonella diverged between 0.057 and
0.176 billion years ago (Ochman & Wilson, 1987, Battistuzzi et al, 2004).
2. Detailed procedures for choosing k.
For DNA data, k is the (approximate) proportion of sites in the multiple sequence alignment
classified as 'slow'. k is 'approximate' because multiple sites might be tied at the same rate
(substitutions/site), so that the cutoff is drawn to minimize the deviation from the desired k (i.e. if
the desired k were 0.67, falling between sites with rank 0.60 and 0.70, the cutoff would be
chosen at 0.70). For AA data, k is the rank-order of the site where the cutoff is drawn. For
example, in an alignment of 10 sites with 1, 2, 3, and 4 substitutions in 4, 3, 1, and 2 sites,
respectively, k would be 0.25 (1/4) if the cutoff were drawn between 1 and 2 substitutions/site,
even though this would result in a proportion of 0.40 (4/10) of all sites being classed as 'slow'.
minFDR
In this method, we compute S:F ratios for all genes in our dataset to yield a distrubutiuon of pvalues for each possible choice of k. We then want to choose the value of k producing the
distribution with the lowest FDR. This is done by performing a 1-sided Kolmogorov-Smirnov
(KS) test of the observed p-value distribution against a uniform distribution of p-values. The
higher the value of the KS test D statistic, the more favorable the ratio of true:false positives. As
verification that the D statistic is indeed a good indicator of FDR, we plotted the observed pvalue distribution (ranging from 0 to 1, binned in increments of 0.05) for each value of k, and
calculated FDR by dividing the average number of counts in bins with p>0.5 by the number of
counts in the p<0.05 bin. We then chose the value of k that minimized FDR. These two methods
always yielded the same choice of k.
minSD
As in the minFDR method, we range k from 0.05 to 0.95 in increments of 0.05. For each k, the
sequence is split into slow and fast-sites, according to their rank-order. Branch lengths are then
estimated twice independently with PhyML, once using only the slow-site distribution and once
using only the fast-site distribution. The result is two gene trees: one estimated using slow-sites,
the other using fast-sites. If, for example, some fast-sites are miscategorized as slow, most of the
slow-sites will support short branch lengths, but a few (miscateogrized) sites will support long
branch lengths. So, a branch in the gene-tree could be ‘stretched’ (inferred to have undergone
more substitutions per site) without significantly reducing the likelihood. Such 'stretching' of
each branch, can be used to estimate the standard deviation of each branch in the gene tree,
which is our goal. Starting from the maximum-likelihood branch length, the branch length is
stretched to 300% of its ML length, in increments of 5%, and then compressed to zero, also in
increments of 5%. The likelihood of the gene tree is re-estimated with PhyML for each
incremental stretch of the branch to produce a distribution of N branch lengths and their
associated likelihoods, for each branch. For branches with non-zero ML lengths, the likelihood
of a particular branch length could be well-approximated by a Normal distribution, transformed
for simplicity such that the mean (ML branch length) is zero (Equation 4). The log-likelihood
distribution can therefore be approximated by a parabola with L-intercept equal to ln (σ(2π)-1/2)
(Equation 5).
Equation 4.
L  px (x) ~ N(0, 2 )

Equation 5.
1
x2
ln L  ln

 2 2 2

where the likelihood (L) is a function of log branch length (x), given by a Normal probability
density function with mean set to zero (by subtracting the PhyML-estimated ML branch length)
and variance σ2. Letting y = ln (σ(2π)-1/2) - ln L and m = 1/2σ2, and substituting the estimated
maximum likelihood for the term 1/σ(2π)1/2, yields the parabola y = mx2. We then estimated the
variance and standard deviation by linearizing the parabola and solving for the best-fit slope (m).
For each branch, we combined slow and fast-sites to calculate a pooled standard deviation. We
repeated this procedure for a range of values of k, and chose the k that minimized the mean
pooled standard deviation over all branches in the gene tree. Note that with this method, each
gene family can take a different value of k.
3. Simulated models of evolution
We used the evolver program from the PAML package (Yang, 1997) to generate sequences of
300 codons for 30 species, using the γ-proteobacteria species tree. All simulations had 3 siteclasses, as described in the main text. S:F was computed for each branch as described for the real
data. The slow:fast cutoff of k = 0.33 was chosen using the minSD method (Supplementary
Methods).
4. Estimation of synonymous and nonsynonymous substitution rates (dS and dN)
We used two methods to estimate dN and dS for each branch in each gene tree: (i) ML
estimation, using the codeml program from the PAML 4.0 package (Yang, 1997), allowing
dN/dS to vary freely among branches ('free-ratio' model), and (ii) using the NG86 'counting'
method of Nei and Gojobori (1986) to estimate dN and dS between each branch and its ancestor
sequence (from the joint reconstruction). Because ancestral reconstructions were performed
using PAML's likelihood model, transition/transversion biases, nonuniform codon usage, and
variable rate categories are accounted for in the NG86 estimates of dN and dS.
References:
Battistuzzi F, Feijao A and Hedges SB (2004) A genomic timescale of prokaryote evolution: insights into the origin
of methanogenesis, phototrophy, and the colonization of land. BMC Evolutionary Biology 4:44
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids
Res. 32:1792-1797
Guindon S and Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum
likelihood. Syst.Biol. 52:696-704
Ochman H and Wilson AC (1987) Evolution in bacteria: evidence for a universal substitution rate in cellular
genomes. J.Mol.Evol. 26:74-86
Shapiro BJ and Alm EJ (2008) Comparing Patterns of Natural Selection across Species Using Selective Signatures.
PLoS Genet. 4:e23
Storey JD and Tibshirani R (2003) Statistical significance for genomewide studies. Proc.Natl.Acad.Sci.U.S.A.
100:9440-9445
Thorne JL, Kishino H and Painter IS (1998) Estimating the rate of evolution of the rate of molecular evolution.
Mol.Biol.Evol. 15:1647-1657
Download