Supplementary Methods (doc 42K)

1. Construction of gene- and species-trees Protein sequences were aligned using MUSCLE (Edgar, 2004), and all gaps were removed, including one flanking residue on either side. Nucleotide alignments were also performed using MUSCLE, ensuring that codons were consistent with the protein alignment. Gene tree topologies and branch lengths were estimated using PhyML 2.4.5 (Guindon & Gascuel, 2003) with a JTT substitution model and 4 γ-distributed rate categories. As we previously noted, use of the WAG substitution model instead of JTT did not significantly influence the branch lengths or topologies of these proteins (Shapiro & Alm, 2008). The consensus species tree topology was estimated as previously described, using a concatenation of 80 COGs present in single copy in each of the 30 genomes (Shapiro & Alm, 2008). Divergence times in the species tree were estimated using a relaxed molecular clock model implemented in multidivtime (Thorne et al, 1998) with parameters rttm = 1.75 By, rttmsd = 1, rtrate = 3.9, rtratesd = 3.9, brownmean = 0.6, brownsd = 0.6, bigtime = 5.0, and the constraint that E. coli and Salmonella diverged between 0.057 and 0.176 billion years ago (Ochman & Wilson, 1987, Battistuzzi et al, 2004). 2. Detailed procedures for choosing k. For DNA data, k is the (approximate) proportion of sites in the multiple sequence alignment classified as 'slow'. k is 'approximate' because multiple sites might be tied at the same rate (substitutions/site), so that the cutoff is drawn to minimize the deviation from the desired k (i.e. if the desired k were 0.67, falling between sites with rank 0.60 and 0.70, the cutoff would be chosen at 0.70). For AA data, k is the rank-order of the site where the cutoff is drawn. For example, in an alignment of 10 sites with 1, 2, 3, and 4 substitutions in 4, 3, 1, and 2 sites, respectively, k would be 0.25 (1/4) if the cutoff were drawn between 1 and 2 substitutions/site, even though this would result in a proportion of 0.40 (4/10) of all sites being classed as 'slow'. minFDR In this method, we compute S:F ratios for all genes in our dataset to yield a distrubutiuon of pvalues for each possible choice of k. We then want to choose the value of k producing the distribution with the lowest FDR. This is done by performing a 1-sided Kolmogorov-Smirnov (KS) test of the observed p-value distribution against a uniform distribution of p-values. The higher the value of the KS test D statistic, the more favorable the ratio of true:false positives. As verification that the D statistic is indeed a good indicator of FDR, we plotted the observed pvalue distribution (ranging from 0 to 1, binned in increments of 0.05) for each value of k, and calculated FDR by dividing the average number of counts in bins with p>0.5 by the number of counts in the p<0.05 bin. We then chose the value of k that minimized FDR. These two methods always yielded the same choice of k. minSD As in the minFDR method, we range k from 0.05 to 0.95 in increments of 0.05. For each k, the sequence is split into slow and fast-sites, according to their rank-order. Branch lengths are then estimated twice independently with PhyML, once using only the slow-site distribution and once using only the fast-site distribution. The result is two gene trees: one estimated using slow-sites, the other using fast-sites. If, for example, some fast-sites are miscategorized as slow, most of the slow-sites will support short branch lengths, but a few (miscateogrized) sites will support long branch lengths. So, a branch in the gene-tree could be ‘stretched’ (inferred to have undergone more substitutions per site) without significantly reducing the likelihood. Such 'stretching' of each branch, can be used to estimate the standard deviation of each branch in the gene tree, which is our goal. Starting from the maximum-likelihood branch length, the branch length is stretched to 300% of its ML length, in increments of 5%, and then compressed to zero, also in increments of 5%. The likelihood of the gene tree is re-estimated with PhyML for each incremental stretch of the branch to produce a distribution of N branch lengths and their associated likelihoods, for each branch. For branches with non-zero ML lengths, the likelihood of a particular branch length could be well-approximated by a Normal distribution, transformed for simplicity such that the mean (ML branch length) is zero (Equation 4). The log-likelihood distribution can therefore be approximated by a parabola with L-intercept equal to ln (σ(2π)-1/2) (Equation 5). Equation 4. L  px (x) ~ N(0, 2 )  Equation 5. 1 x2 ln L  ln   2 2 2  where the likelihood (L) is a function of log branch length (x), given by a Normal probability density function with mean set to zero (by subtracting the PhyML-estimated ML branch length) and variance σ2. Letting y = ln (σ(2π)-1/2) - ln L and m = 1/2σ2, and substituting the estimated maximum likelihood for the term 1/σ(2π)1/2, yields the parabola y = mx2. We then estimated the variance and standard deviation by linearizing the parabola and solving for the best-fit slope (m). For each branch, we combined slow and fast-sites to calculate a pooled standard deviation. We repeated this procedure for a range of values of k, and chose the k that minimized the mean pooled standard deviation over all branches in the gene tree. Note that with this method, each gene family can take a different value of k. 3. Simulated models of evolution We used the evolver program from the PAML package (Yang, 1997) to generate sequences of 300 codons for 30 species, using the γ-proteobacteria species tree. All simulations had 3 siteclasses, as described in the main text. S:F was computed for each branch as described for the real data. The slow:fast cutoff of k = 0.33 was chosen using the minSD method (Supplementary Methods). 4. Estimation of synonymous and nonsynonymous substitution rates (dS and dN) We used two methods to estimate dN and dS for each branch in each gene tree: (i) ML estimation, using the codeml program from the PAML 4.0 package (Yang, 1997), allowing dN/dS to vary freely among branches ('free-ratio' model), and (ii) using the NG86 'counting' method of Nei and Gojobori (1986) to estimate dN and dS between each branch and its ancestor sequence (from the joint reconstruction). Because ancestral reconstructions were performed using PAML's likelihood model, transition/transversion biases, nonuniform codon usage, and variable rate categories are accounted for in the NG86 estimates of dN and dS. References: Battistuzzi F, Feijao A and Hedges SB (2004) A genomic timescale of prokaryote evolution: insights into the origin of methanogenesis, phototrophy, and the colonization of land. BMC Evolutionary Biology 4:44 Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792-1797 Guindon S and Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst.Biol. 52:696-704 Ochman H and Wilson AC (1987) Evolution in bacteria: evidence for a universal substitution rate in cellular genomes. J.Mol.Evol. 26:74-86 Shapiro BJ and Alm EJ (2008) Comparing Patterns of Natural Selection across Species Using Selective Signatures. PLoS Genet. 4:e23 Storey JD and Tibshirani R (2003) Statistical significance for genomewide studies. Proc.Natl.Acad.Sci.U.S.A. 100:9440-9445 Thorne JL, Kishino H and Painter IS (1998) Estimating the rate of evolution of the rate of molecular evolution. Mol.Biol.Evol. 15:1647-1657

Supplementary Methods (doc 42K)

Related documents

Products

Support

Supplementary Methods (doc 42K)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib