Supplementary Material Supplementary Material ....................................................................................................................... 1 1. Study selection .......................................................................................................................................................... 1 2. The "deep sequencing" approach introduces nucleotide biases ...................................................... 3 3. Analyses for the complete set of biochemical predictors in AAIndex ............................................. 5 4. How to reproduce figures and tables ............................................................................................................. 8 References........................................................................................................................................................................ 8 1. Study selection Published studies were identified for a meta-analysis of the fitness effects of amino acid replacements caused by single transitions or transversions, i.e., the so-called "singlet" replacements. Studies chosen for inclusion meet a size threshold of 20 for studies of random mutations, and 10 for studies of beneficial mutations (which tend to be smaller). A random set of mutations is a set designated by a random choice, either by deliberately engineering a set of randomly assigned mutations (Sanjuan, et al. 2004), by using a procedure such as error-prone PCR, or by exhaustively synthesizing all possible versions of a sequence. Beneficial mutations are chosen by selection or screening under some condition. We restricted our attention to studies that measure fitness, as distinct from binding or activity. Various studies are available that measure fitness. We ignore differences in precisely how fitness is defined and measured (Chevin 2011), and accept any study that uses a measure of growth, including the measurement of intrinsic growth rate in a pure culture, pairwise competitive growth assays (mutant against wild-type) and growth in mixed culture (as in some high-throughput studies). Thus, we use the growth-rate results from Jacquier, et al. (2013) rather than any activity or resistance measurements. However, upon discovering extreme and idiosyncratic biases in studies that use deep sequencing to identify and quantify mutants, we rejected these studies as inappropriate (see next section). To understand the nature of these criteria, it is helpful to discuss the kinds of studies that are excluded. There are probably dozens of systematic studies of singlet replacements that report on other phenotypes (e.g., antibiotic resistance, biochemical activity) without reporting on fitness (e.g,. a dozen cited in Yampolsky and Stoltzfus 2005). Studies that measure fitness may be excluded for several reasons, most often because they do not have enough singlet replacements (e.g., Betancourt 2009; Bataillon, et al. 2011; McDonald, et al. 2011). For instance, (McDonald, et al. 2011) isolated 100 beneficial mutants, genotyped 20, and found 13 different mutations, 11 of which were deletions, and only 2 of which were singlet replacements. Some studies look at the distribution of fitness effects without determining genotypes (e.g., Barrett, et al. 2006; Kassen and Bataillon 2006), or they assign fitnesses to lineages with multiple changes, rather than to individual replacements (Holder and Bull 2001; Rokyta, et al. 2009). In two cases, a relevant study was included wholly or largely in a later study. Most of the data reported by MacLean, et al (2009) are reported in a later paper from the same laboratory (MacLean, et al. 2010), which we used instead. Mutants reported by Rokyta, et al (Rokyta, et al. 2005) recur in Miller, et al (2011). Finally, the data from Lind, et al. (2010) are excluded because the authors themselves report that mutant fitness effects are dominated by effects on expression and fail to show predictable protein-level effects. 2. The "deep sequencing" approach introduces nucleotide biases Some recent studies of mutant fitnesses use the methodology of "deep mutational scanning", in which the fitnesses of thousands of mutants growing in a mixed culture are measured simultaneously using deep sequencing. Though this is a promising technology, its application to the measurement of fitness is subject to extreme nucleotide-level biases. Out of just 13 high-throughput studies that measure fitness using deep sequencing, we found 2 that exhibit extreme nucleotide-level biases that are different between the 2 studies. Perhaps other studies do not have such biases, but we cannot be sure. In both cases, the authors completely excluded certain pathways on the grounds that they did not feel confident controlling for effects of mutation bias. Yet, the remaining pathways still show extreme biases. The biases are illustrated in the 2 figures below. Each figure is a matrix of distributions for mutation from one nucleotide to another, where row = from and column = to, e.g., the upper right distribution is for TG. Each distribution is a histogram of fitness quantiles for mutants of that particular type. If there were no differences in fitness distributions, each histogram would be flat. In the data from Acevedo, et al. (2014), the CT and GA transitions are excluded, and among the remaining pathways, the zero values are assigned overwhelming to TR transversions or GY transversions (figure at right). In the study by Wu, et al (next page), the GT and CA transversions are excluded, and there are disproportionate numbers of zero values assigned to CG and GC tranversions. 3. Analyses for the complete set of biochemical predictors in AAIndex The purpose of this section is merely to show that the small samples of 25 indices presented in the main text in Fig 2 and Fig 4 do not present a misleading picture of the entire distribution of biochemical indices in AAIndex. As noted earlier, we discard over half of the indices in AAIndex because they are not genuine biochemical properties, but properties of the evolved distribution of amino acids in natural sequences (e.g., the frequency with which a particular amino acid is found in helixes). The two figures below use the entire set of genuine biochemical factors. The figure at left, which is provided for comparison with Fig 2, shows the power of a binary predictor (based on the named factor) to predict fitness effects. The names are evenly distributed horizontally, so halfway down the list is the median, which corresonds to an AUC of about 0.57. That is, most biochemical predictors have more power than the ti:tv distinction, with AUC = 0.53 ± 0.03. The figure at right, for comparison with Figure 4, shows the support of biochemical predictors for the idea that transitions are conservative. About 3/5 of the predictors are above 0.5, and so could be used to rationalize the conservative transitions hypothesis. The other 2/5 of predictors could be used to rationalize the opposite idea. The fact that the entire distribution is weakly biased toward transitions (median AUC = 0.53) is not necessarily evidence of their conservativeness, because scientists clearly have directed their attention to developing predictors that are effective in accounting for observed evolutionary tendencies, which are strongly biased toward transitions. 4. How to reproduce figures and tables All of the figures and tables are generated by scripts. Please contact Arlin Stoltzfus if you wish to use these scripts. Currently the scripts and the data are in a github archive (http://github.com/arlin/qsme). The instructions specific for this manuscript are in the file "meta-analyses/SN2015/README.md". References Acevedo A, Brodsky L, Andino R. 2014. Mutational and fitness landscapes of an RNA virus revealed through population sequencing. Nature 505:686-690. Barrett RD, MacLean RC, Bell G. 2006. Mutations of intermediate effect are responsible for adaptation in evolving Pseudomonas fluorescens populations. Biol Lett 2:236-238. Bataillon T, Zhang T, Kassen R. 2011. Cost of adaptation and fitness effects of beneficial mutations in Pseudomonas fluorescens. Genetics 189:939-949. Betancourt AJ. 2009. Genomewide patterns of substitution in adaptively evolving populations of the RNA bacteriophage MS2. Genetics 181:1535-1544. Chevin LM. 2011. On measuring selection in experimental evolution. Biol Lett 7:210-213. Holder KK, Bull JJ. 2001. Profiles of adaptation in two similar viruses. Genetics 159:1393-1404. Jacquier H, Birgy A, Le Nagard H, Mechulam Y, Schmitt E, Glodt J, Bercot B, Petit E, Poulain J, Barnaud G, et al. 2013. Capturing the mutational landscape of the beta-lactamase TEM-1. Proc Natl Acad Sci U S A 110:13067-13072. Kassen R, Bataillon T. 2006. Distribution of fitness effects among beneficial mutations before selection in experimental populations of bacteria. Nat Genet 38:484-488. Lind PA, Berg OG, Andersson DI. 2010. Mutational robustness of ribosomal protein genes. Science 330:825-827. MacLean RC, Buckling A. 2009. The distribution of fitness effects of beneficial mutations in Pseudomonas aeruginosa. PLoS Genet 5:e1000406. MacLean RC, Perron GG, Gardner A. 2010. Diminishing returns from beneficial mutations and pervasive epistasis shape the fitness landscape for rifampicin resistance in Pseudomonas aeruginosa. Genetics 186:1345-1354. McDonald MJ, Cooper TF, Beaumont HJ, Rainey PB. 2011. The distribution of fitness effects of new beneficial mutations in Pseudomonas fluorescens. Biol Lett 7:98-100. Miller CR, Joyce P, Wichman HA. 2011. Mutational effects and population dynamics during viral adaptation challenge current models. Genetics 187:185202. Rokyta DR, Abdo Z, Wichman HA. 2009. The genetics of adaptation for eight microvirid bacteriophages. J Mol Evol 69:229-239. Rokyta DR, Joyce P, Caudle SB, Wichman HA. 2005. An empirical test of the mutational landscape model of adaptation using a single-stranded DNA virus. Nat Genet 37:441-444. Sanjuan R, Moya A, Elena SF. 2004. The distribution of fitness effects caused by single-nucleotide substitutions in an RNA virus. Proc Natl Acad Sci U S A 101:8396-8401. Yampolsky LY, Stoltzfus A. 2005. The exchangeability of amino acids in proteins. Genetics 170:1459-1472.