1 Supplementary Methods 2 Hybridization efficiency was calculated from fundamental thermodynamic 3 principles. The hybridization efficiency (Effhyb), defined as the fraction of template 4 bound by a primer at a given temperature (Mathews et al., 1999), was calculated as 5 follows: First, the program UNAFold (Markham and Zuker, 2008) was utilized for 6 determining Gibbs free energy change (ΔGo) of hybridization between primer and 7 template (pt), as well as to determine the free energies of primer folding (pf), target 8 folding (tf) at the target site, and the formation of primer-dimer (pd) products. Since 9 UNAFold allows for a correction factor based on the concentration of ions present in the 10 reaction, the molar concentrations of monovalent cations ([Mon+]; e.g., [K+]) and 11 divalent cations ([Div2+]; e.g., [Mg2+]) were converted to a sodium equivalent 12 concentration [Naeq+] according to Eq. 1 (Owczarzy et al., 2008), which also takes into 13 account the molar concentration of deoxyribonucleotide triphosphates [dNTPs]. + [ππππ ] = [πππ+ ] + 3.33√[π·ππ£ 2+ ] − [πππππ ] (1) 14 Gibbs free energies were then used to calculate equilibrium constants for all 15 reactions using the fundamental relationship shown in Eq. 2, where R is the ideal gas 16 law constant (0.00199 kcal⁄(πππ ∗ πΎ)) and T is the annealing temperature in Kelvin. πΎ = exp(− βπΊ π ) π π (2) 17 An effective equilibrium constant (Keff) was calculated, according to Eq. 3 18 (Mathews et al., 1999), as a function of the side-reaction equilibrium constants (Kpf, Ktf, 1 19 Kpd), the equilibrium constant governing primer-template duplex formation (Kpt), and 20 the molar concentration of primers (Po) present in the PCR reaction. πΎπππ = 4 ∗ πΎππ‘ ∗ πΎππ ∗ π0 (3) 2 (−1 − πΎππ + √(1 + πΎππ ) + 8 ∗ π0 ∗ πΎππ ) ∗ (1 + πΎπ‘π ) 21 Finally, Effhyb was determined directly from the effective equilibrium constant 22 (Keff) using Eq. 4, which assumes that the primer concentration is much greater than the 23 template concentration (Yilmaz and Noguera, 2004). πΈππβπ¦π = ( π0 ∗ πΎπππ ) 1 + π0 ∗ πΎπππ (4) 24 Quantitative PCR (qPCR) was used to measure elongation efficiency of 25 mismatched primers. A strand of template DNA is copied during a PCR cycle when a 26 primer hybridizes to the strand and is subsequently elongated by a DNA polymerase. 27 In this way amplification can be broken into two successive events: hybridization and 28 elongation. 29 hybridization efficiency, as the fraction of primers hybridized to the template that are 30 elongated during a PCR cycle. Hence, the overall amplification efficiency (Effamp) can be 31 described as the product of hybridization (Effhyb) and elongation efficiency (Effelong) 32 according to Eq. 5. We chose to represent elongation efficiency in a similar manner to πΈπππππ = πΈππβπ¦π ∗ πΈπππππππ (5) 2 33 Using qPCR it is only possible to measure amplification efficiency, therefore 34 elongation efficiency can only be directly measured when hybridization efficiency is 35 100%. Thus, in order to measure elongation efficiency with qPCR, we chose to perform 36 experiments at 50°C, and we used primers, either perfectly matched or mismatched, 37 that that had greater than 99.9% hybridization efficiency at that temperature. These 38 elongation efficiency measurements were later used to make predictions at higher 39 annealing temperatures because the extension temperature remained constant at 72°C, 40 and it has been previously shown that the rate of mismatched elongation is largely 41 temperature invariant (Innis et al., 1995). 42 By defining elongation efficiency of perfect match primers to be 100%, we 43 measured elongation efficiency of mismatched primers relative to their perfect match 44 counterpart. To obtain this measurement, the same initial concentration of template 45 was separately amplified, in triplicate, with either perfect match primers or a primer set 46 containing one mismatched primer (Fig. S2). A delay of threshold cycle (Ct) of the 47 amplification with the mismatched primer relative to the perfect match primer set (ΔCt) 48 was obtained from these experiments. 49 according to Eq. 6, which empirically approximates the exponential nature of PCR 50 amplifications and takes into account the fact that each PCR cycle performed with a 51 mismatched primer creates a new template that will be perfectly matched in subsequent Elongation efficiency was then calculated 3 52 cycles. This equation also reflects the assumption that Effhyb is equal to 100% in the 53 experiments. 54 55 πΈπππππππ = 2−βπΆπ‘ (6) 56 The melting temperatures of primer/template duplexes were calculated from 57 Effhyb equations. A primer’s Effhyb as a function of temperature follows a profile with a 58 100% Effhyb plateau at low temperature, 0% efficiency plateau at high temperatures, and 59 a sigmoidal efficiency decrease as temperature increases between the two plateaus. The 60 melting temperature (Tm) is the point at which Effhyb is 50%. Accordingly, the melting 61 point of primer/template duplexes was numerically calculated during primer design by 62 finding the temperature at which Effhyb in Eq. 4 becomes 50%. 63 Perfect match primers were designed to have an average predicted Tm of 65°C 64 (Table S6), and as discussed above, experiments to estimate elongation efficiency were 65 carried out using an annealing temperature of 50°C in order to be at or near the 100% 66 Effhyb plateau. In order to experimentally assess whether this requirement was met, 67 qPCR experiments were conducted using the perfect match primers with annealing 68 temperatures varying from 50°C to 75°C (Fig. S2). 69 temperature at which amplification was delayed by no more than two cycles with 70 respect to the 100% efficiency plateau. These temperatures were between 59°C and 71 70°C, confirming that using 50°C as the annealing temperature for the rest of the study 72 was sufficient to reach the 100% Effhyb plateau. Table S5 reports the observed 4 73 Primers specific to bacterial and archaeal genera were designed to take 74 advantage of mismatches at or near the primer's 3' end. The primer design algorithm 75 begins with a set of aligned DNA sequences for designing primers. We chose to use the 76 RDP (Cole et al., 2009) database (version 10.30) of “good quality” 16S rRNA gene 77 sequences, which included 2,342,448 Bacteria and 117,236 Archaea. We removed 30,085 78 possibly chimeric sequences with DECIPHER (Wright et al., 2012) using full-length 79 sequence parameters. The full set of DNA sequences was grouped into categories, in 80 our case according to phylogenetic similarity where each genus formed its own group. 81 Unclassified groups were omitted from the analysis, which resulted in a dataset of 82 1,640,974 bacterial sequences classified into 1,834 genera, and 47,091 archaeal sequences 83 classified into 109 genera. 84 Next, a set of overlapping k-mers, termed “tiles,” of 27 nucleotides in length was 85 formed from each group of aligned sequences. For each group, up to 10 tiles 86 representing the most common permutations at each position within the sequence were 87 selected for further analysis. The aim was to select the smallest possible number of tiles 88 that represented at least 90% of the sequences within each group (genus). This tiling 89 approach is preferable to using consensus sequences for each group, because it results 90 in less sequence ambiguity and minimizes the possibility of designs that target minor 91 sequence variations that may be the result of either sequencing errors or chimeras not 92 detectable with the chimera-removal tool (Wright et al., 2012). The compromise is that 5 93 this approach may eliminate from the design real sequence variations that are only 94 minor components of the targeted group. The complete set of tiles selected above 95 became the set of subsequences from which primers were designed to target a specific 96 group while avoiding other non-target groups. 97 In the next step, a set of potential forward and reverse primers meeting the 98 following constraints was designed for each target site: First, a minimum set of perfect 99 match primers must cover a specified fraction of the group, while requiring a minimum 100 percentage of the group’s sequences to include that region. For example, we chose to 101 design forward and reverse primers that covered at least 90% of their genus with up to 102 4 permutations, while requiring at least 20% of the group’s sequences to overlap the 103 target site (not all sequences span the entire gene). Second, since about 97% of sequence 104 information in the RDP database is contained between Escherichia coli positions 27 and 105 1406, we only considered tiles in this range in order to obtain a more holistic 106 representation of target and non-target diversity. Third, tiles were also constrained to 107 have no more than 4 runs of a single base (e.g., AAAA), or 4 di-nucleotide repeats (e.g., 108 ACACACAC), as this may result in false priming due to entropic effects. 109 Finally, to create primers from each selected tile, the primer’s length was 110 adjusted, base-by-base, starting from 17 nucleotides long and adding one nucleotide to 111 the 5’-end until a defined minimum hybridization efficiency was met. In this manner, 6 112 primer permutations are given different lengths such that each permutation has a 113 similar efficiency at the annealing temperature in order to minimize bias. For this study 114 we chose to have primers with at least 80% hybridization efficiency at an annealing 115 temperature of 64°C. PCR primers are often designed for maximum (100%) efficiency at 116 the annealing temperature; however such a high binding strength is disadvantageous 117 for discriminating between target and non-target templates. After these steps, any 118 positions within the sequences for which the selected tiles could not meet these initial 119 primer design constraints were eliminated from the set of possible target sites. 120 Typically, about 25 - 75% of 16S positions were viable as candidate target sites after 121 applying the above constraints. 122 The amplification efficiency of a primer hybridized to a perfectly matched 123 template assumed an elongation efficiency of 100%, while a systematic set of mismatch 124 rules were created to predict the elongation efficiency of non-targets. Mismatched 125 terminal nucleotides were evaluated using the averaged penultimate base pairing 126 information derived from Fig. 2 (Table S1). Furthermore, elongation efficiency of single 127 mismatches and indels in positions 2-4 were based on average efficiencies measured in 128 those positions (Fig. 3). Double mismatches and double indels in positions 1-6 were 129 assigned zero efficiency, as multiple mismatches cause both decreased elongation and 130 hybridization efficiencies at the high annealing temperatures typically used in PCR. 7 131 Next, each viable set of primers was scored with the objective of minimizing 132 potential cross-amplification with non-target groups. To accomplish this, the primers 133 were hybridized in silico to all other non-target tiles in the same alignment position and 134 elongated in silico to determine their predicted amplification efficiency. The potential 135 amplification efficiency for each primer at the annealing temperature (64°C) was 136 estimated from hybridization and elongation efficiencies according to Eq. 5. Then, the 137 overall in silico amplification efficiency for a set of forward and reverse primers was 138 approximated by the geometric mean of the forward and reverse efficiencies according 139 to Equation 7. πΈπππππ = √πΈπππ_ππππππ ∗ πΈπππ_ππππππ (7) 140 After performing all in silico amplifications of non-target groups, amplifications 141 with at least 0.1% predicted efficiency were recorded. A specificity score for each 142 potential primer set was then calculated as the sum of the predicted amplification 143 efficiencies for all non-target groups. Potential forward and reverse primer sets were 144 ranked according to their ability to discriminate between their target group and all 145 other non-target groups. This ranked set of primers was screened to eliminate primer 146 combinations that would produce amplicons of undesirable length. In our case, we 147 chose amplicons to be between 300 and 1,200 base pairs so that the PCR product would 148 be long enough for downstream sequencing. Furthermore, a check for potential primer- 8 149 dimer artifacts was conducted with every combination of forward and reverse primers 150 in the ranked set. 151 As a final step, a more thorough search was conducted using the top ranked 152 primer sets. The sequence of each primer was again checked against every non-target 153 tile, but this time for every position upstream and downstream of the target site. This 154 search ensured the primer would not unexpectedly cross-hybridize to a position other 155 than the target site on a non-target DNA template. While it may seem unlikely that 156 cross-hybridizations would occur away from the target site, we found examples in 157 which this occurred, and thus, we included this rule for thoroughness. Scores for each 158 primer set were updated based on this exhaustive search, and the optimal primer set 159 was chosen as the one that would amplify the target group while minimizing the in 160 silico amplification of non-targets. In practice, it was not always possible to find an 161 ideal primer set with no predicted amplification of all non-targets. In such cases, the 162 predicted amplification efficiency of each potential cross-amplification was recorded for 163 further consideration. 164 The genus specific primers developed in this study have been made available 165 online (http://DECIPHER.cee.wisc.edu). For modest primer design tasks we have also 166 made our program accessible online as part of a web tool. In our experience the 167 program’s outputs are adequate for immediate application, but we recommend that 168 PCR conditions be optimized with a temperature gradient using the target template as 9 169 is typically performed in PCR. This is especially necessary if using primers designed by 170 the program with an induced mismatch, because they may have a lower melt 171 temperature with the target template. The primer design program can be used for the 172 identification of primer sets (forward and reverse) that minimize potential non-target 173 hybridizations for any DNA sequence and any level of taxonomic or group clustering. 174 Help documentation available with the program assist the user in setting the 175 appropriate input parameters for their PCR experiment and interpreting the program’s 176 output. 177 REFERENCES 178 Cole, J.R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R.J. et al. (2009) The 179 Ribosomal Database Project: improved alignments and new tools for rRNA analysis. 180 Nucleic Acids Research 37: D141-D145. 181 Innis, M.A., Gelfand, D.H., and Sninsky, J.J. (eds) (1995) PCR Strategies. San Diego, CA: 182 Academic Press, Inc. 183 Markham, N.R., and Zuker, M. (2008) UNAFold: software for nucleic acid folding and 184 hybriziation. In Methods in Molecular Biology: Humana Press, Totowa, NJ, pp. 3-31. 185 Mathews, D., Burkard, M., Freier, S., Wyatt, J., and Turner, D. (1999) Predicting 186 Oligonucleotide Affinity to Nucleic Acid Targets. RNA 5: 1458-1469. 10 187 Owczarzy, R., Moreira, B.G., You, Y., Behlke, M.A., and Walder, J.A. (2008) Predicting 188 stability of DNA duplexes in solutions containing magnesium and monovalent cations. 189 Biochemistry 47: 5336-5353. 190 Wright, E.S., Yilmaz, L.S., and Noguera, D.R. (2012) DECIPHER, a Search-Based 191 Approach to Chimera Identification for 16S rRNA Sequences. Applied and Environmental 192 Microbiology 78: 717-725. 193 Yilmaz, L.S., and Noguera, D.R. (2004) Mechanistic approach to the problem of 194 hybridization efficiency in fluorescent in situ hybridization. Applied and Environmental 195 Microbiology 70: 7126-7139. 196 11