Supplementary Materials The relationship between sequence

advertisement
Supplementary Materials
1) The relationship between sequence variability and probability of site patterns
To investigate a quantitative relationship between sequence variability and the probability of the
sites we consider a three-taxa star tree (1, 2, 3) each with the same branch length b. Based on the
tree we simulated 10000 amino acid sites under the WAG model with varying branch lengths.
The probabilities of the occurrences of the three sequence patterns (AAA, AAB, ABC) (here A,
B and C represent any of the 20 amino acid residues but they are different among each other)
were calculated. Several quantities measuring the differences in the probabilities between the site
patterns were plotted in Figure S1 with respect to the branch lengths. The probabilities of the
identical site AAA are on average higher than that of AAB which are in turn on average higher
than the most variable site ABC for branch length less than 2. As the branches get longer the
probability differences among the site patterns gradually diminish to 0.
Fig. S1 The differences in the probabilities of the occurrences of the site patterns AAA, AAB
and ABC under the WAG model for a star tree (1:b, 2:b, 3:b) varied as a function of the branch
length b.
1
2) The correlation between the minimum number of differing nucleotides in codons for a
pair of amino acids and its amino acid exchangeability score
Figure S2 shows box plots of the WAG exchangeabilities (Whelan S, Goldman N: Mol Biol Evol
18:691-699, 2001) as a function of the minimum number of different nucleotides in codons. The
75 amino acid pairs having one-nucleotide difference in their codons have significantly higher
WAG scores than the 101 amino acid pairs having two-nucleotide difference in their codons,
which in turn have higher WAG scores than the 14 pairs with three-nucleotide difference in
codons.
Fig. S2 Distribution of the WAG exchangeabilities for pairs of amino acids. Pairs of amino acids
are binned according to the minimum number of changes required to get from one to the other.
2
3) Comparing power of the six simulation cases for SLL using different codon frequencies in
analyzing the data under M8A.
To determine thresholds for the SLL test, a null distribution is obtained from a large number
of codons (e.g., 20,000) simulated under neutral evolution conditions using the M8A model,
which takes the same parameter values from the original data analyzed under a M8A model.
Two commonly used codon frequency models can be used in M8A:
(1) F0 model: equal codon frequencies (all 1/61);
(2) F3x4 model: codon frequencies expected from the nucleotide frequencies at the three
codon positions.
The following table shows the power of SLL under the two codon frequency models for the
six cases of positive selection introduced in the main text.
Table S1 Power of the SLL tests using different codon frequencies in the M8A simulations.
Positive selection
casea
# of positive sets predicted when simulating under M8A with the following codon frequency models
F0
F3x4
Binomial
test*
Simes
test
Combined and
unique sets*
Binomial
test*
Simes
test
Combined and unique sets*
1) Original
conditions from
Lysin data
100 (100)
73
100 (100)
100 (100)
63
100 (100)
2) Branch lengths
increased 10 fold
72 (97)
13
73 (97)
74 (98)
9
74 (98)
3) Branch lengths
decreased 10 fold
67 (96)
31
79 (96)
58 (96)
28
71 (96)
4) Weak conditions
1 (ω3=1.5, p3 =
0.05)
16 (76)
13
27 (80)
4 (77)
12
16 (80)
5) Weak conditions
2 (ω3=1.5, p3 =
0.269)
71 (98)
16
75 (98)
44 (89)
16
50 (90)
6) Concatenated 3
M0 datasets
10 (57)
5
15 (60)
3 (32)
2
5 (32)
a
For each case, 100 datasets each of 200 codon sites were simulated and analyzed.
* The first number was based on θc = 0.075 corresponding to the standard site-wise test α = 0.05
and the second number in brackets was based on θc = 0.05 corresponding to a site-wise test α =
0.03 (see main text for details).
3
Download