A New Method for Estimating Nonsynonymous Substitutions and Its

A New Method for Estimating Nonsynonymous Substitutions and
Its Applications to Detecting Positive Selection
Hua Tang and Chung-I Wu
Department of Ecology and Evolution, University of Chicago
The standard methods for computing the number of nonsynonymous substitutions (Ka) lump all amino acid changes
into one single class, even though their rates of substitution vary by at least 10-fold (Tang et al., 2004). Classifying
these changes by their physicochemical properties has not been suitably effective in isolating the fastest evolving classes
of changes. We now propose to use the Universal index U of Tang et al. (2004) to classify the 75 elementary amino
acid changes (codons differing by 1 bp) by their evolutionary exchangeability. Let Ki denote the Ka value of each class
(i 5 1, ., 75 from the most to the least exchangeable). The cumulative Ki for the top 10 classes, denoted Kh (for highexchangeability types), has two important properties: (1) Kh usually accounts for 25%–30% of total amino acid changes and
(2) when the observed number of amino acid substitutions is large, Kh is predictably twice the value of Ka. This shall be
referred to as the twofold approximation. The new method for estimating Kh is applied to the comparisons between human
and macaque and between mouse and rat. The twofold approximation holds well in these data sets, and the signature of
positive selection can be more easily discerned using the Kh statistic than using Ka. Many genes with Ka/Ks . 0.5 can now be
shown to have Kh/Ks . 1 and to have evolved adaptively, at least for the high-exchangeability group of amino acid changes.
Introduction
The estimation of nucleotide changes is fundamental to
molecular evolutionary studies (Li, 1997). For coding
regions, a simple approach is to estimate the Ka (number
of nonsynonymous changes per nonsynonymous site) and
Ks (number of synonymous changes per synonymous site)
values separately (Li, Wu, and Luo, 1985; Nei and Gojobori,
1986; Yang and Bielawski, 2000). While synonymous
changes are weakly, and presumably more or less uniformly,
constrained, the strength of selective constraint varies
greatly among different types of amino acid mutations.
For example, both Ser to Thr and Cys to Tyr are nonsynonymous changes, but their rates of substitutions probably differ by more than 10-fold (Tang et al., 2004). Grouping
nonsynonymous changes that have similar evolutionary dynamics into separate categories is, therefore, likely to be
a useful strategy in discerning hidden patterns of molecular
evolution.
There are many ways to decompose nonsynonymous
changes into individual classes. For example, there have
been attempts at classifying the 20 amino acids into groups
according to their charge, polarity, volume, and so on.
Amino acid substitutions within groups are considered conservative, whereas those between groups are radical
(Zhang, 2000). Several other measures such as Grantham’s
distance have also been proposed to quantify the differences
between amino acids (Grantham, 1974). However, it does
not appear that amino acid changes so classified would have
comparable evolutionary dynamics (Rand, Weinreich, and
Cezairliyan, 2000; Zhang, 2000). Many amino acid
changes classified as conservative in fact evolved slowly,
whereas others so classified evolved much more rapidly
(Yang, Nielsen, and Hasegawa, 1998; Rand, Weinreich,
and Cezairliyan, 2000). The lack of consistency arose from
the nonempirical nature of amino acid classifications.
Key words: amino acid substitution, evolutionary index, positive
selection.
E-mail: ciwu@uchicago.edu.
Mol. Biol. Evol. 23(2):372–379. 2006
doi:10.1093/molbev/msj043
Advance Access publication October 19, 2005
Ó The Author 2005. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: journals.permissions@oxfordjournals.org
An empirical system of classifying amino acid
changes by their evolutionary exchangeability has recently
been developed by Tang et al. (2004). Among the 20 amino
acids, there are 190 possible changes, of which only 75
kinds can be substituted with a 1-bp change in the codons.
Each of these 75 kinds of amino acid mutations is referred
to as an elementary amino acid change. The remaining 115
kinds are composites of two or three elementary changes. In
Tang et al. (2004), we estimated the evolutionary index
(EI(i), i 5 1–75) for each of the 75 kinds of changes. EI
is the equivalent of the Ka/Ks ratio for each kind. Between
closely related species, EIs can be accurately computed
when a large number of DNA sequences are available.
This method of EI for amino acid changes differs from
earlier systems in three important ways. First, EI is codon
based, whereas earlier methods such as the PAM matrix
(Dayhoff, Schwartz, and Orcutt, 1978) were based on amino
acid sequences. Second, EI is computed between closely related species, hence requiring a large number of DNA
sequences. Third, EIs among different gene sets from diverse taxa have been shown to be highly correlated. This
has led to the proposal of a universal measure of amino acid
exchangeability, U. For any large data set, we only need to
a/ K
s in order to compute the expected EIs,
know the mean K
which are linearly correlated with the constant scale, U.
One of the many reasons for separating synonymous
and nonsynonymous changes (and for classifying amino
acid substitutions) is to detect positive selection. If the
Ka/Ks ratio is significantly greater than 1, the sequence evolution is usually interpreted to be driven by positive selection, on the assumption that synonymous changes are the
proxy of neutral changes (Li, 1997). Although synonymous
changes are generally not neutral (Akashi, 1995; Hellmann
et al., 2003; Lu and Wu, 2005; discussed later), the Ka/Ks
ratio remains relatively free of assumptions for inferring the
action of natural selection.
The Ka/Ks . 1 test is perhaps overly stringent as it requires the acceleration in amino acid substitutions, due to
positive selection at some sites, to overcompensate for
the retardation at other sites due to negative selection. In this
study, we demonstrate how grouping amino acid changes by
The 2,369 pairs of orthologs between human and the
macaque monkey were aligned using the DNASTAR Megalign program. The 1,306 orthologs between mouse and rat
were taken from those used in Tang et al. (2004). To calculate the number of substitutions for each of the 75 classes
of elementary changes (Ki), large number of changes have
to be scored. In those cases, we use the mean weighted by
the gene length. It is equivalent to stringing together all
coding sequences to create a ‘‘supersequence.’’
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Evaluation of ðKi* Ka Þ= VarðKi* Þ in U Data Sets
In any U data set, Ki is assumed to equal exactly
cUi ; i51; 2; .; 75; where c is a scaling factor equal to the av*
erage Ka of the
Pdata set. The
Pcumulative
PKi for the first
P i classes is Ki* 5 ij51 Kj Lj = ij51 Lj 5 ij51 cUj Lj = ij51 Lj ,
where Lj is the estimated length for jth-type amino acid
changes in the data set. VarðKi Þ can be approximated as
VarðKi Þ 5 pi ð1 pi Þ=½Li ð1 4pi =3Þ2 ; where pi 5 4=3
ð1 e4=3Ki Þ (Li, 1997). Between closely related species,
Ki 1 for i 5 1, 2, ., 75. Thus, VarðKi Þ can be further approximated as P
VarðKi Þ5KP
i =Li whichPthen gives
*
Þ 5 Varð ij51 Kj Lj ij51 Lj Þ5 ij51 cUj Lj =
rise
i
Pi to VarðK
ð j51 Lj Þ2 :
An optimal i (number of classes)pcan
be found
by
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
maximizing the quantity ðKi* Ka Þ= VarðKi* Þ (where
*
Þ
Ka [ K75
Ki* Ka
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
*
VarðKi Þ
.P
.P
Pi
P75
i
75
j 5 1 cUj Lj
j 5 1 Lj j 5 1 cUj Lj
j 5 1 Lj
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
5
.
Pi
Pi
2
j 5 1 cUj Lj ð
j 5 1 Lj Þ
.P
P75
Pi
pffiffiffiPi
75
c
j 5 1 Uj Lj j 5 1 Uj Lj 3
j 5 1 Lj
j 5 1 Lj
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
5
:
Pi
j 5 1 Uj Lj
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The optimal i at which ðKi* Ka Þ= VarðKi* Þ is
maximized does not depend on thep
exact
values
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi of Li and
c. In figure 1, we plot ðKi* Ka Þ= VarðKi* Þ versus i by
choosing c 5 P
0.05 (equivalently, the average Ka is equal
75
to 0.05) and
i51 Li 510; 000 for convenience because
the general shapes are not affected by their values. Also
8
4
2
0.5
(K*i −Ka)/SD(K*i )
K*i /Ks
1.0
6
(K*i −Ka)/SD(K*i )
0
Materials and Methods
DNA Sequences
K*i /Ks
0.0
their U-index and computing the associated Ka/Ks values can
substantially augment the power of detecting the signature
of positive selection. Moreover, we establish the statistical
criteria for isolating the high-exchangeability amino acid
pairs from the rest. The average Ka value of this group will
be referred to as Kh, which generally accounts for 25%–30%
of total amino acid changes. We show that Kh on average is
about twice the value of Ka at the genomic level. There are
many cases where Kh/Ks . 1, indicating positive selection,
whereas the traditional Ka/Ks does not reveal adaptive
evolution.
1.5
Estimating Nonsynonymous Substitutions and Its Applications to Detecting Positive Selection 373
0
10
20
30
40
50
60
70
i
FIG. 1.—Ki* =Ks (thick solid line) and ðKi* KaÞ =SDðKi* Þ (open
circles) from a generalized U data set with a weighted average of Ka 5
0.05 and Ks 5 0.1 are evaluated. i denotes each class of elementary amino
acid change, ranging from 1 (most exchangeable) to 75 (least exchangeable). The more exchangeable classes with i 10 are separately analyzed.
*
=Ks (5Kh =Ks ) is slightly greater than 1 or twice the value of
Note that K10
the average Ka =Ks .
P
P
Ki* =Ks 5 ij51 cUj Lj = ij51 Lj ð1=Ks Þ versus i is plotted in
the same figure by choosing c/Ks 5 0.5 (i.e., the average
Ka/Ks of the data set is set to be 0.5). In our calculations,
we used the genome-wide frequencies of sites for the 75
kinds of elementary nonsynonymous changes for rodents
which are included in the Table S1 in the Supplementary
Material online. Because mammalian genome codon frequencies are highly correlated and genome-wide ts/tv biases
are similar, the exact genomes used for this purpose do not
matter much.
Results
Computing Nonsynonymous Substitutions of
the ith Kind, Ki (and Ki* )
The Ka value of a gene represents the average rate of
amino acid substitution for that gene. Nonsynonymous substitutions between some types of amino acid changes may
take place much faster than the average rate. Our objective
is to isolate the highly exchangeable types of amino acid
changes from the rest, based on our previous analysis of
two species of yeast and two species of rodents (Tang
et al., 2004). By the criteria of Tang et al. (2004), we classify the 75 kinds of elementary amino acid changes (those
that differ by 1 bp) by a universal ranking, Ui (see Table S1,
Supplementary Material online). In the Ui classification, i 5
1 denotes the most exchangeable (Ser 4 Thr in this case)
and i 5 75 the least exchangeable (Asp 4 Tyr) kind. Generally, the high-exchangeability pairs are more conservative
in their physicochemical properties. For comparison, we
also ranked the changes by (1) the increasing order of Grantham’s distance and (2) rankings by random permutations.
We show below how the Ka value for the ith class of
amino acid changes, referred to as Ki, is calculated. Ki/Ks is
the EI of the ith class, EI(i), as defined by Tang et al. (2004).
374 Tang and Wu
We also define Ki* as the cumulative Ki value of the first i
classes. Ki* is thus the weighted average of Kj for j from 1 to
*
is equivalent to Ka in the conventional cali. Note that K75
culation. The estimation we will introduce here is a simple
counting method, as a first attempt. More elaborate statistical methods will have to be developed and evaluated at
a later time.
Counting the Differences in the Synonymous Class and
Each of the 75 Elementary Amino Acid Change Classes
We first obtain the observed differences for the ithtype elementary amino acid change, Ni (i51, 2, ., 75),
as well as the total number of synonymous differences
Ns. For those codons differing by 2 bp, there are two pathways; for example, TTT–TTC–TCC or TTT–TCT–TCC.
We use the genome-wide EI (Tang et al., 2004) to assign
the probability for each of the two possible pathways. We
disregard the codons differing by 3 bp because they are very
rare between closely related species.
Counting the Synonymous and Nonsynonymous Sites
For genes with length L, we count the total number of
synonymous sites, Ls, and sites for each type of elementary
amino acid changes, Li (i 5 1, 2, ., 75). We assume mutations happen randomly on the sequences and obtain the
frequency of changes from amino acid j to amino acid k
( j 5 j or j 6¼ k);Pmutations to the stop codons are excluded.
Given L5Ls 1 75
i51 Li ; we can then calculate Ls and Li. The
counting method is similar to that in (Wyckoff, Wang,
and Wu, 2000) or Zhang (2000), except that we count
the number of sites for each of the 75 elementary changes
separately.
In enumerating the changes, the mutation pattern is
important. We consider only the difference between transition (ts) and transversion (tv). Dagan, Talmor, and Graur
(2002) showed that the ts/tv ratio has a profound effect
on the estimates of radical versus conservative amino acid
substitutions. Fortunately, Rosenberg, Subramanian, and
Kumar (2003) recently showed the ts/tv ratio to be relatively constant in each genome. Therefore, the genomewide ts/tv ratios (2.4 for primates and 1.7 for rodents)
estimated from the fourfold-degenerate sites were used in
the estimation.
Estimation of Ks, Ki, and Ki*
Given the observed changes (Ns and Ni, i 5 1, 75) and
the estimated number of sites (Ls and Li), we can estimate Ks
and Ki as Ns/Ls and Ni/Li, respectively. To account for multiple hits, we use the method of Jukes and Cantor (1969).
Note that the method developed in this paper is intended for
closely related species with less than 20% sequence divergence; thus, the following formulas were used:
4N s
Ks 5 0:75 ln 1 3 Ls
and
4Ni
:
Ki 5 0:75 ln 1 3 Li
Ki* ; the cumulative rate for the first i kinds of amino acid
changes, is
"
#
Pi
4 j 5 1 Nj
*
;
Ki 5 0:75 ln 1 Pi
3 j 5 1 Lj
where i 5 1, 2, ., 75.
Variance of Ks, Ki, and Ki*
Between species with a sequence divergence of 20%
or less, we suggest using the variance formulas of the
method of Jukes and Cantor (1969).
VarðKs Þ 5
ps ð1 ps Þ
2 ;
Ls 1 4p3s
pi ð1 pi Þ
VarðKi Þ 5 2 ;
Li 1 4p3 i
Pi ð1 Pi Þ
*
i;
VarðKi Þ 5 h
2 Pi
1 4P3 i
j 5 1 Lj
P
P
where ps 5Ns =Ls ; pi 5Ni =Li and Pi 5 ij51 Nj = ij51 Lj :
For larger data sets, we recommend bootstrapping as
an empirical means for estimating the variance of Ki, using
codons as the resampling units.
Application of the Ki Method to the Generalized
U Data Sets
One may have expected Ki to vary from one data set to
another (say, yeast vs. mammals), and the application of
this estimation method to different data sets would yield
different patterns. Fortunately, there is a strong correlation
among Ki’s. The salient finding of Tang et al. (2004) is that
EI(i)’s (5Ki/Ks) for different sets of genes from different
taxa are highly correlated. Hence, a Universal index, U,
was proposed (Table S1, Supplementary Material online)
such that EIs from any data set would be highly correlated
with U. For any data set with more than 20,000 amino acid
a =K
s Þ;
changes, EI(i) can be approximated by Ui 3 ðK
where Ka and Ks are, respectively, the (weighted) mean
nonsynonymous and mean synonymous substitution rates
of the entire data set. Ui’s are given in Table S1 (Supplementary Material online). For such data sets, the correlation
coefficient (r) between the observed EI(i) and the expected
a =K
s Þ is usually greater than 0.95. However, even
U i 3 ðK
smaller data sets with only 2,500 amino acid changes would
still yield an r value of .0.85 (Tang et al., 2004).
We shall refer to a collection of generalized data sets
as U sets, for which
a =K
s Þ for i 5 1 to 75: ð1Þ
EIðiÞ 5 Ki =Ks 5 Ui 3 ðK
a/K
s may differ from one data set to another. In any U set
K
(Tang et al., 2004), K1 =Ks is about 2.5 times, and K5 =Ks is
s. In figure 1, we
a/K
slightly more than twice, as high as K
show the decrease in Ki* as i increases; for example, K5* =Ks
a/ K
s(50.5 in
(.1.1) is more than 2.2 times the value of K
fig. 1). The result suggests, not surprisingly, that the highexchangeability classes of amino acid changes should be
more revealing of positive selection than the entire set of
Estimation of High-Exchangeability Substitutions,
*
Kh (5K10
), in Primates and Rodents
We shall now apply the Ki (or Kh) method to the genomic sequences of primates (human vs. macaque monkey)
and rodents (mouse vs. rat).
The Distribution of Ka/Ks in Primates and Rodents
In figures 2a and 2b, we show the distributions of the
Ka/Ks ratios for primates and rodents. The weighted means
are 0.176 and 0.148 for the primate and rodent comparisons, respectively. Only 21 out of 2,369 genes in the primate data show a Ka/Ks ratio greater than 1 (including 7
genes with Ks 5 0). In rodents, 7 out of 1,306 genes have
Ka/Ks . 1. In none of these comparisons is Ka . Ks significant. In other words, the conventional Ka/Ks analysis reveals little signature of positive selection in these data sets.
Ka versus Kh Among the Fastest Evolving Genes in
Primates and Rodents
In either data set, we first analyzed the top 100 genes
with the highest Ka values as shown in figure 3. The fastest
evolving 100 genes in primates and rodents have a mean Ka/
Ks ratio of 0.72 and 0.60, respectively. In figure 3, the cumulative Ki* =Ks values for the concatenated sequences are
plotted against i (thick black line). The curve in figure 3
decreases monotonically as i increases.
Among the conservative changes in primates (i , 20)
(fig. 3a), the cumulative Ki* =Ks ratios are greater than 1,
300
200
150
number of genes
100
0
0
50
200
number of genes
250
600
Defining Nonsynonymous Substitutions for the
High-Exchangeability Class (Kh)
For small i’s, Ki* (such as K5* ) in a generalized data set
is much higher than the standard Ka but the standard deviation (SD) associated with the estimate of K5* is also relatively large. If we include more classes of amino acid
changes (say, i . 30), the estimation error would be smaller
but Ki* =Ks would also decrease. There is a trade-off in the
demarcation of high- and low-exchangeability classes. An
optimal number of classes in this trade-off is about 10–12,
as shown below.
In figure 1, we plot ðKi* Ka Þ=SDðKi* Þ against i where
SDðKi* Þ is the SD of Ki* Note that ðKi* Ka Þ=SDðKi* Þ increases and reaches a plateau between i 5 10 and i 5 25.
(The general shape and position of the plateau in figure 1 are
not affected much by the total length of genes because we
are comparing the relative values among different Ki* ’s.)
One may wish to maximize both ðKi* Ka Þ=SDðKi* Þ and
ðKi* Ka Þ: To do so, we recommend the i value to be between 10 and 12, corresponding to where the curve in figure
1 reaches the plateau.
In general, the top 10–12 classes account for 25%–
30% of the total number of amino acid differences ob*
*
=Ks or K12
=Ks is slightly above or below twice
served. K10
the conventional Ka =Ks : We shall refer to this pattern as the
‘‘twofold approximation,’’ which will be tested further by
*
will be designated Kh (h
using published genomic data. K10
for high exchangeability) from now on.
(b)
(a)
400
changes. The question ‘‘what demarcates high- and lowexchangeability classes?’’ is addressed in the next section.
350
Estimating Nonsynonymous Substitutions and Its Applications to Detecting Positive Selection 375
0.0
0.4
Ka/Ks
0.8 >1
0.0
0.4
0.8 >1
Ka/Ks
FIG. 2.—(a) The distribution of Ka/Ks for 2,369 orthologous genes
between human and the macaque monkey. (b) The distribution of Ka/Ks
for 1,306 genes between mouse and rat.
hinting the action of positive selection. (Note that the ranking by i was independent of the data of figure 3a; it was
determined in Tang et al. [2004] using different data sets.)
As i approaches 75, the cumulative Ki* =Ks value approaches
0.72, which is the mean Ka/Ks. The inclusion of more radical classes of amino acid changes, on which negative selection operates strongly, masks the signature of positive
*
as noted; Kh/Ks is
selection. We use Kh to designate K10
1.494, about twice the Ka/Ks ratio of 0.72.
To show the significance of the difference between Kh/
Ks and Ka/Ks, we randomly shuffled the i ranking and determined that the differences observed after the ranking is shuffled 1,000 times. For each i value, the highest 5% in Ki* =Ks as
well as the means are plotted in figure 3a. The dashed line also
decreases monotonically because the SD of Ki* decreases
when i becomes larger with more samples. At each i, the observed Ki* =Ks is always bigger than the highest 5% value in
the randomized ranking scenarios. The observed excess is
thus statistically significant for every i rank.
We also tested other ranking methods in the literature.
The long-dashed line presents Ki* =Ks calculations based on
the ranking by Grantham’s distance (Grantham, 1974). The
line moves up and down but never goes above 1. This irregular pattern confirms what has been reported—that
amino acid changes determined to be conservative by Grantham’s distance are often pairs of low exchangeability
(Yang, Nielsen, and Hasegawa, 1998). Grantham’s distance
thus does not provide a suitable ranking of the substitution
rate between amino acids. We obtained the same pattern for
the top 100 genes in rodents (fig. 3b).
Ka versus Kh Among All Genes in Primates and Rodents
Although the Ki method is most useful when large
number of substitutions can be analyzed, Kh can be applied
2.0
376 Tang and Wu
K*i /Ks
1.5
amino acid ranking based on U index(conservtive−−>radical)
amino acid ranking based on Grantham Distance(increasing order)
top 5% from simulation based on the random ranking
mean from simulation based on the random ranking
0.5
1.0
(a): primate sequences
0
10
30
20
40
50
70
60
1.4
i
1.0
(b): rodent sequences
0.6
0.8
K*i /Ks
1.2
amino acid ranking based on U index(conservtive−−>radical)
amino acid ranking based on Grantham Distance(increasing order)
top 5% from simulation based on the random ranking
mean from simulation based on the random ranking
0
10
20
30
40
50
70
60
i
FIG. 3.—(a) The Ki* =Ks values for the concatenated 100 fastest evolving genes from primates are plotted against i (1–75) using three different
rankings: (1) the decreasing order by the U-index (thick black line); (2) the increasing order of Grantham’s distance (long-dashed line); (3) random
order (the highest 5% and the mean are plotted as dashed and dotted lines, respectively). (b) Same as (a) but using rodent sequences.
to individual genes as well. For each individual gene, we
calculated the Kh, Ka, and Ks values. In Table 1, we present
20 such examples, 10 from each data set. These are examples of longer genes and hence smaller stochastic fluctuations. In general, Kh is indeed larger than Ka for longer
genes, but the variance for each individual gene is substan-
tial. The new method is more informative when many genes
are analyzed simultaneously, as shown below.
The scatter plot of Kh/Ks versus Ka/Ks for the 1,948
genes between human and macaque (excluding those with
Ka 5 0 or Ks 5 0) is shown in figure 4a. Among those
genes, only 14 have a Ka/Ks ratio greater than 1. In contrast,
Table 1
Examples for the Estimation of Ks, Ka, and Kh
Data Set
Accession
Gene Name
Number of Codons
Ks
SE(Ks)
Ka
SE(Ka)
Kh
SE(Kh)
Ka/Ks
Kh/Ks
Primate
NM_133259
AF103801
NM_000313
NM_001063
NM_002087
NM_017646
NM_006059
BC010570
NM_015392
NM_002615
NM_013016
NM_012705
NM_032082
NM_130409
NM_057193
NM_012830
NM_017300
NM_053781
NM_012503
NM_130424
LRPPRC
Unknown
PROS1
TF
GRN
IPT
LAMC3
HMGCL
NPDC1
SERPINF1
Ptpns1
Cd4
Hao2
Cfh
Il10ra
Cd2
Baat
Akr1b7
Asgr1
Tmprss2
714
486
327
690
588
326
501
325
320
416
508
455
353
1,233
569
342
419
316
284
490
0.0616
0.0672
0.0389
0.0743
0.0635
0.0504
0.0779
0.0817
0.1021
0.1264
0.1055
0.1630
0.1297
0.1619
0.1698
0.1960
0.1638
0.1693
0.1765
0.1722
0.0101
0.0126
0.0118
0.0114
0.0114
0.0136
0.0132
0.0173
0.0190
0.0193
0.0162
0.0220
0.0220
0.0137
0.0201
0.0285
0.0231
0.0274
0.0295
0.0223
0.0575
0.0493
0.0265
0.0427
0.0339
0.0220
0.0334
0.0209
0.0236
0.0237
0.1152
0.1452
0.0925
0.1062
0.0991
0.1104
0.0782
0.0746
0.0556
0.0501
0.0064
0.0072
0.0063
0.0055
0.0053
0.0057
0.0058
0.0056
0.0061
0.0053
0.0110
0.0132
0.0116
0.0067
0.0095
0.0130
0.0097
0.0109
0.0098
0.0071
0.1146
0.1540
0.0834
0.1142
0.0665
0.0481
0.0543
0.0187
0.0846
0.0379
0.2188
0.1958
0.1971
0.2204
0.1139
0.2308
0.1962
0.1636
0.0541
0.1147
0.0251
0.0402
0.0299
0.0284
0.0264
0.0242
0.0235
0.0150
0.0430
0.0191
0.0423
0.0432
0.0480
0.0285
0.0280
0.0572
0.0451
0.0466
0.0278
0.0308
0.9325
0.7336
0.6829
0.5747
0.5331
0.4375
0.4290
0.2562
0.2313
0.1877
1.0925
0.8907
0.7133
0.6561
0.5834
0.5633
0.4773
0.4407
0.3152
0.2911
1.8584
2.2923
2.1471
1.5367
1.0463
0.9539
0.6973
0.2293
0.8294
0.2999
2.0750
1.2009
1.5191
1.3617
0.6706
1.1771
1.1976
0.9659
0.3062
0.6659
Rodent
These genes were chosen from the primate and rodent data sets mainly on account of their lengths. Examples are given in the descending order of Ka/Ks in each data set.
1.4
1.2
4
1.2
y =2.005x + 0.0122
y=1.9428x + 0.02724
R2= 0.9856
R2= 0.9926
0.8
1.2
Ka/Ks
FIG. 4.—The scatter plots of Kh/Ks versus Ka/Ks for (a) 1,948 orthologs between human and macaque and (b) 1,241 orthologs between mouse
and rat. The dashed lines are fitted regression lines. Those points with Kh/
Ks . 4.0 are plotted on the upper edge. Note that the slopes of the regression lines in both panels are greater than 2.
Kh/Ks is larger than 1 in 174 genes. Most data points in
figure 4a are well above the 45° line; data points below the
45° line are generally genes with small Ka. We also analyzed 1,241 orthologs between rat and mouse, as shown
in figure 4b. In this comparison, 7 genes show Ka/Ks greater
than 1 but for Kh/Ks 92 genes do. In both panels, the slopes
of the regression lines are above 2, indicating that Kh/Ks
is at least twice the value of Ka/Ks on average.
It should be noted that Kh/Ks may often be larger than 1
due to the larger standard error (SE) of Kh, in addition to its
larger mean. To remove, or at least reduce, the contribution
of the larger SE, we examine the correlation between Kh SE(Kh) and Ka SE(Ka). Both data sets show that they are
highly positively correlated with the correlation coefficient
larger than 0.75, and on average the Kh SE(Kh) is approximately 1.7 times the value of Ka SE(Ka). Moreover, for
those genes with Kh/Ks . 1, Kh 2SE(Kh) is on average
1.29 times higher than Ka 2SE(Ka) for rodents and
1.13 times higher for primates. These results suggest that
the larger Kh/Ks ratios in most cases are indeed due to the
larger mean in Kh.
The ‘‘Twofold Approximation’’ of Kh versus Ka
In the generalized data sets, we show Kh ’ 2Ka. To test
the idea that genes of sufficient size will reliably yield the
twofold relationship, Kh ’ 2Ka, we concatenated genes
with similar Ka values. Of the 2,369 orthologs between human and macaque, we sorted the 1,955 genes with Ka .
0 by the descending order of Ka and concatenated every
50 orthologs into a supersequence. A total of 39 supersequences were obtained. In figure 5a, we plot Kh/Ks against
Ka/Ks for these supersequences. The correlation coefficient
r is 0.993, and the slope of the regression line is 2.005.
Therefore, the Kh/Ks ratio is very close to twice the value
of Ka/Ks, as shown in figure 1 for the generalized data set.
We applied the same procedure to 1,200 orthologs between
0.8
Kh/Ks
0.4
0.2
(a)
(b)
0.0
0.4
0.2
0.4
Ka/Ks
0.0
0.0
0.0 0.5 1.0 1.5 2.0
(b)
0
0
(a)
0.6
0.8
0.6
Kh/Ks
2
Kh/Ks
y = 2.47x − 0.067
2
R = 0.52
1
2
1
Kh/Ks
1.0
3
3
y = 2.06x − 0.016
R2 = 0.74
1.0
4
Estimating Nonsynonymous Substitutions and Its Applications to Detecting Positive Selection 377
0.0
0.2
0.4
Ka/Ks
0.6
0.0
0.2
0.4
0.6
Ka/Ks
FIG. 5.—(a) The scatter plot of Kh/Ks versus Ka/Ks for 39 supersequences between human and macaque. Each supersequence is the concatenation of 50 genes with similar Ka values. (b) The same plot for 24
supersequences between mouse and rat. Note that the slopes of both regression lines are around 2, indicating Kh/Ks is usually twice as large
as Ka/Ks.
mouse and rat, creating 24 supersequences (each again being a 50 gene contig). In figure 5b, the correlation coefficient between Kh/Ks and Ka/Ks is 0.996, and the slope is
1.943. Again, the twofold approximation holds quite well.
In summary, Kh’s in any sequence comparisons would
fluctuate, but generally hover about two times of the corresponding Ka value. For longer sequences (or collection
of sequences), the Kh ’ 2Ka approximation holds well.
Discussion
The study of coding sequence evolution generally
relies on the distinction between synonymous and nonsynonymous changes. The latter is a heterogeneous class
driven by forces that both accelerate and retard the rate
of molecular evolution. The signals of positive and negative
selection can be better resolved when nonsynonymous
changes are properly classified. The power of the method
lies in the empirical means of finding the right set of conservative amino acid changes. The relative ranking of most
to least exchangeable replacements applies equally well to
yeasts, Drosophila, plants, and mammals (Tang et al.,
2004). This consistency permits a universal delineation
of the optimal set of the top 10 classes of amino acid
changes. The ranking of amino acid properties is crucial
as other existing indices, such as the most widely used
Grantham’s distance, cannot delineate a subset of changes
that evolve substantially faster than the rest (fig. 3).
The method proposed here requires genomic sequences
from only two closely related species; hence, the influx of
genomic data should make the method widely applicable.
The downside of calculating Kh is the larger SE associated
with only a subset of changes. (Although it is defined by 10
of the 75 types of elementary changes, Kh usually accounts
378 Tang and Wu
for 25%–30% of the amino acid substitutions.) In this study,
a simple counting method is introduced. While it may be
sufficient to accurately estimate the divergence between
closely related species, the potential for more sophisticated
methods, such as maximum likelihood (ML) estimation under the Bayesian framework, should not be overlooked.
By examining the top 10 classes (or about 25%) of
amino acid changes, it becomes much easier to find cases
where nonsynonymous changes outpace synonymous ones
both at the genomic and genic levels. Because Ka . Ks is
taken to indicate the action of positive selection in most
methods (discussed below), Kh . Ks should be interpreted
in the same manner. Although Ks has been known to be
lower than the neutral substitution rate, by as much as
20% in some taxa (Akashi, 1995; Hellmann et al., 2003;
Lu and Wu, 2005), Ka (or Kh) . Ks remains a reasonable
criterion for inferring positive selection. Without invoking
positive selection on amino acid changes, the alternative interpretation would have to be that amino acid changes are
subjected to weaker selective constraints than synonymous
changes are. It seems more reasonable to assume that, on
average, selective constraints on nonsynonymous changes
are at least as strong as those on synonymous changes.
In figure 4 on mammalian genes, Kh fluctuates greatly
among individual genes, but the regression corroborates the
relationship of Kh ’ 2Ka. A source of the fluctuation is codon composition. Different codon compositions yield different estimations due to the weight for each type of change.
There are many other sources of variation, such as neighboring effect, gene structure, and selection constraints for
different types of amino acid changes in different genes.
Nevertheless, if each gene is sufficiently large, the codon
composition will be close to the genome average and the
relationship, Kh ’ 2Ka will be approached.
There are several other approaches to detecting positive selection. A most widely used method is the site-by-site
analysis of a set of DNA sequences (Nielsen and Yang,
1998; Suzuki and Gojobori, 1999). In this approach, the
proportion of sites in the sequences under positive selection
is estimated, often in the ML framework (Yang, 1998; Yang
and Nielsen, 2002). Although the method has been applied
to genomic sequences from as few as three species (Clark
et al., 2003), it is probably more suited to cases where
a much larger number of taxa can be used. Some recent
studies have also raised the issue of possible high rate of
false positives in the more elaborate ML models (Zhang,
2004). More recently, Massingham and Goldman (2005)
demonstrated an improved ML method, that is, sitewise
likelihood ratio (SLR), for detecting nonneutral evolution.
They showed that the SLR method can be more powerful,
especially in difficult cases where the strength of selection
is low. Most other approaches require additional functional
(Hughes and Nei, 1988; Wyckoff, Wang, and Wu, 2000),
chromosomal (Betancourt, Presgraves, and Swanson, 2002;
Lu and Wu, 2005), or polymorphism data (McDonald and
Kreitman, 1991; Fay, Wyckoff, and Wu, 2002) for inferring
positive selection. A recent method that needs less information than our proposed method is the so-called ‘‘volatility
measure’’ (Plotkin, Dushoff, and Fraser, 2004) which uses
only sequences from one single genome. The utility of this
method in detecting selection, however, remains to be
demonstrated (Chen, Emerson, and Martin, 2005; Dagan
and Graur, 2005; Hahn et al., 2005).
In conclusion, in addition to the conventional Ka and
Ks estimates, the calculation of Kh may often yield additional information. This is especially true when one attempts to compare two genomic sequences. Although the
statistical variation associated with Kh is larger than the conventional Ka, genes with high Kh/Ks ratios may reveal more
evolutionary signature and can then be subjected to additional analysis (McDonald and Kreitman, 1991; Yang,
1998; Fay and Wu, 2000) or experimentation (Greenberg
et al., 2003; Sun, Ting, and Wu, 2004).
Supplementary Material
Supplementary Table S1 is available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.
org/).
Acknowledgments
The authors wish to thank Hurng-Yi Wang for providing the alignments of primate orthologs used in this study.
The authors also thank Alex Kondrashov, Wen-Hsiung Li,
Martin Kreitman, and Jian Lu for comments and discussions. The work is supported by National Institutes of
Health grants to C.-I.W.
Literature Cited
Akashi, H. 1995. Inferring weak selection from patterns of polymorphism and divergence at ‘‘silent’’ sites in Drosophila DNA.
Genetics 139:1067–1076.
Betancourt, A. J., D. C. Presgraves, and W. J. Swanson. 2002. A
test for faster X evolution in Drosophila. Mol. Biol. Evol.
19:1816–1819.
Chen, Y., J. J. Emerson, and T. M. Martin. 2005. Evolutionary
genomics: codon volatility does not detect selection. Nature
433:E6–E7[discussion E7–E8].
Clark, A. G., S. Glanowski, R. Nielsen et al. (17 co-authors). 2003.
Inferring nonneutral evolution from human-chimp-mouse
orthologous gene trios. Science 302:1960–1963.
Dagan, T., and D. Graur. 2005. The comparative method rules! Codon volatility cannot detect positive Darwinian selection using
a single genome sequence. Mol. Biol. Evol. 22:496–500.
Dagan, T., Y. Talmor, and D. Graur. 2002. Ratios of radical to
conservative amino acid replacement are affected by mutational and compositional factors and may not be indicative
of positive Darwinian selection. Mol. Biol. Evol.
19:1022–1025.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt 1978. A model
of evolutionary change in proteins. National Biomedical Research Foundation, Washington.
Fay, J. C., and C. I. Wu. 2000. Hitchhiking under positive Darwinian selection. Genetics 155:1405–1413.
Fay, J. C., G. J. Wyckoff, and C. I. Wu. 2002. Testing the neutral
theory of molecular evolution with genomic data from Drosophila. Nature 415:1024–1026.
Grantham, R. 1974. Amino acid difference formula to help explain
protein evolution. Science 185:862–864.
Greenberg, A. J., J. R. Moran, J. A. Coyne, and C. I. Wu. 2003.
Ecological adaptation during incipient speciation revealed by
precise gene replacement. Science 302:1754–1757.
Hahn, M. W., J. G. Mezey, D. J. Begun, J. H. Gillespie, A. D.
Kern, C. H. Langley, and L. C. Moyle. 2005. Evolutionary
Estimating Nonsynonymous Substitutions and Its Applications to Detecting Positive Selection 379
genomics: codon bias and selection on single genomes. Nature
433:E5–E6[discussion E7–E8].
Hellmann, I., S. Zollner, W. Enard, I. Ebersberger, B. Nickel, and
S. Paabo. 2003. Selection on human genes as revealed by comparisons to chimpanzee cDNA. Genome Res. 13:831–837.
Hughes, A. L., and M. Nei 1988. Pattern of nucleotide substitution
at major histocompatibility complex class I loci reveals overdominant selection. Nature 335:167–170.
Jukes, T. H., C. R. Cantor. 1969. Evolution of protein molecules.
P. 21 in H. N. Munroe, ed. Mammalian protein metabolism.
Academic Press, New York.
Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Inc.
Sunderland, Mass.
Li, W. H., C. I. Wu, and C. C. Luo. 1985. A new method for estimating synonymous and nonsynonymous rates of nucleotide
substitution considering the relative likelihood of nucleotide
and codon changes. Mol. Biol. Evol. 2:150–174.
Lu, J., and C. I. Wu. 2005. Weak selection revealed by the wholegenome comparison of the X chromosome and autosomes of
human and chimpanzee. Proc. Natl. Acad. Sci. USA
102:4063–4067.
Massingham, T., and N. Goldman. 2005. Detecting amino acid
sites under positive selection and purifying selection. Genetics
169:1753–1762.
McDonald, J. H., and M. Kreitman. 1991. Adaptive protein
evolution at the Adh locus in Drosophila. Nature 351:
652–654.
Nei, M., and T. Gojobori. 1986. Simple methods for estimating
the numbers of synonymous and nonsynonymous nucleotide
substitutions. Mol. Biol. Evol. 3:418–426.
Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting
positively selected amino acid sites and applications to the
HIV-1 envelope gene. Genetics 148:929–936.
Plotkin, J. B., J. Dushoff, and H. B. Fraser. 2004. Detecting selection using a single genome sequence of M. tuberculosis
and P. falciparum. Nature 428:942–945.
Rand, D. M., D. M. Weinreich, and B. O. Cezairliyan. 2000.
Neutrality tests of conservative-radical amino acid changes
in nuclear- and mitochondrially-encoded proteins. Gene
261:115–125.
Rosenberg, M. S., S. Subramanian, and S. Kumar. 2003. Patterns
of transitional mutation biases within and among mammalian
genomes. Mol. Biol. Evol. 20:988–993.
Sun, S., C. T. Ting, and C. I. Wu. 2004. The normal function
of a speciation gene, Odysseus, and its hybrid sterility effect.
Science 305:81–83.
Suzuki, Y., and T. Gojobori. 1999. A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol.
16:1315–1328.
Tang, H., G. J. Wyckoff, J. Lu, and C. I. Wu. 2004. A universal
evolutionary index for amino acid changes. Mol. Biol. Evol.
21:1548–1556.
Wyckoff, G. J., W. Wang, and C. I. Wu. 2000. Rapid evolution of
male reproductive genes in the descent of man. Nature
403:304–309.
Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol.
Evol. 15:568–573.
Yang, Z., and J. P. Bielawski. 2000. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15:496–503.
Yang, Z., and R. Nielsen. 2002. Codon-substitution models for
detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908–917.
Yang, Z., R. Nielsen, and M. Hasegawa. 1998. Models of amino
acid substitution and applications to mitochondrial protein
evolution. Mol. Biol. Evol. 15:1600–1611.
Zhang, J. 2000. Rates of conservative and radical nonsynonymous
nucleotide substitutions in mammalian nuclear genes. J. Mol.
Evol. 50:56–68.
———. 2004. Frequent false detection of positive selection by the
likelihood method with branch-site models. Mol. Biol. Evol.
21:1332–1339.
Jianzhi Zhang, Associate Editor
Accepted October 10, 2005