Supplemental methods

advertisement
Methods
Data
We used genome-size multiple alignments of genomes of 8 vertebrates with that of
H. sapiens (hg19 edition), and of 7 insects with D. melanogaster (dm3), obtained from UCSC
genome database [11]. These data sets were chosen to represent the highest number of clades that
branch off the lineage of H. sapiens (D. melanogaster), in order to maximize our resolution for
mapping the initial (A B) replacements onto this lineage (Figure 1); a single species with the
highest coverage was chosen from each such clade, and only species with coverage > 6 were used.
Using the canonical splicing variants of 21,018 UCSC hg19 KnownGenes [12] for vertebrates, or of
13,300 FlyBase genes (BDGP release 5) [13] for insects, we extracted the alignment slices of
protein coding regions for the orthologous genes. Single-nucleotide polymorphism data was obtained
from dbSNP (release 134) [14] for human, and from Drosophila Genetic Reference Panel website
[15] for D. melanogaster. Codon sites with gaps or missing data in any of the species were excluded
from analysis. The total numbers of genes and codons analyzed are given in Table 1. Lengths of
segments of phylogenetic trees were taken from the UCSC Genome Bioinformatics Site. All lengths
are measured in the genome-average numbers of nucleotide replacements per site.
Analysis
The nucleotides in the internal nodes of the phylogenies were reconstructed using maximum
likelihood as implemented in the codeml program of PAML package [16]. The results obtained using
maximum parsimony were similar (data not shown). We mapped the nucleotide replacements to the
phylogenetic trees as follows: whenever the nucleotides ascribed to the neighboring nodes differed,
a nucleotide replacement was inferred to have occurred at the edge that connected these two
nodes. No additional nonsynonymous replacements, beyond the initial replacement (A B) at one
of the five (four) internal segments of the H. sapiens (D. melanogaster) lineage (figure 1), and (for
calculation of Fd) the reversal (B A) or orthogonal replacement (B C) at the terminal segment
of this lineage, were allowed at the codon along the lineage of H. sapiens (D. melanogaster).
Throughout the paper, A and B refer to the two amino acids separated by a single nucleotide
substitution in the second position of a codon, and C is either of the one or two remaining amino
acids different from both A and B and separated from them by a single nucleotide substitution in the
second position of the codon. Codons with nonsynonymous replacements beyond those two, i.e.
with nonsynonymous replacements in the first or the third position of the codon, were not analyzed.
For the analysis of polymorphisms, we then counted the numbers of second-position
nucleotides that experienced a replacement in one of the internal segments of the H. sapiens
(D. melanogaster) lineage, and are currently polymorphic in H. sapiens (D. melanogaster). More
precisely, for each of the considered ages of the initial replacement, i.e. for each of the considered
internal segments where such a replacement has occurred (figure 1), we calculated the frequency of
polymorphisms that restore the ancestral amino acid as follows:
F p (B →A )=
poly (B →A )∣ initial (A →B )
,
initial (A →B )
and the frequency of polymorphisms that give rise to an amino acid different than the ancestral one
as follows:
F p (B →C )=
poly (B →C )∣ initial (A →B )
initial (A →B )
Here, initial(A B) is the number of A B replacements inferred to have happened at this
internal segment; poly(B A) and poly(B C) is the number of codons with B A (B C)
polymorphism, such that the ancestral state for polymorphism is B; and the vertical line means that
the frequencies of such polymorphisms are measured at the sites of the inferred A B replacement.
The ancestral state for polymorphism was determined as the allele carried by the reference genome
(hg19 or dm3); for insects, it was additionally required that at least 50% of the D. menanogaster
genotypes carry the ancestral allele B.
The replacements were analyzed similarly, except instead of the polymorphism in H. sapiens
(D. melanogaster), we used the number of replacements in the H. sapiens (D. melanogaster) lineage
after its divergence from M. musculus (D. sechellia), as inferred by codeml. More precisely, for each
of the five (four) internal segments of the H. sapiens (D. melanogaster) lineage, the following
statistics were defined:
F d (B>A )=
terminal (B>A )∣ initial (A>B )
initial (A>B )
F d (B>C )=
terminal (B>C )∣ initial (A>B )
initial (A>B )
Here, terminal(B A) and terminal(B C) are the numbers of B A and B C replacements in
the H. sapiens (D. melanogaster) lineage after its divergence from M. musculus (D. sechellia).
Download