The response of amino acid frequencies to directional mutation pressure in mitochondrial genomes is related to the physical properties of the amino acids and to the structure of the genetic code. Daniel Urbina, Bin Tang, Paul G Higgs. Department of Physics, McMaster University, Hamilton, Ontario L8S 4M1, Canada. Aims of this project - Here we study the variation in frequency of DNA bases in the protein-coding regions of mitochondrial genomes and the corresponding variation in frequency of amino acids in the proteins. 1 3 2 Directional mutation pressure in DNA – The rates of mutation between the four bases are usually not equal. This causes a mutational pressure that drives the base frequencies away from 25%. If no selection acts on the DNA, base frequencies will reach an equilibrium determined by mutation. The frequencies of bases at synonymous sites vary enormously in mitochondrial genomes, indicating that mutation pressure varies in direction among species. Response of amino acid frequencies – Mutation pressure will alter the frequency of usage of codons in gene sequences. This will cause amino-acid substitutions in the proteins that will often be deleterious. Selection will therefore oppose variation in the frequencies of bases and amino acids. In mitochondrial sequences, it is observed that amino acid frequencies vary considerably in response to base frequency changes. Mutation pressure is thus strong enough to drive amino acid frequencies away from their optimal values. Influence of physical properties – Most observed amino acid substitutions are between amino acids with similar physical properties. Selection acts less strongly against these changes because they have a smaller effect on protein structure and function. Here we will show that the physical properties of the amino acids determine the degree to which amino acid frequencies can respond to mutation pressure. Image reproduced from http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/A/AnimalCells.html Mitochondria are organelles inside eukaryotic cells. They possess their own genomes that are distinct from the main genome in the nucleus. Typical animal mitochondrial genomes contain 12 protein-coding genes, 2 rRNAs and 22 tRNAs. Strand asymmetry - There is an asymmetry in replication of the two DNA strands in mitochondrial genomes. The strands are subject to different mutational pressures and the base frequencies are not equal on the two strands. All the data in this study refer to frequencies on the plus strand of the genome, which codes for the majority of genes. This is the front page of OGRe, our relational database for the comparative analysis of mitochondrial genomes. OGRe contains information on gene sequences, gene order and genome rearrangements. Please visit OGRe on-line at http://ogre.mcmaster.ca Vol . 6 SECOND POSITION T 4 F I R S T T C P O S I T I O N A G C TTT F 1 TTC F A G THIRD POSITION TAT Y 10 TAC Y TGT C 17 TGC C TTA L 2 TTG L TCT S TCC S 6 TCA S TCG S TAA Stop TAG Stop TGA W 18 TGG W T C A G CTT L CTC L CTA L CTG L CCT CCC CCA CCG P P 7 P P CAT H 11 CAC H CAA Q 12 CAG Q CGT R CGC R 19 CGA R CGG R T C A G T T 8 T T AAT N 13 AAC N AGT S 20 AGC S ATA M 4 ATG M ACT ACC ACA ACG AAA K 14 AAG K AGA Stop AGG Stop T C A G GTT GTC GTA GTG GCT GCC GCA GCG A A 9 A A GAT D 15 GAC D GGT GGC GGA GGG T C A G ATT I ATC I 3 V V 5 V V GAA E 16 GAG E G G 21 G G This is the genetic code used in vertebrate mitochondrial DNA. It shows the mapping between the 64 possible codons and the 20 possible amino acids. The shaded boxes are four-codon families. Third-position sites in four-codon families are synonymous (or fourfold degenerate). Base changes may occur at these sites without influencing the amino acid. Hence, selection should be negligible (or at least very weak). In contrast, most first and second position changes are nonsynonymous. Therefore selection should be significant at these sites. 7 5 Polarity pI Hyd.1 Hyd.2 Surface Area Fract. Area Ala A 67 11.50 0.00 6.00 1.8 1.6 113 0.74 Arg R 148 14.28 52.00 10.76 -4.5 -12.3 241 0.64 Asn N 96 12.28 3.38 5.41 -3.5 -4.8 158 0.63 Asp D 91 11.68 49.70 2.77 -3.5 -9.2 151 0.62 Cys C 86 13.46 1.48 5.05 2.5 2.0 140 0.91 Gln Q 114 14.45 3.53 5.65 -3.5 -4.1 189 0.62 Glu E 109 13.57 49.90 3.22 -3.5 -8.2 183 0.62 Gly G 48 3.40 0.00 5.97 -0.4 1.0 85 0.72 His H 118 13.69 51.60 7.59 -3.2 -3.0 194 0.78 This is a table of 8 physical properties of amino acids that are thought to influence protein folding and function (Volume, Polarity, Hydrophobicity etc.) Using Principal Component Analysis, we projected this 8-dimensional space into 2d, so that the similarities between the amino acids can be clearly visualized. The PCA plot shows that the amino acids FLIMV in the first column of the genetic code table form a tight cluster with very similar physical properties. This is also true for SPTA in the second column. Most of the third-column amino acids are fairly similar to one another. Surprisingly, the fourth-column amino acids are all very different. There is also no particular similarity between amino acids in the same row of the genetic code. For each of the 473 species in OGRe, we measured T4 (the frequency of T at the fourfold-degenerate sites), and T1 and T2 (the frequency of T at first and second positions). T4 varies enormously due to mutational pressure, from less than 10% to more than 90%. T1 and T2 vary almost linearly with T4, but over a narrower range. This shows that both mutation and selection influence T1 and T2. By fitting a mutation-selection model to the data, we can estimate the relative strength of mutation and selection. The slope for T2 is less than for T1, which shows that selection against second-position substitutions is stronger than that against first-position substitutions. Similar plots are also seen for A, C and G. On the left, we show the variation in the frequencies of three amino acids in response to the variation in T4. Serine shows a significant increase; threonine shows a significant decrease; and alanine shows no trend. The direction and magnitude of these trends is influenced by mutations at all three codon positions. This explains what we saw in part 5 – The similarity between amino acids in columns 1, 2 and 3 means that many first-position substitutions are only weakly selected against, whereas the dissimilarity between amino acids in the same row means that second-position substitutions are more strongly selected against. 9 8 Key point – The amino acids in the first two columns (numbers 1-9) have large slopes that may be either positive or negative, i.e. they are responsive to mutational pressure. The amino acids in the third and fourth columns (numbers 10-21) have slopes close to zero, i.e. they are non-responsive. Hypothesis – An amino acid will respond significantly to mutational pressure at the DNA level if there are neighbouring amino acids in the genetic code to which it can mutate that have similar physical properties. If the neighbouring amino acids are very different in properties, selection will oppose these mutations, and the amino acid will not be responsive to mutation pressure. Responsiveness – We measured the slope for each amino acid against each of the four base frequencies for two independent data sets (fish and mammals). We define the responsiveness of an amino acid as the root mean square value of these 8 slopes. Proximity – We define the distance dij between any pair of amino acids as the euclidean distance between them in the 8d physical property space (after normalizing each property to have unit variance). We then define the proximity of an amino acid as the mean of 1/dij for all its neighbouring amino acids (i.e. those accessible by a single mutation in the DNA). A high-proximity amino acid is one whose neighbours have similar physical properties. On the right, we show the slope of the linear regression of each of the amino acid frequencies against each of the four base frequencies. The amino acids are numbered in the order they appear in the genetic code diagram (see part 4). Note that serine has two separate blocks and is thus numbered twice. Filled symbols are data points from fish genomes. Open symbols are derived from a mutation-selection model. The solid and dashed lines are linear regressions through the data and theory points, respectively. Bulk. Result – The graph shows that there is a strong correlation between Proximity and Responsiveness (R = 0.86, p < 10-6). This confirms the hypothesis in part 8, and means that physical properties have a direct influence on evolutionary properties. Squares are data points from fish genomes. Triangles are derived from a mutation-selection model. Summary – The frequencies of bases and amino acids in mitochondrial genomes vary in a complex way due to the action of directional mutation pressure on the DNA and stabilizing selection pressure on the protein sequences. Our model of mutationselection balance explains the trends seen in these frequencies (see 5, 7 and 8). We developed a measure of similarity between amino acids that enabled us to make quantitative predictions about the responsiveness of the different amino acids to mutation pressure (see 6 and 9). This work also reveals non-random patterns of similarity between neighbouring amino acids in the genetic code that are of interest from the point of view of the evolution of the genetic code itself.