Nat’s Deep Thoughts (on genome amino-acid composition) Statistical Methods. We have primarily relied on two methods for assessing the characteristics of a group of composition statistics, presumably all sharing some common heritage. The similarity in residue composition between any two features was determined by treating each set of statistics as a 21-dimensional vector, and calculating the “distance” between these two vectors: ( A[i] B[i]) 21 2 i 1 From repeated pairwise comparisons across a larger set, a matrix showing all distances can be built. Thus, for example, the genes, pseudogenes, and chromosomes in the human, fly, and worm genomes can be compared within the organism to try to assess the relative age of the pseudogenes. Though dimensionless, the distance factor is on a scale similar to that of an individual residue frequency; thus at the high extreme human genes and M. janischii genes have a distance of = 0.12, while human genes and fly genes are only 0.03 apart; a distance of 0.01 is virtually negligible. Where genes are being compared to chromosomes or pseudogenes we have ignored stops, since these tend to be the most widely separated and distort the distances. The second method treats amino acids individually: in a large enough set of features, the standard deviation of composition of each amino acid can be calculated; we further scaled this figure by dividing by the mean composition. This allows the variability or spread of a residue within a feature set to be evaluated quantitatively, as well as qualitatively from a plot. Furthermore, plotting the spread of residues across a number of feature sets shows clear trends in variability for the majority of residues. For consistency, we have evaluated chromosome composition—best viewed as a function of trinucleotide frequencies—across all three frames; as expected, there is virtually no change depending on start point. Therefore, all of our chromosomal amino acid and hypothetical codon usage statistics use the first possible frame. Simple di- and trinucleotide frequencies are taken over all three frames. In the human, we have generally ignored chromosomes other than 21 and 22, due to size and lack of a reasonably final sequence. Gene predictions were made using GenomeScan; comparison with the latest Ensembl release shows very little difference in composition between sets ( = 0.012); furthermore, in either set the chromosomes studied have gene predictions close to representative of all of Ensembl (though in either case 22 is more divergent). Though we refer here to “pseudogenes”, these features do not make use of a comprehensive algorithm or prediction method, instead being refined from BLAST and FASTA (more specifically, FASTX) alignments. Use of FASTA facilitates inclusion of a wider region of homology and is a far more realistic way of dealing with expected mutations, but complicates analysis by inclusion of large gaps and frameshifts. However, we have not attempted here to draw conclusions about genomic layout of pseudogenes, aside from identifying potential processed pseudogenes, so they are more properly termed “pseudoexons”. Compositional statistics are based on FASTA-generated alignment sequences and thus deal with synonymous or conservative translations of the region, i.e. ignoring frameshifts completely. Comparison with older BLAST-only predictions (the basis for the later FASTA alignments) shows very little substantive change in overall results with this method, even with reduction of the set. Pseudogenes. Our previous analysis of worm pseudogenes1 revealed an easily visible trend in their implied amino-acid composition: 13 residues (out of 21, including the stop codon) fall between the chromosomes and genes, while another 4 are almost exactly matching one of the other features; none are vastly out of line. We believe that mutations in nonviable genes will tend to yield a composition similar to that of the chromosome as a whole, so that the exact degree of separation between sets overall is an indication of the average age of the pseudogenes. [ More specifically, those pseudogenic regions readily detectable by homology will either be recently mutated, or those whose implied composition will not be significantly changed by non-frameshift mutations even as the sequence approaches the (non-random) genomic level. Therefore sequences whose translation is not greatly disrupted by nucleotide substitutions will be more likely to “survive” as identifiable pseudoexons. I NEED PROOF OF THIS…] Obviously it may also be influenced by the type of genes that have fallen out of use, and the nature of their disablement. The analysis has been continued across the remaining complete eukaryotic genomes. Similar methods were used in each case for initial identification of candidate pseudogenes, but in all cases with fewer final predictions than in worm. In human this is hardly surprising, given the small portion of the overall genome used and the structure of human genes: more than a third as many pseudogenes as genes have been predicted, but the overall density is much lower. Disabled homologies in yeast and pseudoexons in fly number 166 and 114, respectively. [ WHY? ] Chromosomes Genes Worm genes 0.0740 Worm pseudogenes 0.0264 0.0601 Human genes 0.0600 Human pseudogenes 0.0479 0.0295 Fly genes 0.0630 Fly pseudogenes 0.0096 0.0667 Yeast genes 0.0609 Yeast pseudogenes 0.0347 0.0356 Table 1. Distances for features in eukaryotic genomes. Column headings apply only to the named feature within the same organism. Human pseudogene predictions across chromosomes 21 and 22 show a similar pattern- 14 residues with intermediate frequencies, 4 others on the margins. This result is closely followed with Ensembl genes (and even improved on chr. 22). Pseudogenes in the human are closer in composition to human genes than is the case in the worm (Table 1); however, the much smaller set of fly pseudogenes is extremely close to the chromosomes. Yeast also follows the worm pattern. While in all cases some residue frequencies in pseudogenes may fall outside the spread expected, this is usually only the case where gene and chromosome frequencies are close to begin with. Residues with greater expected final drift, in contrast, strongly tend to have intermediate pseudogene frequencies. [ NEED TO SAY THIS LESS AWKWARDLY, AND WITH MATH.] Inter-organismal. We examined the genes of the four eukaryotes for composition patterns (Fig. whatever). Calculated distances across the set clearly shows a trend from more complex to less complex organisms— yeast is most different from human, followed by worm and fly— also evident qualitatively from the fact that almost every residue has this pattern. Chromosomes, in contrast, have no clear trend either quantitatively or qualitatively, and indeed the worm has the most contrasting genome. Most tendencies lack a clear explanation—human has reduced Asn and Ile (even greater if the Ensembl rather than GenomeScan predictions are used) and elevated Pro and Gly—though cysteine levels, for instance, can most likely be attributed to numbers of soluble extacellular proteins. Variability Either across related features in different genomes, or different features in the same genome, several amino acids stand out as either especially variable or having highly conserved composition (Table 1). For example, comparisons of all similar features across the eukaryotes show that Ser, Thr, Leu, and Val have very similar frequencies. We believe this is partly a result of the high number of codons for each of these residues; however, Gly and Ala (with 4 codons each) are nonetheless the most variable in chromosomes. Genome signature and codon bias Gentles and Karlin2 have demonstrated a consistent dinucleotide bias or “genome signature” in the model eukaryotes. Certainly there is a correlation between the implicit composition of chromosomes and the dinucleotide signature; this does not always translate into a similar pattern for genes. For example, C. elegans has very high TT/AA levels and the highest amount of Phe, Lys, and Asn, but has less of the latter two than yeast in its genes. Human genes do however have In the human, CG is extremely uncommon, about 1.6% of all dimers in chromosome 22, and Arg (with four out of six codons starting with CG) somewhat less than other organisms in the translated chromosomes, but higher in the genes. As expected, therefore, trinucleotides containing CG are uniformly uncommon, no more than 0.31% in Chr. 21 and 0.51% in Chr. 22. There is some corresponding codon bias in residues having CG codons: AGA and AGG account for 50% of Arg in genes on chromosome 21, and 40% in 22. Trimer frequencies taken across all frames show a similar distribution but with slightly more AGA and AGG; more surprisingly, the CG dimer is far more common in exons than elsewhere in the genome. In either case, although the high frequency of Arg does depend partially on use of more “favorable” trinucleotides, codons and indeed all 2 or 3-nucleotide combinations are more evenly distributed in genes than might be expected from the genome, since although these non-CG codons account for nearly 80% of Arg-coding trinucleotides in either chromosome, genes on 22 have less AGA than CGC or CGG. In all other residues in whose codons CG occurs, the codon containing this dinucleotide is the least common of any; however, it is always almost two to three times more common in genes than as a trinucleotide in the genome. Eukaryotic genes thus have a tendency towards less bias in favor of specific trinucleotides within a group of codons than their chromosomes, lessening the impact of di- and trinucleotide frequencies in raw DNA. 1 PM Harrison, N Echols, MB Gerstein (2001). Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome. Nucleic Acids Res 29: 818-30. 2 A Gentles, S Karlin (2001). Genome-Scale Compositional Comparisons in Eukaryotes. Genome Res. 11: 540-546.