Nat`s Deep Thoughts (on genome amino-acid

advertisement
Nat’s Deep Thoughts (on genome amino-acid composition)
Statistical Methods.
We have primarily relied on two methods for assessing the characteristics of a
group of composition statistics, presumably all sharing some common heritage. The
similarity in residue composition between any two features was determined by treating
each set of statistics as a 21-dimensional vector, and calculating the “distance” between
these two vectors:
   ( A[i]  B[i])
21
2
i 1
From repeated pairwise comparisons across a larger set, a matrix showing all distances
can be built. Thus, for example, the genes, pseudogenes, and chromosomes in the
human, fly, and worm genomes can be compared within the organism to try to assess the
relative age of the pseudogenes. Though dimensionless, the distance factor is on a scale
similar to that of an individual residue frequency; thus at the high extreme human genes
and M. janischii genes have a distance of  = 0.12, while human genes and fly genes are
only 0.03 apart; a distance of 0.01 is virtually negligible. Where genes are being
compared to chromosomes or pseudogenes we have ignored stops, since these tend to be
the most widely separated and distort the distances.
The second method treats amino acids individually: in a large enough set of
features, the standard deviation of composition of each amino acid can be calculated; we
further scaled this figure by dividing by the mean composition. This allows the
variability or spread of a residue within a feature set to be evaluated quantitatively, as
well as qualitatively from a plot. Furthermore, plotting the spread of residues across a
number of feature sets shows clear trends in variability for the majority of residues.
For consistency, we have evaluated chromosome composition—best viewed as a
function of trinucleotide frequencies—across all three frames; as expected, there is
virtually no change depending on start point. Therefore, all of our chromosomal amino
acid and hypothetical codon usage statistics use the first possible frame. Simple di- and
trinucleotide frequencies are taken over all three frames. In the human, we have
generally ignored chromosomes other than 21 and 22, due to size and lack of a
reasonably final sequence. Gene predictions were made using GenomeScan; comparison
with the latest Ensembl release shows very little difference in composition between sets
( = 0.012); furthermore, in either set the chromosomes studied have gene predictions
close to representative of all of Ensembl (though in either case 22 is more divergent).
Though we refer here to “pseudogenes”, these features do not make use of a
comprehensive algorithm or prediction method, instead being refined from BLAST and
FASTA (more specifically, FASTX) alignments. Use of FASTA facilitates inclusion of a
wider region of homology and is a far more realistic way of dealing with expected
mutations, but complicates analysis by inclusion of large gaps and frameshifts. However,
we have not attempted here to draw conclusions about genomic layout of pseudogenes,
aside from identifying potential processed pseudogenes, so they are more properly termed
“pseudoexons”. Compositional statistics are based on FASTA-generated alignment
sequences and thus deal with synonymous or conservative translations of the region, i.e.
ignoring frameshifts completely. Comparison with older BLAST-only predictions (the
basis for the later FASTA alignments) shows very little substantive change in overall
results with this method, even with reduction of the set.
Pseudogenes.
Our previous analysis of worm pseudogenes1 revealed an easily visible trend in
their implied amino-acid composition: 13 residues (out of 21, including the stop codon)
fall between the chromosomes and genes, while another 4 are almost exactly matching
one of the other features; none are vastly out of line. We believe that mutations in nonviable genes will tend to yield a composition similar to that of the chromosome as a
whole, so that the exact degree of separation between sets overall is an indication of the
average age of the pseudogenes. [ More specifically, those pseudogenic regions readily
detectable by homology will either be recently mutated, or those whose implied
composition will not be significantly changed by non-frameshift mutations even as the
sequence approaches the (non-random) genomic level. Therefore sequences whose
translation is not greatly disrupted by nucleotide substitutions will be more likely to
“survive” as identifiable pseudoexons. I NEED PROOF OF THIS…] Obviously it may
also be influenced by the type of genes that have fallen out of use, and the nature of their
disablement.
The analysis has been continued across the remaining complete eukaryotic
genomes. Similar methods were used in each case for initial identification of candidate
pseudogenes, but in all cases with fewer final predictions than in worm. In human this is
hardly surprising, given the small portion of the overall genome used and the structure of
human genes: more than a third as many pseudogenes as genes have been predicted, but
the overall density is much lower. Disabled homologies in yeast and pseudoexons in fly
number 166 and 114, respectively. [ WHY? ]
Chromosomes
Genes
Worm genes
0.0740
Worm pseudogenes
0.0264
0.0601
Human genes
0.0600
Human pseudogenes
0.0479
0.0295
Fly genes
0.0630
Fly pseudogenes
0.0096
0.0667
Yeast genes
0.0609
Yeast pseudogenes
0.0347
0.0356
Table 1. Distances for features in eukaryotic genomes. Column
headings apply only to the named feature within the same organism.
Human pseudogene predictions across chromosomes 21 and 22 show a similar
pattern- 14 residues with intermediate frequencies, 4 others on the margins. This result is
closely followed with Ensembl genes (and even improved on chr. 22). Pseudogenes in
the human are closer in composition to human genes than is the case in the worm (Table
1); however, the much smaller set of fly pseudogenes is extremely close to the
chromosomes. Yeast also follows the worm pattern. While in all cases some residue
frequencies in pseudogenes may fall outside the spread expected, this is usually only the
case where gene and chromosome frequencies are close to begin with. Residues with
greater expected final drift, in contrast, strongly tend to have intermediate pseudogene
frequencies. [ NEED TO SAY THIS LESS AWKWARDLY, AND WITH MATH.]
Inter-organismal.
We examined the genes of the four eukaryotes for composition patterns (Fig.
whatever). Calculated distances across the set clearly shows a trend from more complex
to less complex organisms— yeast is most different from human, followed by worm and
fly— also evident qualitatively from the fact that almost every residue has this pattern.
Chromosomes, in contrast, have no clear trend either quantitatively or qualitatively, and
indeed the worm has the most contrasting genome. Most tendencies lack a clear
explanation—human has reduced Asn and Ile (even greater if the Ensembl rather than
GenomeScan predictions are used) and elevated Pro and Gly—though cysteine levels, for
instance, can most likely be attributed to numbers of soluble extacellular proteins.
Variability
Either across related features in different genomes, or different features in the
same genome, several amino acids stand out as either especially variable or having highly
conserved composition (Table 1). For example, comparisons of all similar features
across the eukaryotes show that Ser, Thr, Leu, and Val have very similar frequencies.
We believe this is partly a result of the high number of codons for each of these residues;
however, Gly and Ala (with 4 codons each) are nonetheless the most variable in
chromosomes.
Genome signature and codon bias
Gentles and Karlin2 have demonstrated a consistent dinucleotide bias or “genome
signature” in the model eukaryotes. Certainly there is a correlation between the implicit
composition of chromosomes and the dinucleotide signature; this does not always
translate into a similar pattern for genes. For example, C. elegans has very high TT/AA
levels and the highest amount of Phe, Lys, and Asn, but has less of the latter two than
yeast in its genes. Human genes do however have
In the human, CG is extremely uncommon, about 1.6% of all dimers in
chromosome 22, and Arg (with four out of six codons starting with CG) somewhat less
than other organisms in the translated chromosomes, but higher in the genes. As
expected, therefore, trinucleotides containing CG are uniformly uncommon, no more than
0.31% in Chr. 21 and 0.51% in Chr. 22. There is some corresponding codon bias in
residues having CG codons: AGA and AGG account for 50% of Arg in genes on
chromosome 21, and 40% in 22. Trimer frequencies taken across all frames show a
similar distribution but with slightly more AGA and AGG; more surprisingly, the CG
dimer is far more common in exons than elsewhere in the genome.
In either case, although the high frequency of Arg does depend partially on use of
more “favorable” trinucleotides, codons and indeed all 2 or 3-nucleotide combinations
are more evenly distributed in genes than might be expected from the genome, since
although these non-CG codons account for nearly 80% of Arg-coding trinucleotides in
either chromosome, genes on 22 have less AGA than CGC or CGG. In all other residues
in whose codons CG occurs, the codon containing this dinucleotide is the least common
of any; however, it is always almost two to three times more common in genes than as a
trinucleotide in the genome. Eukaryotic genes thus have a tendency towards less bias in
favor of specific trinucleotides within a group of codons than their chromosomes,
lessening the impact of di- and trinucleotide frequencies in raw DNA.
1
PM Harrison, N Echols, MB Gerstein (2001). Digging for dead genes: an analysis of the characteristics of
the pseudogene population in the Caenorhabditis elegans genome. Nucleic Acids Res 29: 818-30.
2
A Gentles, S Karlin (2001). Genome-Scale Compositional Comparisons in Eukaryotes. Genome Res. 11: 540-546.
Download