HMG Advance Access published September 30, 2003 TITLE PAGE Polymorphism, shared functions and convergent evolution of genes with sequences coding for polyalanine domains Hugo Lavoie 1, François Debeane1 , Quoc-Dien Trinh1 , Jean-François Turcotte 3 , Louis-Philippe Corbeil-Girard1 , Marie-Josée Dicaire 1 , Anik Saint-Denis 1 , Martin Pagé 1 , Guy A. Rouleau2 and Bernard Brais1 Laboratoire de neurogénétique, Centre de recherche du Centre Hospitalier de l’Université de Montréal 2 Montreal General Hospital, McGill Health Center, McGill University, Montréal, Québec, Canada; 3 Département de Biochimie, Université de Montréal Corresponding author: Bernard Brais Laboratoire de neurogénétique, M4211-L5 Hôpital Notre-Dame -CHUM 1560 Sherbrooke est Montréal, Québec, Canada H2L 4M1 Tel: 1-514-890-8000 (ext. 25560) Fax: 1-514-412-7525 Bernard.Brais@umontreal.ca Copyright © 2003 Oxford University Press Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 1 ABSTRACT Mutations causing expansions of polyalanine domains are responsible for nine hereditary diseases. Other GC-rich sequences coding for some polyalanine domains were found to be polymorphic in human. These observations prompted us to identify all sequences in the human genome coding for polyalanine stretches longer than four alanines and establish their degree of polymorphism. We identified 494 annotated human proteins containing 604 polyalanine domains. 32% (31/98) of tested sequences coding for more than 7 alanines were polymorphic. The length of polymorphism. GCG codons are over-represented in human polyalanine coding sequences. Our data suggest that GCG and GCC codons play a key role in polyalanine-coding sequence appearance and polymorphism. The grouping by shared function of polyalanine-containing proteins in Homo sapiens, Drosophila melanogaster and Caenorhabditis elegans shows that the majority are involved in transcriptional regulation. Phylogenetic analyses of HOX, GATA and EVX protein families demonstrate that polyalanine domains arose independently in different members of these families suggesting that convergent molecular evolution may have played a role. Finally polyalanine domains in vertebrates are conserved between mammals and are rarer and shorter in Gallus gallus and Danio rerio . Together our results show that the polymorphic nature of sequences coding for polyalanine domains makes them prime candidates for mutations in hereditary diseases and suggests that they have appeared in many different protein families through convergent evolution. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 of the polyalanine-coding sequence and its GCG or GCC repeat content are the major predictors INTRODUCTION Since 1996, mutations causing expansions of polyalanine tracts have been documented in nine human diseases and in one mouse strain (1-10). Polyalanine expansions have been found in: synpolydactily (from 15 to 22-29 alanines in HOXD13) (3, 11), cleidocranial dysplasia (from 17 to 27 alanines in RUNX2) (5), oculopharyngeal muscular dystrophy (from 10 to 11-17 alanines in PABPN1) (1), familial holoprosencephaly (from 15 to 25 alanines in ZIC2) (2), hand-foot-genital syndrome (from 18 to 26 alanines in HOXA13) (4), blepharophimosis/ptosis/epicanthus inversus epilepsy (from 12-16 to 20-23 alanines in ARX) (8), X-linked mental retardation with growth hormone deficiency (from 15 to 26 alanines in SOX3) (9) and congenital central hypoventilation syndrome (from 25 to 29 alanines in PHOX2B) (10). The shortest disease causing expansion of a polyalanine domain was observed in recessive OPMD in which the addition of a single alanine residue caused a recessive late-onset disease (1). Interestingly, this polymorphism is present in 2% of the general population and when inherited in a compound heterozygote fashion with a dominant expansion causes a more severe phenotype (1). Non-pathological polyalanine polymorphisms were reported in polyalanine coding sequences in seven other human genes prior to this study: MICA, GPX1 , TGFBR1 , RPL14, MSH3, and FOXE1 (12-17). Two mechanisms leading to lengthening of polyalanine coding sequences have been proposed: expansions of stretches of GCN codons in PABPN1, ARX, MICA, GPX1, TGFBR1, RPL14 and FOXE1 and duplications of mixed (GCN)n polyalanine-coding sequences in PABPN1, HOXD13, HOXA13, RUNX2, ZIC2, FOXL2, ARX, SOX3, PHOX2B and MSH3 (1-6, 12-17). These observations prompted us to identify human proteins with domains of 5 alanines or more and to investigate if variations could be found in their polyalanine-coding sequences. Knowledge of the functional roles of polyalanine domains in proteins is very limited. Studies have suggested that polyalanine may play a role in transcriptional repression in drosophila kruppel, engrailed and even-skipped, and in human glucocorticoid receptor and octamer binding protein 1 (18-23). In contrast, the deletion of the polyalanine domains in SCIP did not alter its transcriptional activity (24). Together, these observations suggest that polyalanine stretches in these proteins play a secondary role in transcription. Polyalanine domains in vivo are believed to form β-pleated sheets in dragline spider silk, moth silk and mollusk shell and tendons (25-27). The interaction between numerous polyalanine domains is known to play a major role in ensuring fiber tensile strength in spider’s silk (25). The biophysical characteristics of homopolymeric alanine stretches in vitro and in silico has been extensively studied but are still Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 syndrome type II (from 15 to 30 alanines in FOXL2) (6), X-linked mental retardation and somewhat controversial since polyalanine polymers have served as models in protein folding research for both β-sheet and α helical conformations (28-30). Polyalanine peptides of 10 residues or more form stable fibrillar macromolecules resistant to chemical denaturation and enzymatic degradation in vitro (29, 31). Though the structure of polyalanine peptides in solution is still debated, a recent study shows that β-sheet structures are favored in hydrophylic physiological conditions and α helical conformation in hydrophobic medium (28). The appearance and conservation of polyalanine domains has not been extensively studied. The polyalanine domains of HOXA13 and class III POU transcription factors appeared average 35% longer than its zebrafish orthologs. These differences in size are primarily due to accumulation of polyalanine and flanking regions rich in proline and glycine (33). Similarly, the Drosophilia ortholog of PABPN1 lacks a polyalanine domain and a large portion of the protein’s N-terminus (34). It was suggested that polyalanine domains in arthropods silk and mollusk shell and tendon have appeared by convergent evolution (35), implying that these domains appeared independently in orthologs and were conserved for functional reasons. The complete sequence of the human genome was a prerequisite to this study’s identification by data mining of all human proteins with polyalanine domains and allow the screening for polymorphisms of a large set of sequences coding for polyalanine. The increasing number of sequenced eukaryotic genomes further permitted the comparative analyses of species differences in size and differential appearance of polyalanine domains through evolution. This study establishes that polyalanine-coding sequences are frequently polymorphic in the human population, that polyalanine domains are more frequent in specific functional categories, especially in transcription regulators, that they likely appeared by convergent molecular evolution in some protein families and that they are longer in mammals than in other vertebrates. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 only in mammals (32, 33). In the case of HOXA13, it has been shown that human protein is on RESULTS Sequences coding for polyalanine domains in human are frequently polymorphic. We identified 494 known human proteins with polyalanine domains longer than 4 alanines in the human refseq protein database (Supplementary material) (36). These 494 proteins contained a total of 604 polyalanine domains. We selected the 124 proteins with domains longer than 7 alanines to investigate if the polyalanine coding sequences are polymorphic in humans. We successfully amplified by PCR on 42 unrelated DNA samples. Amplification products were run on denaturing gels and samples showing two or more alleles were sequenced. 31 of the 98 sequenced regions (32%) were found to have at least two sequenced-proven alleles. Table I lists the genes with polymorphic sequences sorted by the frequency of the most common allele observed. Four of these polymorphisms were previously reported in the literature (Table I). Triplet repeat expansions and contractions are responsible for 77% (24/31) of the observed polymorphisms (Table I). Most are due to changes in the number of GCG or GCC codons (20/24). Changes involving tracts of GCA and GCT codons are observed only in four genes. In seven genes, the polymorphisms are due to duplications or deletions of mixed (GCN)n sequences (Table I). In 48% of the polymorphic loci, the minor alleles have a frequency inferior to 5% (Table I). Codon repeat expansions/contractions and deletions/insertions of larger sequences are both responsible for rare and more frequent alleles. It is noteworthy that we also identified new polymorphisms in two genes containing small polyalanine domains of 7 alanines (GDF1 and CHD4, data not shown). This raises the possibility that sequences coding for even shorter polyalanine domains could also be polymorphic. Poly morphisms with high frequencies could be predicted by aligning Unigene EST sequences for 9 genes (Table I). We found that four of the nine genes with known pathogenic lengthening of a polyalanine tract are polymorphic in the general population: PABPN1, RUNX2, ZIC2 and HOXA13 (1, 2, 4, 5) (Table I). This further underlines that sequences encoding a polymorphic polyalanine domain are prime candidates for diseases mapped to the same chromosomal region. The size of the polyalanine coding sequence and its repeat content appear as the major predictors of polymorphism. Most of the polymorphic sequences (62%) and very few nonpolymorphic sequences (11%) code for more than 13 alanines or contain repeats of more than 5 codons (Fig. 1A). Polymorphism of polyalanine domains coded by mixed GCN codons seems to be mainly influenced by the length of the sequence. In fact, the mixed (GCN)n polyalanine-coding Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 retrieved genomic coding sequences for 135 polyalanine domains (117 genes) and 98 were sequences are more frequently polymorphic when they code for more than 13 alanines (Fig. 1A, arrows). GCG codons are over-represented in polyalanine-coding sequences. There is a statistically significant over-representation of GCG codons in polyalanine-coding sequences compared to its average relative occurrence in a control set of open reading frames (57%±33% in polyalanine compared to 11% ±12% of GCG usage in 597 control sequences) and with GCC codons that are as frequently used in polyalanine -coding sequences as in control sequences (Fig. 1B). GCA and GCT codons are overall under-represented in polyalanine-coding sequences (Fig. 1B). A noticeable increase in GCG and GCC codons usage is observed as the size of the polyalanine domains increase while GCA and GCT codon usage tend to decrease with length (Fig. 1B). The analysis of the contribution of codon repeats in polyalanine-coding sequences demonstrates that GCG and GCC repeats become predominant at a length of three codons or more (Fig. 1C). Most of the sequences are not coded by identical repeated codons. Of the 154 sequences coding for eight or more alanines, only 12% are entirely or in part coded by tracts of eight or more identical codons (Fig. 1C). Therefore, long polyalanine domains are usually coded by mixed (GCN)n sequences. Since GCGGCG and GCCGCC are complementary, their overrepresentation in polyalanine-coding sequences points toward a particular involvement of these GC-rich sequences in the appearance and lengthening of sequences coding for polyalanine domains. The longest polyalanine domains are found in eukaryotes. To establish if polyalanine domain-containing proteins share functional roles, we identified all such proteins in three near complete eukaryotic genomes: Homo. sapiens (H. sapiens), Drosophilia melanogaster (D. melanogaster), and Caenorhabditis. elegans (C. elegans) (Supplementary material). Figures 2A, C and E present the distributions of polyalanine domains according to their size in each species. The number of polyalanine domains encoded in each genome is variable (Fig. 2A, C and E). Based on previous estimates of gene number in these species, polyalanine-containing proteins account approximately for 1.4%, 3.9% and 0.80% in H. sapiens, D. melanogaster and C. elegans respectively (38-42). Thus polyalanine domains appear Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 according to the codon usage database (11% of GCG codon usage) (37) (Fig. 1B). This contrasts to be relatively more frequent in D. melanogaster (Fig. 2C). Interestingly, no domain longer than 20 alanines was uncovered in the analyzed genomes. This maximum length was observed in D. melanogaster ovo and asx and in H. sapiens PHOX2B. In C. elegans, the length of polyalanine domains never exceeded 14 alanines (Fig. 2E). We also established that the genome of S. cerevisiae contains only 39 polyalanine-containing proteins (about 0.65% of S. cerevisiae genes) and none had a domain of more than 10 alanines (data not shown). The number of polyalanine domains was even smaller in prokaryotes. Polyalanine -coding genes account for a mean of 0.74% of archeal genomes and 0.84% of eubacterial genomes. In fact, 10 completely sequenced archea genes and the size of the domains never exceeds nine alanines. Therefore, domains longer than nine alanines appear to be exclusively found in eukaryotes. Though 85% of human proteins contain a single polyalanine domain, some contain more: 48 proteins have two, 15 have three, 4 have four and 5 contain five polyalanine domains. Proteins bearing more than one domain are also found in D. melanogaster (563 have one, 108 have two, 49 have three, 18 have four, 18 have five, 1 has seven and 1 contains ten polyalanine domains) and in C. elegans (122 have one, 12 have two, 2 have three and 2 contained five polyalanine domains). Proteins with polyalanine domains are often involved in transcription regulation. The polyalanine -containing proteins from C. elegans, D. melanogaster and H. sapiens were grouped by high level molecular function categories according to Gene Ontology Consortium functional classification (http://www.geneontology.org/) (Fig. 2B, D and F). Human, Drosophila and worm proteins were divided for analysis based on a size threshold of 8 alanines (Fig. 2B, D and F). For all three organisms the “binding” molecular function is the most represented in polyalanine-containing proteins followed by “transcription regulator” and “enzyme” annotations. The patterns of functions distribution obtained for H. sapiens, D. melanogaster and C. elegans polyalanine-containing proteins are very similar (Fig. 2B, D and F). The significant enrichment of polyalanine-containing proteins in “binding” molecular function is mostly explained by the presence of DNA binding transcription regulators (Fig. 2B, D and F). The organization of the ontology tree leads to this observation. In fact, the ontologies are organized so that one low level function could be linked to more than one high level annotation. Transcription regulators account for 7%, 9% and 10% of H. sapiens, D. melanogaster and C. elegans complete annotated proteomes respectively (Fig. 2B, D and F). Strikingly, 36% of H. sapiens, 35% of D. melanogaster and 25% of C. elegans polyalanine -containing proteins bearing a domain of 5-7 Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 and 31 completely sequenced eubacteria, contained respectively 162 and 746 polyalanine-coding alanines are annotated as “transcription regulators”. The latter functional group is significantly enriched in both the 5-7 alanines and >7 alanines polyalanine-containing proteins compared to whole proteomes for the three organisms (Fig. 2B, D and F). The groups of proteins with >7 alanines domains contain an even greater proportion of transcription regulators (Fig. 2B, D and F), but this increased proportion is only significant for D. melanogaster polyalanine-containing proteins. It is also noteworthy that polyalanine-containing proteins having a protein binding function often bind transcription factors (data not shown). Finally, the number of genes annotated as “enzyme” is significantly smaller in 5-7 alanines and >7 alanines polyalanine-containing Polyalanine domains appeared by convergent evolution in different protein families. Polyalanine domains are more frequent in certain protein families. For example, polyalanine domains are present in 17% of proteins from the homeobox superfamily (35 out of 212 homeobox proteins retrieved with PSI-BLAST in the human refseq database) (43, 44). To investigate the appearance and conservation of polyalanine domains within protein lineages we first established which families had many members in our set of proteins. There are 21 protein families represented by three or more members in the human data set, thirteen of them are transcription factor families: CLCN, DMRT, GATA, FOX, GPR, HOX, IRX, KCN, MAPK, NRG, PBX, POU, PPP1R, PRDM, RPL, SLC, SMARC, SOX, TBX, TLE and ZIC. The over representation of polyalanine domains in certain protein families raises the possibility that they may have appeared either in a common ancestral protein or acquired independently. Polyalanine domains have been studied mostly in proteins of the HOX family (33). It has been suggested that all homeobox DNA binding proteins may have a common origin and form a single superfamily (43, 45, 46). In vertebrates 4 HOX gene clusters (A to D) are believed to have arisen by duplications of a single ancestral cluster followed by divergence (47-50). Phylogenic analysis suggests that polyalanine domains appeared recently and in only some orthologs within paralogous groups of the HOX family (Fig. 3A). For example, phylogenic analysis of the paralogous groups HOXA13 and HOXD13 does not support a shared common ancestral origin for their polyalanine domains but an independent appearance specific to the mammalian lineage (Fig. 3A, in red and green). The same phenomenon is observed for the paralogs HOXA11 and HOXD11 (Fig. 3A, yellow and purple). In addition to these paralogs, HOXD8 has also acquired a polyalanine domain. This phenomenon of independent appearance is also observed in the GATAzinc-fingers family where mammalian GATA6, GATA4 and GATA1 have acquired a polyalanine Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 proteins from H. sapiens and D. melanogaster (Fig. 2B and D). sequence independently (Fig. 3C). Convergence is not only observed at the level of paralogous groups, it is also seen between orthologs as is exemplified by even-skipped proteins from various species (Fig. 3E). In this particular case, a domain appeared in all vertebrate evx2 (Fig. 3E, blue branches) and a second domain is unique to mammalian evx2 (Fig. 3E, red). A third event occurred only in drosophila eve of all known insects even-skipped proteins (Fig. 3E, green). Finally, a fourth event of polyalanine appearance is specific to nematodes VAB-7 (Fig. 3E, yellow). The same phenomenon of evolutionary convergence within orthologous proteins was also observed in runt and engrailed (data not shown). This further supports the independent noteworthy that polyalanine domains appeared only in mammals in HOXD8, HOXA11, HOXD11, HOXA13, HOXD13, GATA1, GATA4 and GATA6 phylogenetic trees. The possible convergent evolutionary appearance of polyalanine domains in proteins of the same family is further supported by their different positions relative to functional domains or to other conserved regions. Figure 3B depicts the position of polyalanine insertions (arrowheads) within sequence alignments of HOX13 and HOX11 paralogous groups. The same was observed in an alignment of GATA4 with GATA6 (Fig. 3D). This phenomenon is even more clearly illustrated by the N-terminal position of the polyalanine domain in vab-7, the ortholog of evenskipped in nematodes, compared to C-terminal insertions in vertebrates evx2 and drosophila eve (Fig. 3F). The relative position of polyalanine domains relative to the DNA binding domain in different transcription factor superfamilies is clearly variable. For example, 89% of polyalanine are located in N-terminus in homeobox-containing proteins, 71% are in C-terminus in SOXcontaining proteins and 73% are in C-terminus in forkhead box domain-containing proteins. Polyalanine domains are longer in mammals than in other vertebrates. The independent appearance of polyalanine domains in related proteins and the observation that polyalanine domains arose mainly in mammals (Fig. 3A and C) led us to assess polyalanine conservation between human and mouse (Mus Musculus; M. musculus), human and chicken (Gallus gallus; G. gallus) and human and zebrafish (Danio rerio ; D. rerio). Figure 4A depicts the high level of conservation of polyalanine domains length between H. sapiens and M. musculus. Figures 4B and C show that polyalanine domains are poorly conserved between H. sapiens and G. gallus and H. sapiens and D. rerio . The Pearson correlation coefficient (R) between human and the three species polyalanine length decreases from M. musculus to D. rerio (Fig. 4A-C). The percentage of identity of pairwise blast alignments seems not to correlate with the conservation of Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 convergent appearance of polyalanine domains in proteins sharing similar functional role. It is polyalanine tracts. Even highly conserved protein segments (80-100% identity; open triangles) do not share the same polyalanine length (Fig. 4A-C). Polyalanine size is clearly larger on average in mammals than in other vertebrates. In fact, the mean polyalanine length difference between H. sapiens and M. musculus, H. sapiens and G. gallus and H. sapiens and D. rerio was respectively –1.06 ± 2.22 alanines, -4.53 ± 3.79 alanines and –6.10 ± 2.22 alanines. Figure 4D illustrates the changes in size of 28 polyalanine domains in 23 proteins sharing orthologs in human, chicken and zebrafish. Polyalanine domains in chicken and zebrafish proteins are overall shorter than their orthologs in human. Only two out of the 28 domains studied are conserved in the three organisms 4D). Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 (NRF1 and MAF) and only one is longer in zebrafish than in human and chicken (PBX1) (Fig. DISCUSSION This study identified by data mining 494 known or predicted human proteins containing a total of 604 polyalanine domains longer than 4 alanines. We demonstrate that sequences coding for polyalanine domains are frequently polymorphic in the human genome. In fact, 31 out of 98 tested polyalanine-coding sequences were polymorphic. We established that the presence of a GCN repeat longer than 5 or a sequence coding for polyalanine stretches longer than 13 alanines are good predictors of polymorphism. Based on these criteria, 82 polyalanine-coding sequences genomic sequence was not available or they fell under our 8 alanines size threshold for screening. Polymorphism of polyalanine-coding sequences is probably more widespread because some shorter sequences were also found to have more than one allele and rarer polymorphisms may have been missed because of the limited size of our DNA panel (84 chromosomes). The observation that rare polymorphisms exist in four of the nine genes with known disease causing mutations further emphasizes the importance of screening similar sequences in diverse pathologies. Furthermore, the finding that the minor allele of PABPN1 carried by 2 % of the population can act as both a modulator of severity for dominant OPMD and a recessive OPMD mutation raises the possibility that the rare polymorphisms observed in this study may play similar roles in other traits (1). The selection of these polyalanine-coding genes as candidates for numerous conditions is hastened by their known chromosomal location in most cases. It has been shown that other types of mutations in HOXD13, HOXA13, RUNX2 , ZIC2 and FOXL2 cause phenotypes similar to expansions of the polyalanine domain in these proteins (2, 4-6, 51). In this light, of particular interest are 10 genes with polyalanine domains in which other types of mutations have been identified in human: APC, ELN, EMX2, FOXE1 , ZIC3, AGPS, AR, IRS2, OFD1 and TBX3 (http://www.ncbi.nlm.nih.gov/omim). Interestingly, mice knock-outs for Hoxd13, Hoxa13 and Runx2 show phenotypes similar to human polyalanine expansions in these genes (52-55). This phenotypic overlap between knock-out mice and human disease allows the flagging of at least 34 human genes with sequences coding for domains of more than 7 alanines with knock-out models as prime candidate genes for human diseases with phenotypic overlap. For example, this raises the possibility that polyalanine expansions in SLC12A2 (17 alanines in man) may cause deafness in humans as point mutations have been found in the Shaker-with-syndactylism (sy) deaf mouse strain (56). Interestingly, this strain also has syndactyly, an observation found in two of the already described polyalanine diseases (56). The ages of onset of the diseases caused by Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 of our set are likely to be polymorphic in human. Of these, 28 were not tested because either their expansions of polyalanine domains found to date suggests that such mutations may be involved in traits and diseases that span human life. Further studies will establish the functional importance of these relatively rare polymorphisms, but they already allow the identification of a large set of mapped genes as prime candidates for disease causing mutations considering that the chromosomal localizations are available for the majority of human genes with sequences coding for polyalanine domains. The mechanism responsible for polyalanine-coding sequence polymorphisms or mutations is still debated. Two hypotheses have been proposed: expansion/contractions of (GCN)n sequences due to unequal crossovers during meiosis. Though our sequence data demonstrates that most changes in size are due to changes in the number of repeated codons (77%, 24/31), only unequal crossover can explain both the presence of expanded repeat size and mixed (GCN)n sequence mutations at the same locus (57). This work clearly shows that GCG codons are significantly over-represented in polyalanine-coding sequences. In fact, GCG codons are on average three times more frequent in polyalanine-coding sequences than in control coding sequences. Furthermore, GCG and GCC repeats are more represented in polyalanine coding sequences than repeats of GCA or GCT. These results together with the observation that most of polyalanine polymorphisms are associated with GCG or GCC repeats suggest that these codons may be particularly important in the appearance and polymorphism of polyalanine domains. GCGGCG and GCCGCC being complementary, they can form stable hairpins or other higher order structures that may play a key role in this process and explain the relative abundance of GCG/CGC and GCC/CGG repeats in polymorphic sequences coding for polyalanine. It has been shown by many groups that biophysical properties of GCC/CGG repeats observed in fragile X mental retardation makes them good substrate for replication slippage. In fact, oligos consisting of CGG repeats (corresponding to GCG polyalanine repeats) fold in vitro as either stable Hogste en-bounded tetrahelical structures or as hairpins (58-62). It is also noteworthy that except for Friedreich ataxia, diverse frames of GCN/CGN triplet reiteration are the substrates for all other pathogenic triplet repeat mutations (58). The observation that CGG, CCG and CGA repeats are in fact prone to proliferation in protein coding sequences also point to their importance in the appearance and expansion of amino acid reiterations (63). Using Gene Ontology molecular function classes we grouped polyalanine -containing proteins from three phylogenetically distant organisms. It is striking that polyalanine proteins in the three eukaryotes studied are mostly involved in transcription regulation. Proteins with Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 repeated sequences through slippage during replication and deletions/duplications of mixed polyalanine domains >7 alanines are even more often involved in transcription than proteins with shorter domains. Transcription regulators are at least five times more abundant in polyalaninecontaining proteins than in the human proteome. This enrichment of polyalanine domains in proteins sharing the same molecular function in distantly related organisms indicates that polyalanine probably participates in function. Previous studies have suggested that indeed, polyalanine domains play a role in transcriptional repression (18-23, 64, 65). It is however unknown how the hydrophobic polyalanine domains participate in transcription. It is also noteworthy that the relative importance of each molecular function was similar between human This study also demonstrates that polyalanine domains have appeared independently in paralogous genes. Homeobox containing proteins account for the largest single functional group of polyalanine proteins: 49/494 in human and 43/758 in drosophila. It is believed that all homeoboxes descend from the same ancestor gene (43, 45, 46). We have shown that polyalanine domains have appeared independently in many of these proteins often in a different position relatively to their homeobox domain. The repeated observation of this phenomenon in many other protein families lead us to conclude that evolutionary pressure have led to this convergence. Convergent molecular evolution has been postulated as an explanatory principle for the presence of polyalanine domains in spider’s silks and mollusk shell and tendon (35). In these two examples polyalanine domains have appeared in very distant species suggesting that the acquired functional properties were responsible for the evolutionary pressure and led to convergence at the molecular level. Molecular convergence was also invoked in the appearance of a repetitive sequence in antifreeze glycoproteins shared between phylogenetically distant Antarctic notothenioid fishes and arctic cod (66). Another example of molecular convergence was documented for glycine -rich domains in cyanobacterial and eukaryotic RNA binding proteins (67). These various examples illustrate that repeated codons may be good substrates for convergent molecular evolution. These sequences probably promote their own variation, allowing for the appearance of new domains. Which evolutionary pressure is responsible for the convergent appearance of polyalanine domains in eukaryotes is unknown. As previously discussed, polyalanine domains are frequent in transcription regulators and they may play a role in protein-protein interactions (3). Our hierarchical classification of polyalanine-containing proteins involved in “binding” uncovered that many were transcription factors while another important group was involved in transcription factor binding. These observations led us to search the literature for interaction data involving polyalanine-containing proteins. We found that many proteins with polyalanine domains interact with each other. For example, GATA4 and GATA6 interact to regulate cardiac development (68). Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 and drosophila. The most impressive polyalanine protein network uncovered is the gro/TLE transcription corepressor family (TLE1, 3 and 4) that interact with polyalanine -containing transcription factors RUNX2, FOXD1/2/3, EN1, EVX2, PAX7, IRX3, NKX6.1, NKX2D, HEY2 and UTX1 to form a regulatory network involved in various aspects of vertebrate development (69-79). It was further found that HEY2 homo and heterodimerize with HAND1 and HAND2 that also contain a polyalanine-domain (80). This polyalanine protein network can be enlarged with the interaction of EN1 with PBX1, 2 and 3 (81). TLEs interact with the eh1 motif of many of these proteins and the role of the polyalanine domains in these interactions has not been studied. Stable polyalanine properties of polyalanine stretches (28-31). It is possible that polyalanine protein -protein interfaces may stabilize interactions between transcription regulators. Thus, new or more stable protein-protein interactions introduced by the appearance of polyalanine domains contribute to the evolutionary convergence observed in numerous protein families. In most cases, when convergent evolution is observed, polyalanine domains are exclusively found in the mammalian lineage. Fast evolution of polyalanine domains in mammals have previously been reported for HOXA13 and POU class III genes (32, 33). We thus carried out a study of polyalanine length conservation among vertebrates. We comprehensively assessed the conservation of polyalanine domains between human and three vertebrate species and we observed good conservation between human and mouse but poor conservation between human and chicken and human and zebrafish. We also followed the evolution of 28 individ ual polyalanine domains and found that except for three of them, all were shorter in chicken and zebrafish with respect to human domain lengths. The clear trend is one of increase in the polyalanine domain sizes between H. sapiens > G. gallus > D. rerio. Together, this data argues for a lengthening of most of polyalanine domains in terrestrial vertebrates and especially in the mammalian lineage. In conclusion, the roles of the polyalanine domains in evolution, development and diseases clearly need to be further investigated. The boundaries between monogenic and multifactorial traits are becoming blurred by the identification of genetic modifiers such as the (GCG) 7 recessive PABPN1 allele in OPMD (1, 82, 83). Mutations causing polyalanine expansions in some of the 494 human proteins identified are likely to contribute to monogenic or polygenic traits having onsets at all stages of life. Therefore, gene and protein databases should be annotated to identify the occurrence of such domains and their variations. This study emphasizes the importance of screening sequences coding for polyalanine domains for mutations Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 hydrophobic protein-protein interactions could play a role, considering the known biophysical and functional polymorphisms and design strategies to establish the function and evolutionary role of these domains. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 MATERIALS AND METHODS Identification of proteins with polyalanine domains and alanine codon usage analysis. We used a program consisting of a regular expression search to retrieve the ID and polyalanine length of proteins containing one or more polyalanine domains (ActivePerl 5.6.1; http://www.activeperl.com). The protein databases of H. sapiens, M. musculus, G. gallus, D. rerio , D. melanogaster and C. elegans available through NCBI’s taxonomy resource were http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/). Sequence redundancy was eliminated by automatic means using cd-hi and the data set was manually curated by using unique identifiers provided by LocusLink and Refseq for H. sapiens (http://www.ncbi.nlm.nih.gov/RefSeq/; http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/), FlyBase for D. melanogaster (http://flybase.bio.indiana.edu/) and WormBase for C. elegans (http://www.wormbase.org/) (36, 84-86). Polyalanine size threshold of 5 was based on the observation that the smallest polymorphic sequences previously observed were the 5 alanine domains of MICA and GPX1 and on a study of polyglutamine size evolution (12, 13, 87). Chromosomal localization, Unigene identifiers, representative protein and mRNA sequences for human loci coding for polyalanine were retrieved by using Stanford SOURCE retrieval tool (88); http://genome www5.stanford.edu/cgi-bin/SMD/source/sourceSearch). Contigs of Unigene ESTs for tested genes was constructed by using LASERGENE2000 Seqman software. All the data was stored in a Microsoft Access database and most is provided as supplemental information at http://www.genome.org. Alanine codon usage was determined by using a perl script (ActivePerl 5.6.1; http://www.activeperl.com) extracting ORF from mRNA sequences and counting alanine codons in ORF sequence. The control gene set was obtained through RefSeq database (http://www.ncbi.nlm.nih.gov/RefSeq/; Supplementary material). A one-way ANOVA was used to test the null hypothesis of no difference between polyalanine and control alanine codon usage. Detection of polymorphisms in human sequences coding for polyalanine domains. The genomic DNA from 42 unrelated participants in our OPMD study were genotyped. All had signed an informed consent. Twenty-six were carriers of a dominant PABPN1 OPMD mutation while the others were healthy family members. The 124 human proteins with polyalanine Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 analyzed (updated the 1st of September 2001; domains of 8 or more alanines were chosen for polymorphism search. The genomic sequence was retrieved by tBLASTn for 117 of these genes (135 domains). Primers based on these genomic sequences were designed using LASERGENE2000 PrimerSelect software. Fragments were amplified by PCR with either 7-deaza -dGTP or betaine at a 1M concentration (1, 89). Amplicons were run on a 5% polyacrylamide denaturing gel. Gels were dried and autoradiograms were obtained (1). Samples that demonstrated two alleles were reamplified and the purified PCR products were sequenced using an ABIprism automated DNA sequencer (Applied Biosystems). polyalanine domains. Polyalanine-containing proteins were functionally classified by using molecular functio n annotations available through the Gene Ontology Consortium web page (http://www.geneontology.org/). These annotations were based on EBI annotations for human, Flybase for D. melanogaster and Wormbase for C. elegans (86, 90, 91). Given the structure of the ontology tree, one single protein could have more than one annotation (http://www.geneontology.org/). Chi square calculations were used to test for statistically significant differences in molecular function distributions between whole proteomes and proteins with polyalanine domains. The identification of all known proteins from a specific family was achieved by searching the human refseq database with PSI-BLAST (blastpgp) and the consensus sequence for the motifs of interest as a query sequence (BLAST version 2.1.2) (43, 44, 92). Known protein sequences from families of interest were retrieved by running BLASTp against the Genbank database. Sequences obtained were aligned with the CLUSTAL W program and the alignments were manually edited (93). For tree construction, a distance approach was used (PHYLIP 3.6) (93). PROTDIST was used to calculate a distance matrix (94). A gamma distribution was applied to correct for unequal rates of change at different amino acid positions. The tree topology was finally inferred using WEIGHBOUR (95). The final unrooted trees were edited using Treetool. Bootstrapping was carried out (1000 replicates). Pairwise comparisons of polyalanine domains between vertebrate species was done using BLASTp with a threshold E value of 1e-20 and a threshold alignment length of 200 amino acids (BLAST version 2.1.2) (44) . We assessed the conservation of the polyalanine tract by blasting a query file containing all human polyalanine proteins against a database of all proteins from the species of interest (Mouse, chicken and zebrafish). The query was also done by running blast with three files containing all polyalanine proteins of the three species against a database of all Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 Functional annotations of proteins, phylogenetic analyses and pairwise comparisons of available human proteins. The resulting pairwise alignments were manually curated for overall protein percent identity and presence of corresponding polyalanine domains between aligned protein pairs. Polyalanine domains were considered orthologous if protein sequence identity was observed on at least one side of the aligned polyalanine domains. If more than one domain was present in a pairwise BLAST alignment, they were considered separately. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 ACKNOWLEDGEMENTS We especially want to thank Franz Lang for his help and suggestions concerning philogenies. H.L. received a scholarship from the Association française contre les myopathies (AFM). B.B. is a chercheur-boursier of the Fonds de Recherche en Santé du Québec (FRSQ). This project was supported by grants from the AFM and the Neuromuscular Research Partnership Program of the Canadian Institute of Health Research (CIHR), the Muscular Dystrophy Association of Canada (MDAC) and the Canadian Amyotrophic Lateral Sclerosis Association. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 REFERENCES 1. Brais,B., Bouchard,J.-P., Xie,Y.-G., Rochefort,D.L., Chrétien,N., Tomé,F.M.S., Lafrenière,R.G., Rommens,J.M., Eichiro,U., Osamu,N. et al. (1998) Short GCG expansions in the PABP2 gene cause oculopharyngeal muscular dystrophy. Nat. Genet., 18, 164-167. 2. Brown,S.A., Warburton,D., Brown,L.Y., Yu,C.-Y., Roeder,E.R., Stengel-Rutkowski,S., Hennekam,R.C.M. and Muenke,M. (1998) Holoprosencephaly due to mutation in ZIC2 a homologue of Drosophila odd-paired. Nat. Genet., 20, 180-183. patterns in synpolydactyly caused by mutations in HOXD13. Science, 272, 548-551. 4. Goodman,F.R., Bacchelli,C., Brady,A.F., Brueton,L.A., Fryns,J.P., Mortlock,D.P., Innis,J.W., Holmes,L.B., Donnenfeld,A.E., Feingold,M. et al. (2000) Novel HOXA13 mutations and the phenotypic spectrum of hand-foot-genital syndrome. Am. J. Hum. Genet., 67, 197-202. 5. Mundlos,S., Otto,F., Mundlos,C., Mulliken,J.B., Aylsworth,A.S., Albright,S., Lindhout,D., Cole,W.G., Henn,W., Knoll,J.H.M. et al. (1997) Mutations involving the transcription factor CBFA1 cause cleidocranial dysplasia. Cell, 89, 773-779. 6. Crisponi,L., Deiana,M., Loi,A., Chiappe,F., Uda,M., Amati,P., Bisceglia,L., Zelante,L., Nagaraja,R., Porcu,S., et al. (2001) The putative forkhead transcription factor FOXL2 is mutated in blepharophimosis/ptosis/epicanthus inversus syndrome. Nat. Genet., 27, 159-166. 7. Johnson,K.R., Sweet,H.O., Donahue,L.R., Ward-Bailey,P., Bronson,R.T. and Davisson,M.T. (1998) A new spontaneous mouse mutation of Hoxd13 with a polyalanine expansion and phenotype similar to human synpolydactyly. Hum. Mol. Genet., 7, 1033-1038. 8. Stromme,P., Mangelsdorf,M.E., Shaw,M.A., Lower,K.M., Lewis,S.M., Bruyere,H., Lutcherath,V., Gedeon,A.K., Wallace,R.H., Scheffer,I.E. et al. (2002) Mutations in the human ortholog of Aristaless cause X-linked mental retardation and epilepsy. Nat. Genet., 30, 441-445. 9. Laumonnier,F., Ronce,N., Hamel,B.C., Thomas,P., Lespinasse,J., Raynaud,M., Paringaux,C., Van Bokhoven,H., Kalscheuer,V., Fryns,J.P. et al. (2002) Transcription factor SOX3 is involved in X-linked mental retardation with growth hormone deficiency. Am. J. Hum. Genet., 71, 14501455. 10. Amiel,J., Laudier,B., Attie-Bitach,T., Trang,H., De Pontual,L., Gener,B., Trochet,D., Etchevers,H., Ray,P., Simonneau,M. et al. (2003) Polyalanine expansion and frameshift mutations of the paired-like homeobox gene PHOX2B in congenital central hypoventilation syndrome. Nat. Genet., 33, 459-461. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 3. Muragaki,Y., Mundlos,S., Upton,J. and Olsen,B.R. (1996) Altered growth and branching 11. Goodman,F.R., Mundlos,S., Muragaki,Y., Donnai,D., Giovannucci-Uzielli,M.L., Lapi,E., Majewski,F., McGaughran,J., McKeown,C., Reardon,W. et al. (1997) Synpolydactyly phenotypes correlate with size of expansions in HOXD13 polyalanine tract. Proc. Natl. Acad. Sci. U. S. A., 94, 7458-7463. 12. Mizuki,N., Ota,M., Kimura,M., Ohno,S., Ando,H., Katsuyama,Y., Yamazaki,M., Watanabe,K., Goto,K., Nakamura,S. et al. (1997) Triplet repeat polymorphism in the transmembrane region of the MICA gene: a strong association of six GCT repetitions with Behcet disease. Proc. Natl. Acad. Sci. U. S. A., 94, 1298-1303. repeat in the coding region of the human cellular glutathione peroxidase (GPX1) gene: in vivo polymorphism and in vitro instability. Genomics, 23, 292-294. 14. Pasche,B., Luo,Y., Rao,P.H., Nimer,S.D., Dmitrovsky,E., Caron,P., Luzzatto,L., Offit,K., Cordon-Cardo,C., Renault,B. et al. (1998) Type I transforming growth factor beta receptor maps to 9q22 and exhibits a polymorphism and a rare variant within a polyalanine tract. Cancer Res., 58, 2727-2732. 15. Shriver,S.P., Shriver,M.D., Tirpak,D.L., Bloch,L.M., Hunt,J.D., Ferrell,R.E. and Siegfried,J.M. (1998) Trinucleotide repeat length variation in the human ribosomal protein L14 gene (RPL14): localization to 3p21.3 and loss of heterozygosity in lung and oral cancers. Mutat. Res., 406, 9-23. 16. Acharya,S., Wilson,T., Gradia,S., Kane,M.F., Guerrette,S., Marsischky,G.T., Kolodner,R. and Fishel,R. (1996) hMSH2 forms specific mispair -binding complexes with hMSH3 and hMSH6. Proc. Natl. Acad. Sci. U. S. A., 93, 13629-13634. 17. Hishinuma,A., Ohyama,Y., Kuribayashi,T., Nagakubo,N., Namatame,T., Shibayama,K., Arisaka,O., Matsuura,N. and Ieiri,T. (2001) Polymorphism of the polyalanine tract of thyroid transcription factor-2 gene in patients with thyroid dysgenesis. Eur. J. Endocrinol., 145, 385-389. 18. Kim,M.K., Lesoon-Wood,L.A., Weintraub,B.D. and Chung,J.H. (1996) A soluble transcription factor, Oct-1, is also found in the insoluble nuclear matrix and possesses silencing activity in its alanine -rich domain. Mol. Cell. Biol., 16, 4366-4377. 19. Licht,J.D., Grossel,M.J., Figge,J. and Hansen,U.M. (1990) Drosophila Kruppel protein is a transcriptional repressor. Nature, 346, 76-79. 20. Licht,J.D., Ro,M., English,M.A., Grossel,M. and Hansen,U. (1993) Selective repression of transcriptional activators at a distance by the Drosophila Kruppel protein. Proc. Natl. Acad. Sci. U. S. A., 90, 11361-11365. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 13. Shen,Q., Townes,P.L., Padden,C. and Newburger,P.E. (1994) An in-frame trinucleotide 21. Licht,J.D., Hanna -Rose,W., Reddy,J.C., English,M.A., Ro,M., Grossel,M., Shaknovich,R. and Hansen,U. (1994) Mapping and mutagenesis of the amino-terminal transcriptional repression domain of the Drosophila Kruppel protein. Mol. Cell. Biol., 14, 4057-4066. 22. Han,K. and Manley,J.L. (1993) Transcriptional repression by the Drosophila even-skipped protein: definition of a minimal repression domain. Genes Dev., 7, 491-503. 23. Han,K. and Manley,J.L. (1993) Functional domains of the Drosophila Engrailed protein. EMBO J., 12, 2723-2733. 24. Monuki,E.S., Kuhn,R. and Lemke,G. (1993) Cell-specific action and mutable structure of a 25. Simmons,A.H., Michal,C.A. and Jelinski,L.W. (1996) Molecular orientation and twocomponent nature of the crystalline fraction of spider dragline silk. Science, 271, 84-87. 26. Sudo,S., Fujikawa,T., Nagakura,T., Ohkubo,T., Sakaguchi,K., Tanaka,M., Nakashima,K. and Takahashi,T. (1997) Structures of mollusc shell framework proteins. Nature, 387, 563-564. 27. Qin,X.X., Coyne,K.J. and Waite,J.H. (1997) Tough tendons. Mussel byssus has collagen with silk-like domains. J. Biol. Chem., 272, 32623-32627. 28. Levy,Y., Jortner,J. and Becker,O.M. (2001) Solvent effects on the energy landscapes and folding kinetics of polyalanine. Proc. Natl. Acad. Sci. U. S. A., 98, 2188-2193. 29. Forood,B., Pérez-Payá,E., Houghten,R.A. and Blondelle,S.E. (1995) Formation of an extremely stable polyalanine B-sheet macromolecule. Biochem. Biophys. Res. Commun., 211, 713. 30. Vila,J.A., Ripoll,D.R. and Scheraga,H.A. (2000) Physical reasons for the unusual alpha-helix stabilization afforded by charged or neutral polar residues in alanine-rich peptides. Proc. Natl. Acad. Sci. U. S. A., 97, 13075-13079. 31. Blondelle,S.E., Forood,B., Houghten,R.A. and Pérez-Payá,E. (1997) Polyalanine-based peptides as models for self-associated B-pleated-sheet complexes. Biochemistry (Mosc). 36, 8393-8400. 32. Sumiyama,K., Washio -Watanabe,K., Saitou,N., Hayakawa,T. and Ueda ,S. (1996) Class III POU genes: generation of homopolymeric amino acid repeats under GC pressure in mammals. J. Mol. Evol., 43, 170-178. 33. Mortlock,D.P., Sateesh,P. and Innis,J.W. (2000) Evolution of N-terminal sequences of the vertebrate HOXA13 protein. Mamm. Genome, 11, 151-158. 34. Benoit,B., Nemeth,A., Aulner,N., Kuhn,U., Simonelig,M., Wahle,E. and Bourbon,H.M. (1999) The Drosophila poly(A)-binding protein II is ubiquitous throughout Drosophila Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 transcription factor effector domain. Proc. Natl. Acad. Sci. U. S. A., 90, 9978-9982. development and has the same function in mRNA polyadenyla tion as its bovine homolog in vitro. Nucleic Acids Res., 27, 3771-3778. 35. Gatesy,J., Hayashi,C., Motriuk,D., Woods,J. and Lewis,R. (2001) Extreme diversity, conservation, and convergence of spider silk fibroin sequences. Science, 291, 2603-2605. 36. Wheeler,D.L., Church,D.M., Federhen,S., Lash,A.E., Madden,T.L., Pontius,J.U., Schuler,G.D., Schriml,L.M., Sequeira,E., Tatusova,T.A. et al. (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res., 31, 28-33. 37. Nakamura,Y., Gojobori,T. and Ikemura,T. (2000) Codon usage tabulated from international 38. Lander,E.S. and Linton,L.M. and Birren,B. and Nusbaum,C. and Zody,M.C. and Baldwin,J. and Devon,K. and Dewar,K. and Doyle,M. and FitzHugh,W. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860-921. 39. Reboul,J., Vaglio,P., Tzellas,N., Thierry-Mieg,N., Moore,T., Jackson,C., Shin-i,T., Kohara,Y., Thierry-Mieg,D., Thierry-Mieg,J. et al. (2001) Open-reading-frame sequence tags (OSTs) support the existence of at least 17,300 genes in C. elegans. Nat. Genet., 27, 332-336. 40. Gopal,S., Schroeder,M., Pieper,U., Sczyrba,A., Aytekin-Kurban,G., Bekiranov,S., Fajardo,J.E., Eswar,N., Sa nchez,R., Sali,A. et al. (2001) Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome. Nat. Genet., 27, 337-340. 41. Roest Crollius,H., Jaillon,O., Bernot,A., Dasilva,C., Bouneau,L., Fischer,C., Fizames,C., Wincker,P., Brottier,P., Quetier,F. et al. (2000) Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet., 25, 235-238. 42. Ewing,B. and Green,P. (2000) Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet., 25, 232-234. 43. Banerjee-Basu,S. and Baxevanis,A.D. (2001) Molecular evolution of the homeodomain family of transcription factors. Nucleic Acids Res., 29, 3258-3269. 44. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402. 45. Bharathan,G., Janssen,B.J., Kellogg,E.A. and Sinha,N. (1997) Did homeodomain proteins duplicate before the origin of angiosperms, fungi, and metazoa? Proc. Natl. Acad. Sci. U. S. A., 94, 13749-13753. 46. Kappen,C., Schughart,K. and Ruddle,F.H. (1993) Early evolutionary origin of major homeodomain sequence classes. Genomics, 18, 54-70. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 DNA sequence databases: status for the year 2000. Nucleic Acids Res., 28, 292. 47. Meyer,A. and Malaga-Trillo,E. (1999) Vertebrate genomics: More fishy tales about Hox genes. Curr. Biol., 9, R210-213. 48. Ruddle,F.H., Bartels,J.L., Bentley,K.L., Kappen,C., Murtha,M.T. and Pendleton,J.W. (1994) Evolution of Hox genes. Annu. Rev. Genet., 28, 423-442. 49. Finnerty,J.R. and Martindale,M.Q. (1998) The evolution of the Hox cluster: insights from outgroups. Curr. Opin. Genet. Dev., 8, 681-687. 50. Gehring,W.J., Affolter,M. and Burglin,T. (1994) Homeodomain proteins. Annu. Rev. Biochem., 63, 487-526. (1998) Deletions in HOXD13 segregate with an identical, novel foot malformation in two unrelated families. Am. J. Hum. Genet., 63, 992-1000. 52. Davis,A.P. and Capecchi,M.R. (1996) A mutational analysis of the 5' HoxD genes: dissection of genetic interactions during limb development in the mouse. Development, 122, 1175-1185. 53. Fromental-Ramain,C., Warot,X., Messadecq,N., LeMeur,M., Dolle,P. and Chambon,P. (1996) Hoxa-13 and Hoxd-13 play a crucial role in the patterning of the limb autopod. Development, 122, 2997-3011. 54. Otto,F., Thornell,A.P., Crompton,T., Denzel,A., Gilmour,K.C., Rosewell,I.R., Stamp,G.W.H., Beddington,R.S.P., Mundlos,S., Olsen,B.R. et al. (1997) Cbfa1, a candidate gene for cleidocranial dysplasia syndrome, is essential for osteoblast differentiation and bone development. Cell, 89, 765-771. 55. Zakany,J. and Duboule,D. (1996) Synpolydactyly in mice with a targeted deficiency in the HoxD complex. Nature, 384, 69-71. 56. Dixon,M.J., Gazzard,J., Chaudhry,S.S., Sampson,N., Schulte,B.A. and Steel,K.P. (1999) Mutation of the Na-K-Cl co-transporter gene Slc12a2 results in deafness in mice. Hum. Mol. Genet., 8, 1579-1584. 57. Warren,S.T. (1997) Polyalanine expansion in synpolydactyly might result from unequal crossing-over of HOXD13. Science, 275, 408-409. 58. Mitas,M. (1997) Trinucleotide repeats associated with human disease. Nucleic Acids Res., 25, 2245-2254. 59. Usdin,K. and Woodford,K.J. (1995) CGG repeats associated with DNA instability and chromosome fragility form structures that block DNA synthesis in vitro. Nucleic Acids Res., 23, 4202-4209. 60. Fry,M. and Loeb,L.A. (1994) The fragile X syndrome d(CGG)n nucleotide repeats form a stable tetrahelical structure. Proc. Natl. Acad. Sci. U. S. A., 91, 4950-4954. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 51. Goodman,F., Giovannucci-Uzielli,M.L., Hall,C., Reardon,W., Winter,R. and Scambler,P. 61. Mariappan,S.V., Catasti,P., Chen,X., Ratliff,R., Moyzis,R.K., Bradbury,E.M. and Gupta,G. (1996) Solution structures of the individual single strands of the fragile X DNA triplets (GCC)n.(GGC)n. Nucleic Acids Res., 24, 784-792. 62. Mitas,M., Yu,A., Dill,J. and Haworth,I.S. (1995) The trinucleotide repeat sequence d(CGG)15 forms a heat-stable hairpin containing Gsyn.Ganti base pairs. Biochemistry (Mosc). 34, 12803-12811. 63. Borstnik,B. and Pumpernik,D. (2002) Tandem repeats in protein coding regions of primate genes. Genome Res., 12, 909-915. alternative translation of a trinucleotide repeat. Nucleic Acids Res., 23, 138-145. 65. Meijer,D., Graus,A. and Grosveld,G. (1992) Mapping the transactivation domain of the Oct-6 POU transcription factor. Nucleic Acids Res., 20, 2241-2247. 66. Chen,L., DeVries,A.L. and Cheng,C.H. (1997) Convergent evolution of antifreeze glycoproteins in Antarctic notothenioid fish and Arctic cod. Proc. Natl. Acad. Sci. U. S. A., 94, 3817-3822. 67. Maruyama,K., Sato,N. and Ohta,N. (1999) Conservation of structure and cold-regulation of RNA-binding proteins in cyanobacteria: probable convergent evolution with eukaryotic glycinerich RNA-binding proteins. Nucleic Acids Res., 27, 2029-2036. 68. Charron,F., Paradis,P., Bronchain,O., Nemer,G. and Nemer,M. (1999) Cooperative interaction between GATA-4 and GATA-6 regulates myocardial gene expression. Mol. Cell. Biol., 19, 4355-4365. 69. McLarren,K.W., Theriault,F.M. and Stifani,S. (2001) Association with the nuclear matrix and interaction with Groucho and RUNX proteins regulate the transcription repression activity of the basic helix loop helix factor Hes1. J. Biol. Chem., 276, 1578-1584. 70. Jimenez,G., Paroush,Z. and Ish-Horowicz,D. (1997) Groucho acts as a corepressor for a subset of negative regulators, including Hairy and Engrailed. Genes Dev., 11, 3072-3082. 71. Tolkunova,E.N., Fujioka,M., Kobayashi,M., Deka,D. and Jaynes,J.B. (1998) Two distinct types of repression domain in engrailed: one interacts with the groucho corepressor and is preferentially active on integrated target genes. Mol. Cell. Biol., 18, 2804-2814. 72. Araki,I. and Nakamura,H. (1999) Engrailed defines the position of dorsal di-mesencephalic boundary by repressing diencephalic fate. Development, 126, 5127-5135. 73. Kobayashi,M., Goldstein,R.E., Fujioka,M., Paroush,Z. and Jaynes,J.B. (2001) Groucho augments the repression of multiple Even skipped target genes in establishing parasegment boundaries. Development, 128, 1805-1815. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 64. Lanz,R.B., Wieland,S., Hug,M. and Rusconi,S. (1995) A transcriptional repressor obtained by 74. Muhr,J., Andersson,E., Persson,M., Jessell,T.M. and Ericson,J. (2001) Groucho-mediated transcriptional repression establishes progenitor cell pattern and neuronal fate in the ventral neural tube. Cell, 104, 861-873. 75. Briscoe,J., Pierani,A., Jessell,T.M. and Ericson,J. (2000) A homeodomain protein code specifies progenitor cell identity and neuronal fate in the ventral neural tube. Cell, 101, 435-445. 76. Aronson,B.D., Fisher,A.L., Blechman,K., Caudy,M. and Gergen,J.P. (1997) Grouchodependent and -independent repression activities of Runt domain proteins. Mol. Cell. Biol., 17, 5581-5587. terminal domain of HES-1 containing the WRPW motif. Biochem. Biophys. Res. Commun., 223, 701-705. 78. Grbavec,D., Lo,R., Liu,Y., Greenfield,A. and Stifani,S. (1999) Groucho/transducin-like enhancer of split (TLE) family members interact with the yeast transcriptional co-repressor SSN6 and mammalian SSN6- related proteins: implications for evolutionary conservation of transcription repression mechanisms. Biochem. J., 337, 13-17. 79. Grbavec,D., Lo,R., Liu,Y. and Stifani,S. (1998) Transducin-like Enhancer of split 2, a mammalian homologue of Drosophila Groucho, acts as a transcriptional repressor, interacts with Hairy/Enhancer of split proteins, and is expressed during neuronal development. Eur. J. Biochem., 258, 339-349. 80. Firulli,B.A., Hadzic,D.B., McDaid,J.R. and Firulli,A.B. (2000) The basic helix-loop-helix transcription factors dHAND and eHAND exhibit dimerization characteristics that suggest complex regulation of function. J. Biol. Chem., 275, 33567-33573. 81. Peltenburg,L.T. and Murre,C. (1996) Engrailed and Hox homeodomain proteins contain a related Pbx interaction motif that recognizes a common structure present in Pbx. EMBO J., 15, 3385-3393. 82. Rubinsztein,D.C., Leggo,J., Chiano,M., Dodge,A., Norbury,G., Rosser,E. and Craufurd,D. (1997) Genotypes at the GluR6 kainate receptor locus are associated with variation in the age of onset of Huntington disease. Proc. Natl. Acad. Sci. U. S. A., 94, 3872-3876. 83. Hayes,S., Turecki,G., Brisebois,K., Lopes-Cendes,I., Gaspar,C., Riess,O., Ranum,L.P., Pulst,S.M. and Rouleau,G.A. (2000) CAG repeat length in RAI1 is associated with age at onset variability in spinocerebellar ataxia type 2 (SCA2). Hum. Mol. Genet., 9, 1753-1758. 84. Li,W., Jaroszewski,L. and Godzik,A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282-283. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 77. Grbavec,D. and Stifani,S. (1996) Molecular interaction between TLE1 and the carboxyl- 85. Consortium,T.F. (2003) The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res., 31, 172-175. 86. Harris,T.W., Lee,R., Schwarz,E., Bradnam,K., Lawson,D., Chen,W., Blasier,D., Kenny,E., Cunningham,F., Kishore,R. et al. (2003) WormBase: a cross-species database for comparative genomics. Nucleic Acids Res., 31, 133-137. 87. Alba,M.M., Santibanez-Koref,M.F. and Hancock,J.M. (1999) Conservation of polyglutamine tract size between mice and humans depends on codon interruption. Mol. Biol. Evol., 16, 16411644. Cherry,J.M., Botstein,D., Brown,P.O. et al. (2003) SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res., 31, 219-223. 89. Henke,W., Herdel,K., Jung,K., Schnorr,D. and Loening,S.A. (1997) Betaine improves the PCR amplification of GC-rich DNA sequences. Nucleic Acids Res., 25, 3957-3958. 90. Camon,E., Magrane,M., Barrell,D., Binns,D., Fleischmann,W., Kersey,P., Mulder,N., Oinn,T., Maslen,J., Cox,A. et al. (2003) The Gene Ontology Annotation (GOA) Project: Implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res., 13, 662-672. 91. Consortium,T.G.O. (2001) Creating the gene ontology resource: design and implementation. Genome Res., 11, 1425-1433. 92. Schaffer,A.A., Aravind,L., Madden,T.L., Shavirin ,S., Spouge,J.L., Wolf,Y.I., Koonin,E.V. and Altschul,S.F. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res., 29, 2994-3005. 93. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673-4680. 94. Felsenstein,J. (2001) Taking Variation of Evolutionary Rates Between Sites into Account in Inferring Phylogenies. J. Mol. Evol., 53, 447-455. 95. Bruno,W.J., Socci,N.D. and Halpern,A.L. (2000) Weighted neighbor joining: a likelihoodbased approach to distance- based phylogeny reconstruction. Mol. Biol. Evol., 17, 189-197. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 88. Diehn,M., Sherlock,G., Binkley,G., Jin,H., Matese,J.C., Hernandez-Boussard,T., Rees,C.A., LEGENDS TO FIGURES Figure 1: General characteristics of tested (A) and all human polyalanine coding sequences (B and C). Polymorphic (closed symbols) and non-polymorphic (opened symbols) polyalaninecoding sequences were plotted for their longest run of a single codon and their overall polyalanine length. The rectangle delimits a region out of which 62% of polymorphic and 11% of non-polymorphic sequences are present. All polyalanine-coding sequences from H. sapiens were retrieved from mRNA sequence and their codon usage (B) and repeat content (C) was analyzed. domain. GCG codon is significantly over-represented in polya lanine coding sequences compared to its normal occurrence while GCA and GCT codons are sometime under-represented (asterisk: p<0.005; one-way ANOVA). The grey zone delimits a region defining the mean codon usage in control sequences (black line) ±1 standard deviation. The alanine codon usage given by the codon usage database (Nakamura et al. 2000) is depicted by a doted line. C) GCG and GCC codons are far more frequently repeated than GCA and GCT codons in polyalanine-coding sequences. Figure 2: Polyalanine domain-containing proteins are enriched in some Gene Ontology molecular functions. Distribution of length of polyalanine domains retrieved in H. sapiens, D. melanogaster and C elegans (A, C and E). Bar diagram illustrating the molecular functions of polyalanine containing proteins and whole proteomes according to Gene Ontology Consortium annotations in H. sapiens, D. melanogaster and C. elegans (B, D and F). Empty bars correspond to whole proteome annotations; gray bars, to proteins containing polyalanine domains from 5-7 alanines long; and black bars, to polyalanine domains >7 alanines. All functions shown on the graph contribute to at least 2% of the whole human proteome. Significant over-representation of functional categories in 5-7 or >7 alanines polyalanine domain-containing proteins compared to whole proteome is denoted by an asterisk while significant over-representation of functional categories in proteins having a domain >7 alanines compared to 5-7 alanines group is denoted by a plus sign (p<0.001; Chi square calculation). Figure 3: Polyalanine domains appeared independently in many protein phylogenies. The polyalanine domains arose independently in members of the HOX (A) and GATA (C) protein families. Convergence was also observed over longer evolutionary time as exemplified by the EVX proteins phylogeny (E). Note that vertebrate evx2, drosophila eve and nematode VAB-7 all contain a polyalanine tract they acquired independently. For A and B, a core tree was generated Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 B) Distribution of the four alanine codons usage in percent according to length of the polyalanine using the alignment of the conserved domains only and sub-trees were generated with alignments of the full-length protein sequences. Sites of insertion of polyalanine domains are shown as arrowheads on schematic representations of alignments of HOX, GATA and EVX (B, D and F). All phylogenies were generated by the Neighbor joining method based on the alignment of conserved residues of homeobox, zinc-finger and homeobox domains for HOX, GATA and EVX respectively. Bootstrap values in percent (1000 replicates) are indicated at the nodes. Species abbreviations are as follows: Agama agama (Aag), Ambystoma mexicanum (Ame), Bombyx mori (Bmo), Branchiostoma floridae (Bfl), Caenorhabditis elegans (Cel.), Charina trivirgata (Ctr), (Dme), Eumeces inexpectatus (Ein), Gallus gallus (Gga), Heterodontus francisci (Hfr), Homo sapiens (Hsa), Latimeria chalumnae (Lch), Monodidelphis domesticus (Mdo), Mus musculus (Mmu), Notophtalmus viridens (Nvi), Petromyzon marinus (Pma), Pristionchus pacificus (Ppa), Rattus norvegicus (Rno), Sceloporus undulatus (Sun), Schistocerca americana (Sam), Silurana tropicalis (Str), Tribolium castaneum (Tca), Typhlops richardi (Tri) and Xenopus laevis (Xla). Figure 4: Conservation of polyalanine domain length between human and three vertebrate species. Polyalanine domain length in H. sapiens proteins was plotted against corresponding polyalanine domain length in M. musculus (A), G. gallus (B) and D. rerio (C). Each polyalanine domain pair was associated to a protein pair. The polyalanine domains were ranked by their associated protein pair percentage of identity in blast alignments and they were given a distinct symbol corresponding to the category they were falling in (50-59%, 60-69%, 70-79%, 80-89% and 90-100% of identity). N corresponds to the number of polyalanine domain pairs between two species and R is the Pearson correlation coefficient between human and species of interest polyalanine sizes. D) depicts the faith of 28 polyalanine domains found in 23 different proteins that share an ortholog in H. sapiens, G. gallus and D. rerio . Series are labelled according to human protein name. Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 Cupiennius salei (Csa), Danio aequipinatus (Dae), Danio rerio (Dre), Drosophila melanogaster TABLES Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 Table I: Polymorphisms observed in gene sequences coding for domains of 8 or more alanines. In bold, the number of repeats or repetition of a mixed (GCN)n sequence present in the more common allele. Gene name ID4 Polyalanine Variation Number Frequency of Polymorphic sequence Number length type5 of alleles the most of frequent allele repeats 6 SRP14 2 6727 9 E 2 99% GCA 9, 10 SOX21 11166 13 E 2 99% GCC 6, 8 FOXF2 2295 9 E 2 99% GCC 9, 10 C14orf4 64207 10 C 2 99% GCC 8, 9 ZIC2 7546 15 d 2 99% GCTGCGGCGGCGGCG 1, 0 ATBF1 2 463 13 C 2 99% GCG 4, 0 HOXA13 3209 18 d 2 98% GCTGCCGCGGCTGCC 1, 0 GCT PABPN1 8106 10 E 2 98% GCG 6, 7 TBL1 3 6907 8 C 2 98% GCG 5, 4 FBS1 64319 19 i 2 98% GCAGCG 1, 2 HOXD11 3 3237 13 E 2 97% GCG 6, 7 SLC12A2 6558 15 C 2 97% GCG 5, 7 ZIC3 7547 10 E 2 97% GCC 8, 9 PHOX2B 3 8929 20 C 2 97% GCG 4, 3 SGCB 6443 8 E 2 97% GCG 4, 5 MAZ 2 4150 9 d 3 95% GCGGCAGCA 1, 0 ZNF358 55136 17 d 2 95% GCTGCTGCAGCTGCA 0, 1 GCGGCCGCG MAP3K4 2 4216 9 E 2 93% GCT 9, 10 SOX1 6656 8 C 2 92% GCG 3, 5 RPL14 1,2 9045 8 E 2 91% GCT 8, 9 HLXB9 2 3110 16 C 2 90% GCC 9, 11 RUNX2 860 17 d 2 88% GCGGCGGCTGCGGCG 1, 0 GCG FOXD1 2297 8 E 2 87% GCC 4, 5 DMRTA2 63950 9 C 2 69% GCG 6, 8 KCNN2 3781 8 E 2 66% GCG 4, 5 EOMES 8320 14 E 2 65% GCC 7, 9 CC1.3 2 9584 12 C 3 63% GCA 5, 6, 8 TGFBR1 1,3 7046 9 E/C 3 63% GCG 6, 9, 10 PRDM12 59335 12 E/C 4 55% GCC 5, 6, 8 FOXE1 1,2 2304 14 E/C 3 55% GCC 7, 9, 11 MSH3 1,2 4437 12 D/d 4 53% GCTGCAGCG 1, 4, 5, 6 1 , Gene sequence previously found to be polymorphic; 2 , gene sequence predicted to be polymorphic based on EST sequence alignment; 3 , region unable to sequence; 4 , LocusLink ID; 5 , Expansion/Contraction (E/C) and Deletion /duplication (D/d); and 6 , number of units of repeated sequence responsible for each allele. FIGURES Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013 ABBREVIATIONS Downloaded from http://hmg.oxfordjournals.org/ at Pennsylvania State University on February 21, 2013