PHYLOGENETIC ANALYSIS OF THE INOSITOL SYNTHESIS PATHWAYS VIA SUPERMATRIX APPROACH By Derek Dashti and Matthew Chin Introduction Inositols are cyclohexanehexol molecules that serve primarily as secondary messengers in cellular signaling pathways. Particularly, inositol derivatives aid in numerous cellular membrane trafficking pathways. Interestingly, all eukaryotes utilize inositols and their derivatives for membrane, cytosolic, and other diverse cellular signaling processes [1]. The metabolism of inositols to develop unique functional inositol derivatives is important for eukaryotes: the hydrolysis of Ins(1,3,4,5)P4 could reduce Ras GAP activity and thus induce the Ras pathway for cellular proliferation and differentiation [2]. However, the metabolic pathway for the synthesis of certain inositol derivatives has formed due to an evolutionary trend from other deep ancestral eukaryotes and prokaryotes. Thus, in order to understand inositol derivative metabolism and function, a study of the ancestral species containing inositols should be assessed. Archaea, bacteria, and eukarya all contain inositol derivatives, but the distribution of such compounds varies for these three biological kingdoms. For example, the metabolism of inositol pyrophosphates for vesicular trafficking occurs mostly in plants, amoebozoa, fungi, and animals [3]. Nevertheless, throughout evolution enzymes that allow the synthesis of these inositol pyrophosphates are highly conserved. In order to study the function of specific inositol derivatives, the distribution of such molecules in both primitive and modern eukaryotes and prokaryotes must be addressed. From scientific literature [1] 10 evolutionary conserved inositol derivatives are illustrated among 11 different species ranging from eukaryotes to prokaryotes. Therefore, with the aim of illustrating evolutionary relationships of different organisms with unique inositol derivatives, we computed a phylogenetic tree to exemplify the trend in which unique species helped evolve the metabolism of inositol derivatives in more modern complex eukaryotes. Methods Reconstruction of species and gene trees can be divided into two general categories: whole-genome based methods and sequence based methods. Sequence based methods can be further subdivided into the supermatrix and supertree approaches. The supermatrix approach generates a supermatrix by concatenating orthologous individually aligned genes of interest. The supermatrix is then used to construct a species tree. This approach is most frequently used because it has a high tolerance for missing data so species with partially sequenced or incomplete genomes can be included in the phylogeny without affecting the accuracy of the results. The inclusion of more species in the phylogenetic analysis allows for a better representation of species diversity and could lead to a more resolved tree. The supermatrix approach is also widely used because of its relatively low computation cost [4]. When constructing a species tree, only orthologous genes should be used. However, due to the limitations of our knowledge on ortholog identification, we could only verify that the genes were homologous. The first step was to identify seed sequences. This was accomplished by using gene ontology webserver (http://www.geneontology.org/) and searching for the different parts of the pathway. The genes that were identified by the correct ontology and existed in humans were chosen as the seed sequence. After selecting all 10 seed sequences from humans, BLAST [5] search was performed on these sequences to identify a broad range of homologs. The E-value cutoff was set very high to ensure that all possible homologs were identified. To filter out the hits that were truly homologs, a Hidden Markov Model was created from the seed sequence and scored against the BLAST hits using w0.5 (software) to build the model and hmmscore (software) to score the hits. These programs are all built in on the DECF machines. The scores of the homologs for each of the genes in the pathway were manually inspected and the only the top hits for each organism were included in the MSA. The sequences were aligned using MUSCLE [6]. After aligning the homologs, the generated MSAs were examined in Belvu alignment viewer. Each MSA was processed to remove any low information content columns as well as remove any portion of sequences that did not align well. A gene supermatrix was made from concatenating the MSAs of the genes from the pathway using the geneConcat.pl program (see Appendix). The MSAs were checked to ensure that the sequences were all the same length. This was accomplished by arbitrarily choosing the first sequence as the correct length. All the MSAs were read once to extract the taxon IDs from the alignments and the taxon IDs were written into a dictionary that eventually would contain the taxon IDs of all the represented species. The alignments were then reread and each gene was concatenated to the previous gene. If a taxon that was present in the dictionary did not appear in the single gene MSA, then gaps were inserted to replace the missing gene. The final concatenated MSA was edited to remove any columns that had low confidence levels. The next step was the conversion of the concatenated FASTA MSA into Phylip format using readseq [7] with option 12 (Phylip format) for the output file format. The Phylip MSA was submitted to protdist [8] using default settings to generate a file of pairwise distances. Neighbor [8] software was then run with default settings to create a neighbor joining tree in Newick format from the pairwise distance matrix. The trees were all viewed and edited using FigTree [9]. A script, gene_wrapper.sh (see Appendix), was written to automate and link all of these steps. To obtain support for the supermatrix tree, bootstrapping was performed on the final concatenated supermatrix. Seqboot [8], also part of the PHYLIP package, was used to conduct the bootstrapping. Seqboot was run with all the default options, except for changes in the number of replicates. Protdist and neighbor were then run on the seqboot program output. Consense [10] took these neighbor joining trees and generated a bootstrap consensus tree. In addition to the supermatrix tree, species trees were also generated for each gene in the pathway. Results: Figure 1 shows the tree generated when all the genes in the pathway are concatenated into the supermatrix. Figure 2 shows the bootstrapped consensus tree of all the genes in the pathway. Figure 3 shows the individual gene tree (InsPase) for a part of the pathway that is incongruent with the overall supermatrix tree. Arabidopsis and cryptosporidium are paired together in Figure 3 whereas arabiposis is correctly paired with toxoplasma in the supermatrix tree of Figure 1. Discussion: The supermatrix tree and the consensus tree support each other and are similar to the expected phylogenies of these species. However, when the individual gene trees are examined, some of the gene trees are incongruous. The tree with the largest inconsistencies between the supermatrix tree and the single gene tree is the tree for the InsPase gene. As hypothesized in the Mitchell paper, the InsPase gene might have arose due to some sort of lateral gene transfer event [1]. Analysis of the individual gene tree provides a possible explanation for how the lateral gene transfer event occurred. The main difference between the InsPase tree and the supermatrix tree is that Arabidopsis is now clustered with the apicomplexans. For the single gene tree there appears to be a closer relationship between the apicomplexans and the Green Plants. The supermatrix tree on the other hand shows that the overall pathway suggests a closer relationship between the Green Plants and the Metazoans. This means that the different parts of the pathways have different evolutionary histories. Endosymbiotic gene transfer (EGT) events are one possible explanation for this discrepancy. Two main EGTs are hypothesized to have occurred in the apicomplexan lineage [11] shown figure 4. Over time, lateral gene transfer from endosymbiont to host could eventually lead to differential gene loss of redundant genes in the genome. Since both host and endosymbiont would each have a copy of the gene, one of the genes could have been lost over the course of evolution. In the case of the Apicomplexan it appears this gene loss was random; sometimes the host’s gene was lost and sometimes the Red Algae endosymbiont’s gene was lost. This would explain the incongruency of the InsPase gene of the pathway, for the close relationship between the Green Plants and the Apicomplexans. For this specific gene of the pathways, the Apicomplexans must have lost the host gene and kept the endosymbiont Red Algae’s gene. Red Algae and Green Algae share a common ancestor, and most plants arose from Green Algae. Hence in the phylogenetic analysis, the tree for the InsPase gene placed Arabidopsis right in the middle of the Apicomplexans. In conclusion, the species trees that are generated from concatenated genes seem to be fairly robust. Both bootstrapping and gene tree clustering supported the concatenated gene topology. One of the reasons that incongruence is seen between some of the individual gene trees and the species tree is due to the fact that some genes in the pathway do not share the same evolutionary history as the rest of the genes. Endosymbiotic gene transfer is the most plausible explanation that we could propose for the incongruence we observed in the topology of one of genes (InsPase) in the inositol pathway. Sources: 1) Michell, H. R. Inositol derivatives: evolution and functions. Nature Rev. Mol. Cell Biol. 9, 151-161 (2008). 2) Majerus, W. P. Inositols do it all. Rev. Genes & Dev. 10, 1051-1053 (1996). 3) Bennett, M. Inositol pyrophosphates: metabolism and signaling. Rev. Cell. Mol. Life Sci. 63, 552-564 (2006). 4) Delsuc F., Brinkmann H., Philippe H. (2005) Phylogenomics and the Reconstruction of the Tree of Life. Nature Reviews Genetics 6, 5: 361-75. 5) Altschul S. F., Gish W., Miller W., Myers E. W., Lipman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol. 215: 403-410. 6) Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32(5), 1792-97. 7) Pearson W. R., Lipman D. J. (1988) Improved Tools for Biological Sequence Analysis. PNAS 85: 2444-2448 8) Felsenstein J. (1989) PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5: 164-166 9) Rambaut A. (2007) FigTree (Version 1.2) http://tree.bio.ed.ac.uk/software/figtree/ 10) Castresana J. (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17, 540-552. 11) Keeling , Patrick. "Comment on "The Evolution of Modern Eukaryotic Phytoplankton"." Science 306(2004) 2191.)