BioE131_FinalPaper

advertisement
PHYLOGENETIC ANALYSIS OF THE INOSITOL SYNTHESIS PATHWAYS VIA
SUPERMATRIX APPROACH
By Derek Dashti and Matthew Chin
Introduction
Inositols are cyclohexanehexol molecules that serve primarily as secondary
messengers in cellular signaling pathways. Particularly, inositol derivatives aid in numerous
cellular membrane trafficking pathways. Interestingly, all eukaryotes utilize inositols and
their derivatives for membrane, cytosolic, and other diverse cellular signaling processes [1].
The metabolism of inositols to develop unique functional inositol derivatives is important for
eukaryotes: the hydrolysis of Ins(1,3,4,5)P4 could reduce Ras GAP activity and thus induce
the Ras pathway for cellular proliferation and differentiation [2]. However, the metabolic
pathway for the synthesis of certain inositol derivatives has formed due to an evolutionary
trend from other deep ancestral eukaryotes and prokaryotes. Thus, in order to understand
inositol derivative metabolism and function, a study of the ancestral species containing
inositols should be assessed.
Archaea, bacteria, and eukarya all contain inositol derivatives, but the distribution of
such compounds varies for these three biological kingdoms. For example, the metabolism of
inositol pyrophosphates for vesicular trafficking occurs mostly in plants, amoebozoa, fungi,
and animals [3]. Nevertheless, throughout evolution enzymes that allow the synthesis of
these inositol pyrophosphates are highly conserved. In order to study the function of specific
inositol derivatives, the distribution of such molecules in both primitive and modern
eukaryotes and prokaryotes must be addressed. From scientific literature [1] 10 evolutionary
conserved inositol derivatives are illustrated among 11 different species ranging from
eukaryotes to prokaryotes. Therefore, with the aim of illustrating evolutionary relationships
of different organisms with unique inositol derivatives, we computed a phylogenetic tree to
exemplify the trend in which unique species helped evolve the metabolism of inositol
derivatives in more modern complex eukaryotes.
Methods
Reconstruction of species and gene trees can be divided into two general categories:
whole-genome based methods and sequence based methods. Sequence based methods can
be further subdivided into the supermatrix and supertree approaches. The supermatrix
approach generates a supermatrix by concatenating orthologous individually aligned genes
of interest. The supermatrix is then used to construct a species tree. This approach is most
frequently used because it has a high tolerance for missing data so species with partially
sequenced or incomplete genomes can be included in the phylogeny without affecting the
accuracy of the results. The inclusion of more species in the phylogenetic analysis allows for
a better representation of species diversity and could lead to a more resolved tree. The
supermatrix approach is also widely used because of its relatively low computation cost [4].
When constructing a species tree, only orthologous genes should be used. However,
due to the limitations of our knowledge on ortholog identification, we could only verify that
the genes were homologous. The first step was to identify seed sequences. This was
accomplished by using gene ontology webserver (http://www.geneontology.org/) and
searching for the different parts of the pathway. The genes that were identified by the correct
ontology and existed in humans were chosen as the seed sequence. After selecting all 10 seed
sequences from humans, BLAST [5] search was performed on these sequences to identify a
broad range of homologs. The E-value cutoff was set very high to ensure that all possible
homologs were identified. To filter out the hits that were truly homologs, a Hidden Markov
Model was created from the seed sequence and scored against the BLAST hits using w0.5
(software) to build the model and hmmscore (software) to score the hits. These programs are
all built in on the DECF machines. The scores of the homologs for each of the genes in the
pathway were manually inspected and the only the top hits for each organism were included
in the MSA. The sequences were aligned using MUSCLE [6]. After aligning the homologs,
the generated MSAs were examined in Belvu alignment viewer. Each MSA was processed to
remove any low information content columns as well as remove any portion of sequences
that did not align well.
A gene supermatrix was made from concatenating the MSAs of the genes from the
pathway using the geneConcat.pl program (see Appendix). The MSAs were checked to
ensure that the sequences were all the same length. This was accomplished by arbitrarily
choosing the first sequence as the correct length. All the MSAs were read once to extract the
taxon IDs from the alignments and the taxon IDs were written into a dictionary that
eventually would contain the taxon IDs of all the represented species. The alignments were
then reread and each gene was concatenated to the previous gene. If a taxon that was
present in the dictionary did not appear in the single gene MSA, then gaps were inserted to
replace the missing gene. The final concatenated MSA was edited to remove any columns
that had low confidence levels. The next step was the conversion of the concatenated FASTA
MSA into Phylip format using readseq [7] with option 12 (Phylip format) for the output file
format. The Phylip MSA was submitted to protdist [8] using default settings to generate a
file of pairwise distances. Neighbor [8] software was then run with default settings to create
a neighbor joining tree in Newick format from the pairwise distance matrix. The trees were
all viewed and edited using FigTree [9]. A script, gene_wrapper.sh (see Appendix), was
written to automate and link all of these steps.
To obtain support for the supermatrix tree, bootstrapping was performed on the final
concatenated supermatrix. Seqboot [8], also part of the PHYLIP package, was used to
conduct the bootstrapping. Seqboot was run with all the default options, except for changes
in the number of replicates. Protdist and neighbor were then run on the seqboot program
output. Consense [10] took these neighbor joining trees and generated a bootstrap
consensus tree. In addition to the supermatrix tree, species trees were also generated for
each gene in the pathway.
Results:
Figure 1 shows the tree generated when all the genes in the pathway are concatenated
into the supermatrix.
Figure 2 shows the bootstrapped consensus tree of all the genes in the pathway.
Figure 3 shows the individual gene tree (InsPase) for a part of the pathway that is
incongruent with the overall supermatrix tree. Arabidopsis and cryptosporidium are paired
together in Figure 3 whereas arabiposis is correctly paired with toxoplasma in the
supermatrix tree of Figure 1.
Discussion:
The supermatrix tree and the consensus tree support each other and are similar to
the expected phylogenies of these species. However, when the individual gene trees are
examined, some of the gene trees are incongruous. The tree with the largest inconsistencies
between the supermatrix tree and the single gene tree is the tree for the InsPase gene. As
hypothesized in the Mitchell paper, the InsPase gene might have arose due to some sort of
lateral gene transfer event [1]. Analysis of the individual gene tree provides a possible
explanation for how the lateral gene transfer event occurred. The main difference between
the InsPase tree and the supermatrix tree is that Arabidopsis is now clustered with the
apicomplexans. For the single gene tree there appears to be a closer relationship between the
apicomplexans and the Green Plants. The supermatrix tree on the other hand shows that the
overall pathway suggests a closer relationship between the Green Plants and the Metazoans.
This means that the different parts of the pathways have different evolutionary histories.
Endosymbiotic gene transfer (EGT) events are one possible explanation for this discrepancy.
Two main EGTs are hypothesized to have occurred in the apicomplexan lineage [11] shown
figure 4.
Over time, lateral gene transfer from endosymbiont to host could eventually lead to
differential gene loss of redundant genes in the genome. Since both host and endosymbiont
would each have a copy of the gene, one of the genes could have been lost over the course of
evolution. In the case of the Apicomplexan it appears this gene loss was random; sometimes
the host’s gene was lost and sometimes the Red Algae endosymbiont’s gene was lost. This
would explain the incongruency of the InsPase gene of the pathway, for the close
relationship between the Green Plants and the Apicomplexans. For this specific gene of the
pathways, the Apicomplexans must have lost the host gene and kept the endosymbiont Red
Algae’s gene. Red Algae and Green Algae share a common ancestor, and most plants arose
from Green Algae. Hence in the phylogenetic analysis, the tree for the InsPase gene placed
Arabidopsis right in the middle of the Apicomplexans.
In conclusion, the species trees that are generated from concatenated genes seem to
be fairly robust. Both bootstrapping and gene tree clustering supported the concatenated
gene topology. One of the reasons that incongruence is seen between some of the individual
gene trees and the species tree is due to the fact that some genes in the pathway do not share
the same evolutionary history as the rest of the genes. Endosymbiotic gene transfer is the
most plausible explanation that we could propose for the incongruence we observed in the
topology of one of genes (InsPase) in the inositol pathway.
Sources:
1) Michell, H. R. Inositol derivatives: evolution and functions. Nature Rev. Mol. Cell
Biol. 9, 151-161 (2008).
2) Majerus, W. P. Inositols do it all. Rev. Genes & Dev. 10, 1051-1053 (1996).
3) Bennett, M. Inositol pyrophosphates: metabolism and signaling. Rev. Cell. Mol. Life
Sci. 63, 552-564 (2006).
4) Delsuc F., Brinkmann H., Philippe H. (2005) Phylogenomics and the Reconstruction
of the Tree of Life. Nature Reviews Genetics 6, 5: 361-75.
5) Altschul S. F., Gish W., Miller W., Myers E. W., Lipman, D. J. (1990) Basic local
alignment search tool. J. Mol. Biol. 215: 403-410.
6) Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy
and high throughput, Nucleic Acids Research 32(5), 1792-97.
7) Pearson W. R., Lipman D. J. (1988) Improved Tools for Biological Sequence
Analysis. PNAS 85: 2444-2448
8) Felsenstein J. (1989) PHYLIP - Phylogeny Inference Package (Version 3.2).
Cladistics 5: 164-166
9) Rambaut A. (2007) FigTree (Version 1.2) http://tree.bio.ed.ac.uk/software/figtree/
10) Castresana J. (2000) Selection of conserved blocks from multiple alignments for
their use in phylogenetic analysis. Molecular Biology and Evolution 17, 540-552.
11) Keeling , Patrick. "Comment on "The Evolution of Modern Eukaryotic
Phytoplankton"." Science 306(2004) 2191.)
Download