Text S17: Alien Gene Fragments

advertisement
TEXT S17: ALIEN GENE FRAGMENTS
Lothar Wissler1, Fabian Zimmer1, Olgert Denas2, James Taylor2,3, Nicole M. Gerardo3, and Erich
Bornberg-Baur1
1Institute
for Evolution and Biodiversity, University of Münster, Münster, Germany
of Mathematics and Computer Science, Emory University, Atlanta GA, United States of
2Department
America
3Department of Biology, Emory University, Atlanta GA, United States of America
Recent work in the pea aphid has demonstrated the horizontal gene transfer of
an alien gene involved in carotenoid production from fungi [1]. Given the tight obligate
association that Atta cephalotes has with its fungus and other microbes in the fungus
garden, we investigated alien genes in its genome that it may have acquired from its
fungal cultivar. To do this, we screened the publicly available insect genome sequences
to find domains that occur in Atta but no other insect species. The insect dataset
included Aedes aegypti (L1.49) [2], Anopholes gambiae (P3.49) [3], Apis mellifera (OGS
r.2) [4], Bombyx mori (r1.0) [5], Culex pipiens (r1.2), Drosophila melanogaster (r5.11)
[6], D. ananassae (r1.3) [7], D. erecta (r1.3) [7], D. grimshawi (r1.3) [7], D. mojavensis
(r1.3), D. pseudoobscura (r2.3) [7], D. persimilis (r1.3) [7], D. sechellia (r1.3) [7], D.
simulans (r1.3) [7], D. willistoni (r1.3) [7], D. virilis (r1.2) [7], D. yakuba (r1.3) [7],
Nasonia vitripennis (OGS r.1) [8], and Tribolium castaneum (51906) [9]. Using this A.
cephalotes-specific set, we then derived an estimate for the species distribution of each
Pfam domain [10] to find candidates of HGT from fungi, viruses, bacteria, and archaea.
These candidate domains were obtained by searching the full NCBI non-redundant
protein dataset (11,238,375 entries) against Pfam-A. For each protein sequence in
NCBI-nr, we reconstructed the lineage of the host species using NCBI taxonomy data.
This allowed us to compute the frequency distribution of each domain across the nodes
of interest: Bacteria, Archaea, Viruses, Eukaroytes/Fungi, and Eukaryotes/others.
Finally, those domains that are almost exclusive to one of these nodes (frequency >
95%) were compared against the Atta unique domain set obtained in the first part of this
analysis. From this analysis, we identified 11 candidate proteins predicted to originate
from viral, bacterial, archael, and fungal sources (Table 1). Because A. cephalotes has
an intimate association with both fungus and bacteria, we investigated in greater details
those domains identified to have originated from these two sources.
The predicted fungal domain PF06011 belongs to a family of transient receptor
potential (TRP) ion channels that is essential for cellular viability and involved in cell
growth and cell wall synthesis. However, we could not obtain any GeneOntology terms
for the matching full-length Atta cephalotes protein sequence using BLAST2GO [11],
and matching the protein sequence against NCBI’s non-redundant protein database did
not yield any significant hits. A best match analysis of the TRP domain against NCBI
revealed proteins in Neosartorya fischeri, Yarrowia lipolytica, and Aspergillus species.
All these fungal proteins are longer than 700 residues and fully match the domain model
of 575 residues length. In contrast, the Atta protein sequence is short (131 residues),
has no methionine at the first position, and matches only about 100 residues of the
middle part of the domain model, suggesting that it is incomplete both in the beginning
and at the end.
A second fungal domain protein, PF11710, was also detected in the A.
cephalotes proteome. This protein, also known as Git3, is one of six proteins required
for glucose-triggered adenylate cyclase activation. It is a G protein-coupled receptor
responsible for the activation of adenylate cyclase through Gpa2 - heterotrimeric G
protein alpha subunit, part of the glucosedetection pathway. Git3 contains seven
predicted transmembrane domains, a third cytoplasmic loop and a cytoplasmic tail. The
PF11710 model matches the conserved N-terminus part of the protein. BLAST hits of
the Atta protein sequence (ACEP_00010692) in NCBI are all insignificant, and no GO
terms could be obtained. Interestingly, in two other A. cephalotes proteins
(ACEP_00011098, ACEP_00016598), we find very weak matches to the Git3 Cterminal domain (Git3 C, or PF11970). Although these signals are below the
significance threshold, they could represent a divergent Git3 protein sequence that has
been split within the A. cephalotes genome, where the N-terminal (Git3) and the Cterminal part (Git3 C) are now located in different A. cephalotes proteins. The best
matches to PF11710 in NCBI are to proteins in Saccharomyces cerevisiae,
Kluyveromyces lactis, and Debaryomyces hansenii.
The bacterial protein domain PF10138 was predicted to be found within the A.
cephalotes proteome, and members of this family confer resistance to the metalloid
element tellurium and its salts. The model of PF10138 is relatively short (99 residues),
and the second half of the model is matched by one A. cephalotes protein
(ACEP_00011691), which again seems to be incomplete. It lacks methionine at the first
position and is only 67 residues in length. To properly resemble a member of this Pfam
family, the gene model should be about 40 residues longer at the beginning. For the A.
cephalotes protein sequence, we could not obtain significant hits in NCBI nonredundant nor GO terms. Scanning the Pfam model against NCBI showed best hits to
proteins in Acinetobacter sp., Pseudomonas syringae, Proteus mirabilis, and Yersinia
pestis.
A second bacterial domain, PF02113, was also identified within the A. cephalotes
genome. This domain belongs to a serine peptidases that is part of MEROPS peptidase
family S13 (D-Ala-D-Ala carboxypeptidase C, clan SE), and is a proteolytic enzyme that
exploits serine in its catalytic activity. The first half of the PF02113 model (383 residues
in total) is matched by one A. cephalotes protein (ACEP_00010620). When compared
across other insect genomes, it became apparent that weak hits can be found in
virtually every other insect species. However, none of the other insects show as
significant of a match (E < 0.001) to this domain as found in Atta. The Pfam model itself
is based only on four eukaryotic sequences, two from plant species, the slime mold
Dictyostelium discoideum, and the amoeba Paulinella chromatophora. However, this
Pfam domain is detected in almost 700 bacterial species, confirming that this domain
has a strong bias towards bacteria and is likely not expected to be found in eukaryotic
genomes. The best hits for PF02113 in NCBI were found to match proteins in Shigella
boydii, Shigella flexneri, and Escherichia coli.
Finally, in an additional attempt to identify genes putatively transferred from fungi
to A. cephalotes, we conducted three other analyses. First, we aligned the A.
cephalotes genome with those of D. melanogaster, N. vitripennis, and A. mellifera, as
well as with draft genomes of the ants Linepithema humile and Pogonomyrmex
barbatus using Jim Kent’s chaining and Netting tools [12]. We identified all MAKER
predicted A. cephalotes genes that fell in gaps of the other genomes. Second, we
determined blast scores of all predicted A. cephalotes genes when blasted against the
genome of A. cephalotes itself, against that of P. barbatus, and against that of L.
Humile. We sorted the predicted genes in decreasing order according to the ratio
2(Atta_score/(Pbar_score + Lhum_score). Hits with a significantly high Atta_score
relative to the other two scores, or with a match against A. cephalotes but not against
the other two ants, could indicate either a gene gain in A. cephalotes, or a gene loss in
P. Barbatus and L. Humile. All genes identified through either search as potential
candidates of lateral gene transfer were then compared via BLASTX to the NCBI nr
database. Few candidates had best matches to fungal genes, and those that did had
only weak matches of a small portion of the total gene. Finally, we made an exhaustive
search of those MAKER predicted genes that had a relatively high BLAST score under
the fungal taxonomic unit (ncbi taxid: 4751) as opposed to the score under the insect
taxonomic unit (ncbi taxid: 6960). Again, a closer look at the most promising candidates
revealed conservation, at least of the intronic regions, across insect taxa.
References
1. Moran NA, Jarvik T (2010) Lateral transfer of genes from fungi underlies carotenoid
production in aphids. Science 328: 624.
2. Nene V, Wortman JR, Lawson D, Haas B, Kodira C, et al. (2007) Genome sequence
of Aedes aegypti, a major arbovirus vector. Science 316: 1718-1723.
3. Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, et al. (2002) The
genome sequence of the malaria mosquito Anopheles gambiae. Science 298:
129.
4. Honey Bee Genome Sequencing Consortium (2006) Insights into social insects from
the genome of the honeybee Apis mellifera. Nature 443: 931-949.
5. Xia Q, Zhou Z, Lu C, Cheng D, Dai F, et al. (2004) A Draft Sequence for the Genome
of the Domesticated Silkworm (Bombyx mori). Science 306: 1937.
6. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. (2000) The genome
sequence of Drosophila melanogaster. Science 287: 2185-2195.
7. Drosophila 12 Genomes Consortium (2007) Evolution of genes and genomes on the
Drosophila phylogeny. Nature 450: 203.
8. Werren JH, Richards S, Desjardins CA, Niehuis O, Gadau J, et al. (2010) Functional
and evolutionary insights from the genomes of three parasitoid Nasonia species.
Science 327: 343-348.
9. Tribolium Genome Sequencing Consortium (2008) The genome of the model beetle
and pest Tribolium castaneum. Nature 452: 949.
10. Finn RD, Mistry J, Tate J, Coggill P, Heger A, et al. (2010) The Pfam protein families
database. Nucleic Acids Research 38: D211.
11. Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, et al. (2005) Blast2GO: a
universal tool for annotation, visualization and analysis in functional genomics
research. Bioinformatics 21: 3674.
12. Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D (2003) Evolution's cauldron:
Duplication, deletion, and rearrangement in the mouse and human genomes.
Proceedings of the National Academy of Sciences of the United States of
America 100: 11484.
Table 1. Atta cephalotes genes with PFAM domains consistent with a microbial origin. Identification
based on comparison to NCBI’s non-redundant database.
Pfam
Fungi
PF06011
PF11710
Viruses
PF05393
PF05092
PF01577
PF05381
Bacteria
PF10150
PF02113
PF10138
Archeae
PF04919
PF01877
Protein
P-value
Description
ACEP_00005250-RA
ACEP 00010692-RA
0.00048
0.00074
Transient receptor potential (TRP) ion channel
G protein-coupled glucose receptor regulating Gpa2
ACEP_00014403-RA
ACEP_00013569-RA
ACEP_00010395-RA
ACEP_00002128-RA
0.00084
0.00016
0.00032
0.00044
Human adenovirus early E3A glycoprotein
Protein of unknown function (DUF686)
Potyvirus P1 protease
Tymovirus endopeptidase
ACEP_00004405-RA
ACEP 00010620-RA
ACEP 00011691-RA
0.00098
0.00056
0.00093
Ribonuclease E/G family
D-Ala-D-Ala carboxypeptidase 3 (S13) family
Tellurium resistance protein
ACEP_00002056-RA
ACEP_00009872-RA
0.00036
0.00054
Protein of unknown function, DUF655
Protein of unknown function DUF54
Download