TEXT S17: ALIEN GENE FRAGMENTS Lothar Wissler1, Fabian Zimmer1, Olgert Denas2, James Taylor2,3, Nicole M. Gerardo3, and Erich Bornberg-Baur1 1Institute for Evolution and Biodiversity, University of Münster, Münster, Germany of Mathematics and Computer Science, Emory University, Atlanta GA, United States of 2Department America 3Department of Biology, Emory University, Atlanta GA, United States of America Recent work in the pea aphid has demonstrated the horizontal gene transfer of an alien gene involved in carotenoid production from fungi [1]. Given the tight obligate association that Atta cephalotes has with its fungus and other microbes in the fungus garden, we investigated alien genes in its genome that it may have acquired from its fungal cultivar. To do this, we screened the publicly available insect genome sequences to find domains that occur in Atta but no other insect species. The insect dataset included Aedes aegypti (L1.49) [2], Anopholes gambiae (P3.49) [3], Apis mellifera (OGS r.2) [4], Bombyx mori (r1.0) [5], Culex pipiens (r1.2), Drosophila melanogaster (r5.11) [6], D. ananassae (r1.3) [7], D. erecta (r1.3) [7], D. grimshawi (r1.3) [7], D. mojavensis (r1.3), D. pseudoobscura (r2.3) [7], D. persimilis (r1.3) [7], D. sechellia (r1.3) [7], D. simulans (r1.3) [7], D. willistoni (r1.3) [7], D. virilis (r1.2) [7], D. yakuba (r1.3) [7], Nasonia vitripennis (OGS r.1) [8], and Tribolium castaneum (51906) [9]. Using this A. cephalotes-specific set, we then derived an estimate for the species distribution of each Pfam domain [10] to find candidates of HGT from fungi, viruses, bacteria, and archaea. These candidate domains were obtained by searching the full NCBI non-redundant protein dataset (11,238,375 entries) against Pfam-A. For each protein sequence in NCBI-nr, we reconstructed the lineage of the host species using NCBI taxonomy data. This allowed us to compute the frequency distribution of each domain across the nodes of interest: Bacteria, Archaea, Viruses, Eukaroytes/Fungi, and Eukaryotes/others. Finally, those domains that are almost exclusive to one of these nodes (frequency > 95%) were compared against the Atta unique domain set obtained in the first part of this analysis. From this analysis, we identified 11 candidate proteins predicted to originate from viral, bacterial, archael, and fungal sources (Table 1). Because A. cephalotes has an intimate association with both fungus and bacteria, we investigated in greater details those domains identified to have originated from these two sources. The predicted fungal domain PF06011 belongs to a family of transient receptor potential (TRP) ion channels that is essential for cellular viability and involved in cell growth and cell wall synthesis. However, we could not obtain any GeneOntology terms for the matching full-length Atta cephalotes protein sequence using BLAST2GO [11], and matching the protein sequence against NCBI’s non-redundant protein database did not yield any significant hits. A best match analysis of the TRP domain against NCBI revealed proteins in Neosartorya fischeri, Yarrowia lipolytica, and Aspergillus species. All these fungal proteins are longer than 700 residues and fully match the domain model of 575 residues length. In contrast, the Atta protein sequence is short (131 residues), has no methionine at the first position, and matches only about 100 residues of the middle part of the domain model, suggesting that it is incomplete both in the beginning and at the end. A second fungal domain protein, PF11710, was also detected in the A. cephalotes proteome. This protein, also known as Git3, is one of six proteins required for glucose-triggered adenylate cyclase activation. It is a G protein-coupled receptor responsible for the activation of adenylate cyclase through Gpa2 - heterotrimeric G protein alpha subunit, part of the glucosedetection pathway. Git3 contains seven predicted transmembrane domains, a third cytoplasmic loop and a cytoplasmic tail. The PF11710 model matches the conserved N-terminus part of the protein. BLAST hits of the Atta protein sequence (ACEP_00010692) in NCBI are all insignificant, and no GO terms could be obtained. Interestingly, in two other A. cephalotes proteins (ACEP_00011098, ACEP_00016598), we find very weak matches to the Git3 Cterminal domain (Git3 C, or PF11970). Although these signals are below the significance threshold, they could represent a divergent Git3 protein sequence that has been split within the A. cephalotes genome, where the N-terminal (Git3) and the Cterminal part (Git3 C) are now located in different A. cephalotes proteins. The best matches to PF11710 in NCBI are to proteins in Saccharomyces cerevisiae, Kluyveromyces lactis, and Debaryomyces hansenii. The bacterial protein domain PF10138 was predicted to be found within the A. cephalotes proteome, and members of this family confer resistance to the metalloid element tellurium and its salts. The model of PF10138 is relatively short (99 residues), and the second half of the model is matched by one A. cephalotes protein (ACEP_00011691), which again seems to be incomplete. It lacks methionine at the first position and is only 67 residues in length. To properly resemble a member of this Pfam family, the gene model should be about 40 residues longer at the beginning. For the A. cephalotes protein sequence, we could not obtain significant hits in NCBI nonredundant nor GO terms. Scanning the Pfam model against NCBI showed best hits to proteins in Acinetobacter sp., Pseudomonas syringae, Proteus mirabilis, and Yersinia pestis. A second bacterial domain, PF02113, was also identified within the A. cephalotes genome. This domain belongs to a serine peptidases that is part of MEROPS peptidase family S13 (D-Ala-D-Ala carboxypeptidase C, clan SE), and is a proteolytic enzyme that exploits serine in its catalytic activity. The first half of the PF02113 model (383 residues in total) is matched by one A. cephalotes protein (ACEP_00010620). When compared across other insect genomes, it became apparent that weak hits can be found in virtually every other insect species. However, none of the other insects show as significant of a match (E < 0.001) to this domain as found in Atta. The Pfam model itself is based only on four eukaryotic sequences, two from plant species, the slime mold Dictyostelium discoideum, and the amoeba Paulinella chromatophora. However, this Pfam domain is detected in almost 700 bacterial species, confirming that this domain has a strong bias towards bacteria and is likely not expected to be found in eukaryotic genomes. The best hits for PF02113 in NCBI were found to match proteins in Shigella boydii, Shigella flexneri, and Escherichia coli. Finally, in an additional attempt to identify genes putatively transferred from fungi to A. cephalotes, we conducted three other analyses. First, we aligned the A. cephalotes genome with those of D. melanogaster, N. vitripennis, and A. mellifera, as well as with draft genomes of the ants Linepithema humile and Pogonomyrmex barbatus using Jim Kent’s chaining and Netting tools [12]. We identified all MAKER predicted A. cephalotes genes that fell in gaps of the other genomes. Second, we determined blast scores of all predicted A. cephalotes genes when blasted against the genome of A. cephalotes itself, against that of P. barbatus, and against that of L. Humile. We sorted the predicted genes in decreasing order according to the ratio 2(Atta_score/(Pbar_score + Lhum_score). Hits with a significantly high Atta_score relative to the other two scores, or with a match against A. cephalotes but not against the other two ants, could indicate either a gene gain in A. cephalotes, or a gene loss in P. Barbatus and L. Humile. All genes identified through either search as potential candidates of lateral gene transfer were then compared via BLASTX to the NCBI nr database. Few candidates had best matches to fungal genes, and those that did had only weak matches of a small portion of the total gene. Finally, we made an exhaustive search of those MAKER predicted genes that had a relatively high BLAST score under the fungal taxonomic unit (ncbi taxid: 4751) as opposed to the score under the insect taxonomic unit (ncbi taxid: 6960). Again, a closer look at the most promising candidates revealed conservation, at least of the intronic regions, across insect taxa. References 1. Moran NA, Jarvik T (2010) Lateral transfer of genes from fungi underlies carotenoid production in aphids. Science 328: 624. 2. Nene V, Wortman JR, Lawson D, Haas B, Kodira C, et al. (2007) Genome sequence of Aedes aegypti, a major arbovirus vector. Science 316: 1718-1723. 3. Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, et al. (2002) The genome sequence of the malaria mosquito Anopheles gambiae. Science 298: 129. 4. Honey Bee Genome Sequencing Consortium (2006) Insights into social insects from the genome of the honeybee Apis mellifera. Nature 443: 931-949. 5. Xia Q, Zhou Z, Lu C, Cheng D, Dai F, et al. (2004) A Draft Sequence for the Genome of the Domesticated Silkworm (Bombyx mori). Science 306: 1937. 6. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. (2000) The genome sequence of Drosophila melanogaster. Science 287: 2185-2195. 7. Drosophila 12 Genomes Consortium (2007) Evolution of genes and genomes on the Drosophila phylogeny. Nature 450: 203. 8. Werren JH, Richards S, Desjardins CA, Niehuis O, Gadau J, et al. (2010) Functional and evolutionary insights from the genomes of three parasitoid Nasonia species. Science 327: 343-348. 9. Tribolium Genome Sequencing Consortium (2008) The genome of the model beetle and pest Tribolium castaneum. Nature 452: 949. 10. Finn RD, Mistry J, Tate J, Coggill P, Heger A, et al. (2010) The Pfam protein families database. Nucleic Acids Research 38: D211. 11. Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, et al. (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21: 3674. 12. Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D (2003) Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proceedings of the National Academy of Sciences of the United States of America 100: 11484. Table 1. Atta cephalotes genes with PFAM domains consistent with a microbial origin. Identification based on comparison to NCBI’s non-redundant database. Pfam Fungi PF06011 PF11710 Viruses PF05393 PF05092 PF01577 PF05381 Bacteria PF10150 PF02113 PF10138 Archeae PF04919 PF01877 Protein P-value Description ACEP_00005250-RA ACEP 00010692-RA 0.00048 0.00074 Transient receptor potential (TRP) ion channel G protein-coupled glucose receptor regulating Gpa2 ACEP_00014403-RA ACEP_00013569-RA ACEP_00010395-RA ACEP_00002128-RA 0.00084 0.00016 0.00032 0.00044 Human adenovirus early E3A glycoprotein Protein of unknown function (DUF686) Potyvirus P1 protease Tymovirus endopeptidase ACEP_00004405-RA ACEP 00010620-RA ACEP 00011691-RA 0.00098 0.00056 0.00093 Ribonuclease E/G family D-Ala-D-Ala carboxypeptidase 3 (S13) family Tellurium resistance protein ACEP_00002056-RA ACEP_00009872-RA 0.00036 0.00054 Protein of unknown function, DUF655 Protein of unknown function DUF54