1 Dataset (Supplemental) 2 The supplemental dataset contains information for the source of each sequence, the major lineage 3 and subgroup designation based on our phylogenetic analysis, the G-protein binding partner, cell 4 type, tissue expression, and counter-ion location, as well as the associated references for each 5 piece of information. All of the sequences acquired from genome data represent manually 6 curated sequences, which can be found on the UCSC genome browser 7 (http://genomewiki.ucsc.edu/index.php/Opsin_evolution:_update_blog) or in the supplemental 8 alignment. For records without an associated Genbank accession number (i.e. 123 sequences), 9 the source has been listed as ‘Genome’ or ‘EST library’. Because genome assemblies are 10 unstable and change as improvements are made, the accessions used at the time of making the 11 gene models are transient. Therefore, these labels indicate that the gene model was obtained by a 12 BLAST of close homologs to GenBank genomic sequences or ESTs rather than to trace reads or 13 transcripts or extracted from published literature. To find the source for these gene models, 14 BLAST the sequence supplied in our supplemental alignment against the raw genomic and EST 15 data at GenBank for the species in question to find the current assembly, scaffold, contig, trace 16 or transcript number. To validate the model for consistency, compare visually to known 17 orthologs in our compilation of manually curated sequences 18 (http://genomewiki.ucsc.edu/index.php/Opsin_evolution:_update_blog) for exon phase and 19 length matching, perform a simple multiple alignment for consistency with orthologs using 20 Multalin (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_multalin.html), 21 and examine the gene model for conservation at known conserved sequence motifs. For more 22 details on the methods used, see below. 23 24 METHODS (Supplemental) 25 Sequence Acquisition 26 Opsin sequences were acquired from Genbank using one of two methods. To include sequences 27 where expression data are available, opsin transcript sequences were mined from Genbank for 28 major taxonomic and known opsin groups using queries of either the nucleotide or the 29 ‘transcriptome shotgun assmebly’ databases. To generate the dataset of genomically-derived 30 opsins, we used conventional transcript-derived entries from GenBank as tBlastn queries against 31 the wgs division of GenBank (where all genome assemblies are stored). Because opsins have a 32 minimum of ~ 20% protein identity to bovine rhodopsin, as determined by the degree of 33 conservation across all GPCRs, and because other opsins in the collection will outperform 34 bovine rhodopsin as a query in specific cases, this method has more than enough sensitivity to 35 detect even short homologous exons in the bovine rhodopsin K296 region. Thus the complete 36 opsin repertoire can be recovered from each species - even new homology classes - provided it is 37 present in the assembly. The recovered sequences using this methodology often need curation in 38 regions of weak alignment against the original data for a particular species and in the case of 39 misassembly stutter. Missing exons were often locatable at the NCBI trace archive in the form of 40 isolated reads that were omitted from the assembly, taking into account the risk of gathering 41 inappropriate exons from unsuspected gene duplications. Recovered opsin sequences were 42 quality checked for the presence of a lysine in the 7th transmembrane helix (bovine rhodopsin site 43 296), conservation of invariant opsin residues and motifs, and better back-blast to opsins than 44 any annotated non-opsin GPCR at the GenBank nucleotide division. With the exception of 45 species too closely related to one already represented in the dataset, all metazoan genomes at 46 GenBank as of 1 Dec 2010 are represented by the dataset used in this study. No species 47 diverging earlier than the ctenophore (notably neither the sponge Amphimedon queenslandica 48 nor the choanoflagellate Monosiga brevicollis) contains a GPCR with a lysine in a position 49 alignable with the K296 motif of bovine rhodopsin. A curated set of recovered opsins from 50 genome trace files, updated monthly, is available at the UCSC genome browser 51 (http://genomewiki.ucsc.edu/index.php/Opsin_evolution:_update_blog). 52 For each opsin extracted from a genome project, the location and phase of each intron 53 within the coding region was determined by alignment to genomic contigs. All opsins to date 54 utilize standard GT-AG splice donors and acceptors. We parsimoniously resolved each 55 significant sequence change (i.e. intron change and indels) as a gain or loss by determination of 56 its ancestral status via multiple outgroups. In bilaterans, notably the tunicate Ciona and insects 57 including Drosophila, very rapid turnover of introns (both gain and loss) occurs; however these 58 limitations do not interfere with opsin classification because enough species with conservatively 59 evolving introns are available to reconstruct the evolutionary history. 60 61 Phylogenetic Analyses 62 For phylogenetic analyses, sequences that spanned less than half of the transmembrane regions 63 of the protein were discarded from the analyses, resulting in 889 transcript plus genome trace 64 opsin sequences. In order to root our opsin phylogenetic analyses, 22 non-opsin GPCRs from the 65 human genome were used as outgroups: somatostatin receptor, opioid receptor mu 3, galanin 66 receptor, chemokine (C-C motif) receptor, bradykinin receptor 1, uracil/cys-leukotriene dual 67 receptor, cys-leukotriene receptor 1, purinergic receptor, orexin receptor, tachykinin receptor, 68 neuromedin U receptor, pyroglutamylated RFamide peptide receptor, human orphan receptor 19, 69 pancreatic polypeptide receptor, neuropeptide Y receptor, prolactin releasing hormone, human 70 orphan receptor 161, alpha-1D-adrenergic receptor, thyrotropin-releasing hormone receptor, 71 thyrotropin receptor, adenosine A3 receptor, and opiate receptor-like 1. This set of sequences 72 was selected as outgroups based on previous phylogenetic studies of opsin and GPCR evolution 73 (Davies et al. 2010; Fredriksson et al. 2003; Plachetzki et al. 2010 Porter et al. 2007; Suga et al. 74 2008) as well as based on a rigorous procedure of BLASTing human opsins against the all other 75 human GPCRs (for description of detailed BLAST procedure see: 76 http://genomewiki.ucsc.edu/index.php/Opsin_evolution#GPCR_outgroup_sequences). 77 Furthermore, rerunning the phylogeny reconstruction without outgroup sequences does not 78 significantly change the sequences within each of the major clades nor the relationships among 79 them, with the exception of three sequences (Platynereis dumerilii TMT1 and TMT2; 80 Stronglocentrus purpuratus encephalopsin) placed at the base of the ‘C-type’ clade when rooted, 81 which are placed at the base of the ‘Cnidops’ clade when unrooted (data not shown). 82 Amino acid sequences of the 889 opsins mined from Genbank and the 22 human 83 outgroup GPCR sequences were aligned using the online MAFFT v6.0 server 84 (http://mafft.cbrc.jp/alignment/server/) (Katoh et al. 2005a; Katoh et al. 2005b; Katoh et al. 85 2002). The aligned dataset was then trimmed to remove the N- and C-terminal sequences, 86 leaving only the transmembrane and loop regions of the protein for further phylogenetic 87 analyses. The resulting alignment has been provided as a supplemental FASTA data file. 88 The aligned and trimmed dataset was used to reconstruct a maximum likelihood 89 phylogeny using Randomized Axelerated Maximum Likelihood (RAxML) v.7.2.7 with rapid 90 bootstrapping as implemented on the Cyberinfrastructure for Phylogenetic Research (CIPRES) 91 Portal v.3.1 (Miller et al. 2010; Stamatakis 2006; Stamatakis et al. 2008; Stamatakis et al. 2005). 92 Using the resulting phylogeny, character mapping of amino acid residues at particular counterion 93 sites was accomplished in Mesquite v2.72 (Maddison & Maddison 2010) using unordered 94 parsimony reconstruction. 95 96 References 97 Davies, W. L., Hankins, M. W., & Foster, R. G. 2010. Vertebrate ancient opsin and melanopsin: 98 divergent irradiance detectors. Photochemical & Photobiological Sciences 9, 1444-1457. 99 Fredriksson, R., Lagerstrom, M. C., Lundin, L. G. & Schioth, H. B. 2003. The G-protein-coupled 100 receptors in the human genome form five main families. Phylogenetic analysis, paralogon 101 groups, and fingerprints. Molecular Pharmacology 63, 1256-1272. 102 103 104 105 Katoh, K., Kuma, K., Miyata, T. & Toh, H. 2005a Improvement in the accuracy of multiple sequence alignment program MAFFT. Genome Inform 16, 22-33. Katoh, K., Kuma, K., Toh, H. & Miyata, T. 2005b MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33, 511-8. 106 Katoh, K., Misawa, K., Kuma, K. & Miyata, T. 2002 MAFFT: a novel method for rapid multiple 107 sequence alignment based on fast Fourier transform. Nucleic Acids Res 30, 3059-66. 108 109 110 111 112 Maddison, W. P. & Maddison, D. R. 2010 Mesquite: a modular system for evolutionary analysis. Version 2.72. Miller, M. A., Holder, M. T., Vos, R., Midford, P. E., Liebowitz, T., Chan, L., Hoover, P. & Warnow, T. The CIPRES Portals. In CIPRES, vol. 2010. Plachetzki, D. C., Caitlin, R., & Oakley, T. H. 2010. The evolution of phototransduction from 113 an ancestral cyclic nucleotide gated pathway. Proceedings of the Royal Society B 277, 114 1963-1969. 115 Porter, M. L., Cronin, T. W., McClellan, D. A., & Crandall, K. A. 2007. Molecular 116 characterization of crustacean visual pigments and the evolution of pancrustacean 117 opsins. Molecular Biology and Evolution 24, 253-268. 118 119 120 121 122 123 124 125 126 127 128 Stamatakis, A. 2006 RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688-90. Stamatakis, A., Hoover, P. & Rougemont, J. 2008 A rapid bootstrap algorithm for the RAxML Web servers. Syst Biol 57, 758-71. Stamatakis, A., Ludwig, T. & Meier, H. 2005 RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21, 456-63. Suga, H., Schmid, V., & Gehring, W. J. 2008. Evolution and functional diversity of jellyfish opsins. Current Biology 18, 51-55.