Features on Nucleic Acid Sequences, Gene Features and Coding Sequences Sequences The most basic tasks you will need to support in gus is the loading of a nucleic acid or amino acid sequence into the database. The most simple way to load a sequence is from a fasta file, and there is, in fact, a plugin to do this (loadSequenceFrom Fasta). In this case, all you need to determine is what kind of sequence it is (Amino Acid or Nucleic Acid), and then decide which table to put it in (Generally, externally loaded sequences are loaded into DoTS.ExternalNASequence or DoTS.ExternalAASequence). If the fasta file has a complicated defline (e.g. “>Cgd7_076|AE10117|heat shock protein”), you may have to specify a regexp to say which filed is the source_id, which the product, etc. But, in every case, all you are doing is placing a sequence in a table and specifying which information goes in which fields in the table. Features Annotated sequence files are more complicated because, in addition to loading a sequence, you must locate specific features on that sequence. The relationship of features to sequences via locations requires the use of more than one table. Simple examples include a promoter, or a repeat region, or a UTR on an NA Sequence. In each case, there is a sequence, there is a feature located on that sequence, and there is a span on that sequence where the feature is located. Each of these pieces, the sequence, the feature and the location will have sub-properties including. ` In GUS, location, feature and sequence are all stored in separate tables. The sequence is stored in an NASequence table, usually ExternalNASequence. This table stores the sequence, its primary id or accession as a source_id, a secondary identifier, if one exists, its length, its nucleotide count, etc. Comments, keywords and secondary accessions may also be linked off of this via join tables to other tables for storing those kinds of information. Sequence_type,is also stored in this table. This field is depreciated however, so you should also specify a sequence_ontology_id in the sequence table to link back to a sequence_ontology entry in Dots.SequnceOntology. Features on a sequence, be they repeats, promoters, or genes, are stored in their own, custom views of the NAFeature tables. Each feature is stored as a record in a view designed to store that kind of feature. Its source_id, description etc. are stored in fields within that table. In addition, a foreign_key pointing back to the sequence on which it is located is found in this table. The location of this feature on that sequence is stored in a third table DoTS.NALocation. This table links, via na_feature_id, back to a specific entry in a feature view. Start_min and start_max store the starting location. If the location is exact, both of these values will be the same. Otherwise, the values can be used to regions over the sequence. End_min and end_max are used in a similar manner to store the end location of the feature. The attribute is_reversed defines whether this feature is on the main 3'-5' span or if it is on the complimentary strand. How strandidness is captured in GUS is described in table 1. A similar set of tables to capture Amino Acid Features on Amino Acid sequences also exists, however, the location table lacks a strandedness attribute since AA sequences are single stranded. NA Strandidness: GFF 1 -1 0 Direction forward backwards unknown dots.nalocation.is_reversed 0 1 null Illustration 1Data Model of a Nucleic Acid Feature in GUS Sequences that contain genes and coding regions complicate this picture by adding hierarchically organized relationships between a set of features coordinated along the same span of a sequence. A gene will contain not only a gene feature, but also an RNA feature, a set of exons and, if it codes for a protein, a coding sequence. To capture these, GUS feature views can be organized hierarchically through parent_id relationships where the parent_id of one feature points to the na_feature_id of another, more basic feature. An example of a record which will build such a hierarchically organized set of features can be found in the GenBank record at the end of this paper. This record contains a number of features to locate on the sequence at the end of the file as well as a significant amount of information about the sequence in the header. In GUS, this sequence, its accession, length and nucleotide counts will be stored in DoTS.ExternalNASequence. Keywords, comments, references and secondary accessions found in the header info will be stored in tables for those sorts of information and linked back to this ExternalNASequence entry. Once this ExternalNASequence record is created, we can create NAFeature records to attach to this sequence. The first of these features is the source feature which contains various information about the source of this nucleotide sequence. This information is all stored in a DoTS.Source entry which has a location spanning the entire sequence. The next feature is a gene_feature with an associated CDS record. These features could also include a divided location, which implies multiple exons. There is also an implied mRNA feature that would serve as the template for the translation in the CDS. GUS uses the following hierarchy to organize these features. This hierarchy is the one assumed by the BioPerl Unflattener. The CDS genes should have the following hierarchy gene (geneFeature) mRNA (RNAType) CDS (transcript) exon (ExonFeature) RNA genes should have this hierarchy: gene RNA exon This default hierarchy is summarized in the BioPerl unflattener: mRNA => 'gene', tRNA => 'gene', rRNA => 'gene', scRNA => 'gene', snRNA => 'gene', snoRNA => 'gene', misc_RNA => 'gene', CDS => 'mRNA', exon => 'mRNA', intron => 'mRNA', Each features table contains a number of attributes which must be populated. The most important of these, the source_id, should point to a unique gene name. In our case, locust_tag is what contains the source_ids for our genes. This mapping of feature on to table, and feature values into attributes is all contained in the genbank2sql.xml found in GUS/Supported/config. You may wish to change these mappings for your own projects. All other features, including repeat regions, UTRs, etc. should map a NAFeature onto an NASequence at a specific NALocation without hierarchical organization. A full mapping for the transcript, exon gene and RNA features can be found at the end of this document in the ISF configuration XML. All of this logic is captured in the GUS supported plugin InsertSequenceFeatures. This plugin begins by loading the ExternalNASequence and its associated references, keywords and comments. It then un-flattens the feature hierarchy for gene features. These are then loaded according to the hierarchy described earlier with children features pointing back to their parent features via the relationship parent_id -> na_feature_id. Exons will be created according to the genes location description if they are not explicitly defined and both the exons and the transcript feature (genbank CDS feature) will be attached to an mRNA record which points back to the gene record. The Translation attribute of a CDS record itself requires building a full set of translation entries. This includes a translatedAAfeature, which links the Transcript feature to a TranslatedAASequence entry where the translation sequence is stored. The product description and the source_id should be passed along and entered in thes TranslatedAAFeature and TranslatedAASequence tables. These translatedAASequence records can now, themselves, serve as base sequences for protein features defined by various applications including signal peptides and trans-membrane domains. Mirroring NAFeatures, AAFeatures have an AAlocation on an AASequence. The entire model is summarized in figure 2.LOCUS AAEL01000585 3539 bp DNA linear INV 27-OCT-2004 DEFINITION Cryptosporidium hominis strain TU502 chromosome 5 CHRO015104, whole genome shotgun sequence. ACCESSION AAEL01000585 AAEL01000000 VERSION AAEL01000585.1 GI:54655937 KEYWORDS WGS. SOURCE Cryptosporidium hominis ORGANISM Cryptosporidium hominis Eukaryota; Alveolata; Apicomplexa; Coccidia; Eimeriida; Cryptosporidiidae; Cryptosporidium. REFERENCE 1 (bases 1 to 3539) AUTHORS Xu,P., Widmer,G., Wang,Y., Ozaki,L.S., Alves,J.M., Serrano,M.G., Puiu,D., Manque,P., Akiyoshi,D., Mackey,A.J., Pearson,W.R., Dear,P.H., Bankier,A.T., Peterson,D.L., Abrahamsen,M.S., Kapur,V., Tzipori,S. and Buck,G.A. TITLE The genome of Cryptosporidium hominis JOURNAL Nature 431, 1107-1112 (2004) REFERENCE 2 (bases 1 to 3539) AUTHORS Xu,P., Widmer,G., Wang,Y., Ozaki,L.S., Alves,J.M., Serrano,M.G., Puiu,D., Manque,P., Akiyoshi,D., Mackey,A.J., Pearson,W.R., Dear,P.H., Bankier,A.T., Peterson,D.L., Abrahamsen,M.S., Kapur,V., Tzipori,S. and Buck,G.A. TITLE Direct Submission JOURNAL Submitted (08-JUN-2004) Center for the Study of Biological Complexity, Virginia Commonwealth University, Trani Center for Life Sciences, 1000 W Cary St, Richmond, VA 23298, USA FEATURES Location/Qualifiers source 1..3539 /organism="Cryptosporidium hominis" /mol_type="genomic DNA" /strain="TU502" /db_xref="taxon:237895" /chromosome="5" gene 514..1434 /locus_tag="Chro.50341" CDS 514..1434 /locus_tag="Chro.50341" /codon_start=1 /product="cancer-associated gene protein like (41.3 kD) (4A872)" /protein_id="EAL35084.1" /db_xref="GI:54655939" /translation="MKYRMFKRGNRVGVCISGGKDSSVLLNVLYELNKRKDYGIELEL IAVDEGIKGYRDDSLEVVKYQQEYYNCPLTILSFKDMFNTTMDEIQSKSSKSNSCTYC GVFRRKALDIGSYKVNADVICTGHSCDDTCETLLLNILRGDFNRLRRCINPITNNEIT KTKDQMQNHDSQNEAFLNIKPRVKPLMYCYEKEIVLYAHYLNLKYFSTECTYSVDAYR GVSREFIRKIQSFDYKYSFNMILAAQELNLEQSNSSPNYIARKCTICGYISSSTICNG CNLVNALKHDNPNLILKNQRQKKKILLQES" gene 1723..1798 /gene="trnT" /locus_tag="Chro.trn001" tRNA 1723..1798 /gene="trnT" /locus_tag="Chro.trn001" /product="tRNA-Thr" gene complement(2271..2909) /locus_tag="Chro.50340" CDS complement(2271..2909) /locus_tag="Chro.50340" /codon_start=1 /product="vacuolar ATP synthase subunit D" /protein_id="EAL35083.1" /db_xref="GI:54655938" /translation="MLKEIVETKRSIGNDIKEASFALAKATWAAGDFKDRIIESCKRP TVTMEVGTENIAGVRLPIFEMNVDNNSSTETCHIGVASGGQVIQSTREIYMKVLRDLV KLASLQTAFFSLDEEIKMTNRRVNALQNVVLPKLEDGMNYILRELDEIEREEFFRLKK IQEKKKEWAEAELQEKLKKDRNNSKENDSSLYDTIKNSGDSILEQKNEGILF" ORIGIN 1 cgaaaaatcc atataatttg acaatttttt ccatttattt taattatttt ttttttttta 61 aattatatat atatgtaaat atatatatat atattcggta ccgttattgg cgccaatgcc 121 atgatcacct ttggcgccaa tagctgaatt aaaattaaat gtcaaataaa gattaaaata 181 241 301 361 421 481 541 601 661 721 781 841 901 961 1021 1081 1141 1201 1261 1321 1381 1441 1501 1561 1621 1681 1741 1801 1861 1921 1981 2041 2101 2161 2221 2281 2341 2401 2461 2521 2581 2641 2701 2761 2821 2881 2941 3001 3061 3121 3181 3241 3301 3361 3421 3481 // aagttaaaaa agcaatattg acagggtacc tatataatgg ttgaataggt attctaactt aatagagttg tatgaactaa ggaattaagg aattgtccct caaagtaaat ttagatattg gatacatgtg tgtattaacc gattctcaaa tatgaaaaag tgtacttatt agctttgatt cagtctaatt tcatctacta ttaattctta tcaattataa tataaagcgt attacatatt tttttttatt caatatttta tggttattgc tttttacaga gcgtgtacaa tttttttaaa atatgcttta aaaagtactt aactattcaa tataactgta gaaatatatt attccttcat gaagaatcgt tctgcccatt atttcatcca ttctgtaatg gcagtctgta gttgattgaa ttattatcaa acctccattg gctgcccatg cgttttgttt tttcttttca aaagccctat acttacattc tttctccttt tcttatttat tttatatgta gaaaatatct tcttttttat attcgatttt tttgaccgcc aaaaaaaaat tgagttgtta taagttgcag aaaataggta ttttaatatg gttttaatta gagtatgtat ataaaagaaa gatataggga taactattct caagtaaatc gaagctataa aaactttatt caataacaaa atgaagcttt aaattgtatt ctgtggatgc acaaatactc ctagtcccaa tttgtaatgg aaaatcaaag gtctgcccgc tgcttacatg tctatcaaat tattaattta atatgaccca gtcggtcttg tttctcaaat gtaaacgtgt gtattaataa aattcctagt acttgtgccg taatatgaat tccatttttt ttatagagaa ttttttgctc tttctttaga ccttcttctt attctctcaa catttacacg aactggctaa ttacttgacc cattcatctc ttactgttgg ttgcctttgc caacaatttc ataaatcata aaattaacaa gacttggtgt aaattgcttt atttaaatat tatatattta tttatacttc ttttattatt ttttccgata atttcactat agttggaatt ttactgtaag gagatgtttt agtaaataat actgcggttc tcttatttta ttcaggagga agattacggt tgactctcta atcctttaaa aaattcgtgc agttaatgct attaaatata taatgagatc cttaaatata atatgcgcat atatagagga tttcaatatg ctatatagct atgcaacctt acagaaaaag tttaacacgt atatatttca tcaatctcac caattctcct atttttacca taaaccgaag tgtttaatag aatttattat atattattaa ttttttttaa ctcaaaaatt ttccttttca cctattaata attaatcgaa caaaattgaa gttatttcta ctcttgaatc aatgtaattc acgatttgtc cttaactaaa tccacttgct aaaaattggt tcttttacat caatgcaaat ttttaacatt tccttgtttt ttaattttca tggtttctga tgttttacta ataaccactt tatatataca ctattattat attatttgtt atttatatta ataggtatcg tatatgaata gaagggaagg attgataact gattgaatgg atatttatta ttaatgaaat aaggattcaa atagaattgg gaagttgtca gatatgttta acatattgtg gatgttattt ctacgtggcg actaaaacta aagcctagag tatcttaacc gtttcacgtg atacttgcag agaaaatgta gttaacgctc aaaattttat aatatatttc actcaatatt aaataataaa cactttttcc ctcgttagcg gtcacgagtt agcccccctt ttatgtattt aaatatttat tttactatca ttattcttga tttcacgaac ctctattatt aaaaaaataa tcaccagagt tccttcttta ttcttcaatc ataccatcct atcttaatct tcccttaata acgccaatat aatctaacac gattctatga gatgcttctt cctctaaatt gctcctttgc atattaaaaa tccatttttc ataattaaac tattataaat tatattcaat cctttgataa tgcctccttt aaaaataatt agatctttat tattgagcaa ttgtaatgaa ttgaagagga aactttccat ccttattatt acagaatgtt gtgttttatt aattaattgc aatatcagca atacaacaat gtgtatttag gtacaggtca attttaatag aagatcaaat ttaagccttt tcaaatattt aattcatacg ctcaagaact ctatttgcgg ttaaacatga tacaagaaag ttgataataa tgcaatatta tcatgcttta tcactttttc tcgccttcgt cgaatctcgt tcccgctaaa tttaatatta ttatatgatt tttaactaag aaattaaact ttattactta atgaatatca cgtgaatcat tcttgattgt gtttctcttg taaagaattc ccaattttgg cttcatctaa ctttcatgta ggcatgtttc ctgcaatatt ttctatcttt taatatcatt tgtttgataa tcttcaactt ttattctata tttcttcact tacctatcta actatcctat tactgatata tatttttatt ttttttatga tcggcgggaa aagtaatgaa tcagataatg aaggtgcaaa ggtacatgaa taagaaaaat aatatttaaa taaaagagga aaatgtatta ggtggatgaa ggaatactac ggacgaaatt aagaaaagca tagttgcgat attacgaaga gcaaaatcat aatgtattgt ctcaacagaa taagattcaa taatcttgaa atacatttca taatccaaac ttaatataaa attattttat tcaaatatgt ttattatttt acacatctca agcatagtct cgaaggctca attcaatagc aattttaatt aaactcaact tgaatatttc aaacacgtaa aaaaactatc ataattattt ttaaaataaa atcgtataaa aagttcagct ttctctctca taaaactaca agaaaagaat aatttctctt tgtagatgaa ctcagtgcca gaaatcacca acctattgaa agcatcactc tattgcttgt taaattgtaa aaaacactat tttatcttcc attgttgtta caattaaaga tatttttctc gaaaaaaaaa ctcgtgggga aataattaa