Sequences, Features on Sequences, Gene Features and Coding

advertisement
Features on Nucleic Acid Sequences, Gene Features and Coding Sequences
Sequences
The most basic tasks you will need to support in gus is the loading of a nucleic acid or
amino acid sequence into the database. The most simple way to load a sequence is from
a fasta file, and there is, in fact, a plugin to do this (loadSequenceFrom Fasta). In this
case, all you need to determine is what kind of sequence it is (Amino Acid or Nucleic
Acid), and then decide which table to put it in (Generally, externally loaded sequences
are loaded into DoTS.ExternalNASequence or DoTS.ExternalAASequence). If the fasta
file has a complicated defline (e.g. “>Cgd7_076|AE10117|heat shock protein”), you may
have to specify a regexp to say which filed is the source_id, which the product, etc. But,
in every case, all you are doing is placing a sequence in a table and specifying which
information goes in which fields in the table.
Features
Annotated sequence files are more complicated because, in addition to loading a
sequence, you must locate specific features on that sequence. The relationship of features
to sequences via locations requires the use of more than one table. Simple examples
include a promoter, or a repeat region, or a UTR on an NA Sequence. In each case, there
is a sequence, there is a feature located on that sequence, and there is a span on that
sequence where the feature is located. Each of these pieces, the sequence, the feature and
the location will have sub-properties including. `
In GUS, location, feature and sequence are all stored in separate tables. The sequence
is stored in an NASequence table, usually ExternalNASequence. This table stores the
sequence, its primary id or accession as a source_id, a secondary identifier, if one exists,
its length, its nucleotide count, etc. Comments, keywords and secondary accessions may
also be linked off of this via join tables to other tables for storing those kinds of
information. Sequence_type,is also stored in this table. This field is depreciated
however, so you should also specify a sequence_ontology_id in the sequence table to link
back to a sequence_ontology entry in Dots.SequnceOntology.
Features on a sequence, be they repeats, promoters, or genes, are stored in their own,
custom views of the NAFeature tables. Each feature is stored as a record in a view
designed to store that kind of feature. Its source_id, description etc. are stored in fields
within that table. In addition, a foreign_key pointing back to the sequence on which it is
located is found in this table.
The location of this feature on that sequence is stored in a third table DoTS.NALocation. This table links, via na_feature_id, back to a specific entry in a
feature view. Start_min and start_max store the starting location. If the location is exact,
both of these values will be the same. Otherwise, the values can be used to regions over
the sequence. End_min and end_max are used in a similar manner to store the end
location of the feature. The attribute is_reversed defines whether this feature is on the
main 3'-5' span or if it is on the complimentary strand. How strandidness is captured in
GUS is described in table 1. A similar set of tables to capture Amino Acid Features on
Amino Acid sequences also exists, however, the location table lacks a strandedness
attribute since AA sequences are single stranded.
NA Strandidness:
GFF
1
-1
0
Direction
forward
backwards
unknown
dots.nalocation.is_reversed
0
1
null
Illustration 1Data Model of a Nucleic Acid Feature in GUS
Sequences that contain genes and coding regions complicate this picture by adding
hierarchically organized relationships between a set of features coordinated along the
same span of a sequence. A gene will contain not only a gene feature, but also an RNA
feature, a set of exons and, if it codes for a protein, a coding sequence. To capture these,
GUS feature views can be organized hierarchically through parent_id relationships where
the parent_id of one feature points to the na_feature_id of another, more basic feature.
An example of a record which will build such a hierarchically organized set of
features can be found in the GenBank record at the end of this paper. This record
contains a number of features to locate on the sequence at the end of the file as well as a
significant amount of information about the sequence in the header. In GUS, this
sequence, its accession, length and nucleotide counts will be stored in
DoTS.ExternalNASequence. Keywords, comments, references and secondary accessions
found in the header info will be stored in tables for those sorts of information and linked
back to this ExternalNASequence entry.
Once this ExternalNASequence record is created, we can create NAFeature records to
attach to this sequence. The first of these features is the source feature which contains
various information about the source of this nucleotide sequence. This information is all
stored in a DoTS.Source entry which has a location spanning the entire sequence.
The next feature is a gene_feature with an associated CDS record. These features
could also include a divided location, which implies multiple exons. There is also an
implied mRNA feature that would serve as the template for the translation in the CDS.
GUS uses the following hierarchy to organize these features. This hierarchy is the one
assumed by the BioPerl Unflattener. The CDS genes should have the following hierarchy
gene (geneFeature)
mRNA (RNAType)
CDS (transcript)
exon (ExonFeature)
RNA genes should have this hierarchy:
gene
RNA
exon
This default hierarchy is summarized in the BioPerl unflattener:
mRNA => 'gene',
tRNA => 'gene',
rRNA => 'gene',
scRNA => 'gene',
snRNA => 'gene',
snoRNA => 'gene',
misc_RNA => 'gene',
CDS => 'mRNA',
exon => 'mRNA',
intron => 'mRNA',
Each features table contains a number of attributes which must be populated. The
most important of these, the source_id, should point to a unique gene name. In our case,
locust_tag is what contains the source_ids for our genes. This mapping of feature on to
table, and feature values into attributes is all contained in the genbank2sql.xml found in
GUS/Supported/config. You may wish to change these mappings for your own projects.
All other features, including repeat regions, UTRs, etc. should map a NAFeature onto an
NASequence at a specific NALocation without hierarchical organization. A full mapping
for the transcript, exon gene and RNA features can be found at the end of this document
in the ISF configuration XML.
All of this logic is captured in the GUS supported plugin InsertSequenceFeatures. This
plugin begins by loading the ExternalNASequence and its associated references,
keywords and comments. It then un-flattens the feature hierarchy for gene features.
These are then loaded according to the hierarchy described earlier with children features
pointing back to their parent features via the relationship parent_id -> na_feature_id.
Exons will be created according to the genes location description if they are not explicitly
defined and both the exons and the transcript feature (genbank CDS feature) will be
attached to an mRNA record which points back to the gene record.
The Translation attribute of a CDS record itself requires building a full set of
translation entries. This includes a translatedAAfeature, which links the Transcript
feature to a TranslatedAASequence entry where the translation sequence is stored. The
product description and the source_id should be passed along and entered in thes
TranslatedAAFeature and TranslatedAASequence tables. These translatedAASequence
records can now, themselves, serve as base sequences for protein features defined by
various applications including signal peptides and trans-membrane domains. Mirroring
NAFeatures, AAFeatures have an AAlocation on an AASequence. The entire model is
summarized in figure 2.LOCUS
AAEL01000585
3539 bp
DNA
linear
INV 27-OCT-2004
DEFINITION Cryptosporidium hominis strain TU502 chromosome 5 CHRO015104, whole
genome shotgun sequence.
ACCESSION
AAEL01000585 AAEL01000000
VERSION
AAEL01000585.1 GI:54655937
KEYWORDS
WGS.
SOURCE
Cryptosporidium hominis
ORGANISM Cryptosporidium hominis
Eukaryota; Alveolata; Apicomplexa; Coccidia; Eimeriida;
Cryptosporidiidae; Cryptosporidium.
REFERENCE
1 (bases 1 to 3539)
AUTHORS
Xu,P., Widmer,G., Wang,Y., Ozaki,L.S., Alves,J.M., Serrano,M.G.,
Puiu,D., Manque,P., Akiyoshi,D., Mackey,A.J., Pearson,W.R.,
Dear,P.H., Bankier,A.T., Peterson,D.L., Abrahamsen,M.S., Kapur,V.,
Tzipori,S. and Buck,G.A.
TITLE
The genome of Cryptosporidium hominis
JOURNAL
Nature 431, 1107-1112 (2004)
REFERENCE
2 (bases 1 to 3539)
AUTHORS
Xu,P., Widmer,G., Wang,Y., Ozaki,L.S., Alves,J.M., Serrano,M.G.,
Puiu,D., Manque,P., Akiyoshi,D., Mackey,A.J., Pearson,W.R.,
Dear,P.H., Bankier,A.T., Peterson,D.L., Abrahamsen,M.S., Kapur,V.,
Tzipori,S. and Buck,G.A.
TITLE
Direct Submission
JOURNAL
Submitted (08-JUN-2004) Center for the Study of Biological
Complexity, Virginia Commonwealth University, Trani Center for Life
Sciences, 1000 W Cary St, Richmond, VA 23298, USA
FEATURES
Location/Qualifiers
source
1..3539
/organism="Cryptosporidium hominis"
/mol_type="genomic DNA"
/strain="TU502"
/db_xref="taxon:237895"
/chromosome="5"
gene
514..1434
/locus_tag="Chro.50341"
CDS
514..1434
/locus_tag="Chro.50341"
/codon_start=1
/product="cancer-associated gene protein like (41.3 kD)
(4A872)"
/protein_id="EAL35084.1"
/db_xref="GI:54655939"
/translation="MKYRMFKRGNRVGVCISGGKDSSVLLNVLYELNKRKDYGIELEL
IAVDEGIKGYRDDSLEVVKYQQEYYNCPLTILSFKDMFNTTMDEIQSKSSKSNSCTYC
GVFRRKALDIGSYKVNADVICTGHSCDDTCETLLLNILRGDFNRLRRCINPITNNEIT
KTKDQMQNHDSQNEAFLNIKPRVKPLMYCYEKEIVLYAHYLNLKYFSTECTYSVDAYR
GVSREFIRKIQSFDYKYSFNMILAAQELNLEQSNSSPNYIARKCTICGYISSSTICNG
CNLVNALKHDNPNLILKNQRQKKKILLQES"
gene
1723..1798
/gene="trnT"
/locus_tag="Chro.trn001"
tRNA
1723..1798
/gene="trnT"
/locus_tag="Chro.trn001"
/product="tRNA-Thr"
gene
complement(2271..2909)
/locus_tag="Chro.50340"
CDS
complement(2271..2909)
/locus_tag="Chro.50340"
/codon_start=1
/product="vacuolar ATP synthase subunit D"
/protein_id="EAL35083.1"
/db_xref="GI:54655938"
/translation="MLKEIVETKRSIGNDIKEASFALAKATWAAGDFKDRIIESCKRP
TVTMEVGTENIAGVRLPIFEMNVDNNSSTETCHIGVASGGQVIQSTREIYMKVLRDLV
KLASLQTAFFSLDEEIKMTNRRVNALQNVVLPKLEDGMNYILRELDEIEREEFFRLKK
IQEKKKEWAEAELQEKLKKDRNNSKENDSSLYDTIKNSGDSILEQKNEGILF"
ORIGIN
1 cgaaaaatcc atataatttg acaatttttt ccatttattt taattatttt ttttttttta
61 aattatatat atatgtaaat atatatatat atattcggta ccgttattgg cgccaatgcc
121 atgatcacct ttggcgccaa tagctgaatt aaaattaaat gtcaaataaa gattaaaata
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
1081
1141
1201
1261
1321
1381
1441
1501
1561
1621
1681
1741
1801
1861
1921
1981
2041
2101
2161
2221
2281
2341
2401
2461
2521
2581
2641
2701
2761
2821
2881
2941
3001
3061
3121
3181
3241
3301
3361
3421
3481
//
aagttaaaaa
agcaatattg
acagggtacc
tatataatgg
ttgaataggt
attctaactt
aatagagttg
tatgaactaa
ggaattaagg
aattgtccct
caaagtaaat
ttagatattg
gatacatgtg
tgtattaacc
gattctcaaa
tatgaaaaag
tgtacttatt
agctttgatt
cagtctaatt
tcatctacta
ttaattctta
tcaattataa
tataaagcgt
attacatatt
tttttttatt
caatatttta
tggttattgc
tttttacaga
gcgtgtacaa
tttttttaaa
atatgcttta
aaaagtactt
aactattcaa
tataactgta
gaaatatatt
attccttcat
gaagaatcgt
tctgcccatt
atttcatcca
ttctgtaatg
gcagtctgta
gttgattgaa
ttattatcaa
acctccattg
gctgcccatg
cgttttgttt
tttcttttca
aaagccctat
acttacattc
tttctccttt
tcttatttat
tttatatgta
gaaaatatct
tcttttttat
attcgatttt
tttgaccgcc
aaaaaaaaat
tgagttgtta
taagttgcag
aaaataggta
ttttaatatg
gttttaatta
gagtatgtat
ataaaagaaa
gatataggga
taactattct
caagtaaatc
gaagctataa
aaactttatt
caataacaaa
atgaagcttt
aaattgtatt
ctgtggatgc
acaaatactc
ctagtcccaa
tttgtaatgg
aaaatcaaag
gtctgcccgc
tgcttacatg
tctatcaaat
tattaattta
atatgaccca
gtcggtcttg
tttctcaaat
gtaaacgtgt
gtattaataa
aattcctagt
acttgtgccg
taatatgaat
tccatttttt
ttatagagaa
ttttttgctc
tttctttaga
ccttcttctt
attctctcaa
catttacacg
aactggctaa
ttacttgacc
cattcatctc
ttactgttgg
ttgcctttgc
caacaatttc
ataaatcata
aaattaacaa
gacttggtgt
aaattgcttt
atttaaatat
tatatattta
tttatacttc
ttttattatt
ttttccgata
atttcactat
agttggaatt
ttactgtaag
gagatgtttt
agtaaataat
actgcggttc
tcttatttta
ttcaggagga
agattacggt
tgactctcta
atcctttaaa
aaattcgtgc
agttaatgct
attaaatata
taatgagatc
cttaaatata
atatgcgcat
atatagagga
tttcaatatg
ctatatagct
atgcaacctt
acagaaaaag
tttaacacgt
atatatttca
tcaatctcac
caattctcct
atttttacca
taaaccgaag
tgtttaatag
aatttattat
atattattaa
ttttttttaa
ctcaaaaatt
ttccttttca
cctattaata
attaatcgaa
caaaattgaa
gttatttcta
ctcttgaatc
aatgtaattc
acgatttgtc
cttaactaaa
tccacttgct
aaaaattggt
tcttttacat
caatgcaaat
ttttaacatt
tccttgtttt
ttaattttca
tggtttctga
tgttttacta
ataaccactt
tatatataca
ctattattat
attatttgtt
atttatatta
ataggtatcg
tatatgaata
gaagggaagg
attgataact
gattgaatgg
atatttatta
ttaatgaaat
aaggattcaa
atagaattgg
gaagttgtca
gatatgttta
acatattgtg
gatgttattt
ctacgtggcg
actaaaacta
aagcctagag
tatcttaacc
gtttcacgtg
atacttgcag
agaaaatgta
gttaacgctc
aaaattttat
aatatatttc
actcaatatt
aaataataaa
cactttttcc
ctcgttagcg
gtcacgagtt
agcccccctt
ttatgtattt
aaatatttat
tttactatca
ttattcttga
tttcacgaac
ctctattatt
aaaaaaataa
tcaccagagt
tccttcttta
ttcttcaatc
ataccatcct
atcttaatct
tcccttaata
acgccaatat
aatctaacac
gattctatga
gatgcttctt
cctctaaatt
gctcctttgc
atattaaaaa
tccatttttc
ataattaaac
tattataaat
tatattcaat
cctttgataa
tgcctccttt
aaaaataatt
agatctttat
tattgagcaa
ttgtaatgaa
ttgaagagga
aactttccat
ccttattatt
acagaatgtt
gtgttttatt
aattaattgc
aatatcagca
atacaacaat
gtgtatttag
gtacaggtca
attttaatag
aagatcaaat
ttaagccttt
tcaaatattt
aattcatacg
ctcaagaact
ctatttgcgg
ttaaacatga
tacaagaaag
ttgataataa
tgcaatatta
tcatgcttta
tcactttttc
tcgccttcgt
cgaatctcgt
tcccgctaaa
tttaatatta
ttatatgatt
tttaactaag
aaattaaact
ttattactta
atgaatatca
cgtgaatcat
tcttgattgt
gtttctcttg
taaagaattc
ccaattttgg
cttcatctaa
ctttcatgta
ggcatgtttc
ctgcaatatt
ttctatcttt
taatatcatt
tgtttgataa
tcttcaactt
ttattctata
tttcttcact
tacctatcta
actatcctat
tactgatata
tatttttatt
ttttttatga
tcggcgggaa
aagtaatgaa
tcagataatg
aaggtgcaaa
ggtacatgaa
taagaaaaat
aatatttaaa
taaaagagga
aaatgtatta
ggtggatgaa
ggaatactac
ggacgaaatt
aagaaaagca
tagttgcgat
attacgaaga
gcaaaatcat
aatgtattgt
ctcaacagaa
taagattcaa
taatcttgaa
atacatttca
taatccaaac
ttaatataaa
attattttat
tcaaatatgt
ttattatttt
acacatctca
agcatagtct
cgaaggctca
attcaatagc
aattttaatt
aaactcaact
tgaatatttc
aaacacgtaa
aaaaactatc
ataattattt
ttaaaataaa
atcgtataaa
aagttcagct
ttctctctca
taaaactaca
agaaaagaat
aatttctctt
tgtagatgaa
ctcagtgcca
gaaatcacca
acctattgaa
agcatcactc
tattgcttgt
taaattgtaa
aaaacactat
tttatcttcc
attgttgtta
caattaaaga
tatttttctc
gaaaaaaaaa
ctcgtgggga
aataattaa
Download