Amino Acid Sequence Record Description 1 A protein is a large

advertisement
Amino Acid Sequence Record Description
A protein is a large molecule manufactured in the cell of a living organism to carry out essential functions within the cell.
The primary structure of a protein is a sequence of amino acids. There are 20 common amino acids, each of which has a
chemical name (e.g., "Glycine"), a three-letter abbreviation (e.g., "Gly"), and a one-letter code (e.g., "G"). See
http://www.chemie.fu-berlin.de/chemistry/bio/amino-acids_en.html
for a table about the chemistry of the amino acids and
http://courses.cs.vt.edu/~algnbio/genetic_code/index.php
for information about how the amino acids fit into the genetic code. The twenty one-letter codes are:
A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
For the purpose of representing and manipulating the primary structure of a protein, it suffices to use the one-letter codes in a
string. For example,
MLQSIIKNIWIPMKPYYTKVYQEIWIGMGLMGFIVYKIRAADKRSKALKASAPAPGHH
is the amino acid sequence for a human protein called "6.8 kDa mitochondrial proteolipid". In this project, amino acid
sequences will always be upper-case, with no white space inserted. There are many online databases from which protein
sequences can be obtained. One is SWISS-PROT, which can be found at http://ca.expasy.org/sprot/. As of
March 27, 2004, SWISS-PROT contained database entries for 146,720 amino acid sequences. Each database entry has much
more information about a protein than its sequence, as can be seen by going to
http://ca.expasy.org/cgi-bin/get-full-entry?[SWISS_PROT-ID:'68MP_HUMAN']
This is the entry for the protein whose sequence was given above.
A complete protein record may contain a fairly large number of logical fields. These are flagged with two-character
sequences occurring at the beginning of each line. A full listing of the possible fields is given in Table 1 on page 2 of this
specification. It is important to note that some protein records will contain only a proper subset of the possible fields. In
addition the amount of data for each field can vary considerably.
For our purposes in this assignment, we will use a text file of complete SWISS-PROT entries. You do not need to be
concerned with validating the correctness of the database entries.
A full description of the logical significance of the various fields, and any format constraints, is given in the UniProt User
Manual, which is available at http://us.expasy.org/sprot/userman.html.
Note: some of the sequence data files may contain multiple entries corresponding to the same accession code. In such a
case, your implementation should recognize if an accession code is already in the index structure, and if so simply reject the
duplicate entries.
It is also possible that a protein record may list more than one accession code. In that case, you should index the record using
the first code that is listed (the primary accession code) and disregard the rest.
1
Amino Acid Sequence Record Description
Table 1 Protein record field specifications:
Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types
and line codes and the order in which they appear in an entry, are shown in the table below.
Line code Content
Occurrence in an entry
ID
Identification
Once; starts the entry
AC
Accession number(s)
Once or more
DT
Date
Three times
DE
Description
Once or more
GN
Gene name(s)
Optional
OS
Organism species
Once or more
OG
Organelle
Optional
OC
Organism classification
Once or more
OX
Taxonomy cross-reference(s) Once or more
RN
Reference number
RP
Reference position
Once or more
RC
Reference comment(s)
Optional
RX
Reference cross-reference(s) Optional
RG
Reference group
Once or more (Optional if RA line)
RA
Reference authors
Once or more (Optional if RG line)
RT
Reference title
Optional
RL
Reference location
Once or more
CC
Comments or notes
Optional
DR
Database cross-references
Optional
KW
Keywords
Optional
FT
Feature table data
Optional
SQ
Sequence header
Once
(blanks)
Sequence data
Once or more
//
Termination line
Once; ends the entry
Comments
3
[O,P,Q][0-9][A-Z,0-9] [0-9]
Once or more
As shown in the table, some line types are found in all entries, others are optional. Some line types occur many times in a
single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//).
Note that some formatting details must be inferred from the sample data files provided on the course website, and the detailed
documentation available online in the UniProt User Manual.
2
Amino Acid Sequence Record Description
Figure 1 Sample Protein Record:
ID
AC
DT
DT
DT
DE
GN
OS
OC
OC
OX
RN
RP
RC
RA
RL
CC
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
KW
SQ
143S_MOUSE
STANDARD;
PRT;
248 AA.
O70456;
16-OCT-2001 (Rel. 40, Created)
16-OCT-2001 (Rel. 40, Last sequence update)
28-FEB-2003 (Rel. 41, Last annotation update)
14-3-3 protein sigma (Stratifin).
SFN OR MKRN3.
Mus musculus (Mouse).
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
NCBI_TaxID=10090;
[1]
SEQUENCE FROM N.A.
STRAIN=FVB/N;
Karpitskiy V.V., Shaw A.S.;
Submitted (APR-1998) to the EMBL/GenBank/DDBJ databases.
-!- FUNCTION: P53-regulated inhibitor of G2/M progression (By
similarity).
-!- SUBUNIT: Homodimer (By similarity).
-!- SUBCELLULAR LOCATION: Cytoplasmic or may be secreted by a nonclassical secretory pathway (By similarity).
-!- SIMILARITY: Belongs to the 14-3-3 family.
EMBL; AF058798; AAC14344.1; -.
HSSP; P29312; 1A38.
MGD; MGI:1891831; Sfn.
GO; GO:0005737; C:cytoplasm; IDA.
GO; GO:0000079; P:regulation of CDK activity; IDA.
InterPro; IPR000308; 14-3-3.
Pfam; PF00244; 14-3-3; 1.
PRINTS; PR00305; 1433ZETA.
ProDom; PD000600; 14-3-3; 1.
SMART; SM00101; 14_3_3; 1.
PROSITE; PS00796; 1433_1; 1.
PROSITE; PS00797; 1433_2; 1.
Multigene family.
SEQUENCE
248 AA; 27713 MW; D433390433FB3F48 CRC64;
MERASLIQKA KLAEQAERYE DMAAFMKSAV EKGEELSCEE RNLLSVAYKN VVGGQRAAWR
VLSSIEQKSN EEGSEEKGPE VKEYREKVET ELRGVCDTVL GLLDSHLIKG AGDAESRVFY
LKMKGDYYRY LAEVATGDDK KRIIDSARSA YQEAMDISKK EMPPTNPIRL GLALNFSVFH
YEIANSPEEA ISLAKTTFDE AMADLHTLSE DSYKDSTLIM QLLRDNLTLW TADSAGEEGG
EAPDDPHI
//
Note that there are absolutely no stated limits on the lengths of the strings that occur in the protein records.
3
Download