winword doc

advertisement
Removing Redundancy in SWISS-PROT and TrEMBL
Claire O'Donovan, Maria Jesus Martin, Eric Glemet*, Jean-Jaques Codani* and
Rolf Apweiler.
EMBL Outstation-The European Bioinformatics Institute, Wellcome Trust
Genome Campus, Hinxton, Cambridge, CB10 1SD, UK and
*
INRIA 78153 Le-Chesney Cedex, France
Abstract
Introduction
Motivation: One of the distinguishing
criteria of the SWISS-PROT protein
sequence data bank is minimal
redundancy. Many sequence databases
contain, for a given protein sequence,
separate entries which correspond to
different literature reports. In SWISSPROT, we try as much as possible to
merge all these data so as to minimize
the redundancy of the database. The
introduction of TrEMBL as a
supplementary database ensure the
comprehensiveness of SWISS-PROT and
TrEMBL but introduced some degree of
redundancy because some of the
sequences in TrEMBL are actually
additional reports of sequences already
in SWISS-PROT. We wish to identify and
merge these sequence as quickly as
possible with the already existing
SWISS-PROT entries to make SWISSPROT and TrEMBL completely nonredundant.
Results:We developed a strategy to
identify the redundancy present within
SWISS-PROT and TrEMBL and between
them and for the subsequent removal of
this redundancy from the database by
merging these data.
Availability: The tools mentioned in this
paper are available on request.
Contact: odonovan@ebi.ac.uk
SWISS-PROT
SWISS-PROT, (Bairoch and Apweiler
1998) is an annotated protein sequence
database established in 1986 and
maintained collaboratively, since 1987,
by the Department of Medical
Biochemistry of the University of
Geneva and the EMBL data library (now
the EMBL Outstation- The European
Bioinformatics Institute (EBI) (Stoesser
et al 1997) It is the most widely used
protein sequence database since it
distinguishes itself from other sequence
databases by three essential criteria.
Annotation. One of SWISS-PROT’s
leading concepts from the very
beginning was to provide far more than a
simple collection of protein sequences,
but rather a critical view of what is
known or postulated about each of these
sequences. In SWISS-PROT, each
sequence entry consists of the sequence
data, the citation information
(bibliographical references), the
taxonomic data (description of the
biological source of the protein) and the
annotation which describes the following
items:
 Function(s) of the protein
 Post-translational
modification(s).
For
example
carbohydrates,
phosphorylation, acetylation, GPIanchor, etc
 Domains and sites. For example
calcium binding regions, ATPbinding
sites,
zinc
fingers,
homeobox, kringle, etc
 Secondary structure
 Quaternary structure
 Similarities to other proteins
 Disease(s) associated with
deficiencie(s) in the protein
 Sequence conflicts, variants, etc.
In SWISS-PROT, annotation is mainly
found in the comment lines (CC), in the
feature table (FT) and in the keyword
lines (KW). A controlled vocabulary is
used whenever possible. This approach
permits the easy retrieval of specific
categories of data from the database. As
much annotation as possible is included
in SWISS-PROT. This information is
obtained from publications that report
new sequence data and from review
articles that are used to periodically
update the annotations of families or
groups of proteins. External experts
have also been recruited to send us their
comments and updates concerning
specific groups of proteins.
Integration with other databases. It is
important to provide the users of
biomolecular databases with a degree of
integration between the three types of
sequence-related databases (nucleic acid
sequences, protein sequences and protein
tertiary structures) as well as with
specialized data collections. SWISSPROT is currently cross-referenced with
28 different databases. Cross-references
are provided in the form of pointers to
information related to SWISS-PROT
entries and found in data collections other
than SWISS-PROT.
Minimal redundancy. Many sequence
databases contain, for a given protein
sequence, separate entries which
correspond to different literature reports.
In SWISS-PROT, we try as much as
possible to merge all these data so as to
minimize the redundancy of the
database. If conflicts exist between
various sequencing reports, they are
indicated in the feature table of the
corresponding entry.
TrEMBL
Ongoing genome sequencing projects
have dramatically increased the number
of protein sequences to be incorporated
into SWISS-PROT. Since we do not
want to dilute the quality standards of
SWISS-PROT
by
incorporating
sequences into SWISS-PROT without
proper sequence analysis and annotation,
we cannot accelerate the incorporation
of new incoming data indefinitely.
However as we also want to make
sequences available as quickly as
possible, we introduced TrEMBL
(Translation of EMBL nucleotide
sequence database) in early 1997.
TrEMBL consists of computer-annotated
entries in SWISS-PROT-like format
derived from the translation of all coding
sequences (CDS) in the EMBL database,
except for CDS already present in
SWISS-PROT.
The production of TrEMBL has
emphasized the importance of linking
not only to a whole EMBL nucleotide
sequence entry but to linking within that
entry at the CDS feature level. This
linking has now been achieved by using
the ‘PID’, the Protein Identification
number found in the ‘/db_xref’ qualifier
tagged to every CDS in the EMBL
nucleotide sequence database. The DR
lines of SWISS-PROT and TrEMBL
entries pointing to an EMBL database
entry are now citing the EMBL
accession number as primary identifier
and the PID as secondary identifier. In
all cases where a ‘PID’ is already
integrated into SWISS-PROT, a
‘db_xref’
qualifier
citing
the
corresponding SWISS-PROT entry is
added to the EMBL nucleotide sequence
database CDS labeled with this ‘PID’. In
the remaining cases, a ‘/db_xref’
qualifier is pointing to the corresponding
TrEMBL entry. This approach helps us
to point precisely from a given SWISSPROT entry to one of potentially many
CDS in the corresponding EMBL entry
and vice versa. We have split TrEMBL
into two main sections: SP-TrEMBL and
REM-TrEMBL:
SP-TrEMBL(SWISS-PROT TrEMBL)
contains the entries which should be
incorporated
into
SWISS-PROT.
SWISS-PROT accession numbers have
been assigned to these entries.
REM-TrEMBL(REMaining TrEMBL)
contains the entries that we do not want
to include in SWISS-PROT. This section
is organized into five subsections:
1. Most REM-TrEMBL entries are
immunoglobulins
and
T-cell
receptors. We do not enter these
sequences into SWISS-PROT since
this highly redundant data would
bias database-wide searches. We
would like to create a specialized
database
dealing
with
these
sequences as a further supplement to
SWISS-PROT and keep only a
representative cross-section of these
proteins in SWISS-PROT.
2. Another category of data which will
not be included in SWISS-PROT are
synthetic sequences. Again we do
not want to leave these entries in
REM-TrEMBL. Ideally one should
build a specialized database for
artificial sequences as a further
supplement to SWISS-PROT.
3. Fragments with less than eight amino
acids.
4. Coding sequences obtained from
patent applications.
5. The last subsection consists of CDS
translations where we have strong
evidence to believe that these CDS
are not coding for real proteins.
SWISS-PROT and TrEMBL are
currently cross-referenced by 470,000
verified links to 28 other databases. Both
the sequences and the annotation of
SWISS-PROT and TrEMBL entries are
constantly updated. The doubling time of
the database is now less than 18 months.
SWISS-PROT and TrEMBL represent
the most complete and up-to-date protein
sequence database with the lowest
degree of redundancy and the highest
standard of annotation publicly available
today. In addition to the regular releases
of TrEMBL, TrEMBLNEW is created
weekly from EMBLNEW and is
available
under
SRS
at
http://www.ebi.ac.uk and ftp.ebi.ac.uk
These entries do not have SWISS-PROT
accession numbers and the PID is used
in the ID and the AC lines.
Redundancy
The stringent requirement of minimal
redundancy as stated above for SWISSPROT applies equally to SP-TrEMBL
but this has created a different set of
challenges due to computer-generated
nature of SP-TrEMBL. SP-TrEMBL is
partially redundant against SWISSPROT since a small but significant
percentage of these entries are actually
additional reports of proteins already
present in SWISS-PROT. We wish to
identify and merge these sequences as
quickly as possible with the
corresponding SWISS-PROT entries to
make SWISS-PROT and SP-TrEMBL
completely non-redundant. There are
different kinds of redundancy, which are
commonplace in many sequence
databases:
-Different literature and sequence reports
of a given protein sequence.
-Mutations, polymorphism, variations in
the sequence which are often given
separate entries in the nucleotide
sequence databases.
These redundancies should not be
present in SWISS-PROT or TrEMBL. It
was necessary to find methods to
manipulate the data from redundant
source databases to meet our stringent
standards of minimal redundancy. The
objective was the development and
application of methods to recognize and
eliminate the redundancy already present
in the databases, and to prevent further
redundancy entering the database.
Algorithms and Methods
We have identified thousands of SPTrEMBL entries which exactly matched
SWISS-PROT or SP-TrEMBL entries
using the CRC32 checksum. These
entries have been merged and this gave
us an indication of the size of the
redundancy problem. However it
became clear that there were still tens of
thousands of cases of potential (not
easily detectable) redundancy which we
also wished to eliminate.
 Exact matches of fragments (a SPTrEMBL entry is a fragment of a
SWISS-PROT entry or vice-versa; or
a SP-TrEMBL entry is a fragment of
another SP-TrEMBL entry).
 SWISS-PROT and SP-TrEMBL protein
entries from the same organism which
should be identical but differ due to
sequencing errors, variants, frameshifts
etc.
We established a pipeline of three main
procedures to achieve the identification
and removal of redundancy in SWISSPROT and TrEMBL. This has also been
integrated into the TrEMBL production
process to ensure such redundancy does
not occur again. It is as follows:
Cyclic Redundancy Check (CRC32)
The Cyclic Redundancy Check (CRC)
calculates a nearly unique and very
compact checksum for each sequence
and this allows fast and accurate
detection of identical sequences.
At
every TrEMBL release, a CRC32 check
is carried out to identify identical
sequences in SP-TrEMBL and SWISSPROT. A curator then merges these
entries manually. There is also a CRC32
check of TrEMBLNEW against SPTrEMBL and SWISS-PROT. The
TrEMBLNEW entries which match
SWISS-PROT entries are collated for
annotation by curators. TrEMBLNEW
entries which match SP-TrEMBL entries
are merged into one entry automatically,
with the following exceptions:
Viral protein fragments
Cross-species protein fragments
MHC fragments
Plasmodium merozites surface antigen
fragments
Outer membrane protein fragments
Fusion protein fragments
Homeobox or Homeodomain protein
fragments
LASSAP and TrEMBL redundancy
The next step in reducing redundancy is
to merge subfragments to longer length
sequences. For this purpose, INRIA has
provided LASSAP (Large Scale
Sequence compArison Package) (Glemet
and Codani 1997 and Codani and
Glemet
1995).
LASSAP
is
a
programmable, high performance system
designed to raise the current limitations
of sequence comparison programs in
order to fit the needs of large scale
analysis.
It provides an API
(Application Programming Interface)
allowing the integration of any generic
pairwise-based algorithm. It allows the
use of several sequence comparison
methods: BLAST (Altschul et al 1990),
FASTA (Pearson and Lipman 1988),
dynamic programming with local (Smith
and Waterman 1981) and global
(Needleman
and
Wunsch
1970)
similarity searches, string matching with
(Landau and Vishkin 1986, Hunt and
Szmanski (1977) or without errors
(Boyer and Moore 1977) and pattern
matching with (Baeza-Yates and Gonnet
1989, Wu and Manber 1991) or without
errors. LASSAP is both an integrated
software for end users and a framework
allowing the integration and combination
of new algorithms. LASSAP has been
modified specifically to identify
redundancy in SWISS-PROT and
TrEMBL. String matching algorithms
have been incorporated and subsequently
modified to cope with SWISS-PROT’s
specific annotation, such as the organism
specification (OS lines of SWISS-PROT
and TrEMBL entries). These modified
algorithms can operate directly on a
whole TrEMBL division, or on TrEMBL
against SWISS-PROT.
Initially this was used for the reduction
of the redundancy already present in SPTrEMBL and SWISS-PROT. This work
has now been completed. Currently, this
step is an integral part of the TrEMBL
production process at each release in
order to check for such subfragment
redundancy within TrEMBLNEW itself
and then between TrEMBLNEW, SPTrEMBL and SWISS-PROT.
MERGE
Finally, a collection of shell scripts and
C programs are used to process the
matches identified by the LASSAP
algorithms. This collection enables the
automatic merging of the entries in a
manner consistent with the required
SWISS-PROT standards of annotation
and accuracy. Multiple matches are
within the scope of this procedure except
when the longer sequences differ outside
the subfragment match range. These
cases are collated and set aside for
curator analysis. In the process of
merging, a master entry is chosen from
the group of entries for merging. This
master entry serves as a template copy to
which new information from the
corresponding grouped entries is added.
For subfragment matching, the master
sequence is the longest sequence and
new citations, cross-references and
alternative protein names from the
subfragments are added to it. When the
subfragment is an SP-TrEMBL entry,
the ID, AC and DATE OF CREATION
DT lines are incorporated in the master
copy.
Results
The data set used for the development of
the redundancy strategy was SWISSPROT (release 35), and EMBL (release
53) and as a result, TrEMBL (release 5)
was produced.
SWISS-PROT (release 35) consisted of
69,113 entries comprising 25,083768
amino acids abstracted from 59,101
references and had 250,000 links to 28
other databases. TrEMBL (release 5)
was created from the EMBL Nucleotide
Sequence Database release 53 which
contained 290,000 CDS. TrEMBL
Release 5 consisted of 166,361 entries
comprising 45,671684 amino acids and
had 220,000 links to 28 other databases.
These statistics were influenced by the
redundancy methods in a number of
ways. Of the 290,000 CDS in EMBL,
100,000 were already present as
sequence reports in SWISS-PROT and
were excluded from the TrEMBL
production process. The remaining
190,000 CDS were merged whenever
possible using the new strategy to reduce
redundancy and the final result was
166,361 entries. This removal of 30,000
entries clearly shows the value of the
redundancy procedures that have been
developed and implemented already.
With the exponentially increasing
dataflow from the genome projects, this
approach will continue to prove useful.
Another useful result was the
highlighting of a number of areas where
further improvement is possible. SPTrEMBL is partially redundant against
SWISS-PROT since approximately
40,000 of these entries are actually
additional sequence reports of proteins
already in SWISS-PROT. It is believed
that this redundancy is due once again to
different literature and sequence reports
of a given protein sequence and the
mutations, polymorphism, variations in
the sequence but in these cases, are nonexact matches. The software used by us
is being modified currently to deal with
this data.
nucleotide sequence databases plus
approximately 50,000 additional
sequence reports. This clearly indicates
that SWISS-PROT and TrEMBL
represent the most complete and up-todate protein sequence database with the
lowest degree of redundancy and the
highest degree of annotation publicly
available today. SWISS-PROT and
TrEMBL is now widely used and will
become the standard non-redundant
protein sequence database. Substantial
redundancy reduction has been already
achieved and methods are now in place
in the TrEMBL production process to
ensure that future potential redundancy
will be avoided. It has proved to be a
worthwhile approach to redundancy
removal. There is also plenty of scope
for further improvement, particularly in
the area of non-exact matches. The
objective is to modify and develop
LASSAP to deal with this kind of
redundancy. This is particularly
important with regards to the vast
amount of genome sequencing data. A
significant proportion of this data is
already present in SWISS-PROT as a
result of work by individual laboratories
and it is an important requirement for us
to merge this valuable biochemical data
with the incoming genome sequencing
data.
Discussion
References
SWISS-PROT and TrEMBL contain the
up-to-date sequences of all 325,000
coding sequence reports currently in the
Altschul,S.F.,
Gish,W.,
Miller,W.,
Myers,E. and Lipman,D. (1990). Basic
Acknowledgements
This work was supported in part by grant
numbers BIO4-CT97-2099 and
BIO4-CT96-4837 of the European
Commission.
local alignment search tool. J. Mol. Biol.
215, 403–410.
Baeza-Yates,R.A. and Gonnet,G.H.
(1989) Efficient text searching of regular
expressions. Proceedings ICALP 16,4661.
Bairoch,A. and Apweiler,R. (1998) The
SWISS-PROT protein sequence data
bank and its supplement TrEMBL in
1998 Nucleic Acids Res. 26,38-42.
Boyer,R. and Moore,S. (1977). A fast
string matching algorithm. C.A.C.M.
20,762-772.
Codani,J.J. and Glemet,E. (1995)
Parallelism in LASSAP, a large scale
sequence comparison package. Lecture
Notes in Computer Science. SpringerVerlag, 919, 787-792. Proceedings of
HPCN95 conference, Milan.
Glémet,E. and Codani,J.J. (1997).
Lassap: a large scale sequence
comparison package. Comp. Appl.
BioSci. 13, 137–143.
Hunt,J. and Szymanski,T.G. (1977) A
fast algorithm for computing longest
common subsequences. Theoretical
Computer Science 20,350-353.
Landau,G.M. and Vishkin,U. (1986)
Efficient string matching with k
mismatches. Theoretical Computer
Science 43,239-249.
Needleman,S.B.
and
Wunsch,C.D.
(1970). A general method applicable to
the search for similarities in the amino
acid sequence of two proteins. J. Mol.
Biol. 48, 443–453.
Pearson,W.R. and Lipman,D.J. (1988)
Improved tools for biological sequence
comparison. Proc. Natl. Acad. Sci.
U.S.A. 85,2444-2448.
Smith,T.F. and Waterman,M.S. (1981).
Identification of common molecular
subsequences. J. Mol. Biol. 147, 195–
197.
Stoesser,G.,
Sterk,P.,
Tuli,M.A.,
Stoehr,P.J. and Cameron,G.N. (1997)
The EMBL Nucleotide Sequence
Database. Nucleic Acids Res. 25:7-13.
Wu,S. and Manber,U. (1991) Fast text
searching with errors. Technical Report
TR 91-11, University of Arizona.
Download