Removing Redundancy in SWISS-PROT and TrEMBL Claire O'Donovan, Maria Jesus Martin, Eric Glemet*, Jean-Jaques Codani* and Rolf Apweiler. EMBL Outstation-The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK and * INRIA 78153 Le-Chesney Cedex, France Abstract Introduction Motivation: One of the distinguishing criteria of the SWISS-PROT protein sequence data bank is minimal redundancy. Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In SWISSPROT, we try as much as possible to merge all these data so as to minimize the redundancy of the database. The introduction of TrEMBL as a supplementary database ensure the comprehensiveness of SWISS-PROT and TrEMBL but introduced some degree of redundancy because some of the sequences in TrEMBL are actually additional reports of sequences already in SWISS-PROT. We wish to identify and merge these sequence as quickly as possible with the already existing SWISS-PROT entries to make SWISSPROT and TrEMBL completely nonredundant. Results:We developed a strategy to identify the redundancy present within SWISS-PROT and TrEMBL and between them and for the subsequent removal of this redundancy from the database by merging these data. Availability: The tools mentioned in this paper are available on request. Contact: odonovan@ebi.ac.uk SWISS-PROT SWISS-PROT, (Bairoch and Apweiler 1998) is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva and the EMBL data library (now the EMBL Outstation- The European Bioinformatics Institute (EBI) (Stoesser et al 1997) It is the most widely used protein sequence database since it distinguishes itself from other sequence databases by three essential criteria. Annotation. One of SWISS-PROT’s leading concepts from the very beginning was to provide far more than a simple collection of protein sequences, but rather a critical view of what is known or postulated about each of these sequences. In SWISS-PROT, each sequence entry consists of the sequence data, the citation information (bibliographical references), the taxonomic data (description of the biological source of the protein) and the annotation which describes the following items: Function(s) of the protein Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation, GPIanchor, etc Domains and sites. For example calcium binding regions, ATPbinding sites, zinc fingers, homeobox, kringle, etc Secondary structure Quaternary structure Similarities to other proteins Disease(s) associated with deficiencie(s) in the protein Sequence conflicts, variants, etc. In SWISS-PROT, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). A controlled vocabulary is used whenever possible. This approach permits the easy retrieval of specific categories of data from the database. As much annotation as possible is included in SWISS-PROT. This information is obtained from publications that report new sequence data and from review articles that are used to periodically update the annotations of families or groups of proteins. External experts have also been recruited to send us their comments and updates concerning specific groups of proteins. Integration with other databases. It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialized data collections. SWISSPROT is currently cross-referenced with 28 different databases. Cross-references are provided in the form of pointers to information related to SWISS-PROT entries and found in data collections other than SWISS-PROT. Minimal redundancy. Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In SWISS-PROT, we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry. TrEMBL Ongoing genome sequencing projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences into SWISS-PROT without proper sequence analysis and annotation, we cannot accelerate the incorporation of new incoming data indefinitely. However as we also want to make sequences available as quickly as possible, we introduced TrEMBL (Translation of EMBL nucleotide sequence database) in early 1997. TrEMBL consists of computer-annotated entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL database, except for CDS already present in SWISS-PROT. The production of TrEMBL has emphasized the importance of linking not only to a whole EMBL nucleotide sequence entry but to linking within that entry at the CDS feature level. This linking has now been achieved by using the ‘PID’, the Protein Identification number found in the ‘/db_xref’ qualifier tagged to every CDS in the EMBL nucleotide sequence database. The DR lines of SWISS-PROT and TrEMBL entries pointing to an EMBL database entry are now citing the EMBL accession number as primary identifier and the PID as secondary identifier. In all cases where a ‘PID’ is already integrated into SWISS-PROT, a ‘db_xref’ qualifier citing the corresponding SWISS-PROT entry is added to the EMBL nucleotide sequence database CDS labeled with this ‘PID’. In the remaining cases, a ‘/db_xref’ qualifier is pointing to the corresponding TrEMBL entry. This approach helps us to point precisely from a given SWISSPROT entry to one of potentially many CDS in the corresponding EMBL entry and vice versa. We have split TrEMBL into two main sections: SP-TrEMBL and REM-TrEMBL: SP-TrEMBL(SWISS-PROT TrEMBL) contains the entries which should be incorporated into SWISS-PROT. SWISS-PROT accession numbers have been assigned to these entries. REM-TrEMBL(REMaining TrEMBL) contains the entries that we do not want to include in SWISS-PROT. This section is organized into five subsections: 1. Most REM-TrEMBL entries are immunoglobulins and T-cell receptors. We do not enter these sequences into SWISS-PROT since this highly redundant data would bias database-wide searches. We would like to create a specialized database dealing with these sequences as a further supplement to SWISS-PROT and keep only a representative cross-section of these proteins in SWISS-PROT. 2. Another category of data which will not be included in SWISS-PROT are synthetic sequences. Again we do not want to leave these entries in REM-TrEMBL. Ideally one should build a specialized database for artificial sequences as a further supplement to SWISS-PROT. 3. Fragments with less than eight amino acids. 4. Coding sequences obtained from patent applications. 5. The last subsection consists of CDS translations where we have strong evidence to believe that these CDS are not coding for real proteins. SWISS-PROT and TrEMBL are currently cross-referenced by 470,000 verified links to 28 other databases. Both the sequences and the annotation of SWISS-PROT and TrEMBL entries are constantly updated. The doubling time of the database is now less than 18 months. SWISS-PROT and TrEMBL represent the most complete and up-to-date protein sequence database with the lowest degree of redundancy and the highest standard of annotation publicly available today. In addition to the regular releases of TrEMBL, TrEMBLNEW is created weekly from EMBLNEW and is available under SRS at http://www.ebi.ac.uk and ftp.ebi.ac.uk These entries do not have SWISS-PROT accession numbers and the PID is used in the ID and the AC lines. Redundancy The stringent requirement of minimal redundancy as stated above for SWISSPROT applies equally to SP-TrEMBL but this has created a different set of challenges due to computer-generated nature of SP-TrEMBL. SP-TrEMBL is partially redundant against SWISSPROT since a small but significant percentage of these entries are actually additional reports of proteins already present in SWISS-PROT. We wish to identify and merge these sequences as quickly as possible with the corresponding SWISS-PROT entries to make SWISS-PROT and SP-TrEMBL completely non-redundant. There are different kinds of redundancy, which are commonplace in many sequence databases: -Different literature and sequence reports of a given protein sequence. -Mutations, polymorphism, variations in the sequence which are often given separate entries in the nucleotide sequence databases. These redundancies should not be present in SWISS-PROT or TrEMBL. It was necessary to find methods to manipulate the data from redundant source databases to meet our stringent standards of minimal redundancy. The objective was the development and application of methods to recognize and eliminate the redundancy already present in the databases, and to prevent further redundancy entering the database. Algorithms and Methods We have identified thousands of SPTrEMBL entries which exactly matched SWISS-PROT or SP-TrEMBL entries using the CRC32 checksum. These entries have been merged and this gave us an indication of the size of the redundancy problem. However it became clear that there were still tens of thousands of cases of potential (not easily detectable) redundancy which we also wished to eliminate. Exact matches of fragments (a SPTrEMBL entry is a fragment of a SWISS-PROT entry or vice-versa; or a SP-TrEMBL entry is a fragment of another SP-TrEMBL entry). SWISS-PROT and SP-TrEMBL protein entries from the same organism which should be identical but differ due to sequencing errors, variants, frameshifts etc. We established a pipeline of three main procedures to achieve the identification and removal of redundancy in SWISSPROT and TrEMBL. This has also been integrated into the TrEMBL production process to ensure such redundancy does not occur again. It is as follows: Cyclic Redundancy Check (CRC32) The Cyclic Redundancy Check (CRC) calculates a nearly unique and very compact checksum for each sequence and this allows fast and accurate detection of identical sequences. At every TrEMBL release, a CRC32 check is carried out to identify identical sequences in SP-TrEMBL and SWISSPROT. A curator then merges these entries manually. There is also a CRC32 check of TrEMBLNEW against SPTrEMBL and SWISS-PROT. The TrEMBLNEW entries which match SWISS-PROT entries are collated for annotation by curators. TrEMBLNEW entries which match SP-TrEMBL entries are merged into one entry automatically, with the following exceptions: Viral protein fragments Cross-species protein fragments MHC fragments Plasmodium merozites surface antigen fragments Outer membrane protein fragments Fusion protein fragments Homeobox or Homeodomain protein fragments LASSAP and TrEMBL redundancy The next step in reducing redundancy is to merge subfragments to longer length sequences. For this purpose, INRIA has provided LASSAP (Large Scale Sequence compArison Package) (Glemet and Codani 1997 and Codani and Glemet 1995). LASSAP is a programmable, high performance system designed to raise the current limitations of sequence comparison programs in order to fit the needs of large scale analysis. It provides an API (Application Programming Interface) allowing the integration of any generic pairwise-based algorithm. It allows the use of several sequence comparison methods: BLAST (Altschul et al 1990), FASTA (Pearson and Lipman 1988), dynamic programming with local (Smith and Waterman 1981) and global (Needleman and Wunsch 1970) similarity searches, string matching with (Landau and Vishkin 1986, Hunt and Szmanski (1977) or without errors (Boyer and Moore 1977) and pattern matching with (Baeza-Yates and Gonnet 1989, Wu and Manber 1991) or without errors. LASSAP is both an integrated software for end users and a framework allowing the integration and combination of new algorithms. LASSAP has been modified specifically to identify redundancy in SWISS-PROT and TrEMBL. String matching algorithms have been incorporated and subsequently modified to cope with SWISS-PROT’s specific annotation, such as the organism specification (OS lines of SWISS-PROT and TrEMBL entries). These modified algorithms can operate directly on a whole TrEMBL division, or on TrEMBL against SWISS-PROT. Initially this was used for the reduction of the redundancy already present in SPTrEMBL and SWISS-PROT. This work has now been completed. Currently, this step is an integral part of the TrEMBL production process at each release in order to check for such subfragment redundancy within TrEMBLNEW itself and then between TrEMBLNEW, SPTrEMBL and SWISS-PROT. MERGE Finally, a collection of shell scripts and C programs are used to process the matches identified by the LASSAP algorithms. This collection enables the automatic merging of the entries in a manner consistent with the required SWISS-PROT standards of annotation and accuracy. Multiple matches are within the scope of this procedure except when the longer sequences differ outside the subfragment match range. These cases are collated and set aside for curator analysis. In the process of merging, a master entry is chosen from the group of entries for merging. This master entry serves as a template copy to which new information from the corresponding grouped entries is added. For subfragment matching, the master sequence is the longest sequence and new citations, cross-references and alternative protein names from the subfragments are added to it. When the subfragment is an SP-TrEMBL entry, the ID, AC and DATE OF CREATION DT lines are incorporated in the master copy. Results The data set used for the development of the redundancy strategy was SWISSPROT (release 35), and EMBL (release 53) and as a result, TrEMBL (release 5) was produced. SWISS-PROT (release 35) consisted of 69,113 entries comprising 25,083768 amino acids abstracted from 59,101 references and had 250,000 links to 28 other databases. TrEMBL (release 5) was created from the EMBL Nucleotide Sequence Database release 53 which contained 290,000 CDS. TrEMBL Release 5 consisted of 166,361 entries comprising 45,671684 amino acids and had 220,000 links to 28 other databases. These statistics were influenced by the redundancy methods in a number of ways. Of the 290,000 CDS in EMBL, 100,000 were already present as sequence reports in SWISS-PROT and were excluded from the TrEMBL production process. The remaining 190,000 CDS were merged whenever possible using the new strategy to reduce redundancy and the final result was 166,361 entries. This removal of 30,000 entries clearly shows the value of the redundancy procedures that have been developed and implemented already. With the exponentially increasing dataflow from the genome projects, this approach will continue to prove useful. Another useful result was the highlighting of a number of areas where further improvement is possible. SPTrEMBL is partially redundant against SWISS-PROT since approximately 40,000 of these entries are actually additional sequence reports of proteins already in SWISS-PROT. It is believed that this redundancy is due once again to different literature and sequence reports of a given protein sequence and the mutations, polymorphism, variations in the sequence but in these cases, are nonexact matches. The software used by us is being modified currently to deal with this data. nucleotide sequence databases plus approximately 50,000 additional sequence reports. This clearly indicates that SWISS-PROT and TrEMBL represent the most complete and up-todate protein sequence database with the lowest degree of redundancy and the highest degree of annotation publicly available today. SWISS-PROT and TrEMBL is now widely used and will become the standard non-redundant protein sequence database. Substantial redundancy reduction has been already achieved and methods are now in place in the TrEMBL production process to ensure that future potential redundancy will be avoided. It has proved to be a worthwhile approach to redundancy removal. There is also plenty of scope for further improvement, particularly in the area of non-exact matches. The objective is to modify and develop LASSAP to deal with this kind of redundancy. This is particularly important with regards to the vast amount of genome sequencing data. A significant proportion of this data is already present in SWISS-PROT as a result of work by individual laboratories and it is an important requirement for us to merge this valuable biochemical data with the incoming genome sequencing data. Discussion References SWISS-PROT and TrEMBL contain the up-to-date sequences of all 325,000 coding sequence reports currently in the Altschul,S.F., Gish,W., Miller,W., Myers,E. and Lipman,D. (1990). Basic Acknowledgements This work was supported in part by grant numbers BIO4-CT97-2099 and BIO4-CT96-4837 of the European Commission. local alignment search tool. J. Mol. Biol. 215, 403–410. Baeza-Yates,R.A. and Gonnet,G.H. (1989) Efficient text searching of regular expressions. Proceedings ICALP 16,4661. Bairoch,A. and Apweiler,R. (1998) The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998 Nucleic Acids Res. 26,38-42. Boyer,R. and Moore,S. (1977). A fast string matching algorithm. C.A.C.M. 20,762-772. Codani,J.J. and Glemet,E. (1995) Parallelism in LASSAP, a large scale sequence comparison package. Lecture Notes in Computer Science. SpringerVerlag, 919, 787-792. Proceedings of HPCN95 conference, Milan. Glémet,E. and Codani,J.J. (1997). Lassap: a large scale sequence comparison package. Comp. Appl. BioSci. 13, 137–143. Hunt,J. and Szymanski,T.G. (1977) A fast algorithm for computing longest common subsequences. Theoretical Computer Science 20,350-353. Landau,G.M. and Vishkin,U. (1986) Efficient string matching with k mismatches. Theoretical Computer Science 43,239-249. Needleman,S.B. and Wunsch,C.D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453. Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85,2444-2448. Smith,T.F. and Waterman,M.S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147, 195– 197. Stoesser,G., Sterk,P., Tuli,M.A., Stoehr,P.J. and Cameron,G.N. (1997) The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 25:7-13. Wu,S. and Manber,U. (1991) Fast text searching with errors. Technical Report TR 91-11, University of Arizona.