MODULE 2 Molecular Biology Databases AIMS To introduce the major nucleotide and protein databases To describe the important features of these databases To explain how to search appropriate databases To explain how to retrieve information from databases OBJECTIVES The student should be able to: Choose appropriate databases for information retrieval Use Boolean operators to search databases Retrieve nucleotide and protein sequence file from databases Introduction The number of databases available to molecular biologists runs to several hundred. These include the major DNA and protein sequence databases and a variety of derived databases. This topic, perhaps more than any other in this course, is plagued by acronyms relating to the databases themselves and the organizations which maintain them. Specific components of the larger databases have been organized into more targeted (e.g. organism-specific) databases. In addition, there are databases dealing with protein structure, metabolic pathways, taxonomy, enzyme classification, genetic diseases and a whole raft of other related topics. There are now databases of molecular biology databases (e.g. DBCAT). Most of these databases can be searched with nucleotide and protein sequences, but there are also text-based procedures for searching for information and retrieving it A short historical background The first sequence database was in fact a protein sequence database. In the 1960s Margaret Dayhoff and her colleagues collected all the known amino acid sequences in “The Atlas of Protein Sequences and Structures”, which itself evolved over time into the Protein Identification Resource (PIR). However, the advent of techniques for rapidly and accurately sequencing DNA with the concomitant accumulation of enormous amounts of sequence information led inevitably to the involvement of computers in the analysis of nucleotide sequence information and its storage. Thus, the new discipline of Bioinformatics came into being. Two databases were established in the 1980s; the EMBL (European Molecular Biology Laboratory) Nucleotide Sequence Database (in Europe) and NIH's (National Institute of Health) GenBank (in the USA), which collect and organise all published nucleotide sequence information. The two databases also exchanged information between themselves. This collaboration was subsequently joined by the DNA Database of Japan (DDBJ). In 1988 these three databases constituted the “International Collaboration of DNA Sequence Databases” and had agreed rules on how records should be updated and also to use a common format for data within a given record to facilitate the exchange of data between the databases. Thus, each of these databases essentially holds the same information and can be used as the primary source for retrieving DNA sequence information. Primary, Secondary and Composite Databases The PRIMARY databases hold the experimentally determined nucleotide sequence information, together with the protein sequences inferred from the conceptual translation of these nucleotide sequences. The three major primary nucleotide sequence databases are those already mentioned: Genbank – maintained by the National Center for Biotechnology Information (NCBI) of the National Institute of Health (NIH) EMBL Nucleotide Sequence Database – maintained by the European Bioinformatics Institure (EBI) DNA Database of Japan (DDBJ) They are comprehensive collections of all available nucleotide sequence information and essentially just represent different points of access to the same information. The primary protein sequence databases contain the amino acid sequences obtained from what are thought to be the coding sequences within the nucleotide sequences. This of course is not experimentally derived information, but has arisen as a result of interpretation of the nucleotide sequence information and consequently must be treated as potentially containing misinterpreted information. The are number of primary protein sequence databases and each requires some specific consideration. GenPept GenPept is a database which holds the conceptual translations of all the coding sequences (CDS) of the GenBank nucleotide sequences. Have a look at a typical record. PIR PIR is a division of the National Biomedical Research Foundation (NBRF) and together with its collaborators, the Munich Information Centre for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), maintain PIR International. The core of PIR International is the PIR-International Protein Sequence Database, which is an annotated, non-redundant and cross-referenced database of protein sequences. PIR International also contains secondary and structure databases. The PIR database contains four sections PIR1-4 which differ in terms of data quality and annotation, with PIR1 containing the most reliable and fully annotated information. Have a look at a typical record. SWISS-PROT SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI). The database provides a very high quality of annotation. Have a look at the Swiss-Prot record for E. coli FtsH. TrEMBL TrEMBL is a computer-annotated supplement to SWISS-PROT. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database, which are not yet integrated into SWISS-PROT. TrEMBL is split into two main sections; SP-TrEMBL and REM-TrEMBL. SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries that should eventually be incorporated into SWISS-PROT and can be considered as a preliminary section of SWISS-PROT as all SP-TrEMBL entries have been assigned SWISS-PROT accession numbers. REM-TrEMBL (REMaining TrEMBL) contains the entries that won’t be included in SWISS-PROT. REM-TrEMBL entries have no accession numbers. Choice of protein sequence database The choice of which of these databases to use is determined by what iformation you want to get. PIR1-4 is arguably the most comprehensive database, but SWISS-PROT provides the highest quality of annotation. Composite Protein Sequence Databases Composite databases aim to amalgamate the information held in two or more of the primary databases. This means that you can search one composite database rather than do multiple searches on individual primary databases e.g. OWL – is a composite of SWISS-PROT, PIR1-4, GenPept and NRL-3D (see below) NRDB (Non-Redundant DataBase) – is a composite of SWISS-PROT+TrEMBL Secondary Databases The secondary databases are so termed because they contain the results of analysis of the sequences held in primary databases. Many seconday protein databases are the result of looking for features that relate different proteins. PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. (see Module 5) Pfam is a collection of multiple alignments and profile hidden Markov models of protein domain families, which is based on proteins from both SWISS-PROT and SP-TrEMBL. (see Module 5). Structure Databases There are databases such as PDB (protein Data Bank) that hold the atomic coordinate data for proteins whose structure has been determined by X-ray crytallography and/or NMR. In addition, the MMDB (molecular Modelling Database) at NCBI is a compliation of the PDB entries as ASN.1 files. Other databases such as SCOP (Structural Classification of Proteins) and CATH (Class, Architecture, Topology, Homology) hold information on the structural relationship of proteins and their structural domains. Text-based searching and retrieval There are three major integrated tools for searching for and retrieving molecular biological information: Entrez is the system developed at the National Center for Biotechnology Information SRS is the tool developed at the Eurpean Bioinformatics Institute DBGET was developed as part of Japan’s GenomeNet. The advantages of these tools is that they are not only capable of retrieving specific nucleotide or protein sequence, but also provide links to additional related information. ENTREZ, SRS and DBGET differ slightly in the databases that they search and in the links they provide to related information. Ideally one should be familiar with all three systems, but we focus on the one that is easiest to use, ENTREZ, despite the fact that SRS and DBGET offer broader information. The more adventurous might want to look at SRS and DBGET to see what they offer. ENTREZ Entrez provides an integrated tool to access the NCBI databases. These databases include nucleotide sequences, protein sequences, macromolecular structures, whole genomes and genetic maps, population data sets, taxonomy, and MEDLINE (a literature database). Extensive documentation on Entrez is available at the NCBI together with short tutorials on using different aspects of Entrez. Although a brief description of Entrez and how to use it is provided here, it is very strongly recommended that you look at the NCBI source information. The Entrez screen shows the available databases on a black menu bar. PUBMED - provides access to bibliographic information, which is drawn primarily from MEDLINE, PreMEDLINE, HealthSTAR, as well as Publisher-Supplied citations. NUCLEOTIDE DATABASE – contains sequence data from GenBank, EMBL, DDBJ as well as from the Genome Sequence Database and the US Patent and Trademark Office. It includes STSs and ESTs. PROTEIN DATABASE – contains sequence data from the translated coding regions of DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to PIR, SWISSPROT, PRF and PDB. GENOME DATABASE – provides integrated views for a variety of genomes. STRUCTURE DATABASE – (alias the Molecular Modeling Database) contains experimental data from crystallographic and NMR structure determinations. POPSET DATABASE - contains aligned sequences submitted as a set resulting from a population, a phylogenetic, or mutation study. A key feature of Entrez is that individual records within each database are not only linked to related records within that database, but also to related records within the databases. These links to related files are termed ‘neighbours’. Thus a retrieved nucleotide sequence would come with links to nucleotide sequence neighbours, protein neighbours, literature references etc. Searching the databases The databases can be searched in a variety of ways: Subject searching – a word or phrase is used to search the database. Where phrases are used they must be surrounded by double quotation marks “”.These searches can be made more sophisticated by the use of Boolean Operators. Wildcards can be used at the end of partial words to broaden the search (e.g. cyanobact* would find records that contained the words cyanobacteria, cyanobacterium and cyanobacterial). Searching for unique identifiers – DNA sequences and proteins are given unique accession numbers when they are entered into the databases. These accession numbers can be directly entered into the search engine. Searching by author – author names take the form last name plus initial(s) e.g. mann nh There are other refinements which can be made to searching such as combining sets, using limits etc. You should look at the Entrez documentation to familiarize yourself with these techniques. Boolean Operators Boolean Operators are named after George Boole, an Englishman, who invented them as part of a system of logic in the mid-1800’s. They tell search engines which keywords you want your results to include or exclude. The Entrez search engine only supports the operators AND, OR and NOT. AND will locate all records containing both the words e.g. human AND protease OR will locate all records containing either word not necessarily both e.g. human OR protease) NOT will locate records containing one word, but NOT the other word e.g. human NOT protease Boolean operators in Entrez must be written in the upper case and are normally processed left to right. If you wish part of your Boolean expression to be processed out of order, enclose it in parentheses e.g. human AND protease (NOT IgG OR serine). Phrases can be used with Boolean Operators as long as they are enclosed in double quotation marks. The Search Results Typically a search may produce several hits. The default is to present these hits in the Summary display, but there are alternatives presented by the dropdown display menu. To look at a particular record in more detail you can either click the tick box and then select the appropriate dropdown display option, or you can click the accession number which will take you to the record in the default GenBank or GenPept format. These too can be displayed in alternative and the various ‘neighbours’ examined. Different views When you retrieve a nucleotide or protein sequence you can look at it in different views. The default view is the GenBank format Exercises Entrez provides extensive online training and help and most of the following exercises rely on these facilities 1. Read through the key features of the Entrez help document (you should read in sufficient depth to enable you to complete exercise 3. 2. Work the the Entrez interactive tutorial on searching the nucleotide database 3. Devise a Boolean search term that will retrieve information on the human YME1 homologue from the NCBI Entrez nucleotide database – does it work? 4. Look at the GenBank entry for human YME1 and then change the display so that it is in the Fasta format. Copy the Fasta file to the clipboard. References and Useful Links Introduction to Molecular Biology Databases – a very useful guide at the EBI. Survey of Molecular Biology Databases and Servers – provides a brief description of all the major databases. Molecular Biology Databases on the Internet II – A Biotechniques article from 1997. Retrieval of information from molecular biology databases – a quiz on information retrieval from databases.