MODULE 2

advertisement
MODULE 2
Molecular Biology Databases
AIMS

To introduce the major nucleotide and protein databases

To describe the important features of these databases

To explain how to search appropriate databases

To explain how to retrieve information from databases
OBJECTIVES
The student should be able to:

Choose appropriate databases for information retrieval

Use Boolean operators to search databases

Retrieve nucleotide and protein sequence file from databases
Introduction
The number of databases available to molecular biologists runs to several hundred. These
include the major DNA and protein sequence databases and a variety of derived databases.
This topic, perhaps more than any other in this course, is plagued by acronyms relating to the
databases themselves and the organizations which maintain them. Specific components of the
larger databases have been organized into more targeted (e.g. organism-specific) databases. In
addition, there are databases dealing with protein structure, metabolic pathways, taxonomy,
enzyme classification, genetic diseases and a whole raft of other related topics. There are now
databases of molecular biology databases (e.g. DBCAT). Most of these databases can be
searched with nucleotide and protein sequences, but there are also text-based procedures for
searching for information and retrieving it
A short historical background
The first sequence database was in fact a protein sequence database. In the 1960s Margaret
Dayhoff and her colleagues collected all the known amino acid sequences in “The Atlas of
Protein Sequences and Structures”, which itself evolved over time into the Protein
Identification Resource (PIR). However, the advent of techniques for rapidly and accurately
sequencing DNA with the concomitant accumulation of enormous amounts of sequence
information led inevitably to the involvement of computers in the analysis of nucleotide
sequence information and its storage. Thus, the new discipline of Bioinformatics came into
being. Two databases were established in the 1980s; the EMBL (European Molecular Biology
Laboratory) Nucleotide Sequence Database (in Europe) and NIH's (National Institute of
Health) GenBank (in the USA), which collect and organise all published nucleotide sequence
information. The two databases also exchanged information between themselves. This
collaboration was subsequently joined by the DNA Database of Japan (DDBJ). In 1988 these
three databases constituted the “International Collaboration of DNA Sequence Databases”
and had agreed rules on how records should be updated and also to use a common format for
data within a given record to facilitate the exchange of data between the databases. Thus, each
of these databases essentially holds the same information and can be used as the primary
source for retrieving DNA sequence information.
Primary, Secondary and Composite Databases
The PRIMARY databases hold the experimentally determined nucleotide sequence
information, together with the protein sequences inferred from the conceptual translation of
these nucleotide sequences. The three major primary nucleotide sequence databases are those
already mentioned:
Genbank – maintained by the National Center for Biotechnology Information (NCBI) of the
National Institute of Health (NIH)
EMBL Nucleotide Sequence Database – maintained by the European Bioinformatics Institure
(EBI)
DNA Database of Japan (DDBJ)
They are comprehensive collections of all available nucleotide sequence information and
essentially just represent different points of access to the same information.
The primary protein sequence databases contain the amino acid sequences obtained from
what are thought to be the coding sequences within the nucleotide sequences. This of course
is not experimentally derived information, but has arisen as a result of interpretation of the
nucleotide sequence information and consequently must be treated as potentially containing
misinterpreted information. The are number of primary protein sequence databases and each
requires some specific consideration.
GenPept
GenPept is a database which holds the conceptual translations of all the coding sequences
(CDS) of the GenBank nucleotide sequences. Have a look at a typical record.
PIR
PIR is a division of the National Biomedical Research Foundation (NBRF) and together with
its collaborators, the Munich Information Centre for Protein Sequences (MIPS) and the Japan
International Protein Information Database (JIPID), maintain PIR International. The core of
PIR International is the PIR-International Protein Sequence Database, which is an annotated,
non-redundant and cross-referenced database of protein sequences. PIR International also
contains secondary and structure databases. The PIR database contains four sections PIR1-4
which differ in terms of data quality and annotation, with PIR1 containing the most reliable
and fully annotated information. Have a look at a typical record.
SWISS-PROT
SWISS-PROT is an annotated protein sequence database established in 1986 and maintained
collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European
Bioinformatics Institute (EBI). The database provides a very high quality of annotation. Have
a look at the Swiss-Prot record for E. coli FtsH.
TrEMBL
TrEMBL is a computer-annotated supplement to SWISS-PROT. TrEMBL contains the
translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence
Database, which are not yet integrated into SWISS-PROT. TrEMBL is split into two main
sections; SP-TrEMBL and REM-TrEMBL. SP-TrEMBL (SWISS-PROT TrEMBL) contains
the entries that should eventually be incorporated into SWISS-PROT and can be considered
as a preliminary section of SWISS-PROT as all SP-TrEMBL entries have been assigned
SWISS-PROT accession numbers. REM-TrEMBL (REMaining TrEMBL) contains the
entries that won’t be included in SWISS-PROT. REM-TrEMBL entries have no accession
numbers.
Choice of protein sequence database
The choice of which of these databases to use is determined by what iformation you want to
get. PIR1-4 is arguably the most comprehensive database, but SWISS-PROT provides the
highest quality of annotation.
Composite Protein Sequence Databases
Composite databases aim to amalgamate the information held in two or more of the primary
databases. This means that you can search one composite database rather than do multiple
searches on individual primary databases e.g.
OWL – is a composite of SWISS-PROT, PIR1-4, GenPept and NRL-3D (see below)
NRDB (Non-Redundant DataBase) – is a composite of SWISS-PROT+TrEMBL
Secondary Databases
The secondary databases are so termed because they contain the results of analysis of the
sequences held in primary databases. Many seconday protein databases are the result of
looking for features that relate different proteins.
PROSITE is a database of protein families and domains. It consists of biologically significant
sites, patterns and profiles that help to reliably identify to which known protein family (if any)
a new sequence belongs. (see Module 5)
Pfam is a collection of multiple alignments and profile hidden Markov models of protein
domain families, which is based on proteins from both SWISS-PROT and SP-TrEMBL. (see
Module 5).
Structure Databases
There are databases such as PDB (protein Data Bank) that hold the atomic coordinate data for
proteins whose structure has been determined by X-ray crytallography and/or NMR. In
addition, the MMDB (molecular Modelling Database) at NCBI is a compliation of the PDB
entries as ASN.1 files. Other databases such as SCOP (Structural Classification of Proteins)
and CATH (Class, Architecture, Topology, Homology) hold information on the structural
relationship of proteins and their structural domains.
Text-based searching and retrieval
There are three major integrated tools for searching for and retrieving molecular biological
information:
Entrez is the system developed at the National Center for Biotechnology Information
SRS is the tool developed at the Eurpean Bioinformatics Institute
DBGET was developed as part of Japan’s GenomeNet.
The advantages of these tools is that they are not only capable of retrieving specific
nucleotide or protein sequence, but also provide links to additional related
information. ENTREZ, SRS and DBGET differ slightly in the databases that they
search and in the links they provide to related information. Ideally one should be
familiar with all three systems, but we focus on the one that is easiest to use,
ENTREZ, despite the fact that SRS and DBGET offer broader information. The more
adventurous might want to look at SRS and DBGET to see what they offer.
ENTREZ
Entrez provides an integrated tool to access the NCBI databases. These databases include
nucleotide sequences, protein sequences, macromolecular structures, whole genomes and
genetic maps, population data sets, taxonomy, and MEDLINE (a literature database).
Extensive documentation on Entrez is available at the NCBI together with short tutorials on
using different aspects of Entrez. Although a brief description of Entrez and how to use it is
provided here, it is very strongly recommended that you look at the NCBI source information.
The Entrez screen shows the available databases on a black menu bar.
PUBMED - provides access to bibliographic information, which is drawn primarily from
MEDLINE, PreMEDLINE, HealthSTAR, as well as Publisher-Supplied citations.
NUCLEOTIDE DATABASE – contains sequence data from GenBank, EMBL, DDBJ as well
as from the Genome Sequence Database and the US Patent and Trademark Office. It includes
STSs and ESTs.
PROTEIN DATABASE – contains sequence data from the translated coding regions of DNA
sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to PIR,
SWISSPROT, PRF and PDB.
GENOME DATABASE – provides integrated views for a variety of genomes.
STRUCTURE DATABASE – (alias the Molecular Modeling Database) contains
experimental data from crystallographic and NMR structure determinations.
POPSET DATABASE - contains aligned sequences submitted as a set resulting from a
population, a phylogenetic, or mutation study.
A key feature of Entrez is that individual records within each database are not only linked to
related records within that database, but also to related records within the databases. These
links to related files are termed ‘neighbours’. Thus a retrieved nucleotide sequence would
come with links to nucleotide sequence neighbours, protein neighbours, literature references
etc.
Searching the databases
The databases can be searched in a variety of ways:
Subject searching – a word or phrase is used to search the database. Where phrases are used
they must be surrounded by double quotation marks “”.These searches can be made more
sophisticated by the use of Boolean Operators. Wildcards can be used at the end of partial
words to broaden the search (e.g. cyanobact* would find records that contained the words
cyanobacteria, cyanobacterium and cyanobacterial).
Searching for unique identifiers – DNA sequences and proteins are given unique accession
numbers when they are entered into the databases. These accession numbers can be directly
entered into the search engine.
Searching by author – author names take the form last name plus initial(s) e.g. mann nh
There are other refinements which can be made to searching such as combining sets, using
limits etc. You should look at the Entrez documentation to familiarize yourself with these
techniques.
Boolean Operators
Boolean Operators are named after George Boole, an Englishman, who
invented them as part of a system of logic in the mid-1800’s. They tell search engines which
keywords you want your results to include or exclude. The Entrez search engine only supports
the operators AND, OR and NOT.
AND will locate all records containing both the words e.g. human AND protease
OR will locate all records containing either word not necessarily both e.g. human OR
protease)
NOT will locate records containing one word, but NOT the other word e.g. human NOT
protease
Boolean operators in Entrez must be written in the upper case and are normally processed left
to right. If you wish part of your Boolean expression to be processed out of order, enclose it
in parentheses e.g. human AND protease (NOT IgG OR serine). Phrases can be used with
Boolean Operators as long as they are enclosed in double quotation marks.
The Search Results
Typically a search may produce several hits. The default is to present these hits in the
Summary display, but there are alternatives presented by the dropdown display menu.
To look at a particular record in more detail you can either click the tick box and then select
the appropriate dropdown display option, or you can click the accession number which will
take you to the record in the default GenBank or GenPept format. These too can be displayed
in alternative and the various ‘neighbours’ examined.
Different views
When you retrieve a nucleotide or protein sequence you can look at it in different views. The
default view is the GenBank format
Exercises
Entrez provides extensive online training and help and most of the following exercises rely on
these facilities
1. Read through the key features of the Entrez help document (you should read in sufficient
depth to enable you to complete exercise 3.
2. Work the the Entrez interactive tutorial on searching the nucleotide database
3. Devise a Boolean search term that will retrieve information on the human YME1
homologue from the NCBI Entrez nucleotide database – does it work?
4. Look at the GenBank entry for human YME1 and then change the display so that it is in
the Fasta format. Copy the Fasta file to the clipboard.
References and Useful Links
Introduction to Molecular Biology Databases – a very useful guide at the EBI.
Survey of Molecular Biology Databases and Servers – provides a brief description of all the
major databases.
Molecular Biology Databases on the Internet II – A Biotechniques article from 1997.
Retrieval of information from molecular biology databases – a quiz on information retrieval
from databases.
Download