Protein sequence databases

advertisement
Introduction to Database Modeling in Bioinformatics
Beate Marx
Database Administrator, EMBL Outstation – The European Bioinformatics Institute. Wellcome
Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
Telephone: ++-44-1223-494419
Fax: ++-44-1223-494468
E-mail: Marx@ebi.ac.uk
Introduction
Over the last couple of years a large number of biological databases have become accessible
to the biological community. Universities, scientific institutes, and commercial vendors offer
access to all kinds of biological information systems via the Internet. These information
systems differ with respect to contents, reliability and easiness of use.
They also differ with respect to the size of the databases, which can be everything from a few
hundred megabytes to tens or even hundreds of gigabytes, and the underlying technology
that is used to maintain these databases.
In the sixties, when the first medical databases became publicly available, database
technology was still in its early beginnings and the state of the art was indexed flat files. A few
vendors offered database management systems (DBMS) that were based on the Network
Model or the Hierarchical Data Model. There were no common standards - the first attempt to
standardise database technology was the CODASYL (Conference on Data Systems
Languages) DBTG (Data Base Task Groups) proposal that was published in the early
seventies.
With the Relational Data Model developed by Ted Codd and published in a paper in 1970,
and the Entity-Relationship Model published by Chen in 1976 a breakthrough took place.
From the early seventies on relational database theory became very popular within the
computer science community, several large conferences dedicated to database theory were
established during these years and quite a number of research projects on relational
database systems were started.
It was not before the late seventies to early eighties that the first commercial relational
database management systems (RDBMS) became available. Query languages based on
relational algebra were introduced by several research groups, of which SQL (Sequel) is now
the most widespread one.
Database management systems (with reduced features) became available for PCs.
During the eighties database technology advanced in various ways: on a more technical side,
there was a shift from mainframes to client-server based systems and distributed DBMS. On
the conceptual side research projects emerged to fill the gap between the relational database
model and relational query languages and modern object-oriented programming languages.
The ER-model was extended to capture concepts like complex objects and inheritance and
database capabilities were extended into several directions: deductive databases were
introduced as well as temporal or active databases; first steps to incorporate complex objects
into the relational model were made (resulting in NF2 databases) and DBMS features for
spatial, temporal, and multimedia data were added.
In the nineties two major trends can be identified for commercial DBMS reflecting research
topics of the late eighties and early nineties: relational database systems have become the
most commonly used systems. They are based on a sound mathematical theory, they are
well understood and considerable progress has been made with respect to physical storage
structures, query optimisation, transaction processing and multi-user concurrency control,
backup and recovery features and database security. Some vendors offer so-called objectoriented relational systems, which are relational systems with some object-oriented features,
like nested relations and abstract object types implemented on top.
Most of the currently available object-oriented database management systems (ODBMS) the other trend - come from a completely different background: they can be best described as
object-oriented programming languages (C++ in most cases) with persistence added.
ODBMS lack a formal model and until quite recently no common standard existed; some
systems still don’t offer ad-hoc query capabilities and all kinds of data processing techniques
are proprietary to the respective database system.
Although commercial database management systems are around for more than 20 years
now, publicly accessible biological data is still likely to be found in flat file organised
databases (some with a highly sophisticated indexing system, like SRS). Yet, over the last ten
years there has been a trend to at least maintain the data inside (most often relational)
database systems thus taking advantage of an advanced technology that offers ad-hoc
declarative query capabilities combined with automatic query optimisation and consistency
checking in a way that is almost impossible to implement on flat files.
Ongoing implementation efforts concentrate on providing forms based, or more currently,
CORBA based interfaces to biological databases. Direct access and unrestricted queries are
usually prohibited for the simple reason that this would impose an unpredictable load on the
database servers, which would be very difficult to handle.
Current database technology offers some concepts that may prove useful in overcoming the
related problems. High availability is one of the buzzwords, meaning that database servers
are implemented on hardware clusters. In a cluster, if one machine becomes unavailable (e.g.
due to a crash), its processes are transparently failed over to the other machine(s), where
they will continue to run. Ideally, users will not loose connections or become aware in any
other way of the hardware problem. Replication is another very useful concept. It allows
mirroring the contents of a database (or part of it) into some other database server. (Readonly) access to this mirror can be granted to the public, even a high load on the mirror will not
affect the performance of the master database.
Features of database systems
Before discussing the features commonly associated with a database system, we will give a
short introduction into the vocabulary used throughout this paper. A database is a logically
coherent collection of related data with some inherent meaning (or simply a collection of
data). A database management system (DBMS) is a collection of programs that enables
users to create and maintain a database, i.e., it’s the (general purpose) software that
facilitates the processes of defining, constructing, and manipulating databases for various
applications. The database and the software together will be called a database system.
Current database systems can be characterised by certain concepts the underlying software
is commonly supposed to support:

A DBMS has the capability to represent a variety of complex relationships among data
and to define and enforce integrity constraints for these data.

Multiple users can use the system simultaneously without affecting each other. A
transaction-processing component applies concurrency control techniques that will
ensure non-interference of concurrently executing transactions (A transaction is a
sequence of user defined operations enclosed in a ‘begin transaction’ and ‘commit’ that
must behave like a single operation on the database.).

Physical data storage is transparent to the user.

Users express queries (or data manipulation/data definition requests) in a high-level
declarative query language (or DML/DDL) such as SQL. Based on the data dictionary,
that is maintained by the DBMS, an optimal (or almost optimal) execution strategy for
retrieving the result (or implementing the changes) of those queries
(manipulations/definitions) will be devised by the DBMS.

A backup and recovery subsystem provides the means for recovering from hardware or
software failures.

An authorisation component protects the database against persons who are not
authorised to access part of the database or the whole database.
When to use (or not to use) commercial DBMS
As mentioned before, biological databases can be very large and the data itself can be very
complex in structure. DBMS offer almost 30 years of experience in providing efficient ways to
store and index large amounts of data while at the same time hiding storage details from the
user. Users will instead be provided with a conceptual representation of the data that does not
include many of the details of how the data is stored and a high level query and manipulation
language. Abstraction from storage specific details also facilitates data integrity checks at a
high level that cannot be achieved for data held in file-based systems.
DBMS allow multiple users to access the database at the same time both for querying and
updating the data. Data inside a database system is always up-to-date, there is no delay for
reorganising file structures or rebuilding indexes that might occur in file-based systems
whenever updates take place.
Experience shows that the quality of data improves if it is maintained inside a database
system provided that the database is designed following certain rules and (automatic) integrity
checking is applied wherever it makes sense. Database theory provides a measure for the
quality of the database design; integrity constraints are part of the relational data model.
In spite of the advantages database systems provide over-file based systems, one has to be
aware of overhead costs that are introduced when using DBMS software:

There will be a high initial investment in hardware and software.

A DBMS is a complex software product that needs specially trained system
administrators to handle it.
One has also to be aware of problems that may arise if a database is poorly designed or if the
DBMS is not administrated by experienced personnel. Because of the overhead costs and the
potential problems of improper design or system administration, it may be more desirable to
use regular files under the following circumstances:

The database and applications are simple, well defined, and not expected to change.

There are stringent real-time requirements for some programs that may not be met
because of DBMS overhead.

Multiple-user access is not required.
How to decide which DBMS to use
Public domain software versus commercial products. In general it will not be easy to find
the right product for a particular application. Small projects that have a highly experimental
status may do very well with public domain software. The advantage with public domain
software is that of low initial costs.
Institutes that run large projects with megatons of data or data centres that deal with huge
data archives are probably better off with a commercial product, that comes with proper
system support and new versions with new features (and - very important! - bug fixes) that
appear in regular intervals.
DBMS specially designed for genome projects. A prominent example for a DBMS
specially designed for a biological project is Acedb. Acedb was originally developed for the
C.elegans genome project, from which its name was derived (A C.Elegans DataBase).
However, the tools in it have been generalised to be much more flexible and the same
software is now used for many different genomic databases from bacteria to fungi to plants to
man. It is also increasingly used for databases with non-biological content.
Relational versus object-oriented.There is an ongoing discussion that splits the database
community into two groups. The supporters of the object-oriented paradigm argue that the
relational data model has been quite successful in the area of traditional business
applications, but that it has certain shortcomings in relation to more complex applications like,
amongst other examples, scientific databases. The object-oriented approach offers the
flexibility to handle requirements of these applications without being limited by the data types
and query languages available in relational systems. A key feature of object-oriented
databases is the power they give the designer to specify both the structure of complex objects
and the operations that can be applied to these objects, thus filling the gap between database
systems and modern programming languages.
The other, more conservative side, although admitting that relational database systems may
provide insufficient support for new applications, is quite reluctant to completely give up the
idea of a sound underlying theory that is based on first order predicate logic that make
database systems objects that can be described and discussed in mathematical terms. They
argue in favour of cautiously extending relational systems to incorporate new features
provided they can be formally described in an extended logical model. Yet, although there is
considerable research activity in that area, except O2 (now called ARDENT) no commercially
available product has emerged so far. O2 is an example for an ODBMS that has started as a
research project implementing a logic-based query language with object-oriented features.
Projects that have to deal with data that is very complex, both in structure and in behaviour,
and cannot easily be mapped into a relational model will benefit from using an ODBMS for
data storage. One has to be aware though, that ODBMS are closely integrated with the
object-oriented programming language they were designed for, which in most cases is C++.
Using an ODBMS amounts to giving up what is considered a benefit of relational database
systems that came from insulation of programs and data, i.e., the fact that the data model and
data storage was independent from the programming language the applications were
implemented in. As a result, changes in either component, the database or the application
programs, had little impact on the other side. On implementing a database in one of the
currently available object-oriented DBMS this independence will be lost. Projects will be
chained to one particular programming language and to the object/class definitions in that
language that represent the database schema.
To overcome the reservation against ODBMS that resulted from the above observation, a
consortium of object-oriented DBMS vendors has proposed a standard, known as ODMG-93,
which was revised into ODMG 2.0 lately. Once this standard has been established, i.e., when
its concepts will be fully implemented in the commercial ODBMSs, they will be a real
alternative to RDBMS for the kind of projects described above.
Relational database theory basics
This section will give a short introduction into the entity-relationship model, database design
theory and relational query languages.
Phases of the database design. Figure 1 shows a simplified description of the database
design process. The first step shown is requirements collection and analysis. Once this has
been done, the next step is to create a conceptual schema for the database using a high-level
conceptual data model, followed by the actual implementation of the database. In the final
step the internal storage structures and file organisation of the database will be specified. In
this paper we will concentrate on the conceptual design and the data model mapping.
The Entity-Relationship Model. The Entity-Relationship Model (ERM) is a popular high-level
conceptual data model. It is frequently used for the conceptual design of database
applications, and many database design tools employ its concepts.
The ERM describes data as entities, relationships, and attributes. The basic object is an
entity, which is a “thing” in the real world with an independent existence. Attributes describe
the particular properties of an entity or a relationship; they may be atomic, composite, or
multivalued. Each attribute is associated with a domain, which specifies the set of values that
may be assigned to that attribute.
A database usually contains groups of entities that are similar, i.e., they share the same
attributes and relationships; an entity type describes the schema for a set of entities that
share the same structure. An important constraint on the entities of an entity type is the key or
uniqueness constraint on attributes; the values of key attributes are distinct for each individual
entity, they can be used to identify an entity uniquely.
A relationship type among two or more entity types defines a set of associations among
entities from these types. Informally, each relationship instance is an association of entities
from these types. It represents the fact, that those entities are related to each other in some
way in the corresponding miniworld. The degree of a relationship is the number of
participating entity types. Relationships usually have structural constraints (e.g. cardinality
ratio) that limit the possible combinations of entities.
Some entity types may not have key attributes of their own; these are called weak entity
types. Weak entities are identified by being related to specific entities from another entity type
in combination with some of their attribute values. A weak entity type always has a total
participation constraint (existence dependency) with respect to its identifying relationship; a
weak entity cannot be identified without an owner entity.
The Entity-Relationship Model of InterPro. Figure 2 shows a diagram of the InterPro
database as an example for an ERM schema. InterPro entity types, such as ENTRY, METHOD
and PROTEIN are shown in rectangular boxes. Relationship types are shown in diamondshaped boxes attached to the participating entity types with straight lines. Attributes are
shown in ovals and each attribute is attached to its entity type or relationship type. Weak
entities are distinguished by being placed in double rectangles and by having their identifying
relationship placed in double diamonds.
Note that the figure shows reflexive relationships on ENTRY and PROTEIN. InterPro entries
may be merged (or just be moved around) and there must be a way of reconstructing entries
as they were before. This is modelled by relationship type EAcc (or, respectively PAcc). Via
this relationship an InterPro entry may have a pointer to some other, older InterPro entry that,
e.g. by merging, is now part of the newer entry.
Enhanced data models. The entity-relationship modelling concepts discussed so far are
sufficient for representing a large class of database schemas for traditional applications, and
in most cases they are sufficient for representing biological data.
Influenced from both programming languages and the broad field of Artificial Intelligence and
Knowledge Representation, semantic modelling concepts have been introduced into the
entity-relationship approach to improve its representational capabilities (this data model is
called EERM, Enhanced-Entity-Relationship Model). These include mainly the concepts of
subclass and superclass and the related concepts of specialisation and generalisation.
Associated with a class hierarchy concept is the important mechanism of attribute inheritance.
A far more general approach that has come up lately is the Unified Modelling Language
(UML), which is now a widely accepted standard for object modelling. UML is a modelling
language that fuses the concepts of different object-oriented modelling languages that have
emerged in the field of software engineering when people became interested in objectoriented analysis and design. As a whole UML supports full object oriented systems design;
its static parts can be used for describing the structure and semantics of a database on the
conceptual level.
The relational data model. The relational model, which was introduced by Codd in 1970, is
based on a simple and uniform data structure - the relation, i.e., the relational model
represents a database as a collection of relations. Usually the relational algebra, which is a
collection of operations for manipulating relations and specifying queries, is considered to be
an integral part of the relational data model. When a relation is thought of as a table of values,
each row in the table represents a collection of related data values. Column names specify
how to interpret the data values in each row. In relational model terminology, a row is a tuple,
a column header is called attribute, and the table itself is called relation.
More formally, a relation schema is made up of a relation name and a list of attributes. It’s
used to describe a relation. Each attribute is the name of a role played by some domain,
which is a set of atomic values. Atomic means, that each value in the domain is indivisible as
far as the relational model is concerned; it cannot be a set of values or some kind of complex
value. A special value, null, is used for cases when the values of some attributes within a
particular tuple may be unknown, or may not apply to that tuple. A relation is a set of data
tuples, or - more precisely - a subset of the Cartesian product of the domains that define the
relation schema.
A relational database schema is a set of relation schemas plus a set of integrity constraints. A
relational database instance is a set of relations such that each of them is an instance of the
respective relation schema and such that all of them together satisfy the integrity constraints
that are defined on the database schema.
Relational model constraints. The various types of constraints, which can be specified on a
relational database schema, include domain constraints, key constraints, entity integrity, and
referential integrity. Data dependencies are another type of constraints. They include
functional dependencies and multi-valued dependencies, which are mainly used for database
design:

Domain constraints specify that the value of each attribute must be an atomic value of the
domain associated with that attribute.

Usually, there are subsets of attributes of a relation schema with the property that no two
tuples in any relation instance should have the same combination of values for these
attributes. A minimal subset with that property is called a (candidate) key of the relation
schema. If there is more than one candidate key, commonly one is designated as the
primary key. This is the set of attributes whose values are used to identify tuples in a
relation.

The entity integrity constraint states that no primary key value can be null.

The referential integrity constraint is specified between two relations and is used to
maintain the consistency among tuples of the two relations. Informally, the referential
integrity constraint states that a tuple in one relation that refers to another relation must
refer to an existing tuple in that relation.
More formally, a set of attributes FK in relation schema R1 is a foreign key of R1 if it
satisfies the following rules:

The attributes in FK have the same domain as the primary key attributes PK of
another relation schema R2; the attributes of FK refer to the relation R2.

A value of FK in a tuple of R1 either occurs as a value of PK from some tuple of R2 or
is null.
A foreign key can refer to its own relation.
These integrity constraints are supported by most relational database management systems.
There’s a class of more general semantic integrity constraints. “An employee’s salary should
not exceed the salary of his supervisor” is an example for this class, that cannot be enforced
in a DBMS by simple means, but most RDBMS provide mechanisms to help implement rules
like that as well.
Relational Algebra. Relational algebra is a collection of operations that are used to
manipulate entire relations, like for example to select tuples from individual relations or to
combine related tuples from several relations for the purpose of specifying a retrieval request
on the database. These operations are usually divided into two groups. One group includes
set operations from mathematical set theory; these are applicable because relations are
defined to be sets of tuples. They include:

UNION,

INTERSECTION,

DIFFERENCE, and

CARTESIAN PRODUCT
The other group consists of operations developed specifically for relational databases; these
include SELECT, PROJECT, and JOIN among others:

The SELECT operation is used to select a subset of the tuples in a relation that satisfy a
selection condition.

The PROJECT operation selects certain columns from a table and discards other
columns.

The JOIN operation is used to combine related tuples from two relations into single
tuples. It is used to process relationships among relations.
The important thing about relational algebra is, that relational operators work on relations, i.e.,
they take relations as input and have relations as output. As a result of that, they can be
arbitrarily combined to form all kinds of complex queries on the database. There also exists a
set of rules of how sequences of operations on relations can be reordered, thus providing
database systems with a means to optimise complex queries by regrouping operations and
evaluate them in an efficient way.
ER-to-relational mapping. Basically, the algorithm of converting an ER schema into a
relational schema follows a list of steps:

Entity types are converted into relations. The simple attributes of an entity type become
attributes of the relation. Complex attributes are broken down into their atomic parts,
which then become attributes of the relation. The multivalued attributes are turned into
relations on their own; relations derived from multivalued attributes include the primary
key of the entity type as relation attributes.
Weak entity types are converted into relations just as normal entity types, but here the
attributes of the primary key of the owner entity type are added to the relation attributes to
make sure, that the new relation has a proper primary key.

Not all relationship types are turned into relations. For binary 1:1 relationship types the
participating entity types and the corresponding relations are identified. One of them is
chosen (more or less) arbitrarily. Its attributes are augmented by the primary key
attributes of the other relation. These attributes now form a foreign key to the other
relation.

Likewise, for a binary 1:N-relationship type, there is a relation corresponding to the entity
type on the “N”-side of the relationship. This relation is augmented by the primary key
attributes of the other relation, which form a foreign key to that relation.

M:N relationship types are converted into their own relations. Here again, the relations
corresponding to the entity types that are participating in the relationship are identified
and their primary key attributes are combined to form the primary key of a new relation
that represents the M:N relationship. Again, the original primary key attributes form
foreign keys to their respective relations. If the relationship type had attributes of its own,
these are added to the new relation.

Likewise for an n-ary relationship type a new relation is created that includes all the
primary keys of the relations that represent the participating entity types.
First informal design guideline for relation schemas: “Design a relation schema so that it
is easy to explain its meaning. Do not combine attributes from multiple entity types or
relationship types into a single relation.”
Note that absolutely everybody, who accesses a database directly, application developers
that write application code or users that write ad-hoc SQL-queries, must fully understand the
meaning of the database schema. Intuitively, if a relation schema is as close as possible to
the entity-relationship schema its meaning tends to be clear.
The InterPro example continued. Figure 3 shows the relational model that was derived from
the InterPro conceptual model shown in figure 2. Note that every entity type has been
converted into one relation type, which has the same attributes as the entity type. Each entity
type had an accession number, a somewhat artificial key attribute, which was not really used
to describe a feature of the entities belonging to that type. These key attributes, like
entry_ac for ENTRY, serve as primary keys of the respective relations. Relationship types
ENTRY2METHOD and MATCH have been converted into their own relation types. Note that
these relation types inherit the combined primary keys of the relation types that correspond to
the entity types participating in the relationship, i.e., entry_ac, method_ac and
protein_ac. Weak entity type DOC has been converted into its own relation type. The weak
relationship between DOC and ENTRY is preserved by including the primary key of ENTRY,
entry_ac, into the attributes of DOC.
Relations CV_DATABASE and CV_ENTRYTYPE provide a “controlled vocabulary”, hence the
CV_-prefix, for database names or entry types that may occur in InterPro. There are only a
few databases so far where entries are accepted from; however, their number may increase
in future.
Two relations ENTRY_ACCPAIR and PROTEIN_ACCPAIR are mainly for tracing old entries (or
proteins), that have been moved or merged together and have a new primary key now. In
case the old entries are needed, they can be accessed via secondary_ac.
Database design theory
Apart from the guideline for a good schema design that was mentioned above, there are other
aspects that have to be considered.
Another look at the InterPro example. Assume that in the InterPro example the description
of an entry type was included in the schema of ENTRY. Figure 4 shows the graphical
representation of the corresponding ENTRY relation type.
Now, while InterPro contains quite a large number of entries, so far there’s only a small
number of entry types, that are supported (currently it’s four: “F” for “Family”, “D” for “Domain”,
“R” for “Repeat”, and “P” for “PTM”). Each one of those four types has a description, so that
any outsider, who accesses the InterPro database for the first time, will understand the
meaning of these types.
In the relation schema that’s depicted in Figure 4 this description is included in the attribute
list of ENTRY. For every tuple of ENTRY, the lengthy text, which makes up the description of
the respective entry type, will thus be included in the tuple and the few descriptions that are
valid will be repeated many times. Obviously this is a waste of storage space, which should
be avoided. There is also a more serious problem related to this kind of design: Assume one
user has a look at a particular entry and decides, that the description of the entry type could
be more precise than it is and he updates that tuple with a new description. The same entry
type will have the old description in all other entry tuples, i.e., by updating one tuple of ENTRY
an inconsistency on the attribute type_descr has been introduced. This kind of behaviour is
commonly called an update anomaly.
There are other types of anomalies. With ENTRY as described in Figure 4, if a new entry type
is introduced, it can only be entered into the database, if an entry already exists there of that
type. This is called an insertion anomaly. Likewise, if the last tuple with a certain entry type is
deleted from the database, then this entry type will not exist anymore in the database. This is
called a deletion anomaly.
Second informal design guideline for relation schemas: “A relational schema should be
designed so that no modification anomalies, i.e., update, insertion, or deletion anomalies,
occur in the relations.”
Note that in some exceptional cases this concept has to be violated in order to improve
performance of certain queries (usually, the performance of queries is the better, the fewer
joins between relations are involved). The anomalies in those cases must be well
documented and understood so that updates on the relation do not end up in inconsistencies.
Third informal design guideline for relation schemas: “ Design relations in a way, that null
values for certain attributes do not apply for the majority of the tuples of that relation.”
Null values are a bit tricky. A null value for some attribute can have multiple interpretations,
such as:

The attribute does not apply to this tuple.

The attribute value for this tuple is unknown.

The value is known, but is absent; that is, it has not been recorded yet.
Having the same representation for all nulls compromises the different meanings they may
have. Another problem is how to account for them in JOIN operations or when aggregate
functions such as COUNT or SUM are applied. So, when nulls are unavoidable, they should
apply in exceptional cases only.
Normal forms. So far, situations have been discussed that lead to problematic relation
schemas and informal guidelines for a good relational design have been proposed. In the
following section formal concepts are presented that allow to define “goodness” or “badness”
of a relational schema more precisely.
The single most important concept in relational design is that of a functional dependency. A
functional dependency is a constraint between two sets of attributes from the database.
Functional dependencies are defined under the assumption, that the whole database is
described by a single universal relation schema, i.e., all attributes that occur in the database
schema belong to one relation schema (this is only an abstraction that simplifies the following
definitions; later these definitions and concepts will be applied to the relations that have been
derived from the entity-relationship schema).
A functional dependency, denoted by XY, between two sets of attributes X and Y of the
universal relation schema, states that for any two tuples, if they have the same values on X,
they must also have the same values on Y. The values of the Y component of the tuple are
thus determined by the values of the X component; or alternatively, the values of the X
component of a tuple uniquely (functionally) determine the values of the Y component.
A functional dependency is one aspect of the meaning (or semantics) of the attributes of the
universal relation schema. It must be defined explicitly by someone who knows this
semantics. Specifying all kinds of functional dependencies on the attributes of a database
schema is a step in the (conceptual) design process that must be performed by a person, that
is familiar with the real world application; it cannot be done automatically.
Once a set of functional dependencies is given, more functional dependencies can be inferred
from them by simple rules. Normal forms are defined on the set of all functional
dependencies, the given and the inferred ones.
The normalisation process, as first proposed by Codd, takes a relation schema through a
series of tests to certify whether or not it belongs to a certain normal form. Normalisation of
data can be looked on as a process during which unsatisfactory relation schemas are
decomposed by breaking their attributes into smaller relation schemas that possess certain
desirable properties. One objective of the (original) normalisation process is to ensure that
modification anomalies discussed in the previous section do not occur.
Before proceeding with the definitions of normal forms, the concept of keys needs to be
reviewed: A superkey of a relation schema is a set of attributes with the property that no two
tuples of any legal extension of that schema will have the same values on all those attributes.
A key is a superkey with the additional property that the removal of any attribute will cause it
not to be a superkey any more, i.e., a key is a minimal set of attributes that form a superkey. If
a relation has more than one key, each one is called a candidate key and one candidate key
is arbitrarily chosen to be the primary key. An attribute is called prime if it is the member of
any key; otherwise the attribute is called nonprime.

The first normal form (1NF) is now considered to be part of the formal definition of a
relation. It states that the domains of attributes must include only atomic (indivisible)
values and that the value of any attribute in a tuple must be a single value from the
domain of that attribute.

The second normal form (2NF) is based on the concept of full functional dependency. A
functional dependency XY is a full functional dependency if the removal of any attribute
from X means that the dependency does not hold anymore.
A relation schema is in second normal form (2NF) if every nonprime attribute is fully
functionally dependent on every key of the relation schema.

A relation schema is in third normal form (3NF) if, whenever a functional dependency
XA (A being a single attribute) holds, either (a) X is a superkey or (b) A is a prime
attribute of the relation schema.

A relation schema is in Boyce-Codd normal form (BCNF) if, whenever a functional
dependency XA (A being a single attribute) holds, then X is a superkey of the relation
schema.
Note that BCNF is slightly stricter than 3NF, because condition (b) of 3NF, which allows A to
be prime if X is not a superkey, is absent from BCNF.
Usually it is considered best to have relation schemas in BCNF. If that is not possible, 3NF
will do; in practice, most relation schemas that are in 3NF are in BCNF anyway.
To see the “historical” relation between 2NF and 3NF, the concept of transitive dependencies
has to be defined: A functional dependency XY is a transitive dependency if there is a set of
attributes Z that is not a subset of any key of the relation schema and both XZ and ZY
holds.
Using the concept of transitive dependencies a more intuitive definition of 3NF can be given:
a relation schema is in 3NF if every nonprime attribute is

fully functionally dependent on every key of the relation schema (2NF) and

nontransitively dependent on every key of the relation schema.
The technique for relational database schema design, which is described in this paper, is
usually referred to as top-down design. It involves designing a conceptual schema in a highlevel data model, such as the ERM, and then mapping the conceptual schema into a set of
relations. In this technique the normalisation principles, such as avoiding transitive or partial
dependencies by decomposing unsatisfactory relation schemas, can be applied both during
the conceptual schema design and afterwards to the relations resulting from the mapping
algorithm.
Note that normal forms, when considered in isolation from other factors, do not guarantee a
good database design. Unfortunately, it is often not sufficient to check separately that each
relation schema in the database is in 3NF or in BCNF. Rather, the process of normalisation
must also confirm the existence of additional properties that the relational schemas together
should possess (for example there are certain restrictions that ensure, that two relations that
result from the decomposition of a relation schema that is not in 3NF or BCNF, produce
exactly the relation they were derived from when they are joined). These concepts are beyond
the scope of this paper; in the literature they can be found under the keywords lossless join
property and dependency preservation property.
A short introduction into SQL
Most commercial DBMSs provide a high-level declarative language interface. By declarative
we mean that the user has to specify what the result of his query should be (not how it should
be evaluated), leaving the decision on how to execute and optimise the evaluation of a query
to the DBMS. SQL (short for structured English query language) is a declarative,
comprehensive database language; it has statements for data definition, query, and update.
In this section table, row and column are used for relation, tuple, and attribute.
Data definition in SQL. The following SQL-statement creates table ENTRY of the InterPro
example:
CREATE TABLE (
entry_ac
NOT NULL
VARCHAR2(9)
entry_type
CHAR(1),
name
VARCHAR2(80),
created
NOT NULL
DATE,
timestamp
NOT NULL
DATE )
CONSTRAINT pk_entry PRIMARY KEY,
NOT NULL constraints are usually specified directly in the CREATE TABLE statement; other
constraints, e.g. those that define a certain column to be the primary key of the table, can be
included in the CREATE TABLE statement; they can also be stated later with an ALTER
TABLE statement.
The above is a very simple example for CREATE TABLE. Some DBMS allow users to specify
lots of storage details as well, which requires a certain knowledge about the DBMSs storage
management, i.e., users who want to create their own objects in the database need a special
training before they can start doing so.
Queries in SQL. SQL has one basic statement for retrieving information from a database. Its
most general form is:
SELECT <attribute_list>
FROM <table_list>
WHERE <condition>
where

<attribute_list> is a list of attribute names whose values are to be retrieved by the
query.

<table_list> is a list of the relation names required to process the query.

<condition> is a conditional search expression that identifies the tuples to be retrieved
by the query; it can be empty.
The following two examples show the most important relational operations: JOIN,
SELECTION, and PROJECTION.
Example 1 “Which databases are covered in InterPro?”
SELECT dbname
FROM cv_database;
In terms of relational algebra this is a projection on attribute dbname of relation
CV_DATABASE.
Note that there is no condition specified in this query, because we want the names of ALL
databases that are in InterPro. Assume that we just want the dbcode of “Pfam”. In this case
the query would have to be rewritten into
SELECT dbcode
FROM cv_database
WHERE dbname = ‘Pfam’;
This is a selection of all the tuples of CV_DATABASE that satisfy condition “dbname = ‘Pfam’”
followed by a projection on attribute dbcode (SQL has been criticised for using the SELECT
clause for specifying a projection and doing the selection of tuples in the WHERE clause, thus
confusing the users. It has become a standard anyway).
Example 2: “What’s the InterPro entry name for fingerprint ACONITASE?”
SELECT entry.name
FROM method, entry2method, entry
WHERE method.name = ‘ACONITASE’
AND method.method_ac = entry2method.method_ac
AND entry2method.entry_ac = entry.entry_ac;
Here, we have a join on tables METHOD, ENTRY2METHOD, and ENTRY, combined with a
selection of all tuples that satisfy condition “method.name = ‘ACONITASE’ “, followed by a
projection on entry.name.
Note, that although the value that shall be retrieved stems from table ENTRY, both tables
ENTRY2METHOD and METHOD (and of course ENTRY) must occur in the FROM clause, because
all three tables are needed to process the query. Unlike example 1, where only one table was
involved, there is an ambiguity now with respect to column names, which is why table names
are put in front of column names (table names and column names are separated by a “.”).
Note also, that the WHERE clause (a) holds the condition, that the method name equals
‘ACONITASE’ and (b) links tables METHOD, ENTRY2METHOD, and ENTRY via their “join”attributes. Whenever tuples are constructed from these three tables, only those tuples are of
interest that follow the “links” on the method_ac and the entry_ac columns; these joins
must be explicitly stated as conditions in the WHERE clause.
More examples for SQL queries. The following examples show SQL queries that
demonstrate the expressive power of SQL. They use aggregate functions and group-by
statements. A discussion of all features of SQL is way beyond the scope of this paper.
‘How many proteins are there in InterPro?’
SELECT COUNT(*)
FROM
protein;
‘How many proteins are in SWISS-PROT and how many are in TrEMBL?’
SELECT cvd.dbname, count(*)
FROM
protein p, cv_database cvd
WHERE
AND
p.dbcode = cvd.dbcode
cvd.dbname IN (‘SWISS-PROT’, ‘TrEMBL’)
GROUP BY cvd.dbname;
‘How long is the average protein in SWISS-PROT and in TrEMBL?’
SELECT
FROM
WHERE
AND
cvd.dbname, avg(len)
protein p, cv_database cvd
p.dbcode = cvd.dbcode
cvd.dbname IN (‘SWISS-PROT’, ‘TrEMBL’)
GROUP BY cvd.dbname;
‘How long is the average method/match?’
SELECT avg(pos_to - pos_from)
FROM
match;
Same as above but this time for each database separately
SELECT cvd.dbname, avg(pos_to - pos_from)
FROM
match m, method me, cv_database cvd
WHERE m.method_ac = me.method_ac
AND me.dbcode = cvd.dbcode
GROUP BY cvd.dbname ORDER BY 2;
‘Which methodology matches most proteins?’
SELECT cvd.dbname, count(distinct protein_ac)
FROM
match m, method me, cv_database cvd
WHERE m.method_ac = me.method_ac
AND me.dbcode = cvd.dbcode
GROUP BY cvd.dbname
ORDER BY 2;
‘Which methodology matches most aminoacids?’
SELECT cvd.dbname, sum(pos_to - pos_from)
FROM
match m, method me, cv_database cvd
WHERE m.method_ac = me.method_ac
AND me.dbcode = cvd.dbcode
GROUP BY cvd.dbname
ORDER BY 2;
‘Which patterns match most proteins (and we want the really big ones only…)?’
SELECT me.name, count(distinct protein_ac)
FROM
match m, method me
WHERE m.method_ac = me.method_ac
GROUP BY me.name
HAVING count(distinct protein_ac) > 1000
ORDER BY 2 desc;
‘Which patterns match the same proteins more than 10 times?’
SELECT me.name, p.name, count(*)
FROM
protein p, match m, method me
WHERE p.protein_ac = m.protein_ac
AND m.method_ac = me.method_ac
GROUP BY me.name, p.name
HAVING count(*) > 10
ORDER BY 3 desc;
‘What are the matches for protein ‘Q10466’?’
SELECT me.name, cvd.dbname, pos_from, pos_to
FROM method me, cv_database cvd, match m
WHERE m.protein_ac = ‘Q10466’
AND m.method_ac = me.method_ac
AND me.dbcode = cvd.dbcode
ORDER BY pos_from, me.name;
Further reading
Elmasri, Navathe: “Fundamentals of Database Systems”, Addison Wesley, third edition, 1999
Ullman: “Principles of Database and Knowledge-Base Systems (Volume 1 and 2)”, Computer
Science Press, 1988/1989
Date: “An Introduction to Database Systems”, Addison Wesley, fifth edition, 1990
Maier: “The Theory of relational Databases”, Computer Science Press 1983
Figure 1. Phases of database design (adapted from Elmasri, Navathe: Fundamentals of
database systems)
Title:
dbDesignPhas es .eps
Creator:
fig2dev Version 3.2 Patchlevel 1
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
Figure 2. The entity-relationship model of InterPro
Title:
interpro.eps
Creator:
fig2dev Version 3.2 Patchlevel 1
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
Figure 3. Relational schema of InterPro
Title:
interpro_rel.eps
Creator:
fig2dev Version 3.2 Patchlevel 1
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
Figure 4. Modified schema of ENTRY
Title:
entry .eps
Creator:
f ig2dev Version 3.2 Patc hlev el 1
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
Post Script print er, but not t o
ot her ty pes of print ers .
Download