Introduction to Database Modeling in Bioinformatics Beate Marx Database Administrator, EMBL Outstation – The European Bioinformatics Institute. Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. Telephone: ++-44-1223-494419 Fax: ++-44-1223-494468 E-mail: Marx@ebi.ac.uk Introduction Over the last couple of years a large number of biological databases have become accessible to the biological community. Universities, scientific institutes, and commercial vendors offer access to all kinds of biological information systems via the Internet. These information systems differ with respect to contents, reliability and easiness of use. They also differ with respect to the size of the databases, which can be everything from a few hundred megabytes to tens or even hundreds of gigabytes, and the underlying technology that is used to maintain these databases. In the sixties, when the first medical databases became publicly available, database technology was still in its early beginnings and the state of the art was indexed flat files. A few vendors offered database management systems (DBMS) that were based on the Network Model or the Hierarchical Data Model. There were no common standards - the first attempt to standardise database technology was the CODASYL (Conference on Data Systems Languages) DBTG (Data Base Task Groups) proposal that was published in the early seventies. With the Relational Data Model developed by Ted Codd and published in a paper in 1970, and the Entity-Relationship Model published by Chen in 1976 a breakthrough took place. From the early seventies on relational database theory became very popular within the computer science community, several large conferences dedicated to database theory were established during these years and quite a number of research projects on relational database systems were started. It was not before the late seventies to early eighties that the first commercial relational database management systems (RDBMS) became available. Query languages based on relational algebra were introduced by several research groups, of which SQL (Sequel) is now the most widespread one. Database management systems (with reduced features) became available for PCs. During the eighties database technology advanced in various ways: on a more technical side, there was a shift from mainframes to client-server based systems and distributed DBMS. On the conceptual side research projects emerged to fill the gap between the relational database model and relational query languages and modern object-oriented programming languages. The ER-model was extended to capture concepts like complex objects and inheritance and database capabilities were extended into several directions: deductive databases were introduced as well as temporal or active databases; first steps to incorporate complex objects into the relational model were made (resulting in NF2 databases) and DBMS features for spatial, temporal, and multimedia data were added. In the nineties two major trends can be identified for commercial DBMS reflecting research topics of the late eighties and early nineties: relational database systems have become the most commonly used systems. They are based on a sound mathematical theory, they are well understood and considerable progress has been made with respect to physical storage structures, query optimisation, transaction processing and multi-user concurrency control, backup and recovery features and database security. Some vendors offer so-called objectoriented relational systems, which are relational systems with some object-oriented features, like nested relations and abstract object types implemented on top. Most of the currently available object-oriented database management systems (ODBMS) the other trend - come from a completely different background: they can be best described as object-oriented programming languages (C++ in most cases) with persistence added. ODBMS lack a formal model and until quite recently no common standard existed; some systems still don’t offer ad-hoc query capabilities and all kinds of data processing techniques are proprietary to the respective database system. Although commercial database management systems are around for more than 20 years now, publicly accessible biological data is still likely to be found in flat file organised databases (some with a highly sophisticated indexing system, like SRS). Yet, over the last ten years there has been a trend to at least maintain the data inside (most often relational) database systems thus taking advantage of an advanced technology that offers ad-hoc declarative query capabilities combined with automatic query optimisation and consistency checking in a way that is almost impossible to implement on flat files. Ongoing implementation efforts concentrate on providing forms based, or more currently, CORBA based interfaces to biological databases. Direct access and unrestricted queries are usually prohibited for the simple reason that this would impose an unpredictable load on the database servers, which would be very difficult to handle. Current database technology offers some concepts that may prove useful in overcoming the related problems. High availability is one of the buzzwords, meaning that database servers are implemented on hardware clusters. In a cluster, if one machine becomes unavailable (e.g. due to a crash), its processes are transparently failed over to the other machine(s), where they will continue to run. Ideally, users will not loose connections or become aware in any other way of the hardware problem. Replication is another very useful concept. It allows mirroring the contents of a database (or part of it) into some other database server. (Readonly) access to this mirror can be granted to the public, even a high load on the mirror will not affect the performance of the master database. Features of database systems Before discussing the features commonly associated with a database system, we will give a short introduction into the vocabulary used throughout this paper. A database is a logically coherent collection of related data with some inherent meaning (or simply a collection of data). A database management system (DBMS) is a collection of programs that enables users to create and maintain a database, i.e., it’s the (general purpose) software that facilitates the processes of defining, constructing, and manipulating databases for various applications. The database and the software together will be called a database system. Current database systems can be characterised by certain concepts the underlying software is commonly supposed to support: A DBMS has the capability to represent a variety of complex relationships among data and to define and enforce integrity constraints for these data. Multiple users can use the system simultaneously without affecting each other. A transaction-processing component applies concurrency control techniques that will ensure non-interference of concurrently executing transactions (A transaction is a sequence of user defined operations enclosed in a ‘begin transaction’ and ‘commit’ that must behave like a single operation on the database.). Physical data storage is transparent to the user. Users express queries (or data manipulation/data definition requests) in a high-level declarative query language (or DML/DDL) such as SQL. Based on the data dictionary, that is maintained by the DBMS, an optimal (or almost optimal) execution strategy for retrieving the result (or implementing the changes) of those queries (manipulations/definitions) will be devised by the DBMS. A backup and recovery subsystem provides the means for recovering from hardware or software failures. An authorisation component protects the database against persons who are not authorised to access part of the database or the whole database. When to use (or not to use) commercial DBMS As mentioned before, biological databases can be very large and the data itself can be very complex in structure. DBMS offer almost 30 years of experience in providing efficient ways to store and index large amounts of data while at the same time hiding storage details from the user. Users will instead be provided with a conceptual representation of the data that does not include many of the details of how the data is stored and a high level query and manipulation language. Abstraction from storage specific details also facilitates data integrity checks at a high level that cannot be achieved for data held in file-based systems. DBMS allow multiple users to access the database at the same time both for querying and updating the data. Data inside a database system is always up-to-date, there is no delay for reorganising file structures or rebuilding indexes that might occur in file-based systems whenever updates take place. Experience shows that the quality of data improves if it is maintained inside a database system provided that the database is designed following certain rules and (automatic) integrity checking is applied wherever it makes sense. Database theory provides a measure for the quality of the database design; integrity constraints are part of the relational data model. In spite of the advantages database systems provide over-file based systems, one has to be aware of overhead costs that are introduced when using DBMS software: There will be a high initial investment in hardware and software. A DBMS is a complex software product that needs specially trained system administrators to handle it. One has also to be aware of problems that may arise if a database is poorly designed or if the DBMS is not administrated by experienced personnel. Because of the overhead costs and the potential problems of improper design or system administration, it may be more desirable to use regular files under the following circumstances: The database and applications are simple, well defined, and not expected to change. There are stringent real-time requirements for some programs that may not be met because of DBMS overhead. Multiple-user access is not required. How to decide which DBMS to use Public domain software versus commercial products. In general it will not be easy to find the right product for a particular application. Small projects that have a highly experimental status may do very well with public domain software. The advantage with public domain software is that of low initial costs. Institutes that run large projects with megatons of data or data centres that deal with huge data archives are probably better off with a commercial product, that comes with proper system support and new versions with new features (and - very important! - bug fixes) that appear in regular intervals. DBMS specially designed for genome projects. A prominent example for a DBMS specially designed for a biological project is Acedb. Acedb was originally developed for the C.elegans genome project, from which its name was derived (A C.Elegans DataBase). However, the tools in it have been generalised to be much more flexible and the same software is now used for many different genomic databases from bacteria to fungi to plants to man. It is also increasingly used for databases with non-biological content. Relational versus object-oriented.There is an ongoing discussion that splits the database community into two groups. The supporters of the object-oriented paradigm argue that the relational data model has been quite successful in the area of traditional business applications, but that it has certain shortcomings in relation to more complex applications like, amongst other examples, scientific databases. The object-oriented approach offers the flexibility to handle requirements of these applications without being limited by the data types and query languages available in relational systems. A key feature of object-oriented databases is the power they give the designer to specify both the structure of complex objects and the operations that can be applied to these objects, thus filling the gap between database systems and modern programming languages. The other, more conservative side, although admitting that relational database systems may provide insufficient support for new applications, is quite reluctant to completely give up the idea of a sound underlying theory that is based on first order predicate logic that make database systems objects that can be described and discussed in mathematical terms. They argue in favour of cautiously extending relational systems to incorporate new features provided they can be formally described in an extended logical model. Yet, although there is considerable research activity in that area, except O2 (now called ARDENT) no commercially available product has emerged so far. O2 is an example for an ODBMS that has started as a research project implementing a logic-based query language with object-oriented features. Projects that have to deal with data that is very complex, both in structure and in behaviour, and cannot easily be mapped into a relational model will benefit from using an ODBMS for data storage. One has to be aware though, that ODBMS are closely integrated with the object-oriented programming language they were designed for, which in most cases is C++. Using an ODBMS amounts to giving up what is considered a benefit of relational database systems that came from insulation of programs and data, i.e., the fact that the data model and data storage was independent from the programming language the applications were implemented in. As a result, changes in either component, the database or the application programs, had little impact on the other side. On implementing a database in one of the currently available object-oriented DBMS this independence will be lost. Projects will be chained to one particular programming language and to the object/class definitions in that language that represent the database schema. To overcome the reservation against ODBMS that resulted from the above observation, a consortium of object-oriented DBMS vendors has proposed a standard, known as ODMG-93, which was revised into ODMG 2.0 lately. Once this standard has been established, i.e., when its concepts will be fully implemented in the commercial ODBMSs, they will be a real alternative to RDBMS for the kind of projects described above. Relational database theory basics This section will give a short introduction into the entity-relationship model, database design theory and relational query languages. Phases of the database design. Figure 1 shows a simplified description of the database design process. The first step shown is requirements collection and analysis. Once this has been done, the next step is to create a conceptual schema for the database using a high-level conceptual data model, followed by the actual implementation of the database. In the final step the internal storage structures and file organisation of the database will be specified. In this paper we will concentrate on the conceptual design and the data model mapping. The Entity-Relationship Model. The Entity-Relationship Model (ERM) is a popular high-level conceptual data model. It is frequently used for the conceptual design of database applications, and many database design tools employ its concepts. The ERM describes data as entities, relationships, and attributes. The basic object is an entity, which is a “thing” in the real world with an independent existence. Attributes describe the particular properties of an entity or a relationship; they may be atomic, composite, or multivalued. Each attribute is associated with a domain, which specifies the set of values that may be assigned to that attribute. A database usually contains groups of entities that are similar, i.e., they share the same attributes and relationships; an entity type describes the schema for a set of entities that share the same structure. An important constraint on the entities of an entity type is the key or uniqueness constraint on attributes; the values of key attributes are distinct for each individual entity, they can be used to identify an entity uniquely. A relationship type among two or more entity types defines a set of associations among entities from these types. Informally, each relationship instance is an association of entities from these types. It represents the fact, that those entities are related to each other in some way in the corresponding miniworld. The degree of a relationship is the number of participating entity types. Relationships usually have structural constraints (e.g. cardinality ratio) that limit the possible combinations of entities. Some entity types may not have key attributes of their own; these are called weak entity types. Weak entities are identified by being related to specific entities from another entity type in combination with some of their attribute values. A weak entity type always has a total participation constraint (existence dependency) with respect to its identifying relationship; a weak entity cannot be identified without an owner entity. The Entity-Relationship Model of InterPro. Figure 2 shows a diagram of the InterPro database as an example for an ERM schema. InterPro entity types, such as ENTRY, METHOD and PROTEIN are shown in rectangular boxes. Relationship types are shown in diamondshaped boxes attached to the participating entity types with straight lines. Attributes are shown in ovals and each attribute is attached to its entity type or relationship type. Weak entities are distinguished by being placed in double rectangles and by having their identifying relationship placed in double diamonds. Note that the figure shows reflexive relationships on ENTRY and PROTEIN. InterPro entries may be merged (or just be moved around) and there must be a way of reconstructing entries as they were before. This is modelled by relationship type EAcc (or, respectively PAcc). Via this relationship an InterPro entry may have a pointer to some other, older InterPro entry that, e.g. by merging, is now part of the newer entry. Enhanced data models. The entity-relationship modelling concepts discussed so far are sufficient for representing a large class of database schemas for traditional applications, and in most cases they are sufficient for representing biological data. Influenced from both programming languages and the broad field of Artificial Intelligence and Knowledge Representation, semantic modelling concepts have been introduced into the entity-relationship approach to improve its representational capabilities (this data model is called EERM, Enhanced-Entity-Relationship Model). These include mainly the concepts of subclass and superclass and the related concepts of specialisation and generalisation. Associated with a class hierarchy concept is the important mechanism of attribute inheritance. A far more general approach that has come up lately is the Unified Modelling Language (UML), which is now a widely accepted standard for object modelling. UML is a modelling language that fuses the concepts of different object-oriented modelling languages that have emerged in the field of software engineering when people became interested in objectoriented analysis and design. As a whole UML supports full object oriented systems design; its static parts can be used for describing the structure and semantics of a database on the conceptual level. The relational data model. The relational model, which was introduced by Codd in 1970, is based on a simple and uniform data structure - the relation, i.e., the relational model represents a database as a collection of relations. Usually the relational algebra, which is a collection of operations for manipulating relations and specifying queries, is considered to be an integral part of the relational data model. When a relation is thought of as a table of values, each row in the table represents a collection of related data values. Column names specify how to interpret the data values in each row. In relational model terminology, a row is a tuple, a column header is called attribute, and the table itself is called relation. More formally, a relation schema is made up of a relation name and a list of attributes. It’s used to describe a relation. Each attribute is the name of a role played by some domain, which is a set of atomic values. Atomic means, that each value in the domain is indivisible as far as the relational model is concerned; it cannot be a set of values or some kind of complex value. A special value, null, is used for cases when the values of some attributes within a particular tuple may be unknown, or may not apply to that tuple. A relation is a set of data tuples, or - more precisely - a subset of the Cartesian product of the domains that define the relation schema. A relational database schema is a set of relation schemas plus a set of integrity constraints. A relational database instance is a set of relations such that each of them is an instance of the respective relation schema and such that all of them together satisfy the integrity constraints that are defined on the database schema. Relational model constraints. The various types of constraints, which can be specified on a relational database schema, include domain constraints, key constraints, entity integrity, and referential integrity. Data dependencies are another type of constraints. They include functional dependencies and multi-valued dependencies, which are mainly used for database design: Domain constraints specify that the value of each attribute must be an atomic value of the domain associated with that attribute. Usually, there are subsets of attributes of a relation schema with the property that no two tuples in any relation instance should have the same combination of values for these attributes. A minimal subset with that property is called a (candidate) key of the relation schema. If there is more than one candidate key, commonly one is designated as the primary key. This is the set of attributes whose values are used to identify tuples in a relation. The entity integrity constraint states that no primary key value can be null. The referential integrity constraint is specified between two relations and is used to maintain the consistency among tuples of the two relations. Informally, the referential integrity constraint states that a tuple in one relation that refers to another relation must refer to an existing tuple in that relation. More formally, a set of attributes FK in relation schema R1 is a foreign key of R1 if it satisfies the following rules: The attributes in FK have the same domain as the primary key attributes PK of another relation schema R2; the attributes of FK refer to the relation R2. A value of FK in a tuple of R1 either occurs as a value of PK from some tuple of R2 or is null. A foreign key can refer to its own relation. These integrity constraints are supported by most relational database management systems. There’s a class of more general semantic integrity constraints. “An employee’s salary should not exceed the salary of his supervisor” is an example for this class, that cannot be enforced in a DBMS by simple means, but most RDBMS provide mechanisms to help implement rules like that as well. Relational Algebra. Relational algebra is a collection of operations that are used to manipulate entire relations, like for example to select tuples from individual relations or to combine related tuples from several relations for the purpose of specifying a retrieval request on the database. These operations are usually divided into two groups. One group includes set operations from mathematical set theory; these are applicable because relations are defined to be sets of tuples. They include: UNION, INTERSECTION, DIFFERENCE, and CARTESIAN PRODUCT The other group consists of operations developed specifically for relational databases; these include SELECT, PROJECT, and JOIN among others: The SELECT operation is used to select a subset of the tuples in a relation that satisfy a selection condition. The PROJECT operation selects certain columns from a table and discards other columns. The JOIN operation is used to combine related tuples from two relations into single tuples. It is used to process relationships among relations. The important thing about relational algebra is, that relational operators work on relations, i.e., they take relations as input and have relations as output. As a result of that, they can be arbitrarily combined to form all kinds of complex queries on the database. There also exists a set of rules of how sequences of operations on relations can be reordered, thus providing database systems with a means to optimise complex queries by regrouping operations and evaluate them in an efficient way. ER-to-relational mapping. Basically, the algorithm of converting an ER schema into a relational schema follows a list of steps: Entity types are converted into relations. The simple attributes of an entity type become attributes of the relation. Complex attributes are broken down into their atomic parts, which then become attributes of the relation. The multivalued attributes are turned into relations on their own; relations derived from multivalued attributes include the primary key of the entity type as relation attributes. Weak entity types are converted into relations just as normal entity types, but here the attributes of the primary key of the owner entity type are added to the relation attributes to make sure, that the new relation has a proper primary key. Not all relationship types are turned into relations. For binary 1:1 relationship types the participating entity types and the corresponding relations are identified. One of them is chosen (more or less) arbitrarily. Its attributes are augmented by the primary key attributes of the other relation. These attributes now form a foreign key to the other relation. Likewise, for a binary 1:N-relationship type, there is a relation corresponding to the entity type on the “N”-side of the relationship. This relation is augmented by the primary key attributes of the other relation, which form a foreign key to that relation. M:N relationship types are converted into their own relations. Here again, the relations corresponding to the entity types that are participating in the relationship are identified and their primary key attributes are combined to form the primary key of a new relation that represents the M:N relationship. Again, the original primary key attributes form foreign keys to their respective relations. If the relationship type had attributes of its own, these are added to the new relation. Likewise for an n-ary relationship type a new relation is created that includes all the primary keys of the relations that represent the participating entity types. First informal design guideline for relation schemas: “Design a relation schema so that it is easy to explain its meaning. Do not combine attributes from multiple entity types or relationship types into a single relation.” Note that absolutely everybody, who accesses a database directly, application developers that write application code or users that write ad-hoc SQL-queries, must fully understand the meaning of the database schema. Intuitively, if a relation schema is as close as possible to the entity-relationship schema its meaning tends to be clear. The InterPro example continued. Figure 3 shows the relational model that was derived from the InterPro conceptual model shown in figure 2. Note that every entity type has been converted into one relation type, which has the same attributes as the entity type. Each entity type had an accession number, a somewhat artificial key attribute, which was not really used to describe a feature of the entities belonging to that type. These key attributes, like entry_ac for ENTRY, serve as primary keys of the respective relations. Relationship types ENTRY2METHOD and MATCH have been converted into their own relation types. Note that these relation types inherit the combined primary keys of the relation types that correspond to the entity types participating in the relationship, i.e., entry_ac, method_ac and protein_ac. Weak entity type DOC has been converted into its own relation type. The weak relationship between DOC and ENTRY is preserved by including the primary key of ENTRY, entry_ac, into the attributes of DOC. Relations CV_DATABASE and CV_ENTRYTYPE provide a “controlled vocabulary”, hence the CV_-prefix, for database names or entry types that may occur in InterPro. There are only a few databases so far where entries are accepted from; however, their number may increase in future. Two relations ENTRY_ACCPAIR and PROTEIN_ACCPAIR are mainly for tracing old entries (or proteins), that have been moved or merged together and have a new primary key now. In case the old entries are needed, they can be accessed via secondary_ac. Database design theory Apart from the guideline for a good schema design that was mentioned above, there are other aspects that have to be considered. Another look at the InterPro example. Assume that in the InterPro example the description of an entry type was included in the schema of ENTRY. Figure 4 shows the graphical representation of the corresponding ENTRY relation type. Now, while InterPro contains quite a large number of entries, so far there’s only a small number of entry types, that are supported (currently it’s four: “F” for “Family”, “D” for “Domain”, “R” for “Repeat”, and “P” for “PTM”). Each one of those four types has a description, so that any outsider, who accesses the InterPro database for the first time, will understand the meaning of these types. In the relation schema that’s depicted in Figure 4 this description is included in the attribute list of ENTRY. For every tuple of ENTRY, the lengthy text, which makes up the description of the respective entry type, will thus be included in the tuple and the few descriptions that are valid will be repeated many times. Obviously this is a waste of storage space, which should be avoided. There is also a more serious problem related to this kind of design: Assume one user has a look at a particular entry and decides, that the description of the entry type could be more precise than it is and he updates that tuple with a new description. The same entry type will have the old description in all other entry tuples, i.e., by updating one tuple of ENTRY an inconsistency on the attribute type_descr has been introduced. This kind of behaviour is commonly called an update anomaly. There are other types of anomalies. With ENTRY as described in Figure 4, if a new entry type is introduced, it can only be entered into the database, if an entry already exists there of that type. This is called an insertion anomaly. Likewise, if the last tuple with a certain entry type is deleted from the database, then this entry type will not exist anymore in the database. This is called a deletion anomaly. Second informal design guideline for relation schemas: “A relational schema should be designed so that no modification anomalies, i.e., update, insertion, or deletion anomalies, occur in the relations.” Note that in some exceptional cases this concept has to be violated in order to improve performance of certain queries (usually, the performance of queries is the better, the fewer joins between relations are involved). The anomalies in those cases must be well documented and understood so that updates on the relation do not end up in inconsistencies. Third informal design guideline for relation schemas: “ Design relations in a way, that null values for certain attributes do not apply for the majority of the tuples of that relation.” Null values are a bit tricky. A null value for some attribute can have multiple interpretations, such as: The attribute does not apply to this tuple. The attribute value for this tuple is unknown. The value is known, but is absent; that is, it has not been recorded yet. Having the same representation for all nulls compromises the different meanings they may have. Another problem is how to account for them in JOIN operations or when aggregate functions such as COUNT or SUM are applied. So, when nulls are unavoidable, they should apply in exceptional cases only. Normal forms. So far, situations have been discussed that lead to problematic relation schemas and informal guidelines for a good relational design have been proposed. In the following section formal concepts are presented that allow to define “goodness” or “badness” of a relational schema more precisely. The single most important concept in relational design is that of a functional dependency. A functional dependency is a constraint between two sets of attributes from the database. Functional dependencies are defined under the assumption, that the whole database is described by a single universal relation schema, i.e., all attributes that occur in the database schema belong to one relation schema (this is only an abstraction that simplifies the following definitions; later these definitions and concepts will be applied to the relations that have been derived from the entity-relationship schema). A functional dependency, denoted by XY, between two sets of attributes X and Y of the universal relation schema, states that for any two tuples, if they have the same values on X, they must also have the same values on Y. The values of the Y component of the tuple are thus determined by the values of the X component; or alternatively, the values of the X component of a tuple uniquely (functionally) determine the values of the Y component. A functional dependency is one aspect of the meaning (or semantics) of the attributes of the universal relation schema. It must be defined explicitly by someone who knows this semantics. Specifying all kinds of functional dependencies on the attributes of a database schema is a step in the (conceptual) design process that must be performed by a person, that is familiar with the real world application; it cannot be done automatically. Once a set of functional dependencies is given, more functional dependencies can be inferred from them by simple rules. Normal forms are defined on the set of all functional dependencies, the given and the inferred ones. The normalisation process, as first proposed by Codd, takes a relation schema through a series of tests to certify whether or not it belongs to a certain normal form. Normalisation of data can be looked on as a process during which unsatisfactory relation schemas are decomposed by breaking their attributes into smaller relation schemas that possess certain desirable properties. One objective of the (original) normalisation process is to ensure that modification anomalies discussed in the previous section do not occur. Before proceeding with the definitions of normal forms, the concept of keys needs to be reviewed: A superkey of a relation schema is a set of attributes with the property that no two tuples of any legal extension of that schema will have the same values on all those attributes. A key is a superkey with the additional property that the removal of any attribute will cause it not to be a superkey any more, i.e., a key is a minimal set of attributes that form a superkey. If a relation has more than one key, each one is called a candidate key and one candidate key is arbitrarily chosen to be the primary key. An attribute is called prime if it is the member of any key; otherwise the attribute is called nonprime. The first normal form (1NF) is now considered to be part of the formal definition of a relation. It states that the domains of attributes must include only atomic (indivisible) values and that the value of any attribute in a tuple must be a single value from the domain of that attribute. The second normal form (2NF) is based on the concept of full functional dependency. A functional dependency XY is a full functional dependency if the removal of any attribute from X means that the dependency does not hold anymore. A relation schema is in second normal form (2NF) if every nonprime attribute is fully functionally dependent on every key of the relation schema. A relation schema is in third normal form (3NF) if, whenever a functional dependency XA (A being a single attribute) holds, either (a) X is a superkey or (b) A is a prime attribute of the relation schema. A relation schema is in Boyce-Codd normal form (BCNF) if, whenever a functional dependency XA (A being a single attribute) holds, then X is a superkey of the relation schema. Note that BCNF is slightly stricter than 3NF, because condition (b) of 3NF, which allows A to be prime if X is not a superkey, is absent from BCNF. Usually it is considered best to have relation schemas in BCNF. If that is not possible, 3NF will do; in practice, most relation schemas that are in 3NF are in BCNF anyway. To see the “historical” relation between 2NF and 3NF, the concept of transitive dependencies has to be defined: A functional dependency XY is a transitive dependency if there is a set of attributes Z that is not a subset of any key of the relation schema and both XZ and ZY holds. Using the concept of transitive dependencies a more intuitive definition of 3NF can be given: a relation schema is in 3NF if every nonprime attribute is fully functionally dependent on every key of the relation schema (2NF) and nontransitively dependent on every key of the relation schema. The technique for relational database schema design, which is described in this paper, is usually referred to as top-down design. It involves designing a conceptual schema in a highlevel data model, such as the ERM, and then mapping the conceptual schema into a set of relations. In this technique the normalisation principles, such as avoiding transitive or partial dependencies by decomposing unsatisfactory relation schemas, can be applied both during the conceptual schema design and afterwards to the relations resulting from the mapping algorithm. Note that normal forms, when considered in isolation from other factors, do not guarantee a good database design. Unfortunately, it is often not sufficient to check separately that each relation schema in the database is in 3NF or in BCNF. Rather, the process of normalisation must also confirm the existence of additional properties that the relational schemas together should possess (for example there are certain restrictions that ensure, that two relations that result from the decomposition of a relation schema that is not in 3NF or BCNF, produce exactly the relation they were derived from when they are joined). These concepts are beyond the scope of this paper; in the literature they can be found under the keywords lossless join property and dependency preservation property. A short introduction into SQL Most commercial DBMSs provide a high-level declarative language interface. By declarative we mean that the user has to specify what the result of his query should be (not how it should be evaluated), leaving the decision on how to execute and optimise the evaluation of a query to the DBMS. SQL (short for structured English query language) is a declarative, comprehensive database language; it has statements for data definition, query, and update. In this section table, row and column are used for relation, tuple, and attribute. Data definition in SQL. The following SQL-statement creates table ENTRY of the InterPro example: CREATE TABLE ( entry_ac NOT NULL VARCHAR2(9) entry_type CHAR(1), name VARCHAR2(80), created NOT NULL DATE, timestamp NOT NULL DATE ) CONSTRAINT pk_entry PRIMARY KEY, NOT NULL constraints are usually specified directly in the CREATE TABLE statement; other constraints, e.g. those that define a certain column to be the primary key of the table, can be included in the CREATE TABLE statement; they can also be stated later with an ALTER TABLE statement. The above is a very simple example for CREATE TABLE. Some DBMS allow users to specify lots of storage details as well, which requires a certain knowledge about the DBMSs storage management, i.e., users who want to create their own objects in the database need a special training before they can start doing so. Queries in SQL. SQL has one basic statement for retrieving information from a database. Its most general form is: SELECT <attribute_list> FROM <table_list> WHERE <condition> where <attribute_list> is a list of attribute names whose values are to be retrieved by the query. <table_list> is a list of the relation names required to process the query. <condition> is a conditional search expression that identifies the tuples to be retrieved by the query; it can be empty. The following two examples show the most important relational operations: JOIN, SELECTION, and PROJECTION. Example 1 “Which databases are covered in InterPro?” SELECT dbname FROM cv_database; In terms of relational algebra this is a projection on attribute dbname of relation CV_DATABASE. Note that there is no condition specified in this query, because we want the names of ALL databases that are in InterPro. Assume that we just want the dbcode of “Pfam”. In this case the query would have to be rewritten into SELECT dbcode FROM cv_database WHERE dbname = ‘Pfam’; This is a selection of all the tuples of CV_DATABASE that satisfy condition “dbname = ‘Pfam’” followed by a projection on attribute dbcode (SQL has been criticised for using the SELECT clause for specifying a projection and doing the selection of tuples in the WHERE clause, thus confusing the users. It has become a standard anyway). Example 2: “What’s the InterPro entry name for fingerprint ACONITASE?” SELECT entry.name FROM method, entry2method, entry WHERE method.name = ‘ACONITASE’ AND method.method_ac = entry2method.method_ac AND entry2method.entry_ac = entry.entry_ac; Here, we have a join on tables METHOD, ENTRY2METHOD, and ENTRY, combined with a selection of all tuples that satisfy condition “method.name = ‘ACONITASE’ “, followed by a projection on entry.name. Note, that although the value that shall be retrieved stems from table ENTRY, both tables ENTRY2METHOD and METHOD (and of course ENTRY) must occur in the FROM clause, because all three tables are needed to process the query. Unlike example 1, where only one table was involved, there is an ambiguity now with respect to column names, which is why table names are put in front of column names (table names and column names are separated by a “.”). Note also, that the WHERE clause (a) holds the condition, that the method name equals ‘ACONITASE’ and (b) links tables METHOD, ENTRY2METHOD, and ENTRY via their “join”attributes. Whenever tuples are constructed from these three tables, only those tuples are of interest that follow the “links” on the method_ac and the entry_ac columns; these joins must be explicitly stated as conditions in the WHERE clause. More examples for SQL queries. The following examples show SQL queries that demonstrate the expressive power of SQL. They use aggregate functions and group-by statements. A discussion of all features of SQL is way beyond the scope of this paper. ‘How many proteins are there in InterPro?’ SELECT COUNT(*) FROM protein; ‘How many proteins are in SWISS-PROT and how many are in TrEMBL?’ SELECT cvd.dbname, count(*) FROM protein p, cv_database cvd WHERE AND p.dbcode = cvd.dbcode cvd.dbname IN (‘SWISS-PROT’, ‘TrEMBL’) GROUP BY cvd.dbname; ‘How long is the average protein in SWISS-PROT and in TrEMBL?’ SELECT FROM WHERE AND cvd.dbname, avg(len) protein p, cv_database cvd p.dbcode = cvd.dbcode cvd.dbname IN (‘SWISS-PROT’, ‘TrEMBL’) GROUP BY cvd.dbname; ‘How long is the average method/match?’ SELECT avg(pos_to - pos_from) FROM match; Same as above but this time for each database separately SELECT cvd.dbname, avg(pos_to - pos_from) FROM match m, method me, cv_database cvd WHERE m.method_ac = me.method_ac AND me.dbcode = cvd.dbcode GROUP BY cvd.dbname ORDER BY 2; ‘Which methodology matches most proteins?’ SELECT cvd.dbname, count(distinct protein_ac) FROM match m, method me, cv_database cvd WHERE m.method_ac = me.method_ac AND me.dbcode = cvd.dbcode GROUP BY cvd.dbname ORDER BY 2; ‘Which methodology matches most aminoacids?’ SELECT cvd.dbname, sum(pos_to - pos_from) FROM match m, method me, cv_database cvd WHERE m.method_ac = me.method_ac AND me.dbcode = cvd.dbcode GROUP BY cvd.dbname ORDER BY 2; ‘Which patterns match most proteins (and we want the really big ones only…)?’ SELECT me.name, count(distinct protein_ac) FROM match m, method me WHERE m.method_ac = me.method_ac GROUP BY me.name HAVING count(distinct protein_ac) > 1000 ORDER BY 2 desc; ‘Which patterns match the same proteins more than 10 times?’ SELECT me.name, p.name, count(*) FROM protein p, match m, method me WHERE p.protein_ac = m.protein_ac AND m.method_ac = me.method_ac GROUP BY me.name, p.name HAVING count(*) > 10 ORDER BY 3 desc; ‘What are the matches for protein ‘Q10466’?’ SELECT me.name, cvd.dbname, pos_from, pos_to FROM method me, cv_database cvd, match m WHERE m.protein_ac = ‘Q10466’ AND m.method_ac = me.method_ac AND me.dbcode = cvd.dbcode ORDER BY pos_from, me.name; Further reading Elmasri, Navathe: “Fundamentals of Database Systems”, Addison Wesley, third edition, 1999 Ullman: “Principles of Database and Knowledge-Base Systems (Volume 1 and 2)”, Computer Science Press, 1988/1989 Date: “An Introduction to Database Systems”, Addison Wesley, fifth edition, 1990 Maier: “The Theory of relational Databases”, Computer Science Press 1983 Figure 1. Phases of database design (adapted from Elmasri, Navathe: Fundamentals of database systems) Title: dbDesignPhas es .eps Creator: fig2dev Version 3.2 Patchlevel 1 Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . Figure 2. The entity-relationship model of InterPro Title: interpro.eps Creator: fig2dev Version 3.2 Patchlevel 1 Prev iew : This EPS picture w as not s av ed w ith a preview inc luded in it. Comment: This EPS picture w ill print to a Pos tSc ript printer, but not to other ty pes of printers. Figure 3. Relational schema of InterPro Title: interpro_rel.eps Creator: fig2dev Version 3.2 Patchlevel 1 Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . Figure 4. Modified schema of ENTRY Title: entry .eps Creator: f ig2dev Version 3.2 Patc hlev el 1 Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a Post Script print er, but not t o ot her ty pes of print ers .