CM2006/M:23 IDOD: an integrated marine environmental database

advertisement
CM2006/M:23
IDOD: an integrated marine environmental database
K. De Cauwer, M. Devolder, S. Jans, A. Meerhaeghe, S. Scory
Belgian Marine Data Centre, Management Unit of the North Sea Mathematical
Models, Gulledelle 100, B–1200 Brussels, Belgium, http://datacentre.mumm.ac.be/
Abstract
In 1997, the Belgian Marine Data Centre started developing an integrated marine
environmental database. The initial core model was designed taking into account the
characteristics of concentration data in seawater. This model was soon expanded to
cope with concentrations in air, sediment and biota. Nowadays, also spectral data
(optical properties, sediment granulometry), observations on living organisms and
biodiversity information are stored in the database. Very recently, the structure was
complemented with a taxonomic module that allows the storage of vernacular names,
taxonomic changes, determinations to any level of detail as well as revised
determinations of a given specimen. Next to the storage of measurement results,
special attention is paid to the storage of a comprehensive set of meta-information like
the corresponding research projects, the methodology and the related quality control
information. The main thought about storing and linking results of various disciplines
together is to improve the knowledge of our marine ecosystem in its broadest point of
view and to make the life of the users of the system easier. The online interface
provides many search options and several export formats, in order to allow everyone
to create a subset of the data suited to his/her very own needs. In this paper, we
present the structure of the IDOD database, the technical choices we made to reach
our goals and our plans for the future.
1. Introduction
The Management Unit of the North Sea Mathematical Models (MUMM) is the
Belgian federal scientific service responsible for the protection of the marine
environment. It has since its creation in 1976 always been involved in data collection
and data management. At the end of the 90’s, the need for a new relational database
management system became evident. Thanks to the funding of the Belgian Federal
Science Policy, the IDOD project was then launched to cover all the data management
needs of both the structural monitoring programmes performed by and on behalf of
the State and the various research projects performed in the universities and similar
scientific institutions.
An increasing variety of datasets needed to be stored, archived and disseminated to
the scientists, the policy makers and the general public. The data and/or metadata of
the monitoring and research efforts needed to be reported to international
organisations and networks according to predefined formats (e.g. ICES, European
Directory of Marine Environmental Datasets (EDMED) …). Since 1998, an integrated
marine environmental database with supportive tools was developed. In this database
different data types, like spectral data, biodiversity information, observations and
concentrations in seawater, air, sediment and biota, are well documented and
integrated in a sole structure. By integrating the different dissemination and reporting
functions as well, the information system facilitates greatly the data managers work.
2. Methodology
The most important and first phase of database design is the creation of the conceptual
model describing the information to be stored as abstract objects and their
associations. With the full integration of all data types into one model in mind, we
addressed first the case of the concentrations of substances and similar parameters in
sea water. Once that model had been validated, it was expanded to concentrations in
air, sediment and biota. Afterwards, spectral data, observations and biodiversity
information were added.
Several bilateral contacts with different data providers were necessary to understand
the characteristics and to foresee the import of their data sets. The pan-European data
catalogues EDMED, European Directory of Marine Environmental Research Projects
(EDMERP), CSR and the ICES environmental reporting format were taken into
account to ensure the exchangeability of our holdings in the required format.
All concepts to be stored were carefully described:
-
-
each entity with all its attributes (field type, format, optionality, ...) and a
primary key,
each relation with its cardinalities (minimum and maximum occurrences that
can exist for a relation from an entity to another) and its transferability (“is the
relation updatable?”),
an ‘Entity – Relation’ diagram,
the business rules to maintain the quality and consistency of the data.
For most parts, the three first normal forms of the database conception rules were
checked. Normalization is a process that eliminates redundancy, organizes data
efficiently, reduces the potential for anomalies during data operations and improves
data consistency. Edgar F. Codd (1970) originally established three normal forms,
later refinements were made as additional normal forms. Normalization is described
in more detail by Morris (2005) and for example on Wikipedia (2006).
The main objective, the storage of samples and results, was kept in mind to avoid an
overnormalisation and complex design requiring a lot of programming effort. As
such, the structure for e.g. publications and authors was not fully normalized.
Whenever this part would gain in importance, the structure can still be adapted.
During the logical design phase, the functional model is converted into a technical
design. The relations between entities are translated in foreign keys, or transition
tables in case of many-to-many tables, the indexes to be created were reviewed and
the triggers and procedures were written. The resulting table definitions can directly
be implemented as ‘physical’ tables by executing the data definition queries. The
CASE tool Designer/2000 was used to construct the conceptual model, to convert this
model into table definitions and to build the data definition SQL code. All
documentation of the system, including the change history, is stored in the CASE tool.
3. The conceptual model
3.1. General structure
A general view of the resulting structure is given in Figure 1. The central part of the
scheme represents the physical samples, grouped by sampling event (same place and
time). They are directly linked to the values for water and biodiversity measurements
and divided into subsamples in the case of sediment and biota samples. The spectral
values are related to samples via the entity SPECTRA.
Samples are collected in the frame of a “project”, during an oceanographic
“campaign” on a “platform”. The platform can be a research vessel, "on foot", an
automatic measurement station, etc. Specification of the involved services and
persons are important for both the campaign and the project. The field ‘reference
sampling depth’ (e.g. ‘bottom’, ‘subsurface’…) is added to allow the export of a
tabular format in which values of different samples can be grouped on same place,
time and depth.
SAMPLES are thus characterised by a position, sampling depth, time and sampling
method. The position can either be an exact position or the reference position of the
sampling station to which it is linked. In case of tracks, the start and end positions and
times are entered. A flag indicates when the time reference system is not known
(otherwise converted to UTC). Similarly, another flag states when the position
reference system is unknown.
Services
Campaigns
Persons
Sampling events
Platforms
Stations
Samples
Projects
Sampling gears
Taxa
Water
values
Sediment
subsamples
Spectra
Sediment
values
Spectral
values
Biodiversity
Biota
values
Matrices
Observations
Categoric
parameters
Analysis methods
Parameters
Biota
subsamples
Units
Figure 1. General structure of IDOD database
As parameters measured on water and air samples have the same characteristics, those
results are stored in the same entity WATER VALUES. Sampling depths are negative for
air and positive for water and the matrices differ: ‘rainwater’, ‘aerosols’ or ‘gas’ for
air; and ‘dissolved’, ‘particulate’ or ‘total’ for water.
The optical data like total backscattering coefficient are characterized by a range of
values at given wavelengths. They are grouped by SPECTRA, the entity defining the
analysis method, project and processing flag.
For sediment data, the entity SEDIMENT SUBSAMPLES was created for the storage of
depth profile data. Extra attributes are necessary to define the fraction sizes of the
sediment analyzed.
A sample of living organisms is often subdivided in individuals or pooled subsamples
of a given species or species group (defined by TAXA, see section 3.2). These BIOTA
SUBSAMPLES can be characterized by physical characteristics like weight and length.
Numerical measurement results like contaminant concentrations on a specific tissue of
the subsample are stored as BIOTA VALUES while OBSERVATIONS contain the so-called
‘categorical values’. OBSERVATIONS contain any data like oil coverage in %, presence
of parasites or cachexy index for birds of a given subsample of specimen(s).
BIODIVERSITY measurements, like bird counts, phytoplankton biomass and mammal
abundance, are directly linked to the sample and to a given species or species group
(TAXA). The same specimen can be determined several times by several people over
the years. Especially in the case of historic collections, redeterminations and
taxonomic revisions are very important. Actually, the BMDC is involved in the
digitization process of the Gilson collection. These samples, collected by G. Gilson in
the Southern North Sea between 1899 and 1910 are stored at the Royal Belgian
Institute of Natural Sciences. The number of determinations by different persons
during the last century varies from specimen to specimen. By linking the biodiversity
results to themselves, any number of determinations of a specimen can be recorded
with specification of date and observer.
All resulting values are linked to an ANALYSIS METHOD, that describes and links to all
information on the analysis, sample handling, parameter, matrix, unit and analysis
laboratory. To incorporate information on the quality of the analysis method, two
different aspects are introduced: intercomparison exercises and control chart
information.
3.2. Taxonomic structure
At the early stage of the development process of the database, a simple denormalized
structure for information on species was generated to link with the results of densities
and biomass of given species, observations on species and contaminant concentrations
in specimens. Although that initial structure was useful for species with a wellestablished taxonomy and determinations to the same level, it proved to be
insufficient to cope with all the datasets we had to handle.
In practice, datasets pertaining to living resources range from contaminant
concentrations in well-known commercial fish species to densities of a dynamic
taxonomic group like nematodes. In the latter case, the problem is even more complex
because determination is not always done at the species-level. New species are
encountered and reported before they have an official scientific name. Also some
benthic faunal groups were determined to ecological community level, with for each
community a list of species, subspecies and genera of which it consists.
A flexible system was needed in order to deal with the hierarchal nature of the data,
like in taxonomic databases (European Register of Marine Species (Costello, 2004),
(Costello, 2000), Integrated Taxonomic Information System (ITIS, 2005)). Also
identifiers of taxa in international taxonomic databases, synonyms and informal
names needed to be stored. Informal names are interpreted as those names which are
not part of the official taxonomic vocabulary like temporary names given by
biologists, names in different languages and commonly used names like zooplankton.
In the entity TAXON (see figure 2), the scientific taxon name at a given level (link with
TAXON_LEVEL) is entered and the mandatory link with the parent as a link to itself.
This relation is one-to-many: each taxon has one and only one parent, each taxon has
zero to many child taxa. Besides the scientific names, also non-official names or
‘clades’ are entered in this entity and the relation with the scientific names is
established via the intermediate table TAXON_CLADE to reflect the many-to-many
relation. The term ‘clade’ is used in our database to define all informal names. Clades
and taxa are stored in the same table. Only the informal names are linked to
CLADE_SCHOOL. CLADE_SCHOOL defines the type of informal name and has entries
like ‘English’, ‘ERMS ID’, ‘Euring’ or ‘name given by person x of service x’. This
way a very flexible system is obtained that is able to deal with the uncertainty in
determinations and taxonomy. The results are stored in the way they are reported by
the data originators.
Figure 2. E-R diagram for taxonomy.
For example, in some cases it is hard to distinguish two or more species. Data
providers report the results with the name of both species, e.g. zooplankton species
Chaetoceros compressus-costatus. In the IDOD database the results will be linked to
the non-official name Chaetoceros compressus-costatus that is entered in the table
TAXA. The individual species are entered as well and the link between the informal
name and the two official species is made via TAXON_CLADE. The clade school, linked
to the informal name Chaetoceros compressus-costatus, defines who has given the
name.
Another example is the determinations of Copepoda to a community level like
Copepoda epibenthic species. The densities are directly linked to the group of species
while this group is linked to the official species of which it may consist.
Also the taxonomic discrepancies as described by Chavan et al. (2005) like
differences in taxonomic hierarchies, differences in spelling and homonyms can be
imported as ‘clades’ in this structure.
Although the structure is capable of, it is not the aim of the Belgian Marine Data
Centre to store the complete taxonomic tree. Instead, only taxa for which
measurements or observations exist are entered in the database. Reference is made to
ERMS (European Register of Marine Species) and ITIS (Integrated Taxonomic
Information System) for the complete and most recent taxonomic information.
3.3. Structure to handle user access rights
Users are able to select and download data online. With regards to the data policy,
users do not have the same rights on all data sets. The practical data policy rules can
also depend on the project for which the data are gathered.
For instance, results of official monitoring campaigns are available for the public
immediately after their incorporation in the data base. For the research projects in the
frame of national research programmes, an embargo of two years after the data
transmission deadline is foreseen. After the embargo all data can be used by anyone
respecting the data access rules. Before the embargo, the data are private and only
available to persons nominated by the data originator. Besides this, a few completely
private datasets also exist in the database.
To respect all these rules, two fields in the table PROJECT are added and a table
SPECIAL ACCESS was created as shown in Figure 3. A project can be flagged as either
public or private. For private projects an embargo period can be entered after which
the data become public. The persons granted access to data of a given project, even
when a project is flagged as private, are linked to that project via the table SPECIAL
ACCESS. Business rules ensure that data are only returned when allowed: when a query
for data is executed by a given user, the project of every value is checked. When the
project is flagged as private and the embargo period is not passed, a procedure checks
if the person is granted access for this project. When the user has access, the values
are returned, if not, only the existence and number of private values are mentioned.
Figure 3. E-R diagram for user access.
All requests performed are stored in the table REQUESTS. The user can recall an
existing request to modify it or to retrieve newly imported data meeting the same
selection criteria, if any. The table is also used to log the data selection activity.
4. Database
The resulting database (Oracle8i) consists of more or less 115 tables of which 39
reference tables and 19 tables used for importing and reporting. Business rules are
implemented as PL/SQL procedures and triggers. Triggers are used on every table to
record auditing information like by whom and when the record is created and updated.
For the table SAMPLES and REQUESTS a complete journaling table is kept. For every
change (update, delete, insert), a record is stored with the user name and date of
change. Imports of data are usually done with SQL commands and triggers on import
tables verifying data integrity and consistency.
Migration to Oracle 10g is foreseen in the near future. Java was used to create the first
version of the user interface. The new interface (see next section) is written in the
scripting language php.
5. User interface
An online user interface has been created to enable users to select and download data
after registration. The requested data type should be selected first. In a next step,
several criteria, grouped in tab pages, can then be specified. Species or higher levels
can be selected in a taxonomical tree or can be entered as free text. Geographical
selection on a map will be available in the near future (Meerhaeghe et al., 2006). The
users are able to edit and launch their previously made requests. The resulting dataset
is returned in record format with one line per value. Optionally, also a tabular format
can be selected as output format, in which the data are presented in columns per
parameter or parameter-method suitable for data analysis. All documentation like the
lists of codes used, explanation of the output format and an inventory of the data
available in the IDOD database is presented on the web site.
Figure 4. User interface.
From the developer’s point of view, a disadvantage of the highly normalized
taxonomic structure is the quite complex SQL codes necessary to visualize the
taxonomic tree or to retrieve data of a given taxon. An advantage of OracleSQL is the
support for querying hierarchical data with the START WITH and CONNECT BY clauses
allowing a hierarchical presentation of the data (Gennick, 2003). Other database
systems would need a code that recursively issues a series of SQL statements to walk
up or down the tree (Morris, P.J., 2005).
The user, however, can be sure that all results that are of possible interest to him/her
are returned. When a taxon of any level is selected in the query interface, SQL code
ensures that not only results for that taxon are returned but also for all ‘child’ taxa,
synonyms of the selected taxon and informal names linked to the selected taxon. For
example, when Chaetoceros compressus Lauder 1864 is selected in the tree on the
query interface, the results for the non-official name Chaetoceros compressuscostatus are also returned. Together with the resulting text file, a file is returned with
information on the non-official names used.
In Figure 5 an example of such a file is given for ‘Copepoda communities’. The
‘Copepoda ecotype 3 – Interstitial species’ community, accompanied with
information about its level, parent and clade school, encompasses a list of different
official species and genera. If however an informal species name is encountered in
this list, e.g. Leptastacus cfr. coulli in line 5, this name returns indented, on the next
line in the column ‘clade’ resembling the official species Leptastacus coulli.
Figure 5. Example of file with information on the non-official names used.
With regards to the redeterminations of biodiversity data, the number of previous
determinations is given as an additional column in the results file. Only the most
recent determination is returned in the file. All previous determinations, if any, can be
requested by mail.
6. Discussion
The IDOD database is constructed for the long-term archival of a large range of
different data types. All marine analytical results and observations are stored with the
corresponding meta-information necessary for a good understanding and
interpretation of the results. In the IDOD database, the results related to living
resources are linked to a hierarchical taxonomic structure that also enables the storage
of vernacular names and revisions of taxonomy. An online user interface was created
for the access to the different data types. In the query interface, different criteria can
be selected as well as a tabular output format besides the usual record format.
We can not forget that the usefulness of such an information system is determined by
its contents (Diepenbroek et al., 2002). Table 1 shows the number of data actually
available in the database. Taking into account that the import of data started only mid2001 by a small team (2.5 full time equivalents) also responsible for developments,
and the number of datasets in the ‘import’-queue, these figures are really encouraging.
Table 1. Contents of the IDOD database.
Table
Nr of records
Projects
252
Campaigns
695
Samples
54942
Parameters
393
Methods
668
Data items
Taxonomic items
324389
4211
Even though design and developments costs are high, the data managers, data users
and moreover the quality of the data benefit from the system. It is the central tool used
daily for a range of data management functions. Since its opening in July 2002,
around 340 persons registered to use the database and 800 requests were executed
with, for every request, almost two runs: with or without modified selection criteria.
New developments are continuously ongoing to deal with the external changes, to
improve the performance and to broaden the range of data types. It has always been
possible to extend the model without major problems. As a next step, an indexing and
inventory system for large data files like multibeam records, video or seismic data is
foreseen.
Acknowledgements
The ongoing development of the IDOD information system is partly funded by the
Belgian federal Science Policy Office.
References
Chavan, V., Rane, N., Watve, A. (2005). Resolving taxonomic discrepancies: role of
electronic catalogues of known organisms. Biodiversity Informatics 2 (2005) 70-78.
Codd, E.F. (1970), "A Relational Model of Data for Large Shared Data Banks",
Eprint, Communications of the ACM, 13(6) 377–387.
Costello, M.J.; Bouchet, P.; Boxshall, G.; Emblow, C.; Vanden Berghe, E. (2004).
European Register of Marine Species. http://www.marbef.org/data/erms.php (28-Jul2006).
Costello,M.J. (2000). Developing species information systems: the European Register
of Marine Species (ERMS). Oceanography Vol. 13 No. 3/2000.
Diepenbroek, M., Grobe, H., Reinke, M., Schindler, U., Schlitzer, R., Sieger, R. and
Wefer, G. (2002). PANGAEA – an information system for environmental sciences.
Comuters & Geosciences 28 (2002) 1201-1210.
Gennick, J. (2003). Querying Hierarchies: Top-of-the-Line Support.
http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/gennick_co
nnectby.html (28-Jul-2006).
ITIS (2004). About ITIS. Standards and Database documentation.
http://www.itis.usda.gov/standard.html (28-Jul-2006).
ITIS (2005). Integrated Taxonomic Information System. http://www.itis.usda.gov/
(28-Jul-2006).
Meerhaeghe, A., De Cauwer, K., Devolder, M., Jans, S. and Scory, S. (2006).
Revealing species communities in a spatial and temporal overlap. ICES Annual
Science Conference 2006, Maastricht.
Morris, P.J (2005). Relational Database Design and Implementation for Biodiversity
Informatics. PhyloInformatics 7 (2005) 1-66.
Wikipedia (2006). http://en.wikipedia.org/wiki/Database_normalization (28-Jul2006).
Download