Introduction

advertisement
Generic format of the Feasibility Study chapters
Introduction
While a vast amount of information about the environment and biodiversity can be found in various locations
around the world, the comprehensive gathering, organisation, evaluation and dissemination of such information
is a challenge which can be addressed only by a variety of distributed and centralized, interoperable and
connected services. Such systems should be web based and provide a common framework for the management
of this information.
In this chapter, we describe the requirements for the effective and efficient integrated management of the
existing biodiversity resources and the creation of a common e-services based infrastructure supporting the
access, organisation, evaluation and disseminating of such data. The existing biodiversity resources include the
entire spectrum from research notes to paper publications, from local IT systems and disparate files to
databases accessible on the Web, from observatories to specimen collections and microbiological samples, from
species catalogues to –omics databases, and from experimental software packages to Web-services with
standard interfaces and behavior.
We classify the users into three major groups. These are:
End Users (e.g., general public, policy maker, industry) need to locate and extract data that matches their
interest, or appropriate data servers to retrieve data of the desired level of quality. They put a high value on the
accessibility, interpretability, and usefulness of data.
Brokers (e.g., environmental scientists, public authority administrator) maintain secondary information services
for end users or answer administrative inquiries, maintain or write new software to access measurement
databases, remote sensing data, to build geographical databases to construct maps of a subject and to
improve the reliability of data using consolidation techniques. In the most cases these programs may use
multiple data sources. Each data source requires a unique program to extract the data for the new program
from the data source.
Data and service Providers (e.g., biologist, geologist, physicist, ornithologist, entomologist, etc.) collect empirical
data, evaluate them and provide access to them for interested parties or the public. This task includes standard
database functions or database services fed and updated by automatic sensors that directly transmit their data
into them as well as quality assurance of data. They maintain or write new software ranging from standard data
evaluation to innovative research tasks.
Although Helbionet infrastructure is focuses primarily on the needs of biologists and biodiversity professionals of
the research community, it will be geared to supporting a much wider range of users which produce biodiversity
information or for which biodiversity information is relevant.. Such uses include: Industry involved in agriculture,
fishing, industry under environmental control, political decision support, environmental management and
control, support of other European research infrastructures in the biodiversity domain.
This chapter will examine the technological aspects of a biodiversity infrastructure for accomplishing the
following functionality:
- Finding aids: Capability to obtain comprehensive information about all information sources and
services, their locations and access methods that may be relevant for a particular user question
or research problem via one virtual homogeneous access point based on centralized
metadata repositories. This includes electronic records and Web services, but also paper
archives and publications, and databases opaque to the Web.
- Centralized Service-directory services: services describing content of and access to electronic
collections and other Web-services in a way that compatible ones can be identified and be
connected in scientific workflows.
- Access: Ability and permission management to access or download all relevant information that
1
exists in electronic form from any location with Internet access. Ability to mediate or transform
between major standard data formats. Ability and permission management to invoke Web-services
for complex evaluation tasks, including ability of interlinking automated observatories
Semantic interoperability and description: Homogeneous core metadata capturing the key
concepts and semantic relationships to characterize the relevance of a data set or service for
an information request or research problem, such as correlating professional scientific or amateur
observations with species, habitat and molecular data and various theories on the behaviour and
evolution of biodiversity, in particular in order to give answers for complex decision problems in
environmental policy.
Managing the dynamic resolution and curation of co-reference, i.e. linking identical resources
and references to identical resources referred to by different names or identifiers, thus leading to
an all-increasing state of semantic connectivity of the distributed resources in the
infrastructure.
-
-
Scope
The semantics of the biodiversity universe of discourse can be characterized by the following core conceptual
model of fundamental categories and their relationships. It is a high level picture of the relationships between
human activities and their targets, as they appear on the highest level of description of all scientific products
and services. They are the ones necessary for the first level selecting relevant information, and for managing
and maintaining the referential integrity of the most fundamental contextual information. It is only on a more
specialized level that all the entities in these models are refined by specific, open-ended terminologies
(typologies, taxonomies) and a relatively small set of more specific relationships among them:
services
Molecular world
and parts
global
indices
databases
Publications
exemplifies
Samples
use to appear in
is
about
from
exemplifies
Species
specimen
Individuals/
treat
use
cross reference
in
collections
in
from
appear in
create
maintain
Records
Activities
Habitats
Observations
Place
Simulation
Time
Evaluation
The general idea is that biodiversity Activities (like Observations, Simulations, Evaluations etc) of human
Actors deal primarily with individual items or specimen coming from habitats or laboratory and occur in a
particular space-time. Evidence of these activities is produced in the form of physical collections of items
(green) and information objects (blue) which are about themselves and the observed world. The
biodiversity world produces categorical theories (yellow) on the species and population level and on the
2
associated micro- and molecular phenomena, and regards primary observations and physical items and
examples of those theories. The data records are stored in information systems accessible via services.
Global indices can be created for accessing them comprehensively. The data bases also keep information
about species, activities, habitats, etc. (Generically, we regard a habitat as a set of coherent physical
phenomena and their constituents that can be observed over a specific place and time period).
Biodiversity research functions may include: species declaration, determination and other kinds of scientific
classification, , biodiversity monitoring, hot-spot comparison, , phylogenetic and bio geographic analytical tools,
virtual population analysis tools, simulation and prediction tools, multivariate analyses, geospatial management
etc.
In the above world of biodiversity we consider different levels and kinds of metadata according to the nature of
the described units and the way they can be accessed. These are:
-
Data record metadata
Collection-level metadata describing a digital collection or part of it as one unit.
Metadata of live digital sources (services providing records from automatic sensors or
regularly updated databases )
Metadata of computational services and tools.
Collection-level metadata describing a physical collection or part of it as one unit.
Physical collection item metadata (specimen, samples)
Paper publication metadata (library records)
Metadata about paper archives describing a paper/photo collection or part of it as one unit.
The above conceptual model will be refined and used for providing access points and cross-linking for all the
above metadata. In addition, central semantic knowledge services are needed to maintain a global notion of
identity about key national people (see www.viaf.org), activities, habitats, places (georeferencing services),
with suitable linking into respective international resources. Following the conceptual model above, these
services are needed to connect the source metadata with clear-cut identification of the associated contextual
elements of observations, on which their understanding and interpretation ultimately relies. In this generic
conceptual framework, a state-of-the-art management of distributed, interlinked terminologies (“ontologies”)
plugs in.
State of the Art,
Currently there is a lot of effort on distributed systems that handle massive data load in conjunction with
semantic information on distributed content. On one side, there is the huge state-of-the-art of the bioinformatics
world, including global integration efforts of molecular biology data (“omics”), specimen holdings and species
catalogues. It is not the aim of this infrastructure to replace or improve these efforts, but to connect such kinds
of services and to complement these services with an even more generic global directory service, which, in the
sequence, allows to maintain “Data Cloud” and “data GRID” like structures. State-of-the-art biodiversity services
appear in this architecture as components maintained by disciplinary subgroups of the infrastructure users.
It is just very recently (about 2 years) that scalable RDF-triple store solutions allow for implementing very large
integrated metadata repositories with deep referential identity and integrity, and therefore tracing related
resources across all kinds of sources and services and by rich contextual associations that could never before
established in this comprehensive form.
It should become clear, that an integrated semantic access as outlined above is fundamentally different from
Web search engines. The success of the latter is based on the fact that information about some key concepts is
highly redundant on the Web. Therefore, users can easily be guided towards some relevant content. In a
descriptive science as biodiversity, users look for facts that have no name as such and are observed only once.
To enable such research, we need semantic networks of deeply linked information to enable targeted navigation
through related contexts and factors. Not a “see also”, but a “found individuals of species X in place Y in
individuals of species Z”.
3
Here some of the most innovative examples of such semantic integration systems are presented. The
Information Systems Laboratory (ISL) of FORTH and its Center for Cultural Informatics have participated in one
way or another in many of these, and accumulated a considerable know-how. Some of these are directly in the
biodiversity domain, but others are more generic, or in the cultural heritage domain. Cultural heritage and
biodiversity have a lot in common, as both deal with records of recent and past observations and their
interpretation, and holdings of physical collections. Indeed, FORTH has leaded a working group that identified
common core models for biodiversity museums and cultural museums.
We present here some recent national and international projects that relate to the state-of-the-art of the
technology we suggest to provide central services to an integrated, distributed biodiversity infrastructure in
Greece and its challenges:
LIFEWATCH
LifeWatch will be a distributed research infrastructure, where the entities of the LifeWatch organization and the
services that the LifeWatch organization provides will be spread over Europe. This research infrastructure is the
permanent elements needed to create an internet/web-based inter-organisational system that links personal and
institutional systems (data resources and analytical tools) for biodiversity research, supported by appro-priate
human resources providing assistance to users.
The proposed architecture of the HELBIONET infrastructure is compatible with the proposed LIFEWATCH
Architecture. The main elements of the proposed Lifewatch architecture are:
- The Resource Layer contains the specific resources, such as data repositories, computational
capacity, sensor networks / devices and modelling / analysis tools, that contribute to the Life-Watch
infrastructure. The primary components at this layer are the biological (e.g. species re-cords) and
abiotic data from sites. Additional components include catalogue services (e.g., taxo-nomic
checklists or gazetteers), analysis tools, and processing resources.
- The e-Infrastructure layer provides mechanisms for sharing the specific resources as generic
services in a distributed environment, spread across multiple administrative domains. This in-cludes
systems for identifying, accessing, and processing resources located within multiple ad-ministrative
domains, uniform security and access protocols, and ontologies for semantic integration.
- The Composition Layer supports the intelligent selection and combination of services in order to
complete tasks. This includes workflows, semantic metadata for the discovery of components and
the storage of additional attributes such as provenance and version information. Viewed from the
perspective of the biodiversity scientist, the Composition Layer consists of a wide range of
application services or biodiversity research capabilities that can be selected and arranged in an
infinite number of ways to conclude specific analytical and experimental tasks.
- The User Layer provides domain-specific presentation environments for the control and monitor-ing
of tasks and tools to support community collaborations. This includes a LifeWatch portal incorporating discovery and workflow management tools and offering a single point of access for all
users.
Institute of Museums and Library Services, Digital Collections and Contents (IMLS DCC)
<http://imlsdcc.grainger.uiuc.edu/about.asp>
The Digital Collections and Content (DCC) developed database about digital collections (some 900). Each
collection is described by collection level metadata.
In this project a logic-based framework for classifying collection-level/item-level metadata relationships is
developing. The idea behind this framework is very similar to what we propose for HELBIONET for the Data
Dictionary services. This framework supports (i) metadata specification developers defining metadata elements,
(ii) metadata librarians describing objects, and (iii) system designers implementing systems that help users take
advantage of collection-level metadata. The users can be search for collections and in collections. Metadata
relationships for collection/item and metadata at the collection-level are classified and are used for finding,
understanding, and using the items in the collection.
Much effort has been made to map between different metadata schemas used by many different communities.
In the next phase of development the employment of CIDOC – CRM is considered.
CLassical Art Research Online Services (CLAROS), http://www.clarosnet.org/about/default.htm
CLAROS is an international interdisciplinary research initiative. There are five founder members of CLAROS,
each with extensive datasets on classical antiquity (mainly art) that have been created over the past thirty years.
4
Collectively they have more than two million records and images that will be available through CLAROS.
The
goals of CLAROS are very similar to the goals of HELBIONET about the persistent data. The user communities
are different since CLAROS addresses the needs of the art world.
The architecture of ClarosWeb is very similar to what we propose for HELBIONET. It is employing mapping, coreference, query, search and browsing, and multilingual thesaurus services. Additional classical art data
providers can be integrated into the system at any time, thereby enriching the total content, simply by mapping
their schemas to the core ontology and by making appropriate entries in the co reference service. Each
member retains his own assets in his own format, his own website and his own IPR .
CIDOC-CRM has been adopted for core ontology since CIDOC-CRM allows common data elements to be used to
represent common concepts, without constraining the overall range of information that can be represented.
This ontology form the foundation for a service that indexes key concepts from these sources to provide a query
facility that operates uniformly across all these data. CIDOC-CRM also supports the targeted use of thesauri for
multi-language support, both in the diversity of data sources and in user queries.
CultureSampo –A National Publication System of Cultural Heritage on the SemanticWeb 2.0
http://www.kulttuurisampo.fi/index.shtml
CULTURESAMPO is an application demonstration of a national level publication system of cultural heritage contents
on the Web, based on ideas and technologies of the Semantic (Web and) Web 2.0. The material consists of
heterogeneous cultural content, which comes from the collections of 22 Finnish museums, libraries, archives,
and other memory organizations, and is annotated using various ontologies. All of these annotations are
represented using the Resource Description Framework (RDF) and a set of ontologies, including the Finnish
Spatio-temporal Place Ontology SAPO. The material, coming from almost 100 different collections, currently
includes metadata about over 250,000 objects, e.g. artefacts, photographs, maps, paintings, poems, books, folk
songs, videos, et cetera, and millions of other reference resources (concepts, places, times, etc.).
CULTURESAMPO provides a centralized geospatial view that can show all kinds of content related to a place, and
organize this data in various ways in order to make sense of it. It employs Google maps view, upon which are
shown by default all locations to which any content is related in any way. It provides a search box, whereby the
user can search for places by name using semantic autocompletion, and a browsable partonomy of places in the
ontology.
The common approach of this project with the HELBIONET node is the usage of semantic models based on
events for addressing the interoperability problems and geospatial models for addressing a geospatial view.
Linking Open Data (LOD), http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
The Linking Open Data (LoD) project, an open, collaborative effort carried out in the realm of the W3C SWEO8
Community Projects initiative aimed at bootstrapping the Web of Data by publishing datasets in RDF on the Web
and creating large numbers of links between these datasets. Currently, the project includes over 50 different
datasets with over two billion RDF triples and three million (semantic) links at the time of writing | representing
a steadily growing, open implementation of the linked data principles.
RDF links enable the navigation from a data item within one data source to related data items within other
sources using a Semantic Web browser. RDF links can also be followed by the crawlers of Semantic Web search
engines, which may provide sophisticated search and query capabilities over crawled data. As query results are
structured data and not just links to HTML pages, they can be used within other applications. Collectively, the
data sets consist of over 13.1 billion RDF triples (dataset statistics), which are interlinked by around 142 million
RDF links (linksset statistics) (November 2009).
In HELBIONET infrastructure the terminologies and georeferencing will all be connected via LoD.
3D-COFORM, www.3d-coform.eu
The 3D-COFORM integrated infrastructure allow different classes of users (CH professionals, technicians,
academics, museum users and web users, i.e. European citizens) to create 3D collection items by digitizing,
processing, classifying, annotating, cross-referencing and storing digital objects with rich semantics into a
repository. It employs ontologies to ingest, store, manipulate and export complex digital objects, their
components and related metadata as well as to enable efficient access, use, reuse and preservation. The CIDOC
CRM is used as a common starting point in 3D-COFORM for data interoperability between the partners and
supporting Cultural Heritage institutions in the consortium. Each gallery or museum schema is be mapped to the
CIDOC CRM and the development of extensions in collaboration with the standards bodies where necessary (e.g.
the current work of the community on an extension for recording digital provenance).
The decision to target the 3D-COFORM interoperability over a CIDOC CRM structured collection is a significant
and important point, and carries with it the requirement to ensure compatibility between a CIDOC CRM
orientated 3D-COFORM repository and other interoperable digital libraries like Europeana, which may have
5
initially targeted other approaches.
The CCI of ISL of FORTH design of the 3d-coform repository infrastructure (RI) and design and implement
the metadata repository. Metadata are ingested in the repository as RDF files. The components supply the RI
with RDF or XML files compliant to a specific Schema, which is produced based on the extended CIDOC-CRM
model. XML files will be internally transformed to RDF files in the RI. Also CCI developed the appropriate
provenance model for digital objects that is more rich than the existing proposals. It employs the notion of a
digital machine event, digital measurement and formal derivation are very generic, and the essence of e-science.
The notion of digitization is specific to certain processes, and assists reasoning about “depicted objects”. Similar
specializations may be created to reason about other measurement devices.
In the Helbionet we propose the same technology for the repository infrastructure and the employment of the
same provenance model.
CIDOC CRM, http://cidoc.ics.forth.gr/
The CIDOC Conceptual Reference Model (CRM) provides definitions and a formal structure for describing the
implicit and explicit concepts and relationships used in cultural heritage documentation. The CIDOC CRM is
intended to promote a shared understanding of cultural heritage information by providing a common and
extensible semantic framework that any cultural heritage information can be mapped to. It is intended to be a
common language for domain experts and implementers to formulate requirements for information systems and
to serve as a guide for good practice of conceptual modelling. In this way, it can provide the "semantic glue"
needed to mediate between different sources of cultural heritage information, such as that published by
museums, libraries and archives.
The CIDOC CRM is the culmination of over 10 years work by the CIDOC Documentation Standards Working
Group and CIDOC CRM SIG which are working groups of CIDOC. Since 9/12/2006 it is official standard ISO
21127:2006.
Besides others, the Centre works as competence center for the CIDOC-CRM (ISO 21127), by building up and
exchanging application know-how, consultancy to implementers and researchers, and contribution to the
dissemination, maintenance and evolution of the standard itself.
EUROPEANA, www.europeana.eu
Europeana rolls multimedia library, museum and archive into one digital website combined with Web 2.0
features. It offers direct access to digitised books, audio and film material, photos, paintings, maps,
manuscripts, newspapers and archival documents that are Europe’s cultural heritage. Visitors to
www.europeana.eu can search and explore different collections in Europe’s cultural institutions in their own
language in virtual form, without having to visit multiple sites or countries. The digital objects that users can
find in Europeana are not stored on a central computer, but remain with the cultural institution and hosted on
their network. Europeana collects contextual information about the items, including a small picture. Users will
search this contextual information. Once they find what they are looking for, a simple click provides them with
access to the full content – inviting them to read a book, play a video or listen to an audio recording – that is
stored on the servers of the respective content contributing institutions.
The Europeana prototype gives direct access to more than 2 million digitised items from museums, libraries,
audiovisual and other archives across Europe. Over 1,000 cultural organisations from across Europe have
provided materials to Europeana. The digitised objects come from all 27 Member States, although for some of
them the content may be very limited at this stage. The HELBIONET infrastructure has similar functionality with
Europeana. Martin Doerr, head of the Center of Cultural Informatics of ICS of FORTH is core a expert of
Europeanna and has contributed a larger part of the EDM model, the schema of the next (“Danube”) release of
Europeana. This model is a generalization about CIDOC CRM, DC, and ORE.
ACGΤ (Advancing Clinico-Genomic Clinical Trials on Cancer - IST-026996)
ACGT brings aims to deliver to the cancer research community an integrated Clinico-Genomic ICT environment
enabled by a powerful GRID infrastructure. In achieving this objective ACGT has formulated a coherent,
integrated work plan for the design, development, integration and validation of all technologically challenging
areas of work. Namely:
(a) GRID: delivery of a European Biomedical GRID infrastructure offering seamless mediation services
for sharing data and data-processing methods and tools, and advanced security;
(b) Integration: semantic, ontology based integration of clinical and genomic/proteomic data - taking
into account standard clinical and genomic ontologies and metadata;
(c) Knowledge Discovery: Delivery of data-mining GRID services in order to support and improve
complex knowledge discovery processes. The technological platform will be validated in concrete setting
of advanced clinical trials on Cancer. Pilot trials have been selected based on the presence of clear
6
research objectives, raising the need to integrate data at all levels of the human being. ACGT promotes
the principle of open source and open access, thus enabling the gradual creation of a European
Biomedical Grid on Cancer. Hence, the project plans to introduce additional clinical trials during its
lifecycle
The Information Systems Laboratory of ICS develop
ed, revised and extended, in collaboration with IFOMIS, the ACGT master ontology (MO). The intention of the
ACGT Master Ontology (MO) is to represent the domain of cancer research and management in a
computationally tractable manner. Also ISL work on
- the design the ACGT mapping format;
- the specification of requirements and the evaluation of the following semantic integration tools
(developed by UPM):
 The mapping tool used to map the sources to the MO
 The ACGT mediator used to integrate clinico-genomic information.
- The development of the Ontology Submission tool: a tool that intends to aid the evolution of an
ontology through requests for change from users and domain experts. The submission tool was
integrated to the Trial Management system (Obtima) using web services.
Mapping the master ontology of ACGT(MO) to the CIDOC CRM important relationships about scientific
observation extracted from the CRM. As a result we could say that the CRM with minor extensions can be a
core model to integrate clinical observation and Omics laboratory measurements.
POLEMON
The system POLEMON, a decentralized management information system for the National Monuments Record
funded by EPET II program by GSRT, has employed a directory service on a distributed network of cultural
information databases. This system was designed for the Greek Archeological Service, one of the oldest public
services in Greece, was organized from the very beginning decentralized. The peripheral units, called Ephorates
of Antiquities, approach now 67 in number and include a number of large independent Museums and some
special services (Underwater Archaeology, Speleology etc.,) are responsible for the field work, excavations,
restorations, protection and management of the archeological sites and monuments and their environment . A
special Direction at the General Directorate of Antiquities and Restoration, called Directorate of Monuments
Records and Publications collects any useful documents concerning sites and monuments, compiles a general
inventory and classifies the historic archives 160 years old of the Greek Archeological Service.
The peripheral units may have their own database management system to manage their monument data and
they give by request permission to researchers to access this data.
These dispersed isolated cultural databases along the Ephorates may have the same logical schema or not as
well as their vendors and type may be varied and may concern different types of monuments.
The CCI of ISL of FORTH was the coordinator and responsible for the design and implementation of the global
access system (directory service) of POLEMON. The exploitation of the POLEMON system has started since 1997
with the participation of 12 peripheral units of the Ministry of Culture. The current installation of POLEMON
system in Ministry of Culture employs 67 dispersed databases over the Greece. The metadata repository of the
Global Access System of POLEMON is very similar with the metadata repository which we propose for
HELBIONET infrastructure. Also the mapping method of different databases schemas to a global schema is very
similar to what we propose for the HELBIONET mapping mechanism.
German Federal Ministry of Education and Research (BMBF): Research Between Natural and
Cultural History Information: Benefits and IT-Requirements for Transdisciplinarity
This research was partly supported by the German Federal Ministry of Education and Research (BMBF) within
the BIOLOG Biodiversity Informatics- and BIOLOG Biota Africa- programmes by funding the DORSA-, DIG- and
BIOTA East Africa E15 projects. http://doi.acm.org/10.1145/1367080.1367
The core ontology we propose to HELBIONET is very similar with the one employing to this research.
The adopted approach in this research is to integrate into core ontology transdisciplinary information. The
proposed core ontology is based on the CIDOC-Conceptual Reference Model (ISO 21127). When the proposed
ontology is instantiated with some realistic examples taken from the field of biodiversity (collecting,
determination, type creation, expedition, and observation events), the formal specification of semantic concepts
makes scientific activities commonly understandable. This research proved that ontologies not only allow one to
describe the results of scientific activities, such as a description of a biological species, but they can help to
clarify the path by which the goal was reached. In particular, ontologies provide a high-level uniform
representation of transdisciplinary research activities and results. Also it is proved that ontologies as knowledge
representation tools will have strong impact on methodological questions and research strategies for different
7
domains such as biology, archaeology, art history, and socio-economy. They can be regarded as semantic glue
between and within scientific and scholarly domains as demonstrated in a series of examples.
Requirements
In the light of existing infrastructure in Greece, a general requirement is posed for the design and
implementation of a directory service for data and service discovering
The proposed infrastructure should support biodiversity research functions for the three types of user groups
(End Users, brokers, data providers). The list of required functionality can be grouped according to the type of
the user group.
The end users need functions for discovering, accessing, processing and viewing data. These functions include:
Searching and browsing mechanisms for distributed data and services.
Uniform identity framework for data and services
Access to existing data and services, distributed among multiple organizations
Mechanisms for source data preservation, i.e., access to past versions of data sets that have been
used to produce secondary information.
 Mechanisms for data analysis as well as mapping and modelling tools, using standard ways to
manipulate and view data.
 Mechanisms for data fusion, integrating different sources (such as sensor data, biodiversity
parameters, geographic data, primary data, workflow execution), to allow fast retrieval at different
levels of detail e.g. for analysis and visualization.
 Support the understanding of results by the user, by providing tools and mechanisms to enhance
 knowledge extraction from discovery as well as from analysis results
The brokers need functions for composing electronic workflows. These functions require translation or mapping
mechanisms between data formats and semantics. Broker functions include:
 Mediation mechanisms like a semantic metadata framework should enhance interoperability between
resources.
 Promote interoperability standards that will allow characteristics user groups to combine particular
services such that they can support for collaborative workflow modelling, providing documentation and
enactment of data processing steps.
 management of provenance metadata information and promotion of provenance metadata generation
by data providers following adequate standards.
 Mechanisms for managing data sets and their processing tools independently by their providers and
mechanisms for reproducing or examining an analysis of a specific version of both the data and the
tools used in this analysis.
 Support for building and managing virtual organisations by allowing the definition of groups and roles
and their authorisation for sharing resources
Data and service providers continue to manage their data (and services) independently as now, including control
of the creation and modification of data/services. However, data can be accessed by authorised users located
anywhere through a generic mechanism defined by Helbionet. They need functions for
 Capturing data from users and lightweight devices, including field sensors and networks providing
continuous streams of new data, and portable computing devices, often with intermittent connectivity.
 Allowing users to add annotations to existing data and services, which may contribute to the quality
assessment and feedback processes;
 Providing control mechanisms for access to data and services, together with monitoring services for the
support of service level agreements and, potentially, including charging mechanisms .
From the technical point of view the requirements include:

Evolutionary development

Clear separation between centralized services and local data and service provider
responsibilities.

High on-line accessibility of all data

Technology independence and rigorous use of standards

Metadata information about any source of data independent of their format and age

Metadata integration and central directory services (integrated access to collection level and
item level metadata)




8

Loosely coupled components

Generic, self-describing infrastructure components

Distributed systems interoperability

Unambiguous identification (identifier generation service)

Dynamic co reference (identity) resolution of URI’s, links, identifiers to: species, specimen,
samples, data records, observation campaigns, publications, archives of scientific work records, services
and software tools, places, actors, habitats
All the above requirements are preconditions which should be followed in national level, in the forthcoming
phase of the project, in order to be accomplished tighter integration of tools and services in a next step of this
project. A tighter integration of tools and services include
 Support workflow modelling, providing documentation and enactment of data processing steps.
 Collaborative community support tools which allow groups of people to work on a shared objective.
Recommendations and conclusions
In the light of this study, we may say that the Greek HELBIONET infrastructure should have the character of a
directory service for data discovery and access and interoperability promotion. Also substantial access services
will be provided in the form of Linked Open Data (LoD).
The architecture of the national node
The service-oriented architecture (SOA) is an appropriate high – level architecture for the HELBIONET
infrastructure to follow. A deployed SOA-based architecture will provide a loosely-integrated suite of services
that can be used within multiple biodiversity domains. SOA defines the interface in terms of protocols and
functionality and separates functions into distinct units, or services, which developers make accessible over a
network in order to allow users to combine and reuse them in the production of applications.
SOA implementations rely on a mesh of software services. Services comprise unassociated, loosely coupled units
of functionality that have no calls to each other embedded in them. Each service implements one action. Instead
of services embedding calls to each other in their source code they use defined protocols that describe how
services pass and parse messages, using description metadata. These metadata should describe in sufficient
detail not only the characteristics of these services, but also the data that drives them. Extensive use of XML is
made in SOA to structure data that they wrap in a nearly exhaustive description-container. Analogously, the Web
Services Description Language (WSDL) typically describes the services themselves, while the SOAP protocol
describes the communications protocols. The great promise of SOA suggests that the marginal cost of creating
the n-th application is low, as all of the software required already exists to satisfy the requirements of other
applications. Ideally, one requires only orchestration to produce a new application
The global system of the Greek HELBIONET infrastructure will constitute a directory service of data and services.
The data and service providers are responsible to manage their products (data and services) independently
(including the creation and modification of data/services) and make public the products they want.
Services on providing data can be distinguished in (a) persistent data services, the services that provide access
to database data, (b) live services, that provide dynamic data by accessing data generating devices, (sensors,
metering stations, etc.), and (c) computational data services, that given a set of data inputs they produce a set
data outputs (simulation data). Only the persistent data services provide data in the form of Linking Open Data
(LoD), and thus metadata on these data can be harvested.
This directory service will provide to the data/service providers the way to declare themselves as providers, to
declare the services and data they provide and to publish (or update the already published) data and services.
All providers should declare their services following the directory provided WSDL description. The same holds for
all the data which the providers wish to make public. First they have to declare the kind of data they wish to
provide, so the directory is aware of what kind of data is available from each provider, and thus redirect the
users to the specific provider. The providers should follow a specific directory provided WSDL description for
9
publishing their data. They have to provide their data as a data-service, implementing a Web-Service on top of
their databases, and to declare this data service to the directory.
For being accessible the data of different providers through a central search mechanism, the providers should
also supply the metadata of this data. All appropriate semantic information (metadata) should be defined in
advance in order to facilitate the querying and retrieval process.
Data providers should also provide a metadata extraction service that will allow the directory service to harvest
the available metadata of the data of a provider’s database.
All harvested metadata will be mapped from their original XML format to a unified and normalized XML schema
which will be based on CIDOC – CRM (CRM core schema compatible). This unified schema will contain all the
well-defined mappings of the different databases schema to this normalized schema. All metadata will be stored
in the metadata repository. The metadata XML original format should be compatible to a specific set of XML
schemas and DTDs (according to the international standards which are being used in these fields, e.g. ISO
21127, DUBLIN CORE, EAD).
In parallel the system will trace and retrieve terms from metadata that will be co-referenced with external
authorities. For example geographical locations will be co-references with geographical dictionaries or directories
(gazetteers). Thus by ingesting the metadata the system will also ingest terminology that will be used by the
search mechanism of the system.
The system should be globally accessible via a web interface (and a set of Web-service actions) which will
enable query formulation on data, metadata, services, in conjunction with the terminology manager in order to
provide successful query results.
In light of the above, the description of the structural and functional components of the proposing system
follows:
Functional components
The system consists of the following functional components:
Directory of Providers Services
 Responsible for creation, deletion, manipulation, management of services from all providers.
 Constitute the central directory the external users will access in order to get information on available
data, or available data and/or services
 Provide the directory of all the providers
Metadata repository
 Responsible for creation, deletion, manipulation, management of the metadata that the providers would
like to be public.
 Unambiguous identification (identifier generation service)
 Dynamic co reference (identity) resolution of URI’s, links, identifiers to: species, specimen, samples,
data records, observation campaigns, publications, archives of scientific work records, services and
software tools, places, actors, habitats
10
HELBIONET Directory service system
Access Web-service interface
Query Result Management
Access Management
Query interface
output interface
Metadata Repository
Application server
Providers and Services retrieval interface
Applicati
co-reference
on server
directory
Directory of Providers and Services
Input interface
output interface
Metadata loading interface
output interface
Application
Terminology
server
extractor
Providers and Services input interface
Input interface
Application
Mapping
server
mechanism
Input interface
Ingest Web-service interface
Ingest tool
sources
Declaration
of Services
and Data
Meta
data
Web-Service interface
Local Data
Base
Declaration
of Services
and Data
Meta
data
Declaration
of Services
and Data
Meta
data
Web-Service interface
Web-Service interface
Local Data
Base
Local Data
Base
Declaration
of Services
and Data
Meta
data
Web-Service interface
Local Data
Base
Mapping mechanism
 Provides the mechanism for mapping the metadata from the data files described in various formats (ISO
21127, DUBLIN CORE, EAD) to the unique format accepted by the Metadata repository (CIDOC - CRM
core schema compatible)
Co-reference directory server
 Adds/deletes/changes terms and their co-references in externals authorities.
 Maintain relations between terms
 Create/delete/maintain term hierarchies
 Queries on term co-references and their relations
Terminology extractor
 Extracts terms from text (metadata or documents)
 Search in external authorities for co-referencing information on the extracted terms.
Co-reference directory controller
 Implements all the logic and control on extracting terminology. Stores and queries the extracting
terminology.
 Works like a “glue” between the co-reference dictionary and the terminology extractor.
 Provides the interface for both (i) input mechanism of the co-reference directory server: [gets a text (in
a buffer or in a document) calls the terminology extractor and extracts the terms and their co-references
with external authorities, and stores them in the co-reference directory server], and (ii) output
mechanism of the co-reference directory server: [based on a term, returns its co-references with
external authorities).
Ingest-tool controller
 Implements all the logic and control for loading the metadata from various selected databases
 Works like a “glue” between the directory repository, metadata repository and the co-reference directory
11
server, concerning the metadata loading and maintenance
When invoked to load a new set of metadata files the following actions will take place: (i) invocation of
the mapping mechanism to produce uniform metadata for all input metadata, (ii) creation of the
appropriate linking from the metadata to the actual data objects (LoD), (iii) invocation of the coreference directory controller to extract terms from metadata and from any data objects (basically the
documents), and store them along their co-references to the appropriate authority entries, (iv) storing
the metadata into the metadata-repository.
Access manager
 To get a result from a database, the user may either use predefined queries; either create its own query
by querying one or more values on one or more tags in the metadata schema
 To get the list of services from a provider, to get a service description (what the service is about), to get
the method to invoke the actions of a service..
Results presentation mechanism
 The system will provide mechanisms for presentation and management of the results. These are
merging, ranking, exporting and saving results mechanisms.
Access mechanism controller
 Implements all the logic and control for querying and accessing the data from the system.
 Works like a “glue” between the directory repository, metadata repository and the co-reference directory
server, concerning the data and service retrieval and presentation
 Basically it is controlled by the user who invokes the UI functions, or the web-service caller, in order to
make queries to various databases and collect the results.
 Is invoked to get the list of services from a provider, or the list of providers that may be candidates for
the kind of data the user want to query and retrieve, to get a service description (what the service is
about), or to get the way the service actions may be invoked.
 When invoked to execute a query the following steps will take place: (i) a set of queries will be
formulated by the unfolding of the terms that are queried in order to get all their co-references, (ii) in
some cases the terms and their co-references may be used to get data records from the providers.
 The system will provide also a co-reference browser in order for the user to get more terms to choose
from.
 The system will provide the following results presentation and management: merging, ranking,
exporting and saving results as well as querying on a result set.

Investment costs
We estimate the following cost for the development phase and for the testing and support phase.
Development phase:
Workload:
The following table presents workload costs:
Personnel type
Developers
Senior
researchers
senior
Engineers
Overhead
Total
number
of
persons
number
of years
number of
months
per year
total
3
4
12
144
2500
360.000
2
4
6
48
6000
288.000
288.000
1
4
12
48
4000
192.000
840.000
192.000
922.800
138.420
1.061.220
15%
monthly cost
subtotal
FPA
(23%)
82800
Subtotal
442.800
S/W: The s/w platform for the development will be based on Open source technology development tools. We
anticipate that some of the development s/w may not be OpenSource, but we estimate the cost of such s/w
needed not to exceed the amount of ~10.000 euro (s/w for geographical representations, maps e.g. like the
development suite of Google-maps, 3D presentations and previewing tools, statistics tools, etc)
H/W: For the development of the project we need the 2 development PCs (~1.500 euro) and one server (~
3.000 euro)(running Windows server or linuX).
12
Download