EU BON MS251: Specification for the European Biodiversity Portal Version 1.0 Milestone MS251 for Task 2.5 (European Biodiversity Portal) is due in month 27 (Feb 2015). The release of a beta version of the portal is due in month 39 (Feb 2016). Contents Contents .................................................................................................................................... 1 1. Introduction ....................................................................................................................... 1 2. Functional specifications .................................................................................................... 2 3. 2.1 Functional specifications summary.............................................................................. 2 2.2 Search process [SPEC-01] ............................................................................................. 3 2.3 Biodiversity networks integration [SPEC-02] ............................................................... 4 2.4 Data supplying, consuming and visualization [SPEC-03].............................................. 5 2.5 EU BON Taxonomic backbone integration [SPEC-04] .................................................. 5 2.6 Connection with GEOSS [SPEC-05]............................................................................... 6 2.7 Integration with DataONE network [SPEC-06]............................................................. 6 2.8 Geospatial capabilities [SPEC-07]................................................................................. 6 2.9 Potential users and roles [SPEC-08] ............................................................................. 7 2.10 Content managing [SPEC-09] ..................................................................................... 7 2.11 Dataset quality specifications [SPEC-10].................................................................... 7 Architectural Design ........................................................................................................... 8 3.1 Current architectural proposal..................................................................................... 8 3.2 Architectural revision using GI-cat ............................................................................... 9 1. Introduction Task 2.5 (European Biodiversity Portal) is defined as follows in the Description of Work (DoW). A European Biodiversity Portal (EBP) will be developed as the main GEO BON information hub. It will link to relevant databases and information systems, policy contacts and recommendations, and structured advice for assessing relevant distributed information/datasets for different user groups, including contributions from citizen science data gathering gateways. The EBP will technically integrate the various data sources under one search facility and spatially/temporally oriented user interface. The portal will build on the tools developed by task 2.3, functions developed by task 2.4. It will provide access to full detailed data, geographic visualisation and 1 remotely sensed data. It will be closely linked to the GCI and GEO Portal, and access layers and data from GEOSS sources. The portal would also act as showcase for the products from the analytical and modelling activities of other WPs and support workflows for building such products using the registered e-services. The portal will also serve general dissemination functions for WP8. (Lead CSIC; UEF, GBIF, UnivLeeds, Pensoft, FIN, Plazi, GlueCAD, NBIC; Months 1-54) Person month allocations for the participating partners are: CSIC (Lead, 18) UEF (7), GBIF (3), UnivLeeds (3.9), MRAC (0.5), FIN (5), Plazi (1), GlueCAD (11), NBIC (2). As described in the EU BON DoW, the European Biodiversity Portal (task 2.5) “will technically integrate the various data sources under one search facility and spatially/temporally oriented user interface. The portal will build on the tools developed in task 2.3, functions developed by task 2.4. It will provide access to full detailed data, geographic visualisations and remotely sensed data. It will be closely linked to the GCI and GEO Portal, and access layers and data from GEOSS sources”. The EU BON Portal’s first priority is to connect and access data from GBIF, LTER, testing sites databases and other data/metadata providers, allowing users to search biodiversity data through a public web interface. The search engine will look for this information by querying each data provider or aggregator connected through a SOA brokering platform (Enterprise Service Bus). As a network of networks, EU BON will not connect directly to the original sources of data, if these are available through existing aggregation services. Instead, the EU BON Portal will use already aggregated data. This document encompasses the functional specifications of the portal. In the first place, a list of specifications is presented, analysing each one in more detail in the successive sections. In the last section, we will analyse the current proposal for the architecture, reviewing it after the inclusion of GEOSS GI-cat as the main data brokering tool. 2. Functional specifications 2.1 Functional specifications summary The main goal of the portal is to provide integration of biodiversity data, ecosystem data and genetic data, searchable using a common user interface. o The portal will provide a common search user interface as an input form. Search filters to include: taxa, geospatial coverage, temporal coverage, EBV and data providers. o The portal will obtain species information by asking the EU BON taxonomic backbone for the species information and their universal identifiers. o The search results will be presented as a list of datasets and their associated metadata. Biodiversity and ecological metadata will be obtained by consuming web services (e.g. OGC CSW, GBIF API) or by EML parsing. Genetic data will be represented as GenBank links and their related metadata. The portal will act as a network of networks. This implies: o This implies that it will use metadata only, providing links to the data. 2 o o This implies that the portal will not host any data. If possible, it will ask for the data to data providers, but this will be very limited and dependant on each network capabilities. The portal must be compatible the DataONE network solutions, as a requirement to become a DataONE Member Node. In particular, it must provide an output REST API to be consumed by DataONE network services. The portal must provide advanced visualisation capabilities. o In particular, it must represent search occurrences in a map user interface. o It must be able to represent remote sensing data, i.e. consuming OGCcompliant services and representing their output layers. o It must be able to use the map as an input interface for the geographic coverage search filter. The portal will consider three main user roles: public, researcher and policy maker. The results details or visualisations may vary depending on each user role. The portal must provide a way to upload or link EU BON products, documents, tools, guidelines and other and relevant info. The portal must provide an output interface to be consumed by GEOSS GCI, that is, by the GEOSS Discovery and Access Broker. The specifications are summarized in the table 1, with an assigned code to ease further traceability with software requirements. Specification Code SPEC-01 SPEC-02 SPEC-03 SPEC-04 SPEC-05 SPEC-06 SPEC-07 SPEC-08 SPEC-09 SPEC-10 Description Provide a filtered search user interface Integrate biodiversity networks/providers/test sites by metadata Supply heterogeneous biodiversity related data Integrate the EU BON Taxonomic backbone Provide an interface to GEOSS Provide an interface to DataONE Provide geospatial capabilities for filtering and visualisation Provide different user interfaces for different user roles Web content management Data quality specifications Table 1. EU BON Biodiversity Portal specifications summary 2.2 Search process [SPEC-01] A simplified search use case, using the taxonomic backbone capabilities, could be as follows: 1. The user selects a particular species or habitat. The user can enter a species scientific name, a species vernacular name or any habitat-related keyword. 2. The system will provide several filters. 3. The user selects any desired filter. 4. If species information has been introduced as an input, the system will ask for species information to the taxonomic backbone. a. The taxonomic backbone will search for the species information on each taxonomic provider accessible from the backbone. It will integrate the information, translate the ID’s if needed and return the compiled information to the system. 3 5. The system will use the integrated species information and habitat-related keywords as an input to search for occurrence/habitat/genetic data. 6. The system will return the list of search results to the portal component, presenting the information on the web page. The search process will use metadata harmonization to enable the lookup and identify of the required information on each providers' datasets. In principal, at least one search standard or protocol must be used as a common output, in order to provide a common API for the search process, e.g. GI-cat OpenSearch and OGC CSW interfaces. Being a heuristics search, federated search is preferred over indexed searches or data caching. Nevertheless, we must point out that GI-Cat admits federated search and a hybrid search (indexing + federating). The following search filters have been proposed: - Species (taxa/vernacular name). - Geospatial filter (bounding boxes / polygons / locations - Providers/testing sites. - Date/time filter. - Kind of data - Broad species traits - EBV class (topics). 2.3 Biodiversity networks integration [SPEC-02] The system will be able to search biodiversity data and metadata from a wide range of data providers. Metadata-based information will be used to discover datasets, while data will be offered as links to downloadable standardised file format, whenever possible (e.g. Darwin Core Archive). The portal will integrate the providers harvested using the Registry and Catalogue Specification (MS241, annex 1). During the first phase of the development, the system will integrate a subset of the data providers, extending the subset in further releases: - GBIF (through GBIF API). - LTER Europe (DEIMS + GeoNetwork, using OGC-CSW web services or EML harvesting). As genetic/genomic provider, the EBP will use GenBank, consuming its “Entrez” WSDL or REST services (http://www.ncbi.nlm.nih.gov/genbank/). Applying with the service-oriented architectural pattern, there will be two possibilities for connecting data providers to the portal: - Direct connection to each provider through WSDL web services, OGC services or REST API. - EML harvesting, through the implementation of a harvester service. Following the recommendation included in MS241, the EuroGEOSS broker, GI-cat, must be assessed and integrated in the architecture as a specialised message broker, that is, the system that will integrate input sources and generate a common set of outputs. A previous assessment has been included in section 3.2. 4 GI-cat provides a set of input interfaces (accessors), translating the messages to a common ISO-19115 data model. - LTER CSW 2.0.2 services can be connected using the CSW accessor. - GBIF requires a new accessor (currently in development). 2.4 Data supplying, consuming and visualization [SPEC-03] As stated in the DoW, EU BON will evaluate seven major biodiversity data types: 1) remote sensing data; 2) products derived from remote sensing data (e.g., vegetation and habitat maps); 3) taxonomic backbone data; 4) ecological data; 5) current and historical specimen data from scientific collections; 6) species profile data; and 7) DNA sequence data. The portal must act as a common showcase for data and metadata providers. The portal will not include uploading capabilities: it will delegate this responsibility to different catalogue and repository applications, installed on each test site or data provider (MS231, data sharing tools), Several data representation techniques must be supplied, depending on each user roles: - Grids, form and maps. - Charts, statistics and reports (these could be the unique outputs for the “policymaker” user role). The portal will be capable of executing pre-built workflows, previously defined - Based on available remote services. - Using datasets exported by functions within the portal. - Using background data forwarded from GEOSS services. Particular workflows for running EBV estimations will be defined later on; these workflows will be use EBV variables for the estimation processes. The main metadata standard for describing datasets will be EML (Ecological Metadata Language). For collections, ABCD metadata translation will not be necessary, since collection datasets are already indexed by GBIF. 2.5 EU BON Taxonomic backbone integration [SPEC-04] A unified taxonomic information service has been developed within the scope of the task 1.2. This backbone has been release and is accessible through a set of REST services (RESTful API), available at http://cybertaxonomy.eu/eu-bon/utis/ The backbone allows running a federated search on multiple European checklists and returns a unified result set of the individual responses. Currently, the checklist includes the Pan-European Species directories Infrastructure (PESI EU-nomen), Catalogue of Life (CoL) and World Register of Marine Species (WoRMS). It is planned to connect more biodiversity catalogues, as EUNIS and Natura2000, as it is required by the INSPIRE directive. The portal must use this backbone and obtain extended input species information prior to searching for datasets among data biodiversity data providers. 5 2.6 Connection with GEOSS [SPEC-05] EU BON aims at providing European biodiversity data for GEO BON, therefore it will be connected to the GEO Discovery and Access Broker (GEO DAB, http://www.eurogeossbroker.eu). That is, the EBP must implement several provider services according to the GEO DAB API (http://api.eurogeoss-broker.eu/docs/index.html). The GEO DAB is based on the message broker GI-cat. This broker is able to provide a common output to be consumed by other GI-cat instances, thus constituting a possible connection between the EU BON platform and GEOSS. Nevertheless, as recommended by the INSPIRE directive, OGC-compliant outputs will be generated, and in particular a OGCCSW 2.0.2 common output. 2.7 Integration with DataONE network [SPEC-06] DataONE established particular requirements to participate in its infrastructure: become a Member Node or use the Investigator Toolkit. EU BON must be prepared to guarantee a further integration with the DataONE network, thus becoming a DataONE Member Node. The integration as Member Node requires the implementation of a set of APIs, depending on the type of Member Node to implement: - Tier 1: Read, public objects (MNCore and MNRead APIs) Provides read-only access to publicly available objects (Science Data, science metadata, and Resource Maps), along with core system API calls for monitoring and logging. - Tier 2: Access control (MNAuthentication API) Allows the access to objects to be controlled via access control list (ACL) based authorization and certificate-based authentication. - Tier 3: Write (MNStorage API) Provides write access (create, update and archive objects). Allows using DataONE interfaces to create and maintain objects on the MN. - Tier 4: Replication target (MNReplication API) Each tier implicitly includes all lower numbered tiers. For instance, a Tier 3 MN must implement tiers 1, 2 and 3 methods. Allows the DataONE infrastructure to use available storage space on the MN for storing copies of objects that originate on other MNs, based on the Node Replication Policy. The EU BON platform should implement the Tier 1 at least. 2.8 Geospatial capabilities [SPEC-07] The portal will use geospatial representation as a way to visualize search results and as a search filter as well. For filtering, the user could describe bounding boxes or polygons on a map. Following the INSPIRE recommendations, OGC standards will be used, e.g. ISO 19115 geospatial metadata model and OGC-CSW services Map visualisation must include not only occurrence and geographical coverage, but also the representation of remote sensing data layers. CartoDB must be assessed as the main GIS and 6 web mapping tool for the EU BON Biodiversity. Nevertheless, other GIS tools may be included as advanced GIS visualisation alternatives. Map visualisation must include not only occurrence and geographical coverage, but also the representation of remote sensing data layers. As a map visualization and interaction component, CartoDB must be evaluated. It provides advanced GIS data visualization, including observation occurrence, timelines and animations through time. Nevertheless, other GIS tools may be included as advanced GIS visualisation alternatives. 2.9 Potential users and roles [SPEC-08] Several user roles and their permissions have been differentiated so far: - Public or anonymous user: simple search use cases, simple data visualization and export functionality. - Researcher: obtain more detailed information, data analysis, charts… - Policy maker: analytical results, EBV’s estimation, charts., … - Administrator: data provider administration, portal content administration, portal and middleware analytics, etc. Particular user interfaces and interaction capabilities for each user role will be defined in the requirement analysis phase and improved on each portal release. 2.10 Content managing [SPEC-09] As it was stated in the DoW, the EBP must give a fast access to EU BON integrated data and products. As far as a current portal is deployed for uploading EU BON news and deliverables, among other products, the EBP must facilitate the upload or link of those products, documents, tools, guidelines and other relevant information. Therefore, the portal could act as a content management system or at least be able to link to the EU BON portal (www.eubon.eu). 2.11 Dataset quality specifications [SPEC-10] The rapid growth of biodiversity data collected by volunteers, amateur naturalists coined as Citizen Science, provides the chance to gain extensive (if not huge) data about biodiversity. While, in parallel, formed a pressing need from the scientific community, that doubts the reliability and quality of public-based data, to establish and provide methods for quality assurance. From the 'Biodiversity Portal Specifications Questionnaire' we can learn of the different approaches regarding the expectations about finding/using/annotating QA data. Hence, within the context of the EU BON Portal specifications, the following topics for discussions regarding quality assurance of observational data are proposed: 1. The convenience of annotating the data with quality-oriented metadata. 2. What information about quality of-data within (registered) datasets: discoverable, identifiable, available, useful, filtered, acceptable, etc. 7 3. Whether there are feasible actions that can or should be taken by WP1 and WP2, to apply modifications and actions oriented to enhance and annotate quality-controlled data discoverable by the portal. E.g. promote, develop and recommend methods to enhance QA data. 4. Do we need/expected to recommend standard(s) or strategies for quality control and annotation, given the lack of standards and the range of methods and tools (techniques & software) to improve, evaluate, validate, and facilitate e.g. CS-based accuracy of data? 3. Architectural Design 3.1 Current architectural proposal Following the principles and concepts of LifeWatch (D2.1, Recommendation 7), EU BON will follow the principles of the Enterprise Application Integration in a service-oriented architectural approach. The system will be focused on a common middleware layer that will connect heterogeneous data providers, orchestrating the messages returned by each one. The EU BON Biodiversity Portal will act as a client of a broader architecture that will connect several heterogeneous providers: biodiversity data/metadata providers, genetic data providers, taxonomic providers and so forth. The majority of them currently supply a web service or REST interface. Figure 1. Middleware-based architecture for EU BON The middleware layer will be composed of an Enterprise Service Bus with data service adapters and the message broker. In particular, the GI-cat message broker, developed by EuroGEOSS, must be evaluated as a candidate for that purpose, due to its ability to integrate 8 disparate metadata catalogues through standardized interfaces and to establish a bridge between EU BON and the GEOSS Common Interface. 3.2 Architectural revision using GI-cat GI-cat is a specialized brokering system developed for the GEOSS portal within the context of the EuroGEOSS project. It is comprise of a JavaEE application that acts as a broker among standardized sources and catalogues, and a set of libraries and connectors to provide standardised input and output capabilities. As a specialized broker, GI-Cat can be configured as a direct- access mediation service or as a metadata harvester. GI-Cat uses OGC-CSW services to distribute the functionality and return search results. The interaction of EML with CSW services must be studied in detail before testing GI-Cat as a candidate for the EU BON message broker. GI-cat common data model is based on ISO 19115 plus extensions. The distributor translates provided data and metadata to this data model, and exposes the information translating again to several catalogue formats (e.g. CS-W, OAI-PMH, OpenSearch). Consuming EML interfaces directly may require the development of a new GI-cat accessor Figure 1. The GI-cat broker system featuring some catalogue query interfaces (right) and several backend mediation components (source: Nativi et al. 2009). After reviewing the recommendation of using GI-cat as the brokering system for EU BON, as stated on the MS241 document “Specification for the registry and metadata catalogue”, we 9 have encountered that, as a specialized broker, GI-cat is a powerful solution for integrating standardized sources and for providing an interface for GEOSS, but it lacks on connectors for common WSDL services or specific input sources needed for EU BON purposes (e.g. GenBank, EU-Nomen, WoRMS, etc). We can propose a revision for the architecture, a hybrid solution that consists of the integration of GI-cat inside a larger ESB based SOA architecture. Figure 2. EU BON middleware platform using GI-cat 10