Project Acronym RDA Europe Project Title Research Data Alliance Europe Project Number 312424 Deliverable Title First year report on RDA Europe analysis programme Deliverable No. D2.4 Delivery Date Author Herman Stehouwer Diana Hendrickx ABSTRACT All detailed analysis results of the data architectures and organizations of the communities studied and a description of possible generalizations towards solutions for integration and interoperability to foster the RDA Europe Forum and RDA discussions. RDA Europe (312424) is a Research Infrastructures Coordination and Support Action (CSA) co-funded by the European Commission under the Capacities Programme, Framework Programme Seven (FP7). 2 rdaeurope@rd-alliance.org | europe.rd-alliance.org 2 3 DOCUMENT INFORMATION PROJECT Project Acronym RDA Europe Project Title Research Data Alliance Europe Project Start 1st September 2012 Project Duration 24 months Funding FP7-INFRASTRUCTURES-2012-1 Grant Agreement No. 312424 DOCUMENT Deliverable No. D2.4 Deliverable Title First year report on RDA Europe analysis programme Contractual Delivery Date 9 2013 Actual Delivery Date 10 2013 Author(s) Herman Stehouwer, MPI-PL; Diana Hendrickx, UM Editor(s) <Insert deliverable editor(s) – Name, Surname, Org Short Name> Reviewer(s) Natalia Manola & Guiseppe Fiameni Contributor(s) Constantino Thanos Work Package No. & Title WP2 Access and Interoperability Platform Work Package Leader Peter Wittenburg – MPI-PL Work Participants CSC, Cineca, MPG, EPCC, CNRS, STFC, UM, ACU, ATHENA, CNR Package Estimated Person Months 16 Distribution public Nature Report Version / Revision 1.0 Draft / Final Draft Total No. (including cover) Pages 47 Keywords rdaeurope@rd-alliance.org | europe.rd-alliance.org 3 4 DISCLAIMER RDA Europe (312424) is a Research Infrastructures Coordination and Support Action (CSA) cofunded by the European Commission under the Capacities Programme, Framework Programme Seven (FP7). This document contains information on RDA Europe (Research Data Alliance Europe) core activities, findings and outcomes and it may also contain contributions from distinguished experts who contribute as RDA Europe Forum members. Any reference to content in this document should clearly indicate the authors, source, organization and publication date. The document has been produced with the funding of the European Commission. The content of this publication is the sole responsibility of the RDA Europe Consortium and its experts, and it cannot be considered to reflect the views of the European Commission. The authors of this document have taken any available measure in order for its content to be accurate, consistent and lawful. However, neither the project consortium as a whole nor the individual partners that implicitly or explicitly participated the creation and publication of this document hold any sort of responsibility that might occur as a result of using its content. The European Union (EU) was established in accordance with the Treaty on the European Union (Maastricht). There are currently 27 member states of the European Union. It is based on the European Communities and the member states’ cooperation in the fields of Common Foreign and Security Policy and Justice and Home Affairs. The five main institutions of the European Union are the European Parliament, the Council of Ministers, the European Commission, the Court of Justice, and the Court of Auditors (http://europa.eu.int/). Copyright © The RDAEurope Consortium 2012. See https://europe.rd-alliance.org/Content/About.aspx?Cat=0!0!1 for details on the copyright holders. For more information on the project, its partners and contributors please see https://europe.rd-alliance.org/. You are permitted to copy and distribute verbatim copies of this document containing this copyright notice, but modifying this document is not allowed. You are permitted to copy this document in whole or in part into other documents if you attach the following reference to the copied elements: “Copyright © The RDA Europe Consortium 2012.” The information contained in this document represents the views of the RDA Europe Consortium as of the date they are published. The RDA Europe Consortium does not guarantee that any information contained herein is error-free, or up to date. THE RDA Europe CONSORTIUM MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, BY PUBLISHING THIS DOCUMENT. rdaeurope@rd-alliance.org | europe.rd-alliance.org 4 5 GLOSSARY ABBREVIATION DEFINITION RDA Europe Research Data Alliance Europe OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting CSC Finnish IT Centre for Science UM Maastricht University MPI-PL Max Planck Institute for Psycholinguistics CLST Centre for Language and Speech Technology RU Radboud University CNRS Centre national de la recherche scientifique ENVRI Common Operations of Environmental Research infrastructures TNO Dutch Organisation for Applied Research E-IRG E-Infrastructure Reflection Group EEF European E-Infrastructure Forum ESFRI European Strategy Forum on Research Infrastructures ACU Association of Commonwealth Universities CERN European Organization for Nuclear Research MPG Max Planck Gesellshaft rdaeurope@rd-alliance.org | europe.rd-alliance.org 5 6 TABLE OF CONTENTS Executive Summary ....................................................................................................... 7 1 Introduction ............................................................................................................... 8 2 RDA/Europe Forum ..................................................................................................... 8 3 Trends and Gaps ........................................................................................................ 9 #1 Usage of data management .................................................................................... 9 #2 Data models ....................................................................................................... 10 #3 Data preservation anticipation ............................................................................... 10 #4 Availability and quality of available data / metadata ................................................. 10 #5 Metadata ............................................................................................................ 11 #6 discoverability of data .......................................................................................... 11 #7 Data reuse is common, but dependent on field ........................................................ 11 Annex 1: Interview Reports RDA/Europe ........................................................................ 13 Community Data Analysis: EMBL - EBI – molecular databases ........................................ 13 Community Data Analysis: Genedata .......................................................................... 14 Community Data Analysis: TNO .................................................................................. 16 Community Data Analysis: ENVRI ............................................................................... 17 Community Data Analysis: Svali ................................................................................. 24 Community Data Analysis: EISCAT 3D ......................................................................... 27 Community Data Analysis: Public Sector Information ENGAGE ........................................ 30 Community Data Analysis: ESPAS ............................................................................... 33 Workshop: On-line databases : from L-functions to combinatorics .................................. 34 Community Data Analysis: Huma-Num ........................................................................ 37 Community Data Analysis: INAF centre for Astronomical Archives ................................... 38 Community Data Analysis: ML-group........................................................................... 40 Community Data Analysis: Computational Linguistics Group .......................................... 41 Community Data Analysis: CLST ................................................................................. 43 Community Data Analysis: Donders Institute ............................................................... 44 Community Data Analysis: Meertens Institute .............................................................. 45 Community Data Analysis: Microbiology ...................................................................... 47 Community Data Analysis: Arctic Data ........................................................................ 48 rdaeurope@rd-alliance.org | europe.rd-alliance.org 6 7 Executive Summary In this report we show the results for the first year from WP2.3: the analysis program. One of its main components is the presence in the annex of all the interview reports produced under this program. The following organizations and communities have been interviewed and reported: Institutes/companies: TNO (general semi-commercial research institute, toxicogenomics case) Genedata (genetic data) Donders institute (brain research) Meertens Institute (dialects and linguistic history in the Netherlands) Arctic Data Institute (NIOZ) INAF centre for Astronomical Archives EMBL-EBI (Molecular databases) Research Groups/Departments: Machine Learning research group Computational Linguistics research group Language and speech technology group Microbiology research group Other: Math Community (on L-functions and combinatorics) ENVRI (large environmental community) Svali (Artic Lands and Ice) EISCAT 3D (3D Imaging radar for atmospheric and geospace research) ENGAGE (Public Sector Information) ESPAS (near earth space research community) Huma-Num (Very Large Research Infrastructure on numerical science for SSH) Interviewing will continue in the second year of the project and the report will be updated to reflect new insights and to include all interview reports. The analysis of the interview reports shows that we still have a long way to go before good data stewardship is commonplace. Furthermore, it underlines the importance of having good metadata. Good metadata enables discoverability and reuse of the data. Based on the analysis in the report we make a number of concrete suggestions, they are listed in short for below: Use basic data management as outlined in the e-IRG whitepaper Consider data management before data collection takes place Document your data with high-quality metadata Use persistent identifiers Ensure discoverability of the data Share your data Researchers should be more familiar with the available possibilities of cloud computing, grid computing, HPC We recommend the usage of computer-actionable data management policy rdaeurope@rd-alliance.org | europe.rd-alliance.org 7 8 1 Introduction In this report we give an overview of the first year of the RDA Europe Analysis programme in which we have conducted many interviews in order to obtain a better and broader understanding of current data practises. We focus specifically on finding gaps and good existing solutions. This report will be updated as soon as the programme is completed during the second year. The report contains a fair amount of additional material that is relevant. In Annex 1 we provide all the analysis reports that are finished at the time of writing. This report will be updated as new interviews come in. In this report we will first describe the RDA/Europe Forum, as the forum is part of the target audience of this report. We will then describe trends and gaps that we note over multiple reports and our report ends with some concrete proposals to improve data exchange, to foster the RDA Europe Forum and to foster RDA discussions. 2 RDA/Europe Forum The process to establish the RDA/Europe Forum was started by the need to be engaged at the European level. John Wood (ACU) discussed with Kostas Glinos (DGConnect) and Carlos Morais-Pires (DGConnect) the possible membership and the types of key stakeholders. When the iCORDI proposal was written it was not exactly clear what was meant by the RDA/Europe Forum (HLSF at the time), different views existed on this. After the granting of the project it was obvious that a group of people representing organizations relevant to the European scenario are needed, furthermore, a group of people that can give backing to the needs of the project politically. The Commission has had the e-IRG, EEF (http://www.einfrastructure-forum.eu/) and different ESFRI groups but none has been really good in bringing together the e-Infrastructure service providers and researchers. The RDA Europe Forum and the RDA Science workshops are also aligned to this purpose.. We have established the RDA/Europe Forum by inviting organisations to send a representative. The current members of the forum (and the organisations they represent) are: P Ayris (LIBER) Jens Vigen (Eiroforum) Norman Wiseman (Knowledge Exchange) Peter Linton (EIF) Donatella Castelli (ERF) Martin Vingron (MPG) Marja Makarow (ESF) Norbert Lossau (LERU) Sandra Collins (European Academies) Paul Boyle (Science Europe) The RDA/Europe Forum is organized by John Wood (ACU), Leif Laaksonen (CSC), Peter Wittenburg (MPI-PL), and Herman Stehouwer (MPI-PL). rdaeurope@rd-alliance.org | europe.rd-alliance.org 8 9 The forum has had one meeting in the first year of the RDA Europe project on 7 March 2013. The second meeting will take place towards the end of November 2013. We expect to continue with around 2 meetings per year. At the first HLSF meeting the RDA/Europe project, the RDA, the (then) current state of RDA, and the RDA/E Forum itself where discussed. Several recommendations were made by the forum. These recommendations include the following: “Organizations should be able to endorse the RDA work”. I.e. the RDA work is performed by individuals in working and interest groups. The outcomes from these group must be adopted. All groups are required to have some form of adoption strategy, however the forum felt it was useful if there was a way for organisations to state their general support for RDA outputs. “Setting up an international legal entity will be a lot of work”. I.e. several members of the forum have experience in setting up international legal entities. Their experience is that setting up such an entity is a lot of work and takes around a year to set up. In the upcoming RDA/E Forum meeting, we will discuss the following items: The RDA legal entity How to strengthen the utilization of RDA output in Europe Liaising more broadly in Europe Industrial connections The upcoming Plenary meeting (March 2014, Dublin) The upcoming RDA/Europe Science workshop The upcoming G8+O6 meeting; Briefing and feedback. This deliverable will be presented to the forum to provide another overview of the status quo and to allow them to push the concrete proposals made. 3 Trends and Gaps Here we will report a picture of the current practises, technologies, models and standards adopted by the interviewed data organizations. Such a picture is of interest to the RDA activities and can help the working groups. The goal of the interviews was to establish such a picture, to find gaps, and to find existing solutions that could be generalized. These findings are one input to the RDA/Europe forum, can also inform new RDA working groups, and are input to the RDA process in general. #1 Usage of data management Data management is the overall practise of dealing well with data. E.g. defining the lifecycle of pieces of data, defining how data can be found inside of the organization. A clear trend that we observe from these interviews is that the quality of the data management is highly variable. Obviously the data infrastructures practise good data management, but once you drop down to institutional or research group level the usage of data management varies wildly. There is also a need for archives to store the growing amounts of data. In some fields, data are in the order of petabytes and larger. rdaeurope@rd-alliance.org | europe.rd-alliance.org 9 10 The underlying data management systems used differ. Simple relational databases are popular in a large number of the interviewed organizations to store their data. The data in the database are usually linked to the original data in the file system. No clear trend as to data management systems is present. In most cases the system is either human-actions or custom build (e.g. a collection of scripts). The use of computer-actionable policy is rare. Based on this point we recommend the encouragement of basic data management as outlined in the e-IRG whitepaper. To be precise: use standard data formats where possible, if there is no clear need for a new definition; state clearly in the metadata which data format is used; have metadata. #2 Data models With relation to data models we can say that the data models used depend on the field interviewed. Obviously this is not surprising. For measurements and analysis there is a clear practise to store these in some XML format. Often data is stored as a time series, either in the raw data or in a relational database, e.g. the NIOZ stores large time series of the DTP (depth, temperature, pressure; often much richer than just dtp) measurements. The use of simpler models (such as array-based) or unusual models (such a graph-based) is limited, to give an example, the biomedical community in EMBL-EBI uses several array-based data models. #3 Data preservation anticipation Several interviewees mentioned that data storage, exchange and preservation must be considered in a preparatory phase and not as an afterthought. That is, when determining what to measure and store one also has to consider what to do with the data. When producing data, long term aspects have to be taken into account. Data preservation is at the moment problematic, but it is clear that, going forward, communities think of preservation wrt their future requirements. Based on this point we recommend considering data management at an early stage, before any data collection takes place. #4 Availability and quality of available data / metadata For several fields, there is a lack of availability of open data. Several interviewees mention that if metadata is available for data that its quality is usually not good enough to find valuable data based on its metadata. There is also a lack of persistent identifiers for data / metadata and the (meta)data are in different formats. There is a need for standardization. rdaeurope@rd-alliance.org | europe.rd-alliance.org 10 11 Based this point we can recommend that data have to be better documented with high quality metadata. Persistent identifiers have to be used. #5 Metadata Most interviewed organizations used a clear, domain-specific, xml-based, metadata standard, though almost all used different ones. However, metadata is not always produced for all data. It is however clear that OAI-PMH is the protocol that is used for metadata exchange. Many of the organizations interviewed underlined the importance of metadata for discovery. #6 discoverability of data Several participants on the institutional level (e.g. Meertens, NIOZ) mentioned that the discoverability of their data is essential to encouraging external reuse. Larger participants, i.e. the research infrastructures, also take care that the data are discoverable. There is a need for integrated data discovery, but this is hampered by the lack of interoperability between different data sources. Based on this point data should be discoverable and interoperable, first through quality metadata, and second through the use of data standards. Persistent identifiers have to be used. #7 Data reuse is common, but dependent on field Data reuse is inherent to the approach taken by the larger data infrastructures that were interviewed. Furthermore, smaller groups (e.g. CL, CLST, Meertens) all indicated that for them data reuse was extremely common. However the small groups often do not have the infrastructure to advertise the available data so the reuse is often contained within the institute itself. Once the infrastructure is available data reuse is also common with third parties. A caveat here is that in some other fields than those interviewed here, e.g. linguistics, data sharing and reuse is less common. Furthermore, in some cases the data contain sensitive information and their sharing is not possible, for instance when talking about patient data or other medical measurements on subjects. Based on this point we recommend the encouragement of data sharing in all fields, unless data are too sensitive, i.e., privacy issues are involved. Concretely we recommend that metadata is always available, so that it is at least clear which data exists. #8 Use of cloud / grid computing rdaeurope@rd-alliance.org | europe.rd-alliance.org 11 12 There is no general trend of using cloud computing for storing and analysing large amounts of data. Based on this point we recommend that researchers become more familiar with the available opportunities for cloud computing to process the growing amounts of data. Furthermore we can recommend to express data management policy as computer actionable rules, which reduces the possibility of human error, saves a lot of effort and enforces consistency with the rules. rdaeurope@rd-alliance.org | europe.rd-alliance.org 12 13 Annex 1: Interview Reports RDA/Europe Community Data Analysis: EMBL - EBI – molecular databases Goals of analysis By interviewing EMBL – EBI, we hope to learn more about data storage, data management and data warehouses. Analysis provided by Maastricht University (UM). Description of the community EMBL-EBI provides freely available data from life science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in academia and industry. Goals of the community with respect to data provide freely available data and bioinformatics services coordinate biological data provision throughout Europe Data types EMBL – EBI provides access to data from life science experiments. Submitted data files are mostly in TAB-delimited text format, but some are in XML. For ArrayExpress, (transcriptomics) data are submitted by the user in MAGE-TAB format, or via MIAMExpress submission tool. The following file formats are used: - .gpr, .txt, CEL plus EXP (raw data) - .txt, CHP (normalized data) - .txt (combined data file) Metadata are submitted in tab-delimited files when using MAGE-TAB, and via a web interface when using MIAMExpress. Data flow Data is collected and processed by external institutes. The data is submitted into one of the EMBL-EBI databases. A team of curators receive the submissions and may curate the files manually. Curated data can be used by other research institutes or companies. Data organization EMBL-EBI adopts Data Management Systems: relational database (ORACLE / MySQL). Data are stored in Data Warehouses. Examples are ArrayExpress (transcriptomics: microarrays, RNA-seq), PRIDE (proteomics: Mass-Spec), ENA- SRA (genomics: sequencing, next generation sequencing data), Metabolights (metabolomics: NMR, MS) and project-specific data warehouses such as the metagenomics portal and diXa (toxicology). Data exchange The computer environment of EMBL-EBI is a centralized system. Multiple computers are communicating through a network. EMBL-EBI mirrors data from other databases (e.g., ArrayExpress manages GEO (Gene Expression Omnibus) data – a similar database at NCBI). The data are transformed to the format used by EMBL-EBI and loaded in EMBL-EBI databases. rdaeurope@rd-alliance.org | europe.rd-alliance.org 13 14 EMBL-EBI has a Biological Sample database (BioSamples) for integration of the biological sample dimension of different –omics data (transcriptomics, metabolomics, proteomics). Data services and Tools EMBL-EBI provides several databases for experimental data from the life sciences. A recent EBI website redesign resulted in more consistent interfaces of different EBI databases; this process is still ongoing. If the user wants to submit data into one or more of the EMBL – EBI databases, he has to register by creating an account for each database. Further EMBL-EBI provides tools to support the development and use of ontologies. The Experimental Factor Ontology (EFO) provides a systematic description of many experimental variables available in EBI databases, and for external projects. It combines parts of several biological ontologies. The scope of EFO is to support the annotation, analysis and visualization of data. Legal and Policy Issues All data are open. If the user submits data, there is an option to keep data private for a certain period before it becomes public. This option is normally used when data accompany a publication that is under review, in which case manuscript reviewers are able to access the dataset before it becomes public. There are no restrictions to data re-use. An exception is the European Genome-phenome Archive (EGA) which serves as a broker of patient data where data access is carefully managed. Limitations /issues EMBL-EBI doesn’t adopt a generic service for Persistent Data Identifiers, however, it guarantees the persistence of data stored in production repositories. EMBL-EBI currently does not have a database that connects data in different databases for multi-omics studies. Proposed actions EMBL-EBI through the diXa project is currently talking to EU-infrastructures that provide services for persistent identifiers (EUDAT). EMBL-EBI is working on a BioStudy database that connects data in different databases for multi-omics studies. Community Data Analysis: Genedata Goals of analysis By interviewing Genedata, we hope to learn more about data services and data integration. Analysis provided by Maastricht University (UM). Description of the community Genedata is a bioinformatics company that provides scalable enterprise software solutions and consultancy. Goals of the community providing consultancy for data analysis in the life sciences; developing software platform for analyzing and visualizing large experimental data sets generated from life science experiments; rdaeurope@rd-alliance.org | europe.rd-alliance.org 14 15 Data types Genedata develops software platforms for the management and analysis of data from life science experiments. The data are from (among others) –omics platforms, clinical labs and production environments. Input data formats are vendor (platform) dependent; parsers are available for all major platforms. Meta-data formats are either tab-delimited, or standard formats (e.g. MAGE-ML, ISA-Tab). Data are stored internally in relational databases (Oracle), or binary files. Data export can be in done in tab-delimited files (e.g. GGF genomic feature format), Excel files, binary files, or pdf reports. Data flow Genedata supports different stages of the data life cycle: experimental design, data preprocessing, quality control, data management, data analysis, data visualization, and result interpretation. For data preprocessing, quality control and data analysis, the client can use one of the software modules developed by Genedata or consult Genedata for support. Data organization Genedata seamlessly integrates into in-house Data Management Systems (e.g. LIMS) as well as custom relational database (ORACLE/SQL). Open file formats / open interfaces are used as much as possible. Own formats are published. All formats are documented. Data exchange The software system is based on a client server architecture. The programming language for the client and server is JAVA and therefore mostly platform independent. The architecture is a classical three tier architecture: application server, database server (ORACLE), windows client (WebStart). The vast majority of the data stay on the server. Genedata adopts a symmetric multiprocessing (SMP) architecture, Grid technologies (SUN grid engine/Open Grid Scheduler/DRMAA). Data services and Tools Genedata uses Lightweight Directory Access Protocol (LDAP) and Active Directory (AD) for authentication / authorization. Genedata provides a huge list of algorithms / methods for data discovery, data analysis, data mining and data visualization. Genedata software allows for automation of processes, e.g. running a complete data processing pipeline automatically with new data arriving on a folder that is monitored by an agent. The Workflow system can also deeply integrated in external processes by a command line client that allows for fully unattended execution of workflows. Genedata supports data preservation by means of process protocols, reporting, and archiving. The computer system of Genedata is a workflow management system. Genedata supports multiple users, multiple sites collaboration through its client- server infrastructure. Genedata uses persistent identifiers: CAS-number (for chemical compounds), Ensemble Gene IDs, RefSeq Gene IDs, GUIDs (global unique identifier) (for lab systems). Genedata adopts ontologies; most often used by customers are gene ontology (GO), ontologies provided by NextBio, sequence ontology (SO), taxonomies. The software can be configured to work with any of those ontologies. Data are protected by security features of Oracle, e.g. POSIX from Unix. Transmission is fully encrypted (SSH / HTTPS). rdaeurope@rd-alliance.org | europe.rd-alliance.org 15 16 Legal and Policy Issues Genedata is a closed source company (source code is not made available to the community), but is using open standards for its interfaces and protocols. Algorithms are documented / referenced, and detailed processing reports generated automatically. (pdf reports). Genedata does not adopt restrictions to data-reuse. Limitations /issues Genedata currently does not adopt cloud computing, but this is planned mid- term. Proposed actions (if mature enough) Cluster computing solution will be available by the middle of this year. Community Data Analysis: TNO Community: Netherlands Toxicogenomics Centre (NTC) Interview with TNO as partner of NTC Goals of analysis From interviewing TNO, one of the partners of NTC, we hope to learn more about data organization within a consortium of research institutes and companies. Analysis provided by Maastricht University (UM) Description of the community NTC is a collaboration between universities, research institutes and companies in the Netherlands. TNO is a research institute and one of the partners of NTC. Goals of the community employ toxicogenomics to increase the basic understanding of mechanisms; develop new methods that better chart the risk of chemical compounds; develop alternatives to animal testing; advance collaboration with external partners. toxicological Data types NTC has experimental data from in vitro and in vivo life science experiments (transcriptomics, metabolomics, ...). Metadata formats (for Omics) are SimpleTox (from Array Track) and ISATab. Data flow Exposure to toxic compounds of cell lines / animals is measured at different time points / for different doses of the toxic compound. The different types of data (transcriptomics, metabolomics, ...) are stored in a raw data file. R scripts generate data and graphs for quality control. The data are manually curated after inspection of QC results. Furthermore, meta-data information is manually curated as well. Data are preprocessed and normalized, and finally analyzed with several data mining and statistical tools. After publication, data will be made available to other research projects (online distribution). rdaeurope@rd-alliance.org | europe.rd-alliance.org 16 17 Data organization NTC adopts Data Management Systems: relational database (my SQL). Data exchange The computer environment of NTC is a centralized system with 4 servers, located in Maastricht, that communicate with each other. A backup server is located at TNO. Information about the NTC research projects can be obtained from the NTC website. NTC uses software tools (Metacore, Inguity) to extract data from outside sources. NTC utilizes web services for data exchange with external data sources as well. Data services and Tools For data protection, the following techniques and tools are used: SSH secure shell, user and group authentication, firewall and physical protection. NTC uses persistent identifiers: Entrez Gene ID, KEGG ID, CAS-nr. NTC uses BioPortal, a portal that provides access to commonly used biological ontologies. NTC applies bioinformatics tools for data integration and web services (PubChem, KEGG) for retrieving external information. Data analysis techniques adopted by NTC are, among others: classification, clustering, statistics, functional enrichment, network analysis, quality control, normalization. Data are visualized in heatmaps, PCA, line/scatter plots, networks, boxplots, network visualization, etc. NTC also develops its own tools for data mining, data analysis and data visualization (R scripts). Legal and Policy Issues NTC makes data available through data platforms (GEO, ArrayExpress, ...). Within NTC, data can be re-used on demand. Outside NTC, data will be made available after publication. Users of the data have to collaborate with NTC. Community Data Analysis: ENVRI Goals of analysis From the interview of the ENVRI project we hope to get sophisticated understanding on the data organization within a large environmental community in Europe, which combines a number of various ESFRI research infrastructures, a few non-ESFRI projects and partner organizations. What are common data requirements? What are dissimilarities? What are general and specific data solutions and challenges? The given interview covers 4 key partners of 6: ICOS, LifeWatch, EPOS, and EMSO as described below. Analysis provided by CSC. Description of the community Description The ENVRI project (Common Operations of Environmental Research Infrastructure) is a collaboration conducted within the European Strategy Forum on Research Infrastructures (ESFRI) Environmental Cluster. The ESFRI Environmental research infrastructures involved in ENVRI including: rdaeurope@rd-alliance.org | europe.rd-alliance.org 17 18 ICOS, European distributed infrastructure dedicated to the monitoring of greenhouse gases (GHG) through its atmospheric, ecosystem and ocean networks, http://www.icos-infrastructure.eu/; EURO-Argo, European contribution to Argo, which is a global ocean observing system, http://www.euro-argo.eu/; EISCAT-3D, European new-generation incoherent-scatter research radar for upper atmospheric science, http://www.eiscat3d.se/; LifeWatch, an e-science Infrastructure for biodiversity and ecosystem research, http://www.lifewatch.com/; EPOS, European Research Infrastructure on earthquakes, volcanoes, surface dynamics and tectonics, http://www.epos-eu.org/; EMSO, European network of seafloor observatories for the long-term monitoring of environmental processes related to ecosystems, climate change and geo-hazards, http://www.emso-eu.org/management/. ENVRI also maintains close contact with the other not-directly involved ESFRI Environmental research infrastructures by inviting them for joint meetings. These projects are: IAGOS, Aircraft for global observing system and SIOS, Svalbard arctic Earth observing system. ENVRI IT community provides common policies and technical solutions for the research infrastructures, which involves the following organization partners: Cardiff University, CNRISTI, CNRS (Centre National de la Recherche Scientifique), CSC, EAA (Umweltbundesamt Gmbh), EGI, ESA-ESRIN, University of Edinburgh, and University of Amsterdam. Goals of the community with respect to data The central goal of the ENVRI project is to implement harmonized solutions and to draw up guidelines for the common needs of the environmental ESFRI projects, with a special focus on issues as architectures, metadata frameworks, data discovery in scattered repositories, visualization and data curation. This will to empower the users of the collaborating environmental research infrastructures and enable multidisciplinary scientists to access, study and correlate data from multiple domains for "system level" research. The collaborative effort will ensure that each infrastructure can fully benefit from the integrated new ICT capabilities beyond the project duration by adopting the ENVRI solutions as part of their ESFRI implementation plans. In addition, the result will strengthen the European contributions to GEOSS - the Global Earth Observation System of Systems. All the nine Social Benefit Areas identified and addressed by GEO-GEOSS will take advantage of such approach. Data types EMSO: The EMSO data infrastructure has been conceived to utilize the existing distributed network of data infrastructures in Europe and use the INSPIRE and GEOSS data sharing principles. A number of standards have been set forth that will allow for state-of-the-art transmission and archiving of data with the kinds of metadata recording and interoperability that allow for more straightforward use and communication of data. These standards include the Open Geospatial Consortium (OGC) Sensor Web Enablement (SWE) suite of standards, namely the OGC standards SensorML, Sensor Registry, Catalogue Service for Web (CS-W), Sensor Observation Service (SOS) and Observations and Measurements (O&M). OGC SensorML is an eXtensible Markup Language (XML) for describing sensor systems and processes. Following on progress from EuroSITES and others a SensorML profile is being created that can be stored in a so-called Sensor Registry that will act as a catalogue of each rdaeurope@rd-alliance.org | europe.rd-alliance.org 18 19 EMSO sensor. This dynamic framework can accommodate the diverse array of data and formats used in EMSO, including the addition of delayed mode data. Measurements: a non-exhaustive list of measurements and sensors that can be performed with seafloor observatories is listed below: Water Conductivity, Temperature, Pressure, pH, Eh, alkalinity; Ground motion velocity and acceleration; Earth Gravity acceleration and Magnetic induction field; Geodesy and seafloor deformation (displacement) Gas and dissolved elements concentration; Sound velocity; Heat Flow (temperature). Metadata: EMSO collects metadata on both, the physical sensors and observatories as well as on the data. Observatories are intended to be described by SensorML. Metadata on archived data sets is compatible to ISO19115, DIF or the NetCDF (CF) specification. EPOS: Raw measurements: continuous seismic waveform data in SEED format and corresponding metadata, 1-100'sTB; accelerometric waveform data in ASCII. Measurements / Quality Control data: Power Spectral Density: PSD of the background noise as function of time, for selected frequencies. PSD database (PQLX) and Probability Density Function (PDF) representation (PQLX database). Magnitude: histograms for magnitude differences between station magnitude and VEBSN magnitude. Time residuals: time residuals distribution for each station. Metadata: metadata definitions are currently an important topic of discussion. Within seismology a task force will be established to define and store the concepts and the vocabulary terms for the metadata items. Dataless SEED is the current international standard format to describe instrument characteristics. (Derivative XML formats are also in use, but common agreement has not been reached yet). The main requirements or the next phase of the metadata definition can be listed as following (EUDAT initiative): A simple „flat‟ metadata standard for discovery; (flat metadata means it is a single record with attributes rather than a group of linked records each with attributes and with relationships between the records); A structured (linked entity) standard for context (relating the dataset to provenance, purpose, environment in which generated etc); Detailed metadata standards for each kind of data to be co-processed; The following standards appropriate to support such model: Discovery: DC; Contextual: CERIF (Common European Research Information Format ) or ISO 19115; Detailed: Individual standards depending on type of dataset; for research datasets from large-scale facilities CSMD (e.g http://www.ijdc.net/index.php/ijdc/article/view/149, see also PaNData, http://www.pan-data.eu/PANDATA_– rdaeurope@rd-alliance.org | europe.rd-alliance.org 19 20 _Photon_and_Neutron_Data_Infrastructure), for geospatial datasets INSPIRE 1, http://inspire.jrc.ec.europa.eu/, http://en.wikipedia.org/wiki/INSPIRE (as in ENVRI). Derived/processed data, publications, software: Earthquake catalogues represented in many different formats, from text based to XML. ICOS: Data hierarchy of ICOS differs between its two Thematic Centers (Atmospheric and Ecosystem). Data hierarchy in ICOS Atmospheric Thematic Center (ATC) is divided into 4 levels defined as: Level-0: Raw data (e.g. current, voltages) produced by each instrument; Level-1: Parameter expressed in geophysical units. For example it can be GHG concentrations (e.g. ppm-CO2). Level 1 is also dived into two levels: Level 1.a: Rapid delivery data (NRT, 24hr) and level 1.b: Long term validated data; Level-2: Elaborated products. For GHG concentrations it can be, e.g., gap-filling, selection, etc. Level-3: Added value products (to be defined with AS PIs). Metadata are provided by PIs via a graphical applications developed at ATC. Raw data are daily transferred to ATC where data are automatically processed. Those raw dataset are mainly ASCII files, depending on considered instrument. A specialized processing chain is dedicated for each type of instrument deployed in ICOS Atmospheric Station. The process involves the transformation of raw data (Level 0) to upper level product. A level 1 ICOS atmospheric station is continuously measuring 18 parameters (among them greenhouse gases, meteorological parameters and planetary boundary layer height). Most data are continuous measurements. About 400 Mo by level 1 ICOS Atmospheric Station is daily uploaded to ATC. Considering that ICOS Atmospheric network will comprise about 50 atmospheric observatories, amount of data produced is estimated to be around 20Go/day, i.e. 7.3To/yr. Note that is an upper bound value since all stations are not going to be labeled as level-1 ICOS atmospheric station. Data catalogue of produced datasets is not yet automatically available but is intended to be. Data hierarchy in ICOS Ecosystem Thematic Center is divided into 5 levels defined as: Level-0: Raw data; Level-1: First set of corrections applied to the raw data; Level-2: Consolidated half-hourly fluxes; Level-3: Standardized QAQC and filtering applied to the half hourly data; Level 4: Data gap filled and aggregated at different resolutions; Level 5: Derived variables calculated) data products. The data collected at the Ecosystem Sites are raw data at 10 Hz time resolution. These data need a first processing step to calculate greenhouse gas fluxes with typical time resolution of 30 min. These fluxes are further corrected, filtered, gap-filled where necessary, and processed to retrieve additional variables. LifeWatch: Raw measurements: These are mostly generated by sensors, resulting in data such as organism presence/absence/abundance; species identification; or physiological data (i.e. plant respiration). Measurements: Long-term monitoring programmes are managed by established networks for terrestrial and for marine environments. Data are: species composition, biomass, phenology, decomposition, etc. Observations: (mostly human) observations of species presence (identification, date/time, spatial coordinates). rdaeurope@rd-alliance.org | europe.rd-alliance.org 20 21 Derived/processed data, publications, software: these are input data for other users. Data flow By examining computational characteristics of participating ESFRI environmental research infrastructures, we have identified 5 common subsystems: Data Acquisition, Data Curation, Data Access, Data Processing and Community Support. Typical data lifecycle spans over five subsystems. This lifecycle begins with the acquisition of raw data from a network of integrated data collecting instruments (seismographs, weather stations, robotic buoys, human observations, etc.) which is then preprocessed and curated within a number of data stores belonging to an infrastructure or one of its delegate infrastructures. This data is then made accessible to authorised requests by parties‟ out with the infrastructure, as well as to services within the infrastructure. This results in a natural partitioning into data acquisition, curation and access. In addition, data can be extracted from parts of the infrastructure and made subject to data processing, the results of which can then be situated again within the infrastructure. Finally, the community support subsystem provides tools and services required to handle data outside of the core infrastructure and reintegrate it when necessary. Data organization Typical granularity of a data set of EMSO is e.g. for given period of time (monthly, duration of an experiment), or for a given instrumentation (see examples at PANGAEA or MOIST). Currently the following distinction is made among the data levels within EPOS: L0 – raw data; L1 – QC data; L2 – filtered data; L3 – research level pre-processed data; L4 – research product Organization of ICOS ATC and ETC data is based on levels described in the first point (type of data). ICOS Level 3 products defined as “Added value products” is under consideration with PIs, but will include dataset resulting from aggregation of multiple lower levels ICOS products. LifeWatch data are of very different kinds: species information, species distributions, species abundance, biomass and distributions, species DNA sequences, genes, earth observation data (canopy cover etc), species compositions, age distributions, radar data, etc. Data exchange Within EPOS, the data are received at the data centers in real time, through dedicated TCP/UDP connections to the sensors, adopting a widely known application level protocol (SeedLink). For ICOS, data are daily uploaded from ICOS Atmospheric station on a dedicated ftp server at ICOS ATC. Note that data can be exceptionally provided via attached document in an email. ATC Raw data are automatically ingested into a MySQL database, and then processed. ICOS Ecosystem sites submit their raw data monthly to the ICOS ETC. In addition, preliminary half hourly fluxes and meteorological data will be transferred automatically to the ETC in Near Real Time (one day). rdaeurope@rd-alliance.org | europe.rd-alliance.org 21 22 LifeWatch data are generated by various external data providers at the European and global scale. LifeWatch is deploying (shared) data for analysis and modeling. Data reside at the external data providers, and LifeWatch data catalogues assist users in data discovery. Data services and Tools Most of the projects are still in their construction phase and their services are not yet fully operable. Service partially in operation and/or under implementation within EMSO: PANGAEA OAI-PMH for ESONET data in EMSO sites: harvesting test, integration into ENVRI metadata catalogue etc; PANGAEA GeoRSS : Embedding GeoRSS feed; Ifremer SOS for EUROSITES oceanographic data in EMSO sites: getCapabilities; getObservation, check O&M format; PANGAEA SOS for INGV data in EMSO sites (via MOIST: moist.rm.ingv.it); getCapabilities, getObservation, check O&M format!! MOIST OpenSearch for INGV data and metadata in EMSO sites: Data and metadata search according to time or space or parameter; Common NetCDF metadata extraction and transformation service: Metadata extraction; MOIST OAI-PMH for harvesting INGV data and metadata in EMSO sites: Data and metadata harvesting. LifeWatch is an e-Science infrastructure facilitating biodiversity and ecosystem research with state- of- the-art ICT services for analysis and modelling in the context of a systems approach to tackle biodiversity complexity. LifeWatch is at the beginning of its Construction Phase and comprehensive realizations of services have not yet been done. Web services are intended and pilot implementation of these is being undertaken within the Biodiversity Virtual e-Laboratory (BioVeL) project. There are six different levels of services within LifeWatch: Communication services with integration, transformation, messaging, encoding and transfer services; Information management services: with data access, thematic data access, annotation, identification, discovery, mediation and user management services; Processing services with spatial processing, analytical and modeling, taxonomic processing, visaulisation, thematic processing, metadata and integration services; Human interaction services with portrayal, thematic interaction, interaction, personalization and collaboration services; Workflow services with orchestration services; and System management services with security, quality evaluation and provenance services such as monitoring, service management and transaction services. For more details see the LifeWatch Reference Model documentation. Once the metadata structure will be defined for EPOS, the searching capabilities will be implemented adopting standard interfaces, e.g. OGC web services. The carbon portal of ICOS will allow for data search across the different ICOS databases. Data search restricted to geographical areas or time periods, of similar nature than the one rdaeurope@rd-alliance.org | europe.rd-alliance.org 22 23 developed on the INSPIRE GEOPORTAL, will be implemented. The CP will also act as a platform to offer access to higher level data product and fluxes. By incorporating state-of-the art Information Retrieval methodologies, LifeWatch will enable scientists to query across datasets and discover statistically significant patterns within all available information. This form of Information Retrieval assumes a model of the world based on the available information in the datasets, and applies statistical methods to reveal patterns in this model. LifeWatch recognizes the need for well-defined semantics and uniformity in datasets and stimulates the practice by promoting standards and protocols. But at the same time it realizes that this is an ambition only feasible on the long-term and thus needs a pragmatic approach to suit the needs of scientists on the shorter term. Information Retrieval methodologies, already well-established in other scientific fields, will supply this pragmatism and enable LifeWatch to build intelligent query interfaces even without structured data. Within EPOS, VERCE will lay the basis for a transformative development in data exploitation and modeling capabilities of the earthquake and seismology research community in Europe and consequently have a significant impact on data mining and analysis in research. LifeWatch services include custom-made toolboxes for various user (research) areas, consisting of applications to be combined to preferred workflows. Such applications will cover sets of related algorithms. Legal and Policy Issues Most of the projects follow the open data sharing policy. The vision of EMSO is to allow scientists all over the world to access observatories data following an open access model. Within EPOS, EIDA data and Earthquake parameters are generally open and free to use. Few restrictions are applied on few seismic networks and the access is regulated depending on email based authentication/authorization. The ICOS data will be accessible through a license with full and open access. No particular restriction in the access and eventual use of the data is anticipated, expected the inability to redistribute the data. Acknowledgement of ICOS and traceability of the data will be sought in a specific, way (e.g. DOI of dataset). A large part of relevant data and resources are generated using public funding from national and international sources. LifeWatch is following the appropriate European policies, such as: The European Research Council (ERC) requirement that data and knowledge generated by public money should also become available in the public domain, http://erc.europa.eu/pdf/ScC_Guidelines_Open_Access_revised_Dec07_FINAL.pdf . The European Commission’s open access pilot mandate in 2008, requiring that the published results of European-funded research in certain areas be made openly available. For publications, initiatives such as Dryad instigated by publishers and the Open Access Infrastructure for Research in Europe (OpenAIRE). The private sector may deploy their data in the LifeWatch infrastructure. A special company will be established to manage such commercial contracts. rdaeurope@rd-alliance.org | europe.rd-alliance.org 23 24 Limitations /issues When software code can also regarded as data, there is the unresolved issue of identities, provenance and interoperability of software codes (for example as components in a workflow); Identity of concepts; Streaming data analytics (computation) of parallel real-time data streams; Interoperability between different research infrastructures e.g., how to support integrated data discovery and access of heterogeneous scientific data. Proposed actions (if mature enough) No efforts yet; See example of Global (species) Names Architecture: http://www.globalnames.org/; Considered by the EUDAT project; In ENVRI, OpenSearch technology is used to create a web data portal which allows users to discover and access data residing at the different federated Digital Repositories (DR) sites on the basis of personal search criteria. Space for general remarks We have realized there is an urgent need of developing a Common Reference Model for the community. A Reference Model is a standard and an ontological framework for the description and characterization of computational and storage infrastructures in order to achieve seamless interoperability between the heterogeneous resources of different infrastructures. Such a Reference Model can serve the following purpose: To provide a way for structuring thinking which helps the community to reach a common vision; To provide a common language which can be used to communicate concepts concisely; To help discover existing solutions to common problems; To provide a framework into which different functional components of research infrastructures can be placed, in order to draw comparisons and identify missing functionality. Only by adopting a good reference model can the community secure interoperability between infrastructures, enable reuse, share resources and experiences, and avoid unnecessary duplication of effort. Community Data Analysis: Svali Goals of analysis From the interview of the SVALI consortium we hope to learn more about data organization within well standardized and organized Nordic scientific community. rdaeurope@rd-alliance.org | europe.rd-alliance.org 24 25 Analysis provided by CSC. Description of the community Description The SVALI consortium (Stability and Variations of Arctic Land Ice) Initiative (TRI, www.topforskningsinitiatived.org) Top-level Research NCoEs DEFROST experience and access to other RIs. The partners are working with focus on the Arctic and Sub-Arctic. Goals of the community with respect to data TRI NCoE ICCC aims at a joint Nordic contribution in cryospheric studies to solve one of the most important global climate change research challenges. The programme integrates studies on stability of glaciers, atmospheric chemistry and biogeochemistry. The SVALI collaboration is based on three pillars: 1) common analysis, interpretation and reporting of changes in the cryosphere in the North Atlantic area; 2) common platform for graduate studies and postgraduate research work between the main research institutions and universities involved in cryospheric studies based on exchange of students and researchers, a common pool of observational data and a joint programme for organization and for obtaining support for future cryospheric research, and 3) using NCoE ICCC as a vehicle for wider international collaboration within cryosphere research in the Nordic countries. The objective of NCoE ICCC data management is to ensure the security, accessibility and free exchange of relevant data that both support current research and future use of the data. The main goals of the SVALI community with respect to data are to facilitate open access to SVALI results for research within and outside of SVALI, to ensure safe storage and availability of relevant data beyond the SVALI project period and efficient exchange of data between SVALI partners in collaborative research efforts. Data Types SVALI operates with observational and experimental data, i.e. collected by measurement and computational data produced by simulations. The main data types are (1) remote sensing data of various types (SPOT5, Aster, ICESat, Cryosat-2, ERS, Tandem-X, GRACE, ...) as well as data from airborne lidar surveys, (2) Field data such as mass balance measurements, GPS measurements, data from meteorological stations on glaciers, etc., (3) model simulation results (data from Earth system models (EC-Earth), meteorological models (WRF, HIRHAM5), ice flow models (Elmer, PISM) and mass balance models. These data are typically stored in special format that is appropriate for each type, such as various image formats for remote sensing data, netCDF (modelling results), grib (EC-Earth and HIRLAM5), las (point cloud lidar data), etc. Data Flow Remote sensing data are obtained from international space agencies and data centres. This often involves internet data access ports with online application forms for gaining access to the data. Field data are gathered by individual project partners and simulation results are similarly created by individual partners. In some cases, simulation results are created at computer centres (CSC, ECMWF), pre- processed there and preprocessed data are transferred for further rdaeurope@rd-alliance.org | europe.rd-alliance.org 25 26 analysis to the computer system of the respective partner. Simulations (e.g. EC-Earth) often create vast amounts of data in temporary storage areas that are deleted after further processing where e.g. monthly averages are computed from data files with higher temporal resolution. The raw model results may be created in specialized format, e.g. grib, and reformatted into more general purpose format, e.g. netCDF, in the preprocessing. After processing, data may be made available to other project partners over the internet or submitted to international data archives for storage or further analysis. Data Organization NCoE ICCC data are those data generated during the duration of the three individual NCoEs within NCoE ICCC (October 2010 – September 2015) through work packages that are organised as part of the individual NCoEs. Data generated by simulation models or gathered in the field measurement campaigns as a SVALI activity, are freely available to all SVALI participants and to other scientists as appropriate on the shortest feasible timescale in accordance with the TRI/ICCC data policy. To facilitate SVALI data storage and sharing, data providers are requested to submit a data description (or metadata) by filling a corresponding template and send it to the dissemination team at GEUS, Denmark. The metadata will be available at the SVALI website (http://ncoe-svali.org/data). Data of international importance collected within SVALI should be stored for longer term than the duration of the SVALI NCoE and will be submitted to appropriate data centers, depending on the nature of the data. Existing data centers, such as the National Snow & Ice Data Center (www.nsidc.org) or the World Glacier Monitoring Service (www.geo.uzh.ch/microsite/wgms) will be used, rather than the SVALI NCoE creating own storage of data. Formal contact has been made with GCW (WMO programme Global Cryosphere Watch) as well. Some SVALI data (not suitable for storage in NSIDC and WGMS) will be uploaded to the forthcoming GCW data portal (http://gcwdemo.met.no) together with appropriate metadata. This will ensure long-term storage of these data beyond the 5-year TRI period. If this does not work out, other options for ensuring long-term storage will be explored before the end of the project period. Data that are primarily of importance within the SVALI project and alpha- and beta- versions of data sets that need to be shared between project partners will be stored on externally accessible servers as appropriate with metadata descriptions available on the SVALI web. Data Exchange Data are mainly exchanged between partners through the internet (ftp, web-based download from internet servers). Some partners and collaborating institutes have or are developing internet data download centres that provide access to data and maintain a log of the downloads (who is downloading, for what purpose, etc.). Some SVALI created simulation data are available as part of international databases for providing access to model simulation results. As mentioned above, it is the policy of SVALI that data are as far as possible to be submitted to international archives where they are openly available for research. Data Services and Tools The SVALI community uses a variety of software tools for storing, exchanging, processing, analyzing, visualizing data. Software and software packages such as grib, netCDF, Matlab, R, python, FERRET, ERDAS, netCDF tools (nco, ncBrowse, netCDF read/write libraries for analysis software) are among the the most important software packages. Data are mainly exchanged between partners over the internet (ftp, sftp, web-based download). Legal and Policy Issues rdaeurope@rd-alliance.org | europe.rd-alliance.org 26 27 The SVALI NCoE adheres to an Open Data Policy. The TRI/ICCC Data Policy, which applies to the SVALI NCoE as well as to other TRI/ICCC NCoE, is based on the existing “International Polar Year Data Policy”. The aim of the data policy, as for the IPY policy, is to provide a framework for data to be handled in a consistent manner, and to strike a balance between the rights of investigators and the need for widespread access through the free and unrestricted sharing and exchange of both data and metadata. The policy is compatible with the data principles of the Top-level Research Initiative (TRI, “http://www.toppforskningsinitiativet.org/en”). The data policy is reviewed annually by the Steering Group and any updates will be formally signed by the Project Leader to record their formal adoption and for issue controlling. Limitations/issues Some particularly important data archives for NCoE ICCC are the National Snow and Ice Data Center (NSIDC, http://nsidc.org), World Glacier Monitoring Service (www.geo.uzh.ch/microsite/- wgms), archives related to WCRP CliC and the newly established GCW. It must be recognized that data preservation and access should not be afterthoughts and need to be considered while data collection plans are developed. A subset of data both generated and used by NCoE ICCC needs a specialized policy and access considerations, because they are legitimately restricted in some way. Access to these data may for example be restricted because there may be intellectual property issues. It is the overall aim of NCoE ICCC that data are as freely available as possible within the constraints provided by such legitimate restrictions. Proposed actions (if mature enough) Not applicable Space for general remarks None Community Data Analysis: EISCAT 3D Goals of analysis From the interview of the EISCAT_3D research infrastructure we hope to learn on different aspects of the data organization (general, technical, legal) of the project inside specific international scientific community starting its preparatory phase and find out what are the requirements and challenges facing the high attitude project in the field of data management. Analysis provided by CSC. Description of the community Description EISCAT_3D (European 3D Imaging radar for atmospheric and geospace research, E3D) will be a world-leading international research infrastructure using the incoherent scatter technique to study the atmosphere in the Fenno-Scandinavian Arctic and to investigate how the Earth's atmosphere is coupled to space. E3D is led by Swedish EISCAT Scientific rdaeurope@rd-alliance.org | europe.rd-alliance.org 27 28 Association. Current participants involved are EISCAT partners [China, Finland, Japan, Norway, Sweden, and UK], associated partners [France, Russia, Ukraine] and EISCAT user communities. Goals of the community with respect to data EISCAT_3D is designed for continuous operation, capable of imaging an extended spatial area over northern Scandinavia with multiple beams, interferometric capabilities for small-scale imaging and with real-time access to the extensive data. The goal of the community is continuous measurement of the space environment – atmosphere coupling at the southern edges of the polar vortex and the aurora oval. E3D will be a key facility for various researches and operational areas including environmental monitoring, space plasma physics, solar system science and space situational awareness. In addition, EISCAT_3D will provide a platform to develop new applications in radar technology, experiment design and data analysis. Data Types EISCAT operates initially with raw observational data. The following data types are currently in use: EISCAT raw data (Matlab binary); analysed data (Matlab binary, ASCII, HDF5, etc.); CDF format for metadata and analysed data; KMZ format for analysed data; ps, pdf, png for summary plots. EISCAT-3D will operate with a huge volume of data. The structure of these data is considered and metadata is expected to be in accordance with standards. Data Flow The EISCAT_3D facilities will comprise one core site and at least four distant sites equipped with antenna arrays, supporting instruments, platforms for movable equipment and high data rate internet connections. The key part of the core site is a phased-array transmit/receive (TX/RX) system consisting of roughly 10,000 – 16,000 elements and other state-of-the-art signal processing and beam-forming instruments. Each antenna produces 2 x 32bit/sample x 30 Msamples/s (= 2Gbit/s). At 25% duty cycle this is 5 Tbit/day and the data rate of a 16,000 antenna array would be about 80 Pbit/day. Antenna group computes a number of beams from a small number of antennas. The antenna group of about 100 antennas forms a limited number of polarized beams at a selected limited bandwidth I/O. These beam- formed data are stored in a ring buffer for a relatively long duration (hours to days). As the full arrayproduces a raw data rate which is too large to be archived, one has to limit the archived data rates so that e.g. 160 antenna groups form 100 beams with total maximum of 20 Gbit/s data to be stored in archive. At least one set of time-integrated correlated data will be calculated from each set of beam-formed data, and permanently stored in a Web accessible master archive. One or several analyzed data sets will be permanently stored corresponding to each set of correlated data. Next, for further offline work these data shall be transferred from on-site archives to HPC. All data should exist at least at two independent sites, archive and datawarehouse, so that archive and data warehouse works even if one site is offline Well- functioning networks with minimum of 10Gbit/s are required and the starting archiving rate would be of the order of 50PB/year. Data Organization Currently, EISCAT archive is small and about 60TB. This is because EISCAT was not archiving sampled raw data from the start of operations in year 1981. Instead, what often is called EISCAT raw data is in fact correlated data samples, the so called auto-correlation function estimates, which are organized and stored for further analysis as lag-profiles, altitude profiles of different time lags of the auto-correlation function of the signal. The final analyzed EISCAT rdaeurope@rd-alliance.org | europe.rd-alliance.org 28 29 data, physical ionospheric and atmospheric parameters, such as e.g. Electron density and temperature as function of time and altitude etc., are available and organized in the Madrigal DB. Madrigal is an upper atmospheric science DB, used by research groups world-wide and originally specially designed for incoherent scatter radar (ISR) data. Madrigal is a distributed DB, where data at each Madrigal site is locally controlled and accumulated, but shared metadata between Madrigal sites allow searching of all Madrigal sites at once. Madrigal is built so that it also can handle future EISCAT_3D data, as it is already handling data from the US and Canadian phased-array radars, PFISR (Poker Flat ISR) and RISR (Resolute Bay ISR). However, further development of volumetric data products is also envisaged. This development will be of benefit for the whole global ISR community. Data Exchange The future system will generate very large volumes of data. An efficient archive and data warehouse shall be deployed during the construction phase using existing e-infrastructures in Northern Scandinavia and synergies with resource centers. At least two independent sites shall support all data to be accessible, backed up and secured. The expected stored data volume in the initial phase of operation is of the order of more than 1000 TB per year. Data Services and Tools Currently, the EISCAT community uses: Madrigal DB, IUGONET Metadata DB, UDAs – IUGONET Data Analysis SW, Dagik visualization tool, CEF (Conjunction Event Finder) web tool for seamlessly browsing quick-look data. Users may also do re-analysis of the data. The basic analysis software GUISDAP is available for platforms running Matlab and there is also a web interface to use GUISDAP if simplified analysis control is all that is needed, which is often the case for standard radar experiments. Future E3D tools are under consideration. In the Preparatory Phase of EISCAT_3D, one is concentrating the development to solutions for the analysis of the data, which would use the latest ISR data analysis strategies. Algorithms and SW modules are being developed for parallelization of the computational tasks both in SW-based beam forming and analysis of multi-beam data and imaging applications. Resources for development of high-level end user data services and visualization tools are not included in the Preparatory Phase. Such work is assumed to be the task of the user communities and their networking efforts with communities who use similarly structured scientific data. Legal and Policy Issues EISCAT has established data policies and procedures for user access. Those will be adapted to the new system keeping in mind the importance that the project places on attracting new users. The new products should target the following four groups of users: 1) experienced EISCAT users; 2) new users attracted by the enhanced conventional capabilities and/or of the new E3D capabilities; and 3) environmental and space weather modelers and service providers. Occasional users interested in E3D for short-duration research projects or as source of supporting data. Data are open and shared internationally by corresponding national research organizations with 1 year exclusive right.Long storage of data is required, both because the nature of the geophysical data as a record of space weather and solar- terrestrial relations with minimum time-scales covering several solar cycles (each 11 years) is relevant to climate change research, as well as due to the fact that re-analysis of raw data would bring in possibly new interpretations, improve results and reveal unforeseen natural processes. Minimum preservation time of data would be 30-40 years. rdaeurope@rd-alliance.org | europe.rd-alliance.org 29 30 The proposed new EISCAT agreement, which would cover EISCAT_3D, considers 3 types of participation and financial contributions to the research infrastructure. All these have different implications to data policy. The in–kind Core Science investment is fully relevant to EISCAT core science and in line with the scientific and strategic priorities. Operational costs are then fully met by EISCAT. Open access to data and compliance with the EISCAT data and user access policy is required for such contributions. In-kind Mix of Core and Non-Core Science investment is partly relevant to EISCAT core science, but not fully in line with its scientific and strategic priorities. Operational costs are divided between EISCAT and contributing Associate or Affiliate in proportion to contribution to EISCAT core science and strategy. Open access to data is required in this case, too. Hosting contributions are using an EISCAT site and infrastructure but there is no value to EISCAT core science. Operational costs are then fully met by the contributing Associate, Affiliate or 3rd party. Open access to data is encouraged but not required. Limitations/issues Towards its 1st stage by 2018, EISCAT_3D needs moderate level of archive size about 1PB/year and more HPC capacity (1 Pflop/run) and storage 1 PB. By the next stage in 2023, EISCAT archive is expected to be about 50 PB/year, and HPC performance will up to 1000 Pflop/run. The key issue is transferring data from sites for processing, and fast internet connection is strictly required. Proposed actions (if mature enough) The E3D is now in the Preparatory Phase. Its aims to ensure that E3D project will reach a sufficient level of maturity with respect to technical, legal and financial issues so that the construction of the next generation E3D radar system can begin immediately after the conclusion of the phase. A new EISCAT agreement will be finalized in November 2013. The E3D preliminary design review is planned for October 2014. The EISCAT_3D Preparatory Phase currently works under 14 work packages that cover different aspects of the advanced infrastructure. Taking into accounts E3D needs in HPC and storage capacity, collaboration with EUDAT could be considered. The present EISCAT is already fully integrated in the global network of incoherent scatter radars. E3D is an environmental RI on the EU ESFRI roadmap. Space for general remarks The E3D preparatory phase (October 2010-September 2014) is funded by EC under the call FP7- INFRASTRUCTURES-2010-1 “Construction of new infrastructures: providing catalytic and leveraging support for the construction of new research infrastructures”. The EISCAT implementation time line (2014-2021) incorporates a smooth transition from preparation to implementation in 2014, provided that sufficient funds are allocated, subsequently construction in 2016, and the first operation in 2018. Community Data Analysis: Public Sector Information ENGAGE Goals of analysis Open Public Sector Information (PSI) can be anything varying from election results, statistics from population, unemployment, earnings and transportation to fire incidents, criminal records and illegal immigrants. As the PSI files are mostly unstructured and in non-machine rdaeurope@rd-alliance.org | europe.rd-alliance.org 30 31 processable formats(e.g. pdf), it is interesting to see how the ENGAGE curation community works and what tools it uses to make them more structured and what kind of metadata approach is needed to treat such data. Analysis provided by ATHENA. Description of the community The ENGAGE community consists of researchers, innovators, government employees, software developers, media journalists and citizens who create, improve, use, analyze, combine, report on, visualize open data. Goals of the community The goal of the ENGAGE community is to make Public Sector Information (PSI) data openly available with data curation and cleaning facilities, improved metadata, appropriate analysis/visualization tools and portal access. Context: Current practices - achievements and limitations: The ENGAGE platform currently links to original PSI data and derived / curated datasets created, maintained and extended by users (researchers, citizens, journalists, computer specialists) in a collaborative environment. Therefore the ENGAGE is a research / data curation community platform with focus on the Social Sciences and Humanities domain. The vision of the ENGAGE infrastructure is to extract, highlight and enhance the re-use value of PSI data. This can be achieved by moving slowly from low-structured, isolated, difficult to find PSI data to high-structured, easy to link, easy to process datasets through crowd-sourcing. Data types Open data covers almost all research disciplines but currently available datasets are mainly in social sciences, meteorology and transport. Right now the majority are in pdf format with little or no metadata. Next comes .csv or .xls files again with little or no metadata. Around 4% has CKAN or DC as metadata and stored in tables (including Excel) or RDF triples. The rest are mainly in tabular format (relational) although commonly as files rather than databases. Data flow Datasets metadata are harvested or uploaded to the ENGAGE platform where the metadata is enhanced (some automation, dominantly manual) and made openly available (subject to any rights management / licensing restrictions). Data organization ENGAGE stores metadata using the CERIF (Common European Research Information Format – an EU Recommendation to Member States). In addition the ENGAGE platform provides a single point of access to Public Sector Information. Users are able to extend / revise these datasets with a description and type of the extension (e.g. Conversion to other format, Data Enrichment, Metadata enrichment, Snapshots of real-time data, datasets mashups). Users are able to track the entire history of the extensions up to the original dataset through a graph-based diagram of the revisions. rdaeurope@rd-alliance.org | europe.rd-alliance.org 31 32 Data exchange The ENGAGE portal provides metadata and pointers to the open dataset(s). It therefore facilitates interoperation and dataset co-use / mashup. A user is able to upload a new dataset or extend an existing one. The user is allowed to set a maintaining group for this dataset and thus giving managing rights to all the members of this particular group. Data services and Tools The tools cover data upload/download, metadata improvement, data cleaning, analysis and visualization. Also data community/social networking tools. In detail the ENGAGE platform supports the following tools: Browse / search for datasets through faceted search Upload / Bulk Upload datasets Download datasets Request datasets Extend / Revise Datasets Visualise Datasets (Core visualization for tabular datasets - Creating chart based visualizations, Creating map based visualizations, integrating visualizations from external engines – e.g. Many Eyes) ENGAGE plug-in for Open Refine Clustering Analysis (K-means clustering) Manage / share related items (Publications, Web applications, APIs related to the dataset) SPARQL endpoint Restful ENGAGE API (JSON format) ENGAGE Wiki Dataset rating and commenting system Legal and Policy Issues ENGAGE strives for a common CC-BY license but records other licensing regimes. Limitations /issues The only limitations are the availability of open data and – particularly – the poor quality of existing metadata. rdaeurope@rd-alliance.org | europe.rd-alliance.org 32 33 Proposed actions (if mature enough) We are looking at automated metadata improvement while providing facilities for human metadata improvement. Community Data Analysis: ESPAS Goals of analysis As the ESPAS community receives data from a variety of sources of different formats and with different practices, it is interesting to observe how such data is treated and how a common denominator may be achieved. Analysis provided by ATHENA. Description of the community Description The ESPAS community consists of researchers in the area of near earth space, i.e. the upper layers of the atmosphere (ionosphere, lower magnetosphere). Goals of the community ESPAS aims at building the data e-Infrastructure necessary to support the access to observations, the modeling and the prediction of the near-Earth space environment (extending from the Earth's atmosphere up to the outer radiation belts). Data types ESPAS data are outputted from a wide range of available instruments to monitor the nearEarth space: both ground-based ones (such as coherent scatter radar, incoherent scatter radar, GNSS receivers, beacon, ionosondes, oblique sounding, magnetometers, riometers, Neutron monitor, Fabry-Perot interferometers) and space-based ones (ELF/VLF wave experiments, radio spectrometers, Langmuir probes, high energy particle spectrometer, electric and magnetic sensors, energy particle sensors, radio occultation experiments, radio plasma imagers, coronal imagers and EUV radiometers, coronographs). Moreover, there is data derived from models, such as the Physics-based plasmaspheric kinetic model, EDAM and CMAT2. The ESPAS data files outputted from the instruments/models are either numerical data or images that describe the observations. There are various file formats available for each category of data (numerical, images) that are supported from ESPAS data providers. So, for the numerical data the available file types are the following: text files (general, CEF, SAO, SAO.XML, DVL and RINEX) and binary files (CDF, GDF, HDF5, MATLAB (.mat), netCDF, PDF, RDF, RSF and SBF). For image files the following file formats are available: FITS, GIF, JPG, PDF and PNG. The variety of the file formats corresponds to the large variety of the origins of the data (data from different instruments, different experiments and different organizations measuring different physical properties) and it is taken into consideration for the development of ESPAS system. Data flow rdaeurope@rd-alliance.org | europe.rd-alliance.org 33 34 Datasets metadata are harvested from the data providers via the ESPAS platform where the metadata is enhanced and made openly available (subject to any rights management / licensing restrictions). The datasets themselves will be harvested via the ESPAS platform only when needed for the computation of algorithms and models (in the general case only metadata is harvested and not the actual dataset). Data organization ESPAS data providers do not have a common approach to metadata creation, data organization and open access to it. ESPAS has defined a common metadata format based on OGC (Open Geospatial Consortium) observation and measurements standard, http://www.opengeospatial.org/standards/om). Moreover, ESPAS has defined and maintains vocabularies (ontologies) for scientific terms used in the metadata (e.g. http://ontology.espasfp7.eu/observedProperty). Data exchange ESPAS has decided the use of OGC Catalogue Service API as a means to exchange metadata between the ESPAS platform and the data providers, and also between the external users and the ESPAS platform. OGC Sensor Service is also used for the exchange of actual data. Data services and Tools The tools cover data upload/download, metadata improvement, data cleaning, analysis, algorithm execution and visualization. Legal and Policy Issues Each data provider has a different data access policy (from open access to closed data). ESPAS tries to handle all different cases and also strives to adopt a more open approach. Limitations /issues The only limitations are the availability of open data and the wide variety of data formats. Proposed actions (if mature enough) In order to homogenize the file formats, the data providers are encouraged to implement the OGC Sensor Service API that defines a common data format. Math Community: combinatorics On-line databases : from L-functions to Edinburgh, 21-25 January 2013 Françoise GENOVA, Final version, 13 March 2013 The workshop The workshop1 sponsored by the American Institute of Mathematics, the International Centre for Mathematical Science (Edinburgh), and the National Science Foundation, was devoted to 1 http://www.aimath.org/ARCC/workshops/onlinedata.html rdaeurope@rd-alliance.org | europe.rd-alliance.org 34 35 the development of new software tools for handling mathematical databases. These tools will assist mathematicians in the integration, display, distribution, maintenance and investigation of mathematical data, particularly in the context of the computer algebra system Sage. F. Genova was invited to present data centres in astronomy and their development, in particular the International Virtual Observatory. This note is based on information gathered during the workshop and on discussions with the participants. Context The starting point is the Sage free open-source mathematics software system 2, a community effort which aims at creating a viable free open source alternative to the commonly used software Magma, Maple, Mathematica and Matlab. “Sage is built out of nearly 100 open-source packages and features a unified interface. Sage can be used to study elementary and advanced, pure and applied mathematics. This includes a huge range of mathematics, including basic algebra, calculus, elementary to very advanced number theory, cryptography, numerical computation, commutative algebra, group theory, combinatorics, graph theory, exact linear algebra and much more. It combines various software packages and seamlessly integrates their functionality into a common experience. It is well-suited for education and research. ” There are currently 250 contributors in 170 different places from all around the world3. Sage is a large ecosystem, and the workshop gathered two communities engaged in addedvalue efforts on two different but nearby topics, number theory and combinatorics, looking for possible convergence of their efforts: The LMFDB project, which aims to gather data in number theory, as relevant to the study of L-functions The Sage-combinat project and related projects, concerned with combinatorial problems As it will appear in the following, LMFDB and Sage-combinat are both strongly related to Sage but with different types of relations. LMFDB, the L-functions and modular form database “LMFDB4 is the database of L-functions5, modular forms, and related objects. It intends to be a modern handbook including tables, formulas, links, and references for L-functions http://www.sagemath.org/ http://www.sagemath.org/development-map.html 4 http://www.lmfdb.org/ 5 Wikipedia definition of L-Functions : “In mathematics, an L-function is a meromorphic function on the complex plane, associated to one out of several categories of mathematical objects. An L-series is a power series, usually convergent on a half-plane, that may give rise to an L-function via analytic continuation. ”. LMFDM definition : “By an L-function, we generally mean a Dirichlet series with a functional equation and an Euler product.”which can be explored from http://www.lmfdb.org/knowledge/show/lfunction.definition. 2 3 rdaeurope@rd-alliance.org | europe.rd-alliance.org 35 36 and their underlying objects.” It is thus a handbook for a special class of functions, with lots of connections to basic analysis and other related objects 6 . The information is very well structured, and the database has rich searching functionalities. It implements elements coming from Sage, but also other information. It aims at being as complete as possible, and its implementation has triggered research to fill gaps. It includes an “Encyclopedia” technically based on knowls 7 , which dynamically includes relevant, supplementary information in web pages. Each sub-domain is coordinated by a specialist who is a member of the editorial committee. The target audience is other number theorists and students, with the hope to attract other people to the subject. More than 100 institutions around the world are potentially interested in the topic. One question for the workshop was to assess whether this model can be exported to other parts of maths. Sage-combinat “Sage-combinat 8 is a software project whose mission is to improve the open source mathematical system Sage as an extensible toolbox for computer exploration of (algebraic) combinatorics, and foster code sharing between researchers in this area. In practice, Sage-combinat is a collection of experimental patches (i.e. extensions) on top of Sage, developed by a community of researchers. The intent is that most of those patches get eventually integrated into Sage as soon as they are mature enough, with a typical short lifecycle of a few weeks. In other words: just install Sage, and you will benefit from all the Sagecombinat development, except for the latest bleeding edge features. ” Sage-combinat has around 30 contributors. Sharing software and data in maths A fraction of the community represented in the workshop considers that Sage fulfils its needs. Others want to develop added-value services and databases, aggregating information from Sage and eventually additional information. Facilitating the usage of Sage is one goal, and the construction of a “Sage Explorer” using the semantic information in Sage to allow exploration of Sage objects and connection between them was prototyped. The workshop concluded with suggestions for future projects, one being a major conceptual evolution for LMFDB: that all the code handling mathematical calculations should be implemented in Sage, and just called by LMFDB. This could be a path for development of other added-value services in other domains. The workshop was based on a “hands-on” approach, with topics for discussions selected each day, discussions between the interested participants and rapid construction of prototypes to assess ideas and feasibility. General objectives are well understood and possible implementations are assessed in this bottom-up way, which seems to correspond to the disciplinary culture. These projects demonstrate an excellent expertise in building and managing shared software collections at community level. 6 7 8 http://www.lmfdb.org/bigpicture http://www.aimath.org/knowlepedia/ http://wiki.sagemath.org/combinat rdaeurope@rd-alliance.org | europe.rd-alliance.org 36 37 Community Data Analysis: Huma-Num Goal of analysis The goal of the analysis is to present the case of a research community relatively new to “digital science”, in a domain, humanities and social sciences, which has several ESFRI programmes in the data management and dissemination field. Huma-Num is the French national infrastructure and project in the domain, which has its own goals and methods, and which also acts in support to the future ERIC DARIAH and ESFRI CLARIN project (by the way of Aix-Marseille University). Analysis provided by CNRS. Description of the community Huma-Num (http://www.huma-num.fr) is a “Very Large Research Infrastructure” (Très Grande Infrastructure de Recherche or TGIR) labelled by the French Ministry of Higher Education and Research. It was created in March 2013 by the merging of two TGIRs in the domain of social and human sciences, ADONIS (2007) and CORPUS (2011). Its aim is to facilitate the switch to numerical science for the social and human science communities. Human-Num is managed by the CNRS in association with Aix-Marseille University and Campus Condorcet, and the target community is the research and teaching community from CNRS, plus University teams, in the domain. The activities include (1) creation of and support to communities organised thematically (“consortia”) for their adaptation to data conservation and dissemination, and dissemination of technologies and methods so that they can become actors (data and tool providers) in the process; (2) provision of massive storage and long term archiving, and of data dissemination and browsing capacities. Context The humanities and social science community is a newcomer in the domain. It produces very significant data volumes which obliges it to re-think its methods and to implement a new kind of data management with a data life cycle including re-use and medium/long term conservation. Research across the sub-discipline borders is also an incentive. Data types Data types addressed by Huma-Num are very diverse, and include modern and ancient texts, fixed and animated images, audio and movie pictures data, very large data series from surveys, 3D data obtained from in situ numerical captors or reconstructed, etc. Data formats One aim is to unify data formats using widely used formats, e.g. XML/TEI for texts, MPEG4, MATROSKA for motion picture, XML/EAD and XML/METS as envelope of complex data, etc. Data are annotated (in XML/TEI format for texts for example) by the data producers or by the researchers which edit text or image corpuses. rdaeurope@rd-alliance.org | europe.rd-alliance.org 37 38 Data life cycle The data life cycle follows the OAIS model for the long term preservation section. Data organization Research produces corpuses, which are documents organised into a set which has a scientific meaning. A corpus can be a researcher’s (including its archival materials), or from a laboratory, field campaign or science & culture heritage project, a survey, etc. Dissemination can be through a platform specific to a community or a discipline, or through the ISIDORE platform developed by Human-Num, which provides a global unified access to data with harvesting, enrichment, links between data, data browsing and APIs and SPARQL endpoint. Data exchange, data services and tools Data Exchange is through APIs provided by the specific community or through the general ISIDORE service. One aim is to expose in the linked open data (with RDF) data and metadata description with scientific communities ontologies. For this, ISIDORE service proposes an assembly line to annotate, turn in RDF, enrich with linked data URIs (dbpedia, Geonames, VIAF, etc.) and expose results in an SPARQL endpoint (http://www.rechercheisidore.fr/sparql). Legal and policy issues Huma-Num advocates open access, including for software. When communities enrich raw data the recommended licence is the Etalab Licence Ouverte/Open Licence (http://www.etalab.gouv.fr/pages/Licence_ouverte_Open_licence-5899923.html), which is compatible with the Creative Commons licences. Challenges Among the challenges is the massive scale of the task since the aim is a change in paradigm in the way this very diverse community deals with its data productions. One important goal is that long term data aspects are systematically taken into account before data production begins. Human-num is working at increasing community awareness but then time and efforts will be needed before adoption by the whole scientific community in the humanities and social sciences. The adoption level currently is very uneven between the different sub-disciplines. Everywhere there are individuals which care about data, but it is not the case for all subdisciplines as communities. One can hope in a kind of “snowball effect” with the most advanced disciplines and individuals progressively motivating their neighbours. Community Data Analysis: INAF centre for Astronomical Archives Goals of analysis The goal is to analyze activities of a medium sized astronomical observatory, Trieste Observatory, which has responsibilities relevant to scientific data at the national and international level. Analysis provided by CNRS. Description of the community rdaeurope@rd-alliance.org | europe.rd-alliance.org 38 39 The Osservatorio Astronomico di Trieste (OAT) performs data management tasks in addition to research in astronomy. It hosts the INAF centre for Astronomical Archives (IA2). IA2 manages Italian data from ground based telescopes, in particular the Telescopio Nazionale Galileo and the Large Binocular Telescope. The Telescopio Nazionale Galileo is an Italian Telescope based in the Canary Islands, which hosts among its instrument an international one, HARPS-N (High Accuracy Radial Velocity Planet Searcher). The Large Binocular Telescope, based in Arizona, is an international collaboration gathering agencies and laboratories from Italy, USA and Germany. IA2 also hosts data from research teams and individuals. OAT hosts VObs.it, the Italian Virtual Observatory project. Goals of the community Host data, provide data to the astronomical community and work on interoperability standards and tools. Context: Current practices - achievements and limitations: IA2 hosts raw data provided by the telescopes and some reduced data. They are in charge of the data pipeline which produces reduced data of the HARPS-N instrument. They keep the data archives and distribute public data. Each telescope defines its data policy. The data is in general made public after one year, and metadata are immediately public in all cases. IA2 also hosts private data from research teams and individuals. Data types Images, spectra, radio data, catalogues. FITS (the reference format for astronomical data) is the standard. Data organization and exchange Data is accessible through web interfaces and through the IVOA protocols. All public data is made accessible in the astronomical Virtual Observatory. Data services and Tools Public data is available through the VO-enabled tools and through web interfaces. IA2 has developed the VODanse system to build VO-enabled databases. The system is used internally and there are plans to release it. Data ingestion is through the home-made NADIR data handling system which allows management and ingestion with large data rate over several geographic sites. The ingestion part of NADIR is based on the OAIS standard. Legal and Policy Issues The Italian policy is that data from a telescope is made publicly available one year after observation. The private period can be expanded on demand for programs which last for a long time. Limitations /issues There are no difficulties with telescopes since a MoU defining their relations with IA2 is signed. In some cases data provided by teams or individuals are not well formatted. A minimum standard (FITS data files and VO keywords) is required. rdaeurope@rd-alliance.org | europe.rd-alliance.org 39 40 Community Data Analysis: ML-group Data-Management in an AI group Example: ML-group, Inteligent systems, icis, RU (Tom Heskens, Josef Urban) Analysis provided by MPI-PL General Data Flow The ML groups does not collect any data on their own. They work together with other groups that do collect data such as the Donders institute, bioinformatics at the university itself and many others. Typically very little is done to the data. After receiving the data it is cleaned up and stored on disk. Experiments are done to the data on disk. In general nothing further is done to the data and no effort at long-term preservation is made. The only computation that is done on the data (outside of the experiments) is cleanup; it will be discussed below. In general the group deals with the technology side of the research, not the data side. Computational Steps The data as it comes in generally needs to be cleaned up. The extent of the data cleanup depends on the type of data and the quality. Typically unit conversion (e.g. degrees Fahrenheit to Celcius), and removing clearly erroneous measurements is done. Some data is already partially cleaned up, MRI data often is. Example Projects The exception to the rule is the project on mathematical proofs led by Josef Urban. The project is developing a system to aid in the creation of mathematical proofs. That is, given a (hard) mathematical theorem the system tries to find a proof or an inconsistency (given as a counterexample). Computationally this is extremely difficult. The novel approach of the system is that it is non-exhaustive and it tells the user when it is too weak to solve the proof/disproof. The uniqueness of the system lies in three aspects: 1) it does not do complete calculus, 2) It tries to learn techniques from pre-existing proofs found in the libraries, and 3) it limits the search-space. This system needs data, as input it takes three pre-existing datasets and uses them as a base. This comes with two major problems: translation and inconsistencies. First, these mathematical datasets (attempts to encode all mathematical knowledge in a computerreadable way), are written in different formalisms, such as type theory, high order logic, or set theory. In the project this is all stored in first-order-logic. Second, these datasets are not wholly consistent between each other or even internally. Each dataset has a core (kernel) part, which is consistent. The inconsistencies mainly arise in more complex theorems and definitions. The use of such proof-systems lies in their ability to formally look at the correctness of a given theorem given the base knowledge. This has many applications, from chip design to the verification of software systems. rdaeurope@rd-alliance.org | europe.rd-alliance.org 40 41 Data Reuse Data is reused as much as possible, also in between projects as to maximize the number of publications. Community Data Analysis: Computational Linguistics Group Example: CL group of Antal van den Bosch @RU Analysis provided by MPI-PL General Data Flow The CL group, part of the Language and Speech Technology PI group of the Centre for Language Studies (Faculty of Arts, Radboud University) bases most of its research on textual corpora, experimental data from others, web crawls (e.g. Twitter, Wikipedia, etc.). The data is lightly pre-processed: Usually only tokenization and conversion to a standardized formal (Folia9) is done. The LDC 10is a major source for the corpora. The group has a corpora store, in which the processed data should end up. This is an informal and not enforced policy. In practice most corpora end up on it relatively quickly. Most data crawled from the internet is easily stripped from HTML and so forth. Data from corpora and others comes in a variety of data formats such as TIE, IMDI, CMDI, Alpino, Plaintext, etc. and will have to be converted. A library of converters is maintained for this purpose. FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources. See also: http://proycon.github.io/folia/. 10 The Linguistic Data Consortium creates and distributes a large number of language resources. 9 rdaeurope@rd-alliance.org | europe.rd-alliance.org 41 42 Crawler Corpus Tokenize + Folia Store Experimental Data Others Experiment Computational Steps In collaboration with the CL group at Tilburg University, the group maintains a Dutch language computational suite for standard computational linguistic analysis steps called Frog. For their own processing this suite is most frequently used for Tokenization and packaging (into the Folia format). As needed for experiments this suite is used to add more annotations as well. In general machine-learning-based modules that have been trained in advance on annotated corpora and lexica perform the computational steps. A master process combines all module outputs and casts them into FoLiA. Typically the word forms are used in a sliding window 11. Example Projects In Adnext (part of the COMMIT programme) the group works together with a major news service (ANP) to recognize patterns in text (ANP and Twitter) in order to predict events. Data from the ANP news service is combined with data from Twitter to see what people write around events. Events are defined by the fact that the news service wrote about them. Harvested patterns are used to try and predict events from current Twitter data. Data Management As mentioned above processed data is placed in a central store to have a “fixed” version to refer to and use. This store consists of a file system and is managed by hand. In the sliding window approach the focus slides over all available words oneby-one, for each focus word also taking into account some immediate context words. 11 rdaeurope@rd-alliance.org | europe.rd-alliance.org 42 43 Provenance information of data is managed in an ad-hoc manner. I.e. every researcher in the group has his or her own way to handle provenance. This ranges from keeping detailed logs of all operations and experiments to less structured methods. Data Reuse A large amount of data reuse occurs and is encouraged. In CL Data reuse makes experimental results comparable between systems. I.e., if two systems for the same task are tested on the same datasets you can directly compare the results from the publications. Failing that one has to run one’s own tests. The group find that publishing their data, results and programs on the web (e.g. Github for codebases) results in more cooperation, more citations, and also use. Community Data Analysis: CLST CLST group at RU General Data Flow CLST (Centre for Language and Speech Technology) deals with both text and audio data. Partially they make their own recordings; partially these are done via web applications, such as in the case of Wikispeech. In general data is stored at the university and often it is published as a corpus via LDC. Computational Steps The computational steps taken depend on the data and the goal that is had for the data. In general a non-computational step done on most data is that of transcription. Sometimes transcription can be done automatically, for example when prompt files (for teleprompters) are available. Phonetic transcription can sometimes be done automatically, depending on the type and the quality of the audio. Metadata is added as a computational step and also edited by hand. The resulting data, that is transcriptions and metadata, are used to build combinations databases. Concurrently with the databases the audio recordings are stored as files. These databases are then used to process queries on the data and to select specific intervals of audio for further analysis. Example Project An example project for the CLST is the Oral History Annotation Tool. With this tool it is possible to search and annotate historical audio dealing with the second world war, Dutch Indies, and New Guinea. These data is curated at DANS. Data Management There is a corpora disk for internal use. Finished projects are managed externally in the form of corpora. Usually at the LDC, TST or at the project partner. All data is made public, via corpora or via the web. rdaeurope@rd-alliance.org | europe.rd-alliance.org 43 44 Internally the data is managed by the system administration. That means, there is a backup, but nothing specific for the type of data. Data Reuse Reuse is extremely common, but dependent on the dataset, for instance CGN (Corpus Spoken Dutch) is reused extremely often. As mentioned above the data is also made available publically. This is advertised on the CLARIN infrastructure as well. Community Data Analysis: Donders Institute Analysis provided by MPI-PL General Data Flow A large amount of the data in the institute is self-generated for a specific research question. This data comes from several sources, most notably: MRI, MEG, EEG, NIRS, behavioural studies, and physiological (eye-tracking, ECG, EMG). The generation and processing of the data happens in a linear process. After planning an experiment, measurements are taken. The data is preprocessed and then analyzed. The pre-processing is ad-hoc and differs based on needs. The measurements, and results are archived. Archival is file-based and there is no metadata. A project typically has 20 to 50 subjects. More and more longitudinal studies are performed. Computational Steps Typical preprocessing steps include motion correction, noise filtering, and compression on a per-subject basis. Over the experiment or a collection of subjects statistical analysis is performed. Example Project A large meta-project happening within the Donders institute is the BIG (Brain Imaging Genetics) project, which aims to look at the relation between genetics and the brain. Currently the project has some 2500 brain scans and some 1300 gene sequences. For statistical analysis from brain structure to genetics more than 10.000 pairs of scans and sequences are needed. That is, to generate hypotheses’. However, even with the limited data currently available, it is already possible to work based on hypothesis. Usually they postulate relations between the presence or absence of certain genes and volumes or connectivity of brain areas. This has already publications.html . resulted in many publications: see http://cognomics.nl/big/big- rdaeurope@rd-alliance.org | europe.rd-alliance.org 44 45 Data Management Data is stored centrally and backup up daily. Next to the central data all data is archived on portable harddrives. All mutations to the data are cataloged by hand. Data Reuse Data reuse within the group is common. External reuse is uncommon due to privacy laws and ethics. These days the consent forms signed by the subjects are broader, enabling more cases of reuse. There are some external cooperation with other institutes, in such cases an MOA is present. All communication on existence of data happens within the scientific dialogue, there is no search system. Community Data Analysis: Meertens Institute Analysis provided by MPI-PL General Data Flow In dataflow depends on the type of data. In the figure we show the normal dataflow for scanned and OCRed (historical) text. Next to texts other types of data are frequently added such as dictionaries. Typically the cleanup of OCRed text is done automatically with the TiCCL tool. Afterwards, if applicable, further automatic and manual annotations are added such as lemmata, part-ofspeech tags, named entities, and increasingly opinion mining. Naturally detailed metadata is added. This whole: primary data, secondary data (annotations), and metadata is stored. Next to this set the possibility exists these days for users to generate their own annotations and store these in the system. These are stored separately. The Meertens institute has developed and uses profiles to ingest data into the system. That is, if they have ingested the data-type before they have a largely automated workflow for ingesting it. rdaeurope@rd-alliance.org | europe.rd-alliance.org 45 46 Scanning Images Main Store User Annotation OCR Cleanup User Store Annotation Figure 1: Typical DataFlow in the Meertens institute. Note that it is possible for users to have their own, extra, layers of annotation. Computational Steps Typical computational steps taken are computational linguistic processes. Frequently used are: Tokenization Automatic clean-up of text (using hashing collisions) Lemmatization Part-of-Speech tagging Sentiment mining Named entity recognition Metadata Generation Example Projects Nederlab is a project that aims to make available all digitized dutch texts to researchers and students. The collection ranges from 800 to the present. Within Nederlab the fully automated process is combined with human curators. These curators control the data quality and make sure that texts are linked to the right author-entity for instance. Within Nederlab researchers will be able to construct their digital workflows in order to postprocess the selected data. The resulting processes can be stored for further use. The processes can result in further annotations on the primary data. These annotations can also be stored for further use. Data Reuse All data in the system is inherently reused as the Meertens deals with historical texts. Some specific steps are taken to encourage reuse and interoperability such as the tagging of datacategories with Isocat categories, the creation of an institute-names lexicon (using openskos). rdaeurope@rd-alliance.org | europe.rd-alliance.org 46 47 Making sure the data is findable encourages reuse. One of the ways the findability of the data is increased is by the embedding of the Meertens institute in the European CLARIN infrastructure. Community Data Analysis: Microbiology Example: Molecular Biology department of MPI for developmental biology (Jonas Müller, Joffrey Fitz, Xi Wang) Analysis provided by MPI-PL General Data Flow Data at the Molecular Biology department comes in three types, namely: 1) sequencing data (DNA and RNA), 2) automated images of growing plants, and 3) small scale imagine, e.g. of guppies, microscopy-photography, plants in environment. Of these the first generates by far the largest amount of data. For the sequencing data systems are in place for data management upon release, the other types are generally released as online supplements to publications without any specific data management. In the rest of this short report we will focus on sequencing data. Sequencing data is produced in a raw form on the sequencing machine. On the machine itself it is converted to a usable format and that data is copied to a central storage. Each project has an pre-allocated amount of storage. The storage of each project is associated with a detailed description of the project, though there is not fixed standard for this. If the data is no longer needed for active research it is moved to a tape archive. In case of RNA sequences the environment of the sample is pertinent to the sequences. Typically time series with controlled environment are used. Computational Steps Sequencing data is filtered after it is copied from the sequencer. The sequencer attaches a quality value (probability that the base is correct), the bases are discarded if the probability value is too low. Typically the short sequences that come out of the machine are also aligned, this happens in two ways: 1) by comparing the sequences to a known reference sequence to determine the absolute position on the genome, or 2) by Denovo assembly, which typically results in an incomplete genome with many possible parts of sequences, this is still usefull. For the main subject of the research of the microbiology group (Arabidopsis thaliana), there is a reference sequence with many annotations on the functions of genes and many other aspects available. The sequences found in the sequencing step are addressed using the absolute positions they take on the refence genome. Interesting for research are the differences between the sequences genome and the reference genome. These variances can take the form of deletions, inversions, or single base differences. rdaeurope@rd-alliance.org | europe.rd-alliance.org 47 48 Example Projects One project spanning five years looked at the rate of mutation in Arabidopsis Thaliana for 30 generations 12 . Many sequences where needed for this project as well as good data management. This found that “Our results imply a spontaneous mutation rate of 7 × 10−9 base substitutions per site per generation, the majority of which are G:C→A:T transitions. We explain this very biased spectrum of base substitution mutations as a result of two main processes: deamination of methylated cytosines and ultraviolet light–induced mutagenesis”. Another example project is ongoing. For this some 80 worldwide Arabidopsis Thaliana where sequenced. The aim of this project is to cross all 80 varieties to study viability. This will help isolate genetic factors which make certain crosses unviable. One important factor is the amount and type of immune-system genes in the cross. Data Management Data management internally is done in two stages. First, it is stored on the working filesystem, with backups while the data is being used. Second, once an article is published about it, the data is made available over the internet. Typically the data is made available using SRA or ENA. All published data is available freely over the internet. For microbiological genetic data a PID is assigned from a new PID system that is being currently developed by the community. Data Reuse Reuse of raw data is rare for two reasons: 1) the subjects of the research (Arabidopsis Thaliana) are very easy to grow, and 2) the genetic sequencers get much better over time. The result is that if similar data is needed it is usually better to redo the relevant parts of the experiment. The resulting data is better which makes analysis much easier. On the other hand, analysis results of the data are reused frequently. These are usually available as a tabseparated-file. Community Data Analysis: Arctic Data Example: NIOZ Analysis provided by MPI-PL General Data Flow Their ship the Pelagia collects most data that arrives at the NIOZ institute. During a voyage two things are collected: 1) measurements, and 2) samples. Measurements are varied, but the typical one is CTD: Conductivity, temperature, and depth. The samples consist of water and soil samples for which location, depth and so on are recorded. The measurements are preprocessed on return of the ship at the institute and then stored in the relevant databases and stores. 12 DOI:10.1126/science.1180677 rdaeurope@rd-alliance.org | europe.rd-alliance.org 48 49 The samples are stored in the institute on return of the ship. The sample locations, depth, age, and type of storage are stored in the archive. If the sample is analyzed at some point the output of that analysis is stored in the archive. Furthermore, in the field, data is made available real-time to third parties. This is for instance used for the calibration of Argo systems. The Argo system consists of autonomous underwater robots that surface once every 10 or so days and that have to calibrate their instruments at that point. All the collected data is made available on the Internet. There are databases and stores of several types including: CTD data, anchor-points, sediment data, optical measurements. Available via www.nodc.nl. Samples Storage Measurements Measurements Preprocess Archive Pelagia publish Figure 2: Overview of the rough dataflow in the NIOZ institute. The "archive" consists of a number of databases and stores. Preprocessing is done in many ways by different departments. The dashed line indicates that this step happens at an undetermined time. Computational Steps The main computational steps are taken during the pre-processing of the CTD data. The raw measurements output by the instrument are converted to units and cleaned up. Cleaning up consists of, amongst other steps, calibration to known-good values and the removal of spikes. After cleanup the metadata is checked and added. Furthermore metadata on “meetstations” (measuring stations) is added. A “meetstation” is a place on the voyage where the ship has maintained a fixed position for some time in order to do measurements. Specific computational steps for specific types of data (e.g. sediment analysis) are performed by the relevant departments within the institute. rdaeurope@rd-alliance.org | europe.rd-alliance.org 49 50 Example Projects A new project is currently underway to integrate the efforts of the NIOZ wrt the Waddenzee with the efforts of other entities wrt the Waddenzee. Other major partners in this project are the universities of Nijmegen and Groningen, IMARES, and SOVON. Their goal is to join up their project with ILTER and LTER-Europe. The project will make all data collected in the Waddenzee, both ecological and socioeconomical available at one central web-location. Data Management Data is stored in two separate locations in the institute. These locations are at the extreme ends of the building. The usual steps of multiple stores and backups are used. Due to the size of the building, it is extremely unlikely to lose both data centers. Data Reuse The collected data as made available on the website is reused for research both inside the institute and outside the institute. Furthermore the samples that are stored at the institute are also reused for different projects. After all, it is much cheaper to analyze pre-existing samples compared to going out to the arctic to take new samples. The possibility for data-reuse is a major source of new cooperative projects for the institute. The publication of their data results in researchers approaching NIOZ for collaboration. This is a clear benefit. rdaeurope@rd-alliance.org | europe.rd-alliance.org 50