Publishing Data Workflows RDA Plenary 5 -- March 11, 2015 Session Chairs: Amy Nurnberger and Mary Vardigan Please sign in: http://bit.ly/1Hju0LM Agenda • Introduction: Objectives • Progress so far • Workflow Examples • Get involved • Dataverse workflow presentation • SoftwareX workflow presentation • Use case development • Group notes document: http://bit.ly/1MlXysR The working group members (currently) • Theodora Bloom (BMJ) [CO-CHAIR] • Sünje Dallmeier-Tiessen (Switzerland, CERN) [CO-CHAIR] • Elizabeth Newbold (BL) [CO-CHAIR] • Merce Crosas (US, Harvard University) • Michael Diepenbroek (PANGAEA) • Kim Finney (Australia, AADC) • John Helly (US, UCSD) • Brian Hole (Ubiquity Press, UK) • Varsha Khodiyar (Nature Scientific Data) • Hylke Koers (The Netherlands, Elsevier) • Rebecca Lawrence (UK, F1000 Research Ltd.) • Fiona Murphy (UK, Wiley-Blackwell) • Amy Nurnberger (US, Columbia University Libraries) • Lisa Raymond (US, Library Woods Hole Oceanographic Institution) • Johanna Schwarz (Germany, Springer) •Jonathan Tedds (UK, University of Leicester) •Mary Vardigan (US, ICPSR) •Ruth Wilson (UK, Nature) •Eva Zanzerkia (US, NSF) •Angus Whyte (UK, DCC) •And growing… Others are very welcome ☺ Background and Motivation • Only a small fraction of research data is preserved and shared, often with a bare minimum of metadata • Often due to the lack of “established” or “trusted” services and workflows But there are established or emerging workflows! • Usually in selected disciplines, e.g., Earth Sciences • Some provide credit via citation mechanisms Objectives • Provide an analysis of a representative range of existing and emerging workflows and standards for data publishing • Including deposit and citation • Provide reference models, a “classification” • Test implementations of key components for application in new workflows • Illustrate the benefits of the reference models for researchers and organisations Relevance • Information about workflows crucial for researchers and other stakeholders to understand the options available to practice open science • Helps to illustrate different possibilities for data sharing, leading to more efficient and reliable reuse of research data • Shows those involved in research data where they fit in the overall scheme of things More detailed work programme • Identification of a smaller set of reference models covering a range of such workflows to include: • For example, when and where QA/QC and data peer-review fit into the publishing process • Who does what and when… • Automated vs. “manual” processes • Selection of key use cases and organizations in which components of a reference model can be implemented and tested for suitability • For example: dedicated data peer review • For example: metadata checks First results of workflow analysis http://tinyurl.com/mvtbrek Workflows in the current list - STFC Data centre NSIDC Data centre ENVRI reference model OJS/ Dataverse INSPIRE Digital library NPG (PubChem & Scientific Data) Publisher UK Data Archive/Service PREPARDE (NCAR CISL) Ocean Data Publication Cookbook (UNESCO IOC) PURR Institutional repository ICPSR Edinburgh Datashare F1000 Research - Ubiquity Press: Open Health Data Journal+... - PANGAEA - Data Publisher for Earth and Environmental Sciences - WDC Climate - Data Publisher for Climate Sciences - CMIP / IPCC DDC - International project series in Climate Sciences - GigaScience - Dryad digital repository with integrated journals workflow - Stanford Digital Repository - Academic Commons: Columbia University Institutional Research Repository - Elsevier: Data in Brief - Integrated data publishing solution at Elsevier [through “traditional” journals] Categories we are looking at • • • • • • • • • • • • • • • Discipline Function of workflow PID assignment to dataset PID type -- e.g., DOI, ARK, etc. Peer review of data (e.g., by researcher & editorial review) Curatorial review of metadata (e.g., by institutional or subject repository?) Technical review & checks (e.g., for data integrity at repository/data centre on ingest) Discoverability: Indexing of the data -- if yes, where? Formats covered Persons/Roles involved, e.g., editor, publisher, data repository manager, etc. Link to data paper or “standalone” data Links to grants, usage of author PIDs Data citation facilitated Data life cycle referred to Standards compliance Observations • The researcher/author generally initiates the workflow • Discipline-specific repositories have the most rigorous ingest and review processes -- more general institutional repositories have a lighter touch • Journals vs. repositories: For the former, any peer review is conducted externally, for many of the latter it is internal Repository view Simplified generic repository workflow Researcher with a central role: submission/deposition Quality Assurance Producer Data Management LT Archiving Data Deposit Ingest Review/QA mainly internal Dissemination Access Consumer/ Reuse Consumer (interdisciplinary) Consumer (disciplinary) Quality Assurance Detailed Quality Assurance Light Data Deposit Producer Data Management LT Archiving Dissemination Access Ingest Project Repositories: • Data are published in a federated data infrastructure • Data are added and corrected • Poor documentation • Usually no data backup • Light-weight quality assurance against intl. and project standards • Tendency that the project data never become stable • Currently no PIDs assigned or reserved but Handles planned Dissemination Access Ingest Long-term Archive: • Data are archived for the long term at a single location • Data are stable and curated • Detailed documentation • Data backup/redundancy • Quality assurance process is more detailed and includes a review • Data is a “snapshot” of the project data at a certain time • DOIs assigned to data collections Designed by M. Stockhause Lessons Learnt and questions • Very diverse landscape • Discipline-specific and cross-discipline actions • Quality assurance a big topic in discipline-specific • • • • repositories Widespread persistent identification Data citation awareness Challenge: Bidirectional data-publication linking Challenge: Versioning Publisher’s perspective Simplified generic publisher workflow Researcher takes over several roles: submitter, reviewer, editor potentially? Article submission Producer Peer Review Process Article preparation Editing Data Submission Who takes on which role and responsibility? Publishing Consumer/ Reuse - Article/data container - Separate article and datasets Example: Dryad repository integrated with journals Lessons learnt and questions • Recommended repositories for collaboration? Who decides/how? • External review • Open, plus invitation • Closed, upon invitation • Blind •Emerging data and software journal landscape: no information yet on uptake Current and future work How to get involved • Contribute to the workflow analysis: http://bit.ly/1BBQQPW • Contribute your own workflow “walk-throughs” and use cases • Tell us what is needed for a “successful” workflow in your institute/discipline … Moving to implementation • Tell us if you are interested to learn from a specific example or are maybe considering implementing data publishing workflows • Tell us if you have code/documentation to share Break for presentations Dataverse: Eleni Castro SoftwareX: Hylke Koers DATA PUBLISHING WORKFLOWS WITH DATAVERSE Eleni Castro (ecastro@fas.harvard.edu) Institute for Quantitative Social Science (IQSS) Harvard University RDA 5th Plenary WG RDA/WDS Publishing Data Workflows March 11, 2015 25 An Integrated & Automated Journal / Data Publishing Workflow Journal Repository 26 Current Workflows in Dataverse: To Connect Data to Journals A. Journals include Dataverse as a Recommended Repository B. Authors Contribute Directly to a Journal’s Dataverse C. Automated Integration of Journal + Dataverse (e.g., OJS) 27 Example of Option C: Phase 1 OJS / Dataverse Integration Project Details: 2012-2014 Integrating Open Journal Systems (OJS) with Dataverse Reference Implementation: Automated via SWORD API Pilot with ~ 50 journals + expand to 1000s using OJS. Dataverse plugin is automatically available w/ OJS. Future: Embed Dataverse widgets into journal article. http://projects.iq.harvard.edu/ojs-dvn 28 In the Backend: Technical Workflow Client sends: Repository sends: XML file: AtomPub "entry” with Dublin Core Terms (e.g., title, creator, isReferencedBy (article citation), …) Zip file: All data files associated with that dataset. XML file: “Deposit Receipt” send data citation from repository to client. Plus updates from client to server during lifecycle (CRUD): In review, reject (delete), publish first version, update new versions. 29 On the Frontend: OJS Dataverse Plugin Walkthrough 30 Journal Manager Sets Up Plugin in OJS 31 Journal Manager Sets Up Data Policies Including Guidelines for: 1) Authors (data citation) 2) Reviewers 3) Copyeditors Read full Data Policies / Guidelines Template: http://bit.ly/1xkLjoZ 32 Author Submits Manuscript + Data (1) 33 Author Submits Manuscript + Data (2) To-Do: Support for adding multiple datasets to a journal article. Option to: (a) deposit into Dataverse OR; (b) if data is already in a repository can include the data citation (w/ persistent URL/identifier). 34 Editor Reviews Article + Data 35 Approved = Data Published in Dataverse 1 2 When issue is published: 1) URL to Article displays in Dataverse. 2) Data Citation shows up in OJS Article (see next slide). 36 Article in OJS: Published w/ Data Citation 37 Video of OJS Dataverse Plugin Demo http://bit.ly/1D1hphu 38 Phase 2: Expansion of API + Workflows 2015-2016 (collaboration w/ Odum Institute) Project Goals 1. Expand to more journals, publishing systems, & workflows 2. Develop Community-Based Repository API Standard: Work w/ RDA, WDS, Data FAIRport, FORCE11, CODATA, etc… Project Questions Should we extend the Repository API beyond SWORD? Support for additional Metadata Schemas & fields (non-DC)? Support for more/which dataset review workflows? 39 How Do I Get Involved? 1 Find Out More: 2 Sign up to Contribute: 3 Contact Project Coordinator: * Visit our Collaborations page: http://bit.ly/1Bg2nkw * Dataverse Project Site: http://dataverse.org Repositories Workshop + Dataverse Community Meeting June 9-11, 2015 @ Harvard http://bit.ly/1A51atJ Eleni Castro (ecastro@fas.harvard.edu) 40 Thank You! Any Questions? Contact Me: Eleni Castro (ecastro@fas.harvard.edu) SoftwareX – a home for research software Hylke Koers, Head of Content Innovation, Elsevier RDA Plenary 5, San Diego Open Access Software (like data) is high-value but hard to access High value & easy access Ease of access Researcher survey, 3824 respondents (Publishing Research Consortium, 2010) High value & difficult to access Importance of access | 42 Open Access Why SoftwareX? • Many scholars develop software , but current paper based system does not capture this “born digital” research output systematically • Users (readers) can’t find this valuable content • Developers (authors) can’t claim credit • Software is a research method in its own right – and deserved to receive full academic recognition | 43 Open Access | 44 SoftwareX: a home for research software SoftwareX aims to acknowledge the impact of software on today's research practice, and on new scientific discoveries in almost all research domains. SoftwareX also aims to stress the importance of the software developers who are, in part, responsible for this impact. To this end, SoftwareX aims to support publication of research software in such a way that: • The software is provided with a peer-reviewed recognition of scientific impact • The software developers are given the academic credit they deserve; • The software is citable, allowing traditional metrics of scientific excellence to apply; • The academic career paths of software developers are supported rather than hindered; • The software is publicly available for inspection, validation, and re-use. Above all, SoftwareX aims to inform researchers about software applications, tools and libraries with a (proven) potential to impact the process of scientific discovery in various domains From “Aims & Scope”, see http://www.journals.elsevier.com/softwarex Open Access | 45 SoftwareX: a home for research software • Publishing “Original Software Publications”: The software and code can include post publication updates - Metadata is systematically captured - • Article is Open Access under CC-BY license • All software and code published is, and will remain, fully owned by their developers. • Peer-reviewed; dedicated software Editors & Reviewers • Multi-disciplinary • Submission in 3 easy steps • GitHub repository to store and expose all software and code • Launched at FORCE15 See http://www.journals.elsevier.com/softwarex/news/you-can-now-submit-your-software-to-softwarex/ Open Access | How does it work? How to submit your software to SoftwareX in 3 easy steps: 1. Select a repository for your software or pack your software into a zip file or archive. Remember to make your software public so that the reviewers and readers can find it. 2. Download the template for the OSP manuscript, and write your article describing your software following this template. 3. Submit your OSP manuscript via the SoftwareX submission site. After review and acceptance, software and/or code will be copied to the journal archive on GitHub and integrated with the online version of your Original Software Publication available on ScienceDirect. See http://www.journals.elsevier.com/softwarex 46 Open Access Template contains structured metadata Nr Code metadata description Please fill in this column C1 Current code version For example v42 C2 Permanent link to code/repository used of this code version For example: https://github.com/mozart/mozart2 C3 Legal Code License List one of the approved licenses C4 Code versioning system used For example svn, git, mercurial, etc. put none if none C5 Software code languages, tools, and services used For example C++, python, r, MPI, OpenCL, etc. C6 Compilation requirements, operating environments & dependencies C7 If available Link to developer documentation/manual C8 Support email for questions For example: http://mozart.github.io/documentation/ | 47 Open Access | 48 Template contains structured metadata Nr (Executable) software metadata description Please fill in this column S1 Current software version for example 1.1, 2.4 etc. S2 Permanent link to executables of this version For example: https://github.com/combogenomics/DuctApe/relea ses/tag/DuctApe-0.16.4 S3 Legal Software License List one of the approved licenses S4 Computing platforms/Operating Systems For example Android, BSD, iOS, Linux, OS X, Microsoft Windows, Unix-like , IBM z/OS, distributed/web based etc. S5 Installation requirements & dependencies S6 If available, link to user manual - if formally For example: published include a reference to the http://mozart.github.io/documentation/ publication in the reference list S7 Support email for questions Open Access Flexible range of open-source licenses for computer code • • • • • • • • • • Apache License, 2.0 (Apache-2.0) BSD 3-Clause "New" or "Revised" license (BSD-3-Clause) BSD 3-Clause "Simplified" or "FreeBSD" license (BSD-2-Clause) GNU General Public License (GPL) GNU Library or "Lesser" General Public License (LGPL) MIT license (MIT) Mozilla Public License 2.0 (MPL-2.0) Common Development and Distribution License (CDDL-1.0) Eclipse Public License (EPL-1.0) Creative Commons Zero (CC0) | 49 Open Access And now.. The moment you have all been waiting for… | 50 Open Access A workflow diagram Editorial + peerreview process | 51 OSP published on ScienceDirect Bi-directional links Submits to journal as OSP + code (supp. mat.) Researcher has code and paper Code made available on journal GitHub instance Open Access A workflow diagram OSP submitted to journal Editorial + peerreview process | 52 OSP published on ScienceDirect Bi-directional links OSP linked with code Code deposited to (or build on) code repository Code made available on journal GitHub instance Open Access Thank you! Any questions? | 53 Discussion Use case development Developing use cases for workflows ● The tools ○ Part A: http://goo.gl/forms/Wkc7KyxvX5 ○ Part B: http://goo.gl/forms/ZFRrzG6krX ● The process ○ Walk through the tools ○ Form up in groups ○ Generate use cases The tools: Part A http://goo.gl/forms/Wkc7KyxvX5 The tools: Part A http://goo.gl/forms/Wkc7KyxvX5 The tools: Part A http://goo.gl/forms/Wkc7KyxvX5 The tools: Part A http://goo.gl/forms/Wkc7KyxvX5 The tools: Part A http://goo.gl/forms/Wkc7KyxvX5 Thank you! You have completed Part A of this use case. For the next part, you will be completing multiples of a form, to address each individual actor listed in this use case. Click this to get to Part B: http://goo.gl/forms/ZFRrzG6krX The tools: Part B http://goo.gl/forms/ZFRrzG6krX The tools: Part B http://goo.gl/forms/ZFRrzG6krX The tools: Part B http://goo.gl/forms/ZFRrzG6krX The tools: Part B http://goo.gl/forms/ZFRrzG6krX Group up! ● The tools ○ Part A: http://goo.gl/forms/Wkc7KyxvX5 ○ Part B: http://goo.gl/forms/ZFRrzG6krX