Archiving Research Data, Dryad,and Publishers Neil Beagrie, Charles Beagrie Ltd Bloomsbury Conference June 2010 With contributions from Julia Chruszcz, Peter Williams, and Todd Vision Overview • The Challenge; • The Dryad Consortium; • Supplementary Data and Publishers; • Research Data Preservation Costs (KRDS); • The Future. The Challenge PRC Global Study n=3759 n=2940 n=1262 n=1653 n=2989 n=2118 n=1294 n=2565 n=1868 n=2273 n=841 n=2362 4 Source: PRC global study (forthcoming) Requesting Data • Wicherts et al. (2006 Am. Psychol. 61, 726) requested data from the 141 most recent articles in American Psychological Association (APA) journals. “6 months later, after … 400 emails, [sending] detailed descriptions of our study aims, approvals of our ethical committee, signed assurances not to share data with others, and even our full resumes…” Only 27% of authors shared their data The Dryad Consortium of Scholarly Societies and publishers (and libraries) Archiving at publication • Avoids loss, corruption, obsolescence of data files; • The point in time when authors are best able to ensure the correctness of data and metadata; • Authors have incentive to deposit their data in order to complete the publication process; • Journals are best able to monitor compliance with policy; • In short, the “Genbank model” works. Incentives to authors • Access to colleagues’ data • Visibility and citability – Another way for work to have high impact • Integration – Combinability with other data adds value • Long-term preservation – Including data format migration • Ad hoc data sharing can be burdensome – Deposition to multiple specialized repositories – Fulfilling individual requests for data takes effort Joint Data Archiving Policy • DEPOSIT AT PUBLICATION – As a condition for publication, all data used in the paper should be archived in an appropriate public archive. • REPEATABILITY – Data should be given with sufficient detail so that together with the paper content, each result in the published paper may be re-created. • EMBARGO – Authors may elect to have the data publicly available at time of publication, or if the archive allows opt to embargo access to the data. • EXCEPTIONS – Exceptions may be granted at the discretion of the editor, especially for sensitive information such as the location of endangered species. • COORDINATION – The aim is for the Dryad consortium of journals to adopt this policy simultaneously. That’s all well and good, but where’s this “appropriate public archive”? A mosaic of specialized databases • There are a growing number to which deposition is encouraged/required (Genbank, Treebase) – And others are emerging • A world in which every datatype had its own required database, each with its own submission system: – Would be a huge burden on authors – Would inevitably leave some data orphaned – Might never be financially possible Overcoming the submission burden • Integrating journal submission and data submission – Prepopulating bibliographic metadata – “Handshaking” with specialized repositories • Enhancing low-quality author-provided metadata – Human curation – Machine assisted metadata enhancement The Dryad Digital Repository The Repository • Dryad is a repository (at Duke) for datasets underlying scientific research articles; • ƒ Its initial focus has been evolution and ecology; • ƒ Participating journals subscribe to the Joint Data Archiving Policy; • ƒ Dryad datasets will have (DOIs), and Creative Commons ‘CC-Zero’ licenses; • Project ƒ Funded by the National Science Foundation 2008-2012; • Sustainability plan a key deliverable. Supplementary Data and Publishers Overview • Consultancy for Dryad Sustainability: covered areas of draft business plan and sustainability for Dryad • Presenting one of the contributions(publishers) to section on Comparators and Costs • Outcomes from desk research and 12 interviews with publishers/data publishers + some additional input drawn from Keeping Research Data Safe • Very brief presentation – article in preparation for Learned Publishing Oct 2010 issue….KRDS2 available from JISC Interviewees • • • • • • • • • • • Journal of Clinical Investigation Journal of the American Medical Association Molecular Phylogenetics and Evolution (Elsevier) Journal of Heredity (OUP) Ecological Society of America Wiley-Blackwell + Ecology Letters Royal Society Federation of American Societies for Experimental Biology OECD Publishing Internet Archaeology and Archaeology Data Service Pangaea: Publishing Network for Geoscientific & Environmental Data • Dataverse Network (Social Sciences, Harvard) Some Findings: growth • Many interviewees stated that supplementary data and materials are showings rapid growth • 3 gave figures: from 32 articles in 2000, to 251 in 2009 – an increase of 784%; from 6% in 2005 to 38% in 2009; from 2% a decade ago to 87% in 2009. Some Findings: workflow • supplementary data have grown organically at the various journals investigated (author driven); • Both the work and the costs being absorbed into the daily running of journals; • in 4 cases minimal impact on work duties; in 5 others there was a significant but often unquantified impact (two of these might be considered data publications with a focus on publishing data papers or datasets); and in 3 cases the information was not available or unknown; • can be explained in terms of level of effort or importance applied : the greatest levels of effort are associated with copy editing, format migration, addition of metadata, etc, whilst the least effort is required for simply hosting the material; and/or high-levels of automation in the workflow. Some Findings: costs • These were in most cases unknown or only partially known; • Costs mentioned but usually not quantified include: digital storage costs, salary costs of journal staff; and long term preservation costs; • detailed cost information was really only available from Internet Archaeology via Archaeology Data Service which had participated in an activity based costing study (KRDS2); • Internet Archaeology archiving costs reflect those for a “dataset publisher” so only a comparator for part of Dryad’s content – large datasets. Some Findings: revenue • only author fees and journal subscription fees were mentioned as current revenue sources for the supplementary materials in journals; • 3 journals interviewed have author charges for supplementary materials (see next slide); • The data archiving and sharing organisations interviewed relied primarily on (uncertain) research grants and temporary or re-current core funding, but one had access to a small endowment and another has a charging policy for some depositors. Some Findings: author charges • Journal of Clinical Investigation - authors are charged $300 for supplemental data to appear online with accepted articles; • Ecological Archives - submission of ‘appendices and supplements’ is free up to 10MB. Above this, there is a fee of $250 for the first 1 GB and $50 for each subsequent GB. The fee for publication of a data paper is $250 for publication of the abstract in the relevant journal plus publication of up to 10 MB in Ecological Archives. An additional $250 is charged for data sets between 10MB and 1GB, and for larger datasets there is an additional $50 per GB fee; • The Federation of American Societies for Experimental Biology (FASEB) charges $100 for each Supplemental file. Keeping Research Data Safe (KRDS1 & KRDS2): JISC-funded studies of Research Data Preservation Costs (separate Dryad costing project by Lori EakinRichards based on KRDS approach) KRDS: what did we learn? Whole of Service costing/Seeing the“Big Picture” Selection of 2009 Allocation of UKDA Activity Costs Acquisition 5.8% Ingest 21.5% A. Storage +Pres. Planning 3.1% Access 16.9% KRDS:Implications • Changing view of digital preservation costs: – “getting stuff in and out” costs much higher than “keeping it (bit preservation + migration)”; – Staff costs c.70% of total costs; – Importance of economies of scale and automation; – Findings of KRDS and Dryad Repository’s own activity costing projections fed into Dryad sustainability planning. Future Plans • Dryad sustainability plan being put to Dryad member societies and publishers; • Dryad extending consortium to new members –achieving economies of scale; • Bid to JISC to establish Dryad-UK; • Extending KRDS research and implementations. Further Information Dryad see www.datadryad.org Keeping Research Data Safe2 (KRDS2) webpage at www.beagrie.com/jisc.php KRDS2 report available from JISC website http://www.jisc.ac.uk/publications/reports/20 10/keepingresearchdatasafe2.aspx#downlo ads Email: neil@beagrie.com