Entering the Data Era; Digital Curation of Data-intensive Science…… and the role Publishers can play The STM view on publishing datasets Bloomsbury Conference 2010 London, 24 June 2010 Eefke Smit, International Association of STM publishers Director, Standards and Technology Context: The Fourth Science Paradigm Jim Gray, Microsoft Research to the National Research Council in 2008: 4 Science Paradigms: 1. Thousand years ago, Science was Empirical describing natural phenomena 2. Last few hundred years: Theoretical using models and generalisations 3. Last few decades: Computational simulating complex phenomena 4. Today: Data Exploration unifying theory + experiment + simulation Publications Processed Data/ Data Presentations Raw Data 2 Context “…… increased availability of primary sources of data in digital form has the potential to shift the balance away from research based on secondary sources such as publications, thus positioning data as the central element in the scientific process.” (a statement from the Director of the Directorate General for Information Society and Media of the European Commission, 2008) “If the raw data doesn’t form a central part of the scientific record then we perhaps need to start asking whether the usefulness of that record in its current form is starting to run out.” (from a blog called Science in the Open: http://blog.openwetware.org/scienceintheopen/2008/05/16/avoidthe-pain-and-embarassment-make-all-the-raw-data-available/ “..let us get back to the days where observational scientists could justify peer reviewed publication primarily on the basis of collection, description and reporting of high quality data sets (usually with some basic level of interpretation..” Quote taken from a discussion paper called “The Risk-Reward Basis for Data Publication” (marine sciences, 2007) “Problem = scientific community does not see online data as “publication” (from a presentation called: How to motivate scientists to publish data online, Mark J. Costello. June 2008) 3 How the volume of Data will grow Estimated amount of data stored per research project 45% 40% 40% 41% 36% 35% 30% 25% 25% 20% 19% 20% 17% 17% 15% 14% 13% 13% 11% 8% 10% 6% 5% 5% 5% 3% 2% 1% 1% 1% 2% 0% 0% 0% 0MB 1-100MB 100MB-1GB 1GB-1TB Current In 2 years 1TB-1PB 1PB-10PB >10PB Don't Know In 5 Years 4 What types of Data ? Data types used by researchers Office docs 94% Network-based data 79% Images 79% Plain text 55% Archived data 53% Scientific/statistical data formats 47% Databases 46% Source code 46% Software apps 46% Raw data 45% Multimedia data 32% Structured text 23% Configuration data 21% Structured graphics 17% Other 5% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 5 What happens to Data now ? Where do you as a researcher store your data for future use? Computer at work 81% Portable storage carrier 66% Organisational server 59% Computer at home 51% Submitted with journal (at publisher) 15% Digital archive of organisation 14% Digital archive of discipline 6% Other 3% Don't store digital research data 3% External web service 2% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 6 What plans for digital curation? Plans for digital archive? Yes, <1 year 5% Yes, 1-3 Years 5% Yes, 3-5 Years 2% Yes, >5 Years 4% Don't Know 84% 7 Ever needed Data from others that was not available ? Did you ever need digital research data gathered by other researchers that was not available? Don't Know 19% No 28% Yes 53% 8 Problems with sharing Data - 1 How openly available is your data? My data is openly available for my research group / colleagues in research collaboration. 58% My data is openly available for everyone. 25% Access to my data is temporarily restricted. 16% I do not share my data, but I would like to do so in the future. 16% My data could be made available with appropriate changes (e.g. anonymous clinical data) 11% My data is openly available for my research discipline. 11% I do not share my data and I do not want to share it in the future. 6% My data is available for a fee. 4% 0% 10% 20% 30% 40% 50% 60% 70% 9 Problems with sharing Data - 2 Barriers for sharing research data Legal issues 41% Misuse of data 41% Incompatible data types 33% Lack of technical infrastrcuture 28% Lack of financial resources 27% fear to Lose scientific edge 27% Restricted access to data archive 21% No problems foreseen 16% Other 10% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 10 What do scientist want……. 11 How to locate data ? 12 Where to submit data ? 13 What publishers currently do Can authors submit their underlying digital research data with their publication to you? Number of journals covered in survey 80% n = 9050 71% 70% 60% No / don't know 6% 57% 50% <50 journals 40% >50 journals 28% 30% 20% 14% 15% 14% 10% 0% Yes No Don't Know Yes 94% What publishers currently do Data types accepted by publishers Office docs 83% 65% Images 75% 57% 41% 43% Plain text 35% Multimedia data 52% 31% Scientific/statistical data formats 52% 28% Structured graphics Databases 25% Archived data 25% 39% 35% 30% 23% Structured text Source code 13% 17% Network-based data 15% Raw data All of the above 12% Software apps 12% 13% 48% 19% 22% 22% 17% 6% 4% 6% 9% Other Configuration data 1% Don't Know 0% 22% 10% 20% 30% <50 journals 40% >50 journals 50% 60% 70% 80% 90% What publishers currently do Does your organisation have a policy for preservation of digital publications? Number of journals covered in survey n = 9050 55% No / don't know 7% Yes 84% 34% No 8% 10% Yes 93% Don't Know 8% 0% 20% 40% <50 journals 60% >50 journals 80% 100% What publishers currently do Do you have preservation arrangements for underlying digital research data? 69% No preservation arrangements for digital research data exist (yet) 69% 20% Yes, same as for our publications 17% 10% Yes, through a data archive other than for our publications 3% 2% Other (please specify) 10% 0% <50 journals 10% 20% >50 journals 30% 40% 50% 60% 70% 80% Who should preserve research data ? Who is responsible for the preservation of digital research data? 52% 48% Author 43% 43% The author’s institute 40% Publisher 35% 38% National library 26% 35% Research community (researchers collectively) 48% 33% Government 26% 21% European Union 13% A specialised external organisation (Portico, CLOCKSS, etc.) 19% 13% 13% 13% A coalition of publishers 3% Don’t know 22% 3% Other (international) organisation 26% 0% <50 journals 10% 20% >50 journals 30% 40% 50% 60% Solutions for datasets from publishers Instructions to authors in “Tetrahedron” 19 Supplementary files are linked directly from an article’s abstract page. 20 Supplementary files are referenced within the article text and linked via the article’s abstract page using the doi. 21 22 How do Publishers view research data in the context of “IP” The Publishing Industry (STM/ALPSP) position is: “…..believe that, as a general principle, data sets, raw data outputs of research, and sets or subsets of that data should wherever possible be made freely accessible to other scholars” (Statement from STM & ALPSP, June 2006) It is also stated that: “….articles published in scholarly journals often include tables and charts in which certain data points are included or expressed. Journal publishers often do seek the transfer of or ownership of the publishing rights in such illustrations.., but this does not amount to a claim to the underlying data itself..” 23 Research data and the Publisher’s Mission Publishers are committed to making genuine contributions to the research communities….. Can we meaningful contribute to an “editorial” process for data? Submission processes editorial organization, review Can we contribute to the data dissemination/retrieval process? Storing, Linking Search, Discovery Can we contribute to research workflows ? Meta-data, collections, ontologies Visualization, mining, etc • support to the scholarly communication process • increased availability of research output • increased citations to research output • increased overall quality of research • develop new means of knowledge discovery • increase in the research efficiency 24 Support through the journal networks and publishing platforms Move from….. • • • • General instructions to make available available as supplementary information with the online article Textual references to data repositories & datasets Verbal instructions, limited support by editorial team Note: a successful implementation requires a combination of domain specific and generic solutions To………. • • • • • • • • “More granular” definition of research data and supplementary information Specific instructions on how, when and where to submit, and how to cite. Specific sustainable destinations for research data Agreed formats & metadata requirements for data submission Expand editorial teams with a “data-editor” Hyper-linking between articles and (final) dataset destinations and v.v. “Federated searching” Intelligent (contextual) referencing of datasets in articles 25 working examples…….. 26 Vice versa 27 What Publishers are busy solving • Peer review practices • Readability, navigation, accessibility, presentation • Discoverability: search, metadata, linking, citability • Copyright issues • Preservation and long term archiving • Version control/ dynamic data • Access, permissions for re-use • Editorial practice and support See joint NISO/ NFAIS initiative: http://www.niso.org/topics/tl/supplementary/ What is next: the stuff inbetween….. Publications Processed Data/ Data Presentations Raw Data So stay tuned for new experiments…. Conclusions •Many publishers are well aware of the impact of the advent of the Data Era and the 4th paradigm in Science •They are getting prepared to handle these, ensure longevity, preservation, access and re-use in combination with the publications. •To make solutions scalable and sustainable, publishers need convergence of stakeholders: •Good collaboration with all players in the chain: researchers, research instuitutes, safe data repositories, libraries, policymakers • Development of standards and common practice, building on what is in place already: from persistent identifiers, citation conventions, to submission guidelines across scholarly journals