Entering the Data Era; Digital Curation of Data-intensive can play

advertisement
Entering the Data Era;
Digital Curation of Data-intensive
Science…… and the role Publishers
can play
The STM view on publishing datasets
Bloomsbury Conference 2010
London, 24 June 2010
Eefke Smit,
International Association of STM publishers
Director, Standards and Technology
Context: The Fourth Science Paradigm
Jim Gray, Microsoft Research to the National Research Council in 2008:
4 Science Paradigms:
1. Thousand years ago, Science was Empirical
describing natural phenomena
2. Last few hundred years: Theoretical
using models and generalisations
3. Last few decades: Computational
simulating complex phenomena
4. Today: Data Exploration
unifying theory + experiment + simulation
Publications
Processed Data/
Data Presentations
Raw Data
2
Context
“…… increased availability of primary sources of data in digital form has the potential to
shift the balance away from research based on secondary sources such as publications,
thus positioning data as the central element in the scientific process.” (a statement from the
Director of the Directorate General for Information Society and Media of the European Commission, 2008)
“If the raw data doesn’t form a central part of the scientific record then we perhaps need to
start asking whether the usefulness of that record in its current form is starting to run out.”
(from a blog called Science in the Open: http://blog.openwetware.org/scienceintheopen/2008/05/16/avoidthe-pain-and-embarassment-make-all-the-raw-data-available/
“..let us get back to the days where observational scientists could justify peer reviewed
publication primarily on the basis of collection, description and reporting of high quality data
sets (usually with some basic level of interpretation..” Quote taken from a discussion paper called “The
Risk-Reward Basis for Data Publication” (marine sciences, 2007)
“Problem = scientific community does not see online data as “publication” (from a presentation
called: How to motivate scientists to publish data online, Mark J. Costello. June 2008)
3
How the volume of Data will grow
Estimated amount of data stored per research project
45%
40%
40%
41%
36%
35%
30%
25%
25%
20%
19%
20%
17%
17%
15%
14%
13%
13%
11%
8%
10%
6%
5%
5%
5%
3%
2%
1% 1%
1%
2%
0% 0%
0%
0MB
1-100MB
100MB-1GB
1GB-1TB
Current
In 2 years
1TB-1PB
1PB-10PB
>10PB
Don't Know
In 5 Years
4
What types of Data ?
Data types used by researchers
Office docs
94%
Network-based data
79%
Images
79%
Plain text
55%
Archived data
53%
Scientific/statistical data formats
47%
Databases
46%
Source code
46%
Software apps
46%
Raw data
45%
Multimedia data
32%
Structured text
23%
Configuration data
21%
Structured graphics
17%
Other
5%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
5
What happens to Data now ?
Where do you as a researcher store your data for future
use?
Computer at work
81%
Portable storage carrier
66%
Organisational server
59%
Computer at home
51%
Submitted with journal (at publisher)
15%
Digital archive of organisation
14%
Digital archive of discipline
6%
Other
3%
Don't store digital research data
3%
External web service
2%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
6
What plans for digital curation?
Plans for digital archive?
Yes, <1 year
5%
Yes, 1-3 Years
5%
Yes, 3-5 Years
2%
Yes, >5 Years
4%
Don't Know
84%
7
Ever needed Data from others that
was not available ?
Did you ever need digital research data gathered by other researchers that
was not available?
Don't Know
19%
No
28%
Yes
53%
8
Problems with sharing Data - 1
How openly available is your data?
My data is openly available for my research group / colleagues in
research collaboration.
58%
My data is openly available for everyone.
25%
Access to my data is temporarily restricted.
16%
I do not share my data, but I would like to do so in the future.
16%
My data could be made available with appropriate changes (e.g.
anonymous clinical data)
11%
My data is openly available for my research discipline.
11%
I do not share my data and I do not want to share it in the future.
6%
My data is available for a fee.
4%
0%
10%
20%
30%
40%
50%
60%
70%
9
Problems with sharing Data - 2
Barriers for sharing research data
Legal issues
41%
Misuse of data
41%
Incompatible data types
33%
Lack of technical infrastrcuture
28%
Lack of financial resources
27%
fear to Lose scientific edge
27%
Restricted access to data archive
21%
No problems foreseen
16%
Other
10%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
10
What do scientist want…….
11
How to locate data ?
12
Where to submit data ?
13
What publishers currently do
Can authors submit their underlying digital
research data with their publication to
you?
Number of journals covered in
survey
80%
n = 9050
71%
70%
60%
No / don't
know
6%
57%
50%
<50 journals
40%
>50 journals
28%
30%
20%
14%
15% 14%
10%
0%
Yes
No
Don't Know
Yes
94%
What publishers currently do
Data types accepted by publishers
Office docs
83%
65%
Images
75%
57%
41%
43%
Plain text
35%
Multimedia data
52%
31%
Scientific/statistical data formats
52%
28%
Structured graphics
Databases
25%
Archived data
25%
39%
35%
30%
23%
Structured text
Source code
13%
17%
Network-based data
15%
Raw data
All of the above
12%
Software apps
12%
13%
48%
19%
22%
22%
17%
6%
4%
6%
9%
Other
Configuration data
1%
Don't Know
0%
22%
10%
20%
30%
<50 journals
40%
>50 journals
50%
60%
70%
80%
90%
What publishers currently do
Does your organisation have a policy for
preservation of digital publications?
Number of journals
covered in survey
n = 9050
55%
No /
don't
know
7%
Yes
84%
34%
No
8%
10%
Yes
93%
Don't Know
8%
0%
20%
40%
<50 journals
60%
>50 journals
80%
100%
What publishers currently do
Do you have preservation arrangements for underlying digital research
data?
69%
No preservation arrangements for digital research data exist (yet)
69%
20%
Yes, same as for our publications
17%
10%
Yes, through a data archive other than for our publications
3%
2%
Other (please specify)
10%
0%
<50 journals
10%
20%
>50 journals
30%
40%
50%
60%
70%
80%
Who should preserve research data ?
Who is responsible for the preservation of digital research data?
52%
48%
Author
43%
43%
The author’s institute
40%
Publisher
35%
38%
National library
26%
35%
Research community (researchers collectively)
48%
33%
Government
26%
21%
European Union
13%
A specialised external organisation (Portico, CLOCKSS,
etc.)
19%
13%
13%
13%
A coalition of publishers
3%
Don’t know
22%
3%
Other (international) organisation
26%
0%
<50 journals
10%
20%
>50 journals
30%
40%
50%
60%
Solutions for datasets from publishers
Instructions to authors in “Tetrahedron”
19
Supplementary
files are linked
directly from an
article’s abstract
page.
20
Supplementary files
are referenced within
the article text and
linked via the article’s
abstract page using
the doi.
21
22
How do Publishers view research data in
the context of “IP”
The Publishing Industry (STM/ALPSP) position is:
“…..believe that, as a general principle, data sets, raw data outputs of research,
and sets or subsets of that data should wherever possible be made freely accessible
to other scholars” (Statement from STM & ALPSP, June 2006)
It is also stated that:
“….articles published in scholarly journals often include tables and charts in which
certain data points are included or expressed. Journal publishers often do seek
the transfer of or ownership of the publishing rights in such illustrations.., but this
does not amount to a claim to the underlying data itself..”
23
Research data and the Publisher’s Mission
Publishers are committed to making genuine contributions to
the research communities…..
Can we meaningful contribute to an
“editorial” process for data?


Submission processes
editorial organization, review
Can we contribute to the data
dissemination/retrieval process?


Storing, Linking
Search, Discovery
Can we contribute to research
workflows ?


Meta-data, collections, ontologies
Visualization, mining, etc
• support to the scholarly
communication process
• increased availability of
research output
• increased citations to
research output
• increased overall quality of
research
• develop new means of
knowledge discovery
• increase in the research
efficiency
24
Support through the journal networks and
publishing platforms
Move from…..
•
•
•
•
General instructions to make
available
available as supplementary
information with the online
article
Textual references to data
repositories & datasets
Verbal instructions, limited
support by editorial team
Note: a successful implementation
requires a combination of domain
specific and generic solutions
To……….
•
•
•
•
•
•
•
•
“More granular” definition of
research data and supplementary
information
Specific instructions on how, when
and where to submit, and how to
cite.
Specific sustainable destinations
for research data
Agreed formats & metadata
requirements for data submission
Expand editorial teams with a
“data-editor”
Hyper-linking between articles and
(final) dataset destinations and v.v.
“Federated searching”
Intelligent (contextual) referencing
of datasets in articles
25
working examples……..
26
Vice versa
27
What Publishers are busy solving
• Peer review practices
• Readability, navigation, accessibility, presentation
• Discoverability: search, metadata, linking, citability
• Copyright issues
• Preservation and long term archiving
• Version control/ dynamic data
• Access, permissions for re-use
• Editorial practice and support
See joint NISO/ NFAIS initiative: http://www.niso.org/topics/tl/supplementary/
What is next: the stuff inbetween…..
Publications
Processed Data/
Data Presentations
Raw Data
So stay tuned for new experiments….
Conclusions
•Many publishers are well aware of the impact of the advent of the
Data Era and the 4th paradigm in Science
•They are getting prepared to handle these, ensure longevity,
preservation, access and re-use in combination with the
publications.
•To make solutions scalable and sustainable, publishers need
convergence of stakeholders:
•Good collaboration with all players in the chain: researchers,
research instuitutes, safe data repositories, libraries, policymakers
• Development of standards and common practice, building on what
is in place already: from persistent identifiers, citation conventions, to
submission guidelines across scholarly journals
Download