Science as an Open Enterprise: Open Data for Open Science

Science as an Open Enterprise:
Open Data for Open Science
Professor Brian Collins CB, FREng
UCL, June 2012
Emerging conclusions from a
Royal Society Policy Report
Open data as the engine of
the “scientific revolution”
Publish scientific theories – and the
experimental and observational
data on which they are based – to
permit others to scrutinise them, to
identify errors, to support, reject or
refine theories and to reuse data
for further understanding and
Henry Oldenburg
Why is “open data” a big current issue?
The data deluge from powerful acquisition tools
coupled with
powerful tools for storing, manipulating, analysing,
displaying and transmitting data
citizens interest in scrutinising scientific claims
have created
new challenges & new opportunities that require
newforms of openness and novel social dynamics in
• Maintaining scientific self-correction (closing the
concept-data gap)
• Responding to citizens’ demands for evidence in “public
interest science”
• Exploiting data-intensive science – a 4th paradigm?
• The potential of linked data
• “Data is the new raw material for business”
• Exposing malpractice and fraud
• Stimulating citizen science
Aspiration: all scientific literature online, all data online,
and for them to interoperate
Openness of data per se has no value.
Open science is more than disclosure
For effective communication, we need intelligent openness.
Data must be:
Only when these four criteria are fulfilled are data properly open
Metadata must be audience-sensitive
Scientific data rarely fits neatly into an EXCEL spreadsheet!
Boundaries of openness?
• Legitimate commercial interests
• Privacy
(complete anonymisation is impossible)
• Safety & Security
But the boundaries are fuzzy & complex
Benefits/costs of open data to the science process
Pathfinder disciplines where benefit is recognised and habits are changing
Bioinformatics (-omics disciplines)
Biological science
Particle physics
Environmental science
Longitudinal societal data
Astronomy & space science
e.g. Gene Omnibus – 2700 GEO uploads by non-contributors in 2000 led to 1150 papers
(>1000 additonal papers over the 16 that would be expected from investment of $400,000)
Tier 1 – International databases – e.g. Worldwide Protein Databank: >65 staff; $6.5M pa;
1% of cost of collecting data
Tier 3 – Institutional data management - UK 2011, average UK university repository
- 1.36 FTE (managerial, administrative, technical)
Levels of data curation
Tier 1 – International databases
Tier 2 – National
(e.g. Research Councils
Tier 3 – Institutions
(Universities & Institutes)
Tier 4 – “Small science” researchers
& research groups
Financial sustainability?
Priorities for action- 1
1) Change the mindset: publicly funded data is a public resource
2) Credit for useful data and productive, novel collaboration (the
Tim Gowers phenomenon)
3) Mandatory access to data underlying publications
4) Common standards for communicating data
5) Sustainability (the power needs of current modes of data
storage will outstrip the global electricity supply within the
Priorities for action - 2
• R & D on software tools (Enabling dynamic data;
managing the data lifecycle; tracking provenance, citation,
indexing and searching, standards & inter-operability,
sustainability - note that the ICT industry is often way ahead & the US prioritises investment here)
• Institutional responsibility for the knowledge they
create (cumulative small science data > cumulative big
science data)
• Data scientists (they are being trained, and the commercial
demand is large)
“Big Iron” is a national infrastructure priority
“Big data” is a science priority – the big costs
are people and software, not computers
Targets for recommendations
• Scientists – changing cultural assumptions
• Employers (universities/institutes) – data responsibilities; crediting researchers
• Funders of research - the cost of curation is a cost of research
• Learned societies – influencing their communities
• Publishers of research – mandatory open data
• Business – exploiting the opportunity; awareness & skills
• Government – efficiency of the science base; exploiting its data
• Governance processes for privacy, safety, security - proportionality