Tomorrow, and tomorrow, and tomorrow


a centre of expertise in data curation and preservation

"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage

Chris Rusbridge

Presentation at OCLC

a centre of expertise in data curation and preservation

• "To-morrow, and to-morrow, and to-morrow,

• Creeps in this petty pace from day to day,

• To the last syllable of recorded time;

• And all our yesterdays have lighted fools

• The way to dusty death.

• Out, out, brief candle!

• Life's but a walking shadow; a poor player,

• That struts and frets his hour upon the stage,

• And then is heard no more: it is a tale

• Told by an idiot, full of sound and fury,

• Signifying nothing."

• Shakespeare: Macbeth

a centre of expertise in data curation and preservation

• Dunsinane Hill

a centre of expertise in data curation and preservation

a centre of expertise in data curation and preservation

a centre of expertise in data curation and preservation


• Curation and the Digital Curation Centre

• Science and Data Citations

• The “poor players” of data curation

• Sustainability of curated data

• Macbeth again…

a centre of expertise in data curation and preservation


• Data increasingly important as evidence

• Experimental verifiability (the basis of science)

• Unrepeatable observations & experiments

(particularly environmental in broadest sense)

• Legal, compliance & transactions

• Cultural resources

• “Preservation” view vs “Publishing” view

a centre of expertise in data curation and preservation

Lynch remarks

• Closing the Curation Conference

• 3 views of digital curation

• Finite process, handover to preservation

• Whole life process, evolving object(s)

• Collection as a living thing

a centre of expertise in data curation and preservation

Digital curation?

For later use


Digital preservation

a centre of expertise in data curation and preservation

Digital curation?

In use now (and the future) For later use



Digital curation Digital preservation


a centre of expertise in data curation and preservation

Digital curation

In use now (and the future) For later use




Digital curation & preservation

“maintaining and adding value to a trusted body of digital information for current and future use”

a centre of expertise in data curation and preservation


“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”

a centre of expertise in data curation and preservation

Organisation to Engage & Collaborate

communities of practice: users curation organisations eg DPC


Network community support & outreach service definition

& delivery management

& admin support research research collaborators testbeds

& tools development co-ordination

Industry standards bodies

a centre of expertise in data curation and preservation

Organisation to Engage & Collaborate: Leads

communities of practice: users curation organisations eg DPC




Glasgow Edinburgh Edinburgh testbeds

& tools



a centre of expertise in data curation and preservation

Associated work

• DCC LOCKSS Technical Support Service

(Lots of Copies Keep Stuff Safe)

• DCC SCARP Project

• Disciplinary approaches to sharing, curation, reuse and preservation

• EU projects associated


• Digital Preservation Europe


a centre of expertise in data curation and preservation

Phase 2

• Externally-moderated, reflective selfevaluation completed

• Phase 2 proposal (2007/10) to JISC

• Accepted: focus on science data, reduced scale

• EPSRC-funded Research continues until


a centre of expertise in data curation and preservation

2nd International Digital Curation


• Research & invited presentations

• Glasgow, 21/22 November, 2006

• Please register at:

a centre of expertise in data curation and preservation

a centre of expertise in data curation and preservation

Data resource stages

• Curated data is created…

• Observations? Fixed!

• Or Acquired…

• Data brought/bought from outside

• Ingest

• Development

• Derived, refined, combined, processed data

• Potentially many stages

a centre of expertise in data curation and preservation

SDSS (Visual)

TWOMASS (Infrared)

Slide from Rajendra Bose

a centre of expertise in data curation and preservation

Slide from Rajendra Bose

a centre of expertise in data curation and preservation

New discovery…

• National Virtual Observatory

• Johns Hopkins press release: “ Scientists working to create the

NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”

a centre of expertise in data curation and preservation


• Data meaningless without context

• Linkage

• Metadata of many kinds

• Workflow!

• Provenance

• Computational lineage

• Authenticity

a centre of expertise in data curation and preservation



University research group1

Csat subscene


8-day composite and subscene


Ctot calc

8-day composite and subscene

Pbopt calc


Zeu calc

PPeu calc

University research group2 research group3 local decisionmaking body

a centre of expertise in data curation and preservation

Access and re-use

• Ethics and rights control access

• Weak in expressing this long-term

• Collaboration tools

• Annotation, discussion, review

• Re-use leading to change and development

• “Publication”

• Not just in “print”

• Underlying data should be “published”, too

• Citation…

a centre of expertise in data curation and preservation

CLADDIER citation investigation

“My last example was an MST data set held at the BADC, and I was suggesting something like this (for a citation):

< Citation >< Author > Natural Environment Research Council </ Author >

< Title > Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth </ Title >

< Medium > Internet </ Medium >

< Publisher > British Atmospheric Data Centre (BADC) </ Publisher >

< PublicationDate status =" ongoing "> 1990</ PublicationDate >

< Identifier ></ Identifier >

< Feature >< FeatureType >http://featuretype.registry/verticalProfile</ FeatureType ><

LocalID >200409031205</ LocalID ></ Feature >

< AccessDate > Sep 21 2006 </ AccessDate >

< AvailableAt >< url ></ url ></ AvailableAt >

</ Citation >

(Made up tags!)”

Bryan Lawrence Weblog

a centre of expertise in data curation and preservation

CLADDIER 2: “Version of record”

• Role of Publisher: add value

• provision of catalogue metadata

• some commitment to maintenance of the resource at the AvailableAt url

• some commitment to the resource being conformant to the description of the Feature

• some commitment to the maintenance of the mapping between the identifier [LocalID] and the resource.

Bryan Lawrence Weblog

a centre of expertise in data curation and preservation

CLADDIER 3: persistence

• Wayback Machine

• Only snapshots (eg only 2004 version of Bryan’s home page!)

• WebCite

• allows the creater of content to submit URLs for [archiving], thus ensuring when one writes an academic document, the material will be archived, and the citation will be persistent

• But no real help for data…

• “… only allow [data citation] when we believe in the persistence of the organisation making the data available…”

Bryan Lawrence Weblog

a centre of expertise in data curation and preservation

a centre of expertise in data curation and preservation


• Needs a stable resource to cite…

OWL Web Ontology Language


W3C Proposed Recommendation 15 December 2003

This version :

Latest version :

Previous version :

• (FRBR works & expressions?)

a centre of expertise in data curation and preservation


• The date alone (as in common web citation approaches) is not enough!

[6] The CIA World Factbook.


Retrieved on 8 Jan 2006.

• Cited object likely to have changed…

• Citation should link to the cited object as it was!

a centre of expertise in data curation and preservation

Citation needs…

• An efficient way to reference and access “archived” past states of a changing dataset (work in progress, Buneman et al)

• Not important for original observations

• Don’t mess with those data

• Less important for incremental datasets

• Later stuff should not invalidate earlier

• Very important for revisable datasets

• Eg Genomics… datasets that result from the combined work of curators, or contain opinions or facts likely to change

• Eg Mapping… OS maps represent a huge database that changes on a daily basis

XMLArch: System Architecture


a centre of expertise in data curation and preservation

XML Archive at time t - 1

Data Extractor




XML Snapshot at time t

XML Archive at time t

• Carwyn Edwards

a centre of expertise in data curation and preservation

Who are the curation players?

a centre of expertise in data curation and preservation

Curation: Individual

• “Small science” 2-3 times more data than “Big science”, but much more at risk

• PhD student? RA? PI? Administrator? IT support?

• Data potentially on local hard drives, or at best shared network drives

• May be inadequately protected

• Liable for policy-led deletion on resignation

• Individual “knows” too much

• Documentation/metadata unlikely to be adequate

• Tomorrow: gone!

a centre of expertise in data curation and preservation

Department: eCrystals

• Specialist department archive (& national service)

• Workflow recording of lab parameters (R4L)

• Public & private elements

• Trying to build eCrystals federation (eBank 3)

• But… ReciprocalNet?

French COD efforts?

Fragmented discipline!

• Tomorrow: likely to continue

a centre of expertise in data curation and preservation

Institution: Cambridge Chemistry

• 175,000 small molecule structures in CML

• Alongside Archaeology,

Manuscripts, Learning

Materials, etc

• No library curation skills; dependent on research group enthusiast

• Collection isolated from other Chemistry

• Tomorrow: assured…

a centre of expertise in data curation and preservation

Community: CDL

• Shared effort from group of institutions

• Comparison OhioLink?

• Document tradition, not data

• Passive role re collections

• Rely on departmental & domain expertise

• Tomorrow: assured…

a centre of expertise in data curation and preservation

Community: SDSC?

• Data specialists

• Multiple disciplines

• Distinct from domains; curation dependent on external expertise

• Research ethos

• Tomorrow: dependent on grant/contract income & research priorities

a centre of expertise in data curation and preservation

Community: LOCKSS?

• Self-selected group of collectors: closest to genuine open activity (despite


• Traditionally libraries collecting eJournals

• Model respects IPR

• No domain expertise; rely on origins

• Data limitations…

• Tomorrow: potentially very persistent (low cost, high reliability, attack resistance, distributed)

a centre of expertise in data curation and preservation

Discipline: Archaeology

• Staffed by archaeologist curators

• Understand special legal issues

• Strong relationship with community & peers

• Internationally still fragmented?

• Tomorrow: dependent on research council grants + deposit funding

a centre of expertise in data curation and preservation

Discipline: Astronomy

• Part of major international effort

• Expensive shared facilities, global reach

• Well integrated into community

• Enable new science

• Tomorrow: assured by community (another large facility)

a centre of expertise in data curation and preservation

Discipline: Atmosphere

• Strong believer in need for domain scientists as curators

• Significant participant in

“community proxy” agenda-setting activities

• Internationally fragmented resources

• Tomorrow: mostly dependent on grant funding (but strong commitment)

a centre of expertise in data curation and preservation

Discipline: Pharmacology

• International Scientific


• Attempting to build credit for data contributions

• DB ownership rotates

• Tomorrow: extremely limited funding

a centre of expertise in data curation and preservation

Discipline: Social Sciences

• Mature!

• Staffed by Social

Science curators

• Alert to opportunities

• Able to appraise material offered

• Strong relationship to discipline

• Tomorrow: assured through broad mix of funding streams

a centre of expertise in data curation and preservation

Publisher: Crystallography

• Publisher and Scientific


• Created key domain crystallographic standard


• Strong motivator for deposit of structure data

• Consistent quality checks

• DOIs used for structure data

• Tomorrow: publishing business model

a centre of expertise in data curation and preservation

National bodies: British Library

• Serious and robust approach

• Legal deposit powers & responsibilities as driver

• Oriented primarily towards “cultural heritage” (broadly interpreted)

• Little data, no science domain experience

• Tomorrow: strong future commitment

a centre of expertise in data curation and preservation

National bodies: TNA/NDAD

• Specialist archive for government datasets

• Understand government regulations, dynamics & requirements

• Subject generalists; disconnected from associated science

• Technology specialists

(understand databases)

• Tomorrow: likely to pass eventually to The National


a centre of expertise in data curation and preservation

National bodies: NOAA (etc)

• Government body making serious data available

• Domain scientists curate data

• Operates in current political context (!)

• Tomorrow: reasonably assured but some unfunded mandates?

a centre of expertise in data curation and preservation

3rd parties: OCLC?

• Should this be community?

• Demand driven

• No domain science expertise: rely on origins

• Tomorrow: business case

a centre of expertise in data curation and preservation

3rd parties: Portico

• Specific area: eJournals

• Depends on publisher agreements

• No data or domain science expertise

• Tomorrow: commitment from Mellon + publishers + subscriptions, good funding mix

a centre of expertise in data curation and preservation

3rd Parties: Iron Mountain

• Records management

IS a curation problem

• Organisations like this very likely to branch out

• No domain science expertise

• Tomorrow: business case, viability, stock market…

a centre of expertise in data curation and preservation

Institutions & the network

• Institutions have some fundamental sustainability

• Disciplines live in the network; sustainability is an issue

• Can we get the best of both?

a centre of expertise in data curation and preservation















3 etc

X X etc

a centre of expertise in data curation and preservation

Who are the curation players again?

a centre of expertise in data curation and preservation

Project StORe findings

• Discipline commonality from survey (Miller, UKDA, 2006):

• 2-way links between data & publication useful

• Barriers to actual deposit of data/outputs

• Sharing data important, likely between colleagues

• Perceived inconsistency across repositories

• Most common searching: Google type

• Researchers favour self-reliance rather than library support

• Recognise need for common minimum metadata

• Aim for pilot linking middleware demonstrator

• “Creating small scale ‘silos’ of information with institutional repositories is not … a compelling information management strategy in the ‘Google age’” (Heery &

Anderson for JISC, 2005)

a centre of expertise in data curation and preservation

Sustainability: tomorrow is the emerging worry

• Sustainability work package in DCC (new grant!)

• JISC/NDIIPP meeting addressed it

• AHRC report draft soon

• Research Information Network report draft

• JISC study on sustainable IT systems for HE

• Recent ARL/NSF workshop, NSF strategy

a centre of expertise in data curation and preservation

Sustainability of what?

• Repository as an organisation

• Repository as a service

• Repository as a system

• Repositories as a network (federation?)

• Collections and objects supported by repositories

• Commit to collection: contract the manager!

a centre of expertise in data curation and preservation

Social factors

• Commitment essential… much more than anything else

(cf persistent identifiers)

• Funder requirements express social determination

• Policy & grant application forms, selection criteria

• Monitoring essential

• Legal , ethical , IPR impacts all significant

• Public good questions

• Academic credit (citations?)

• Free-loaders (embargos?)

• Disciplines are different!

• Workforce skills: researcher, data librarian/scientist

a centre of expertise in data curation and preservation

Sustainability a function of...

• Commitment

• Goals

• Value and cost

• Business model

• Time

• Environment

• Domain knowledge and information

• Dimensions (how much stuff)

• Technical approaches

• Usage

a centre of expertise in data curation and preservation

So, tomorrow…

• Digital data repositories already sustained > 30 years

• How?

• Vision, leadership, commitment

• Libraries, archives, museums sustained 100s of years

• How?

• Aggregate value proposition

• Perception now under threat!

• Collectively we need to identify the next steps toward digital data sustainability, for tomorrow, and tomorrow, and tomorrow!

a centre of expertise in data curation and preservation

a centre of expertise in data curation and preservation

Mission (impossible?)

• To that last syllable of recorded time

• Keep our tales forever full of significance!

Thank you

