Tomorrow, and tomorrow, and tomorrow

advertisement

a centre of expertise in data curation and preservation

“Tomorrow, and tomorrow, and tomorrow”: the players on the curation stage

Chris Rusbridge

Presentation at OCLC

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK:

Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative

Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

Funded by:

a centre of expertise in data curation and preservation

• "To-morrow, and to-morrow, and to-morrow,

• Creeps in this petty pace from day to day,

• To the last syllable of recorded time;

• And all our yesterdays have lighted fools

• The way to dusty death.

• Out, out, brief candle!

• Life's but a walking shadow; a poor player,

• That struts and frets his hour upon the stage,

• And then is heard no more: it is a tale

• Told by an idiot, full of sound and fury,

• Signifying nothing."

• Shakespeare: Macbeth

OCLC October 2006

a centre of expertise in data curation and preservation

• Dunsinane Hill

OCLC October 2006

• Photo by Fabrice

OCLC October 2006 a centre of expertise in data curation and preservation

OCLC October 2006 a centre of expertise in data curation and preservation

a centre of expertise in data curation and preservation

Contents

• Curation and the Digital Curation Centre

• Science and Data Citations

• The “poor players” of data curation

• Sustainability of curated data

• Macbeth again…

OCLC October 2006

a centre of expertise in data curation and preservation

Curation

• Data increasingly important as evidence

• Experimental verifiability (the basis of science)

• Unrepeatable observations & experiments

(particularly environmental in broadest sense)

• Legal, compliance & transactions

• Cultural resources

• “Preservation” view vs “Publishing” view

OCLC October 2006

a centre of expertise in data curation and preservation

Lynch remarks

• Closing the Curation Conference

• 3 views of digital curation

• Finite process, handover to preservation

• Whole life process, evolving object(s)

• Collection as a living thing

OCLC October 2006

OCLC October 2006 a centre of expertise in data curation and preservation

Digital curation?

For later use

Static

Digital preservation

a centre of expertise in data curation and preservation

Digital curation?

In use now (and the future) For later use

Dynamic

Long-term

Digital curation Digital preservation

Static

OCLC October 2006

a centre of expertise in data curation and preservation

Digital curation

In use now (and the future) For later use

Dynamic

Long-term

Static

Digital curation & preservation

“maintaining and adding value to a trusted body of digital information for current and future use”

OCLC October 2006

a centre of expertise in data curation and preservation

Mission

“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”

OCLC October 2006

a centre of expertise in data curation and preservation

Organisation to Engage & Collaborate

communities of practice: users curation organisations eg DPC

Associates

Network community support & outreach service definition

& delivery management

& admin support research research collaborators testbeds

& tools development co-ordination

Industry standards bodies

OCLC October 2006

a centre of expertise in data curation and preservation

Organisation to Engage & Collaborate: Leads

communities of practice: users curation organisations eg DPC

Bath

Associates

Network

Glasgow Edinburgh Edinburgh testbeds

& tools

CCLRC

Industry

OCLC October 2006 standards bodies research collaborators

a centre of expertise in data curation and preservation

Associated work

• DCC LOCKSS Technical Support Service

(Lots of Copies Keep Stuff Safe)

• DCC SCARP Project

• Disciplinary approaches to sharing, curation, reuse and preservation

• EU projects associated

• CASPAR

• Digital Preservation Europe

• PLANETS

OCLC October 2006

a centre of expertise in data curation and preservation

Phase 2

• Externally-moderated, reflective selfevaluation completed

• Phase 2 proposal (2007/10) to JISC

• Accepted: focus on science data, reduced scale

• EPSRC-funded Research continues until

2007/8

OCLC October 2006

a centre of expertise in data curation and preservation

2nd International Digital Curation

Conference

• Research & invited presentations

• Glasgow, 21/22 November, 2006

• Please register at: http://www.dcc.ac.uk/events/dcc-2006/

OCLC October 2006

OCLC October 2006 a centre of expertise in data curation and preservation

a centre of expertise in data curation and preservation

Data resource stages

• Curated data is created…

• Observations? Fixed!

• Or Acquired…

• Data brought/bought from outside

• Ingest

• Development

• Derived, refined, combined, processed data

• Potentially many stages

OCLC October 2006

a centre of expertise in data curation and preservation

OCLC October 2006

SDSS (Visual)

TWOMASS (Infrared)

Slide from Rajendra Bose

a centre of expertise in data curation and preservation

OCLC October 2006

Slide from Rajendra Bose

a centre of expertise in data curation and preservation

New discovery…

• National Virtual Observatory

• Johns Hopkins press release: “ Scientists working to create the

NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”

OCLC October 2006

a centre of expertise in data curation and preservation

Context

• Data meaningless without context

• Linkage

• Metadata of many kinds

• Workflow!

• Provenance

• Computational lineage

• Authenticity

OCLC October 2006

a centre of expertise in data curation and preservation

NASA

HRPT

University research group1

OCLC October 2006

PAR

Csat subscene

E0

8-day composite and subscene

Csat

Ctot calc

8-day composite and subscene

Pbopt calc

SST

Zeu calc

PPeu calc

University research group2 research group3 local decisionmaking body

Slide from Rajendra Bose

a centre of expertise in data curation and preservation

Access and re-use

• Ethics and rights control access

• Weak in expressing this long-term

• Collaboration tools

• Annotation, discussion, review

• Re-use leading to change and development

• “Publication”

• Not just in “print”

• Underlying data should be “published”, too

• Citation…

OCLC October 2006

a centre of expertise in data curation and preservation

CLADDIER citation investigation

“My last example was an MST data set held at the BADC, and I was suggesting something like this (for a citation):

< Citation >< Author > Natural Environment Research Council </ Author >

< Title > Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth </ Title >

< Medium > Internet </ Medium >

< Publisher > British Atmospheric Data Centre (BADC) </ Publisher >

< PublicationDate status =" ongoing "> 1990</ PublicationDate >

< Identifier > badc.nerc.ac.uk/data/mst/v3/upd15032006</ Identifier >

< Feature >< FeatureType >http://featuretype.registry/verticalProfile</ FeatureType ><

LocalID >200409031205</ LocalID ></ Feature >

< AccessDate > Sep 21 2006 </ AccessDate >

< AvailableAt >< url >http://badc.nerc.ac.uk/data/mst/v3/</ url ></ AvailableAt >

</ Citation >

(Made up tags!)”

OCLC October 2006

• Bryan Lawrence Weblog

a centre of expertise in data curation and preservation

CLADDIER 2: “Version of record”

• Role of Publisher: add value

• provision of catalogue metadata

• some commitment to maintenance of the resource at the AvailableAt url

• some commitment to the resource being conformant to the description of the Feature

• some commitment to the maintenance of the mapping between the identifier [LocalID] and the resource.

OCLC October 2006

• Bryan Lawrence Weblog

a centre of expertise in data curation and preservation

CLADDIER 3: persistence

• Wayback Machine

• Only snapshots (eg only 2004 version of Bryan’s home page!)

• WebCite

• allows the creater of content to submit URLs for [archiving], thus ensuring when one writes an academic document, the material will be archived, and the citation will be persistent

• But no real help for data…

• “… only allow [data citation] when we believe in the persistence of the organisation making the data available…”

OCLC October 2006

• Bryan Lawrence Weblog

OCLC October 2006 a centre of expertise in data curation and preservation

a centre of expertise in data curation and preservation

Citation

• Needs a stable resource to cite…

OWL Web Ontology Language

Reference

W3C Proposed Recommendation 15 December 2003

This version : http://www.w3.org/TR/2003/PR-owl-ref-20031215/

Latest version : http://www.w3.org/TR/owl-ref/

Previous version : http://www.w3.org/TR/2003/CR-owl-ref-2003081

• (FRBR works & expressions?)

OCLC October 2006

a centre of expertise in data curation and preservation

Citation…

• The date alone (as in common web citation approaches) is not enough!

[6] The CIA World Factbook.

• www.cia.gov/cia/publications/factbook/.

Retrieved on 8 Jan 2006.

• Cited object likely to have changed…

• Citation should link to the cited object as it was!

OCLC October 2006

a centre of expertise in data curation and preservation

Citation needs…

• An efficient way to reference and access “archived” past states of a changing dataset (work in progress, Buneman et al)

• Not important for original observations

• Don’t mess with those data

• Less important for incremental datasets

• Later stuff should not invalidate earlier

• Very important for revisable datasets

• Eg Genomics… datasets that result from the combined work of curators, or contain opinions or facts likely to change

• Eg Mapping… OS maps represent a huge database that changes on a daily basis

OCLC October 2006

XMLArch: System Architecture

Relational

Database a centre of expertise in data curation and preservation

XML Archive at time t - 1

Data Extractor

Pre-processor

Version

Merger

XML Snapshot at time t

OCLC October 2006

XML Archive at time t

• Carwyn Edwards

a centre of expertise in data curation and preservation

Who are the curation players?

OCLC October 2006

a centre of expertise in data curation and preservation

Curation: Individual

• “Small science” 2-3 times more data than “Big science”, but much more at risk

• PhD student? RA? PI? Administrator? IT support?

• Data potentially on local hard drives, or at best shared network drives

• May be inadequately protected

• Liable for policy-led deletion on resignation

• Individual “knows” too much

• Documentation/metadata unlikely to be adequate

• Tomorrow: gone!

OCLC October 2006

a centre of expertise in data curation and preservation

Department: eCrystals

• Specialist department archive (& national service)

• Workflow recording of lab parameters (R4L)

• Public & private elements

• Trying to build eCrystals federation (eBank 3)

• But… ReciprocalNet?

French COD efforts?

Fragmented discipline!

• Tomorrow: likely to continue

OCLC October 2006

a centre of expertise in data curation and preservation

Institution: Cambridge Chemistry

• 175,000 small molecule structures in CML

• Alongside Archaeology,

Manuscripts, Learning

Materials, etc

• No library curation skills; dependent on research group enthusiast

• Collection isolated from other Chemistry

• Tomorrow: assured…

OCLC October 2006

a centre of expertise in data curation and preservation

Community: CDL

• Shared effort from group of institutions

• Comparison OhioLink?

• Document tradition, not data

• Passive role re collections

• Rely on departmental & domain expertise

• Tomorrow: assured…

OCLC October 2006

a centre of expertise in data curation and preservation

Community: SDSC?

• Data specialists

• Multiple disciplines

• Distinct from domains; curation dependent on external expertise

• Research ethos

• Tomorrow: dependent on grant/contract income & research priorities

OCLC October 2006

a centre of expertise in data curation and preservation

Community: LOCKSS?

• Self-selected group of collectors: closest to genuine open activity (despite

Alliance)?

• Traditionally libraries collecting eJournals

• Model respects IPR

• No domain expertise; rely on origins

• Data limitations…

• Tomorrow: potentially very persistent (low cost, high reliability, attack resistance, distributed)

OCLC October 2006

a centre of expertise in data curation and preservation

Discipline: Archaeology

• Staffed by archaeologist curators

• Understand special legal issues

• Strong relationship with community & peers

• Internationally still fragmented?

• Tomorrow: dependent on research council grants + deposit funding

OCLC October 2006

a centre of expertise in data curation and preservation

Discipline: Astronomy

• Part of major international effort

• Expensive shared facilities, global reach

• Well integrated into community

• Enable new science

• Tomorrow: assured by community (another large facility)

OCLC October 2006

a centre of expertise in data curation and preservation

Discipline: Atmosphere

• Strong believer in need for domain scientists as curators

• Significant participant in

“community proxy” agenda-setting activities

• Internationally fragmented resources

• Tomorrow: mostly dependent on grant funding (but strong commitment)

OCLC October 2006

a centre of expertise in data curation and preservation

Discipline: Pharmacology

• International Scientific

Union

• Attempting to build credit for data contributions

• DB ownership rotates

• Tomorrow: extremely limited funding

OCLC October 2006

a centre of expertise in data curation and preservation

Discipline: Social Sciences

• Mature!

• Staffed by Social

Science curators

• Alert to opportunities

• Able to appraise material offered

• Strong relationship to discipline

• Tomorrow: assured through broad mix of funding streams

OCLC October 2006

a centre of expertise in data curation and preservation

Publisher: Crystallography

• Publisher and Scientific

Union

• Created key domain crystallographic standard

(CIF)

• Strong motivator for deposit of structure data

• Consistent quality checks

• DOIs used for structure data

• Tomorrow: publishing business model

OCLC October 2006

• Slide from IUCr

a centre of expertise in data curation and preservation

National bodies: British Library

• Serious and robust approach

• Legal deposit powers & responsibilities as driver

• Oriented primarily towards “cultural heritage” (broadly interpreted)

• Little data, no science domain experience

• Tomorrow: strong future commitment

OCLC October 2006

a centre of expertise in data curation and preservation

National bodies: TNA/NDAD

• Specialist archive for government datasets

• Understand government regulations, dynamics & requirements

• Subject generalists; disconnected from associated science

• Technology specialists

(understand databases)

• Tomorrow: likely to pass eventually to The National

Archives

OCLC October 2006

a centre of expertise in data curation and preservation

National bodies: NOAA (etc)

• Government body making serious data available

• Domain scientists curate data

• Operates in current political context (!)

• Tomorrow: reasonably assured but some unfunded mandates?

OCLC October 2006

a centre of expertise in data curation and preservation

3rd parties: OCLC?

• Should this be community?

• Demand driven

• No domain science expertise: rely on origins

• Tomorrow: business case

OCLC October 2006

a centre of expertise in data curation and preservation

3rd parties: Portico

• Specific area: eJournals

• Depends on publisher agreements

• No data or domain science expertise

• Tomorrow: commitment from Mellon + publishers + subscriptions, good funding mix

OCLC October 2006

a centre of expertise in data curation and preservation

3rd Parties: Iron Mountain

• Records management

IS a curation problem

• Organisations like this very likely to branch out

• No domain science expertise

• Tomorrow: business case, viability, stock market…

OCLC October 2006

a centre of expertise in data curation and preservation

Institutions & the network

• Institutions have some fundamental sustainability

• Disciplines live in the network; sustainability is an issue

• Can we get the best of both?

OCLC October 2006

a centre of expertise in data curation and preservation

Intersections…

Discipline

1

Discipline

2

Institution

1

Institution

2

Institution

3

X X

X X

Discipline

3 etc

X X etc

OCLC October 2006

a centre of expertise in data curation and preservation

Who are the curation players again?

OCLC October 2006

a centre of expertise in data curation and preservation

Project StORe findings

• Discipline commonality from survey (Miller, UKDA, 2006):

• 2-way links between data & publication useful

• Barriers to actual deposit of data/outputs

• Sharing data important, likely between colleagues

• Perceived inconsistency across repositories

• Most common searching: Google type

• Researchers favour self-reliance rather than library support

• Recognise need for common minimum metadata

• Aim for pilot linking middleware demonstrator

• “Creating small scale ‘silos’ of information with institutional repositories is not … a compelling information management strategy in the ‘Google age’” (Heery &

Anderson for JISC, 2005)

OCLC October 2006

a centre of expertise in data curation and preservation

Sustainability: tomorrow is the emerging worry

• Sustainability work package in DCC (new grant!)

• JISC/NDIIPP meeting addressed it

• AHRC report draft soon

• Research Information Network report draft

• JISC study on sustainable IT systems for HE

• Recent ARL/NSF workshop, NSF strategy

OCLC October 2006

a centre of expertise in data curation and preservation

Sustainability of what?

• Repository as an organisation

• Repository as a service

• Repository as a system

• Repositories as a network (federation?)

• Collections and objects supported by repositories

• Commit to collection: contract the manager!

OCLC October 2006

a centre of expertise in data curation and preservation

Social factors

• Commitment essential… much more than anything else

(cf persistent identifiers)

• Funder requirements express social determination

• Policy & grant application forms, selection criteria

• Monitoring essential

• Legal , ethical , IPR impacts all significant

• Public good questions

• Academic credit (citations?)

• Free-loaders (embargos?)

• Disciplines are different!

• Workforce skills: researcher, data librarian/scientist

OCLC October 2006

a centre of expertise in data curation and preservation

Sustainability a function of...

• Commitment

• Goals

• Value and cost

• Business model

• Time

• Environment

• Domain knowledge and information

• Dimensions (how much stuff)

• Technical approaches

• Usage

OCLC October 2006

a centre of expertise in data curation and preservation

So, tomorrow…

• Digital data repositories already sustained > 30 years

• How?

• Vision, leadership, commitment

• Libraries, archives, museums sustained 100s of years

• How?

• Aggregate value proposition

• Perception now under threat!

• Collectively we need to identify the next steps toward digital data sustainability, for tomorrow, and tomorrow, and tomorrow!

OCLC October 2006

a centre of expertise in data curation and preservation

Macbeth again…

• "To-morrow, and to-morrow, and to-morrow,

• Creeps in this petty pace from day to day,

• To the last syllable of recorded time;

• …it is a tale

• Told by an idiot, full of sound and fury,

• Signifying nothing."

OCLC October 2006

a centre of expertise in data curation and preservation

Mission (impossible?)

• To that last syllable of recorded time

• Keep our tales forever full of significance!

Thank you

OCLC October 2006

Download