a centre of expertise in data curation and preservation
Chris Rusbridge
Presentation at OCLC
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK:
Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative
Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
Funded by:
a centre of expertise in data curation and preservation
• "To-morrow, and to-morrow, and to-morrow,
• Creeps in this petty pace from day to day,
• To the last syllable of recorded time;
• And all our yesterdays have lighted fools
• The way to dusty death.
• Out, out, brief candle!
• Life's but a walking shadow; a poor player,
• That struts and frets his hour upon the stage,
• And then is heard no more: it is a tale
• Told by an idiot, full of sound and fury,
• Signifying nothing."
• Shakespeare: Macbeth
OCLC October 2006
a centre of expertise in data curation and preservation
• Dunsinane Hill
OCLC October 2006
• Photo by Fabrice
OCLC October 2006 a centre of expertise in data curation and preservation
OCLC October 2006 a centre of expertise in data curation and preservation
a centre of expertise in data curation and preservation
• Curation and the Digital Curation Centre
• Science and Data Citations
• The “poor players” of data curation
• Sustainability of curated data
• Macbeth again…
OCLC October 2006
a centre of expertise in data curation and preservation
• Data increasingly important as evidence
• Experimental verifiability (the basis of science)
• Unrepeatable observations & experiments
(particularly environmental in broadest sense)
• Legal, compliance & transactions
• Cultural resources
• “Preservation” view vs “Publishing” view
OCLC October 2006
a centre of expertise in data curation and preservation
• Closing the Curation Conference
• 3 views of digital curation
• Finite process, handover to preservation
• Whole life process, evolving object(s)
• Collection as a living thing
OCLC October 2006
OCLC October 2006 a centre of expertise in data curation and preservation
For later use
Static
Digital preservation
a centre of expertise in data curation and preservation
In use now (and the future) For later use
Dynamic
Long-term
Digital curation Digital preservation
Static
OCLC October 2006
a centre of expertise in data curation and preservation
In use now (and the future) For later use
Dynamic
Long-term
Static
Digital curation & preservation
“maintaining and adding value to a trusted body of digital information for current and future use”
OCLC October 2006
a centre of expertise in data curation and preservation
“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”
OCLC October 2006
a centre of expertise in data curation and preservation
communities of practice: users curation organisations eg DPC
Associates
Network community support & outreach service definition
& delivery management
& admin support research research collaborators testbeds
& tools development co-ordination
Industry standards bodies
OCLC October 2006
a centre of expertise in data curation and preservation
communities of practice: users curation organisations eg DPC
Bath
Associates
Network
Glasgow Edinburgh Edinburgh testbeds
& tools
CCLRC
Industry
OCLC October 2006 standards bodies research collaborators
a centre of expertise in data curation and preservation
• DCC LOCKSS Technical Support Service
(Lots of Copies Keep Stuff Safe)
• DCC SCARP Project
• Disciplinary approaches to sharing, curation, reuse and preservation
• EU projects associated
• CASPAR
• Digital Preservation Europe
• PLANETS
OCLC October 2006
a centre of expertise in data curation and preservation
• Externally-moderated, reflective selfevaluation completed
• Phase 2 proposal (2007/10) to JISC
• Accepted: focus on science data, reduced scale
• EPSRC-funded Research continues until
2007/8
OCLC October 2006
a centre of expertise in data curation and preservation
• Research & invited presentations
• Glasgow, 21/22 November, 2006
• Please register at: http://www.dcc.ac.uk/events/dcc-2006/
OCLC October 2006
OCLC October 2006 a centre of expertise in data curation and preservation
a centre of expertise in data curation and preservation
• Curated data is created…
• Observations? Fixed!
• Or Acquired…
• Data brought/bought from outside
• Ingest
• Development
• Derived, refined, combined, processed data
• Potentially many stages
OCLC October 2006
a centre of expertise in data curation and preservation
OCLC October 2006
SDSS (Visual)
TWOMASS (Infrared)
Slide from Rajendra Bose
a centre of expertise in data curation and preservation
OCLC October 2006
Slide from Rajendra Bose
a centre of expertise in data curation and preservation
• National Virtual Observatory
• Johns Hopkins press release: “ Scientists working to create the
NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”
OCLC October 2006
a centre of expertise in data curation and preservation
• Data meaningless without context
• Linkage
• Metadata of many kinds
• Workflow!
• Provenance
• Computational lineage
• Authenticity
OCLC October 2006
a centre of expertise in data curation and preservation
NASA
HRPT
University research group1
OCLC October 2006
PAR
Csat subscene
E0
8-day composite and subscene
Csat
Ctot calc
8-day composite and subscene
Pbopt calc
SST
Zeu calc
PPeu calc
University research group2 research group3 local decisionmaking body
Slide from Rajendra Bose
a centre of expertise in data curation and preservation
• Ethics and rights control access
• Weak in expressing this long-term
• Collaboration tools
• Annotation, discussion, review
• Re-use leading to change and development
• “Publication”
• Not just in “print”
• Underlying data should be “published”, too
• Citation…
OCLC October 2006
a centre of expertise in data curation and preservation
“My last example was an MST data set held at the BADC, and I was suggesting something like this (for a citation):
< Citation >< Author > Natural Environment Research Council </ Author >
< Title > Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth </ Title >
< Medium > Internet </ Medium >
< Publisher > British Atmospheric Data Centre (BADC) </ Publisher >
< PublicationDate status =" ongoing "> 1990</ PublicationDate >
< Identifier > badc.nerc.ac.uk/data/mst/v3/upd15032006</ Identifier >
< Feature >< FeatureType >http://featuretype.registry/verticalProfile</ FeatureType ><
LocalID >200409031205</ LocalID ></ Feature >
< AccessDate > Sep 21 2006 </ AccessDate >
< AvailableAt >< url >http://badc.nerc.ac.uk/data/mst/v3/</ url ></ AvailableAt >
</ Citation >
(Made up tags!)”
OCLC October 2006
• Bryan Lawrence Weblog
a centre of expertise in data curation and preservation
• Role of Publisher: add value
• provision of catalogue metadata
• some commitment to maintenance of the resource at the AvailableAt url
• some commitment to the resource being conformant to the description of the Feature
• some commitment to the maintenance of the mapping between the identifier [LocalID] and the resource.
OCLC October 2006
• Bryan Lawrence Weblog
a centre of expertise in data curation and preservation
• Wayback Machine
• Only snapshots (eg only 2004 version of Bryan’s home page!)
• WebCite
• allows the creater of content to submit URLs for [archiving], thus ensuring when one writes an academic document, the material will be archived, and the citation will be persistent
• But no real help for data…
• “… only allow [data citation] when we believe in the persistence of the organisation making the data available…”
OCLC October 2006
• Bryan Lawrence Weblog
OCLC October 2006 a centre of expertise in data curation and preservation
a centre of expertise in data curation and preservation
• Needs a stable resource to cite…
OWL Web Ontology Language
Reference
W3C Proposed Recommendation 15 December 2003
This version : http://www.w3.org/TR/2003/PR-owl-ref-20031215/
Latest version : http://www.w3.org/TR/owl-ref/
Previous version : http://www.w3.org/TR/2003/CR-owl-ref-2003081
• (FRBR works & expressions?)
OCLC October 2006
a centre of expertise in data curation and preservation
• The date alone (as in common web citation approaches) is not enough!
•
[6] The CIA World Factbook.
• www.cia.gov/cia/publications/factbook/.
•
Retrieved on 8 Jan 2006.
• Cited object likely to have changed…
• Citation should link to the cited object as it was!
OCLC October 2006
a centre of expertise in data curation and preservation
• An efficient way to reference and access “archived” past states of a changing dataset (work in progress, Buneman et al)
• Not important for original observations
• Don’t mess with those data
• Less important for incremental datasets
• Later stuff should not invalidate earlier
• Very important for revisable datasets
• Eg Genomics… datasets that result from the combined work of curators, or contain opinions or facts likely to change
• Eg Mapping… OS maps represent a huge database that changes on a daily basis
OCLC October 2006
XMLArch: System Architecture
Relational
Database a centre of expertise in data curation and preservation
XML Archive at time t - 1
Data Extractor
Pre-processor
Version
Merger
XML Snapshot at time t
OCLC October 2006
XML Archive at time t
• Carwyn Edwards
a centre of expertise in data curation and preservation
OCLC October 2006
a centre of expertise in data curation and preservation
• “Small science” 2-3 times more data than “Big science”, but much more at risk
• PhD student? RA? PI? Administrator? IT support?
• Data potentially on local hard drives, or at best shared network drives
• May be inadequately protected
• Liable for policy-led deletion on resignation
• Individual “knows” too much
• Documentation/metadata unlikely to be adequate
• Tomorrow: gone!
OCLC October 2006
a centre of expertise in data curation and preservation
• Specialist department archive (& national service)
• Workflow recording of lab parameters (R4L)
• Public & private elements
• Trying to build eCrystals federation (eBank 3)
• But… ReciprocalNet?
French COD efforts?
Fragmented discipline!
• Tomorrow: likely to continue
OCLC October 2006
a centre of expertise in data curation and preservation
• 175,000 small molecule structures in CML
• Alongside Archaeology,
Manuscripts, Learning
Materials, etc
• No library curation skills; dependent on research group enthusiast
• Collection isolated from other Chemistry
• Tomorrow: assured…
OCLC October 2006
a centre of expertise in data curation and preservation
• Shared effort from group of institutions
• Comparison OhioLink?
• Document tradition, not data
• Passive role re collections
• Rely on departmental & domain expertise
• Tomorrow: assured…
OCLC October 2006
a centre of expertise in data curation and preservation
• Data specialists
• Multiple disciplines
• Distinct from domains; curation dependent on external expertise
• Research ethos
• Tomorrow: dependent on grant/contract income & research priorities
OCLC October 2006
a centre of expertise in data curation and preservation
• Self-selected group of collectors: closest to genuine open activity (despite
Alliance)?
• Traditionally libraries collecting eJournals
• Model respects IPR
• No domain expertise; rely on origins
• Data limitations…
• Tomorrow: potentially very persistent (low cost, high reliability, attack resistance, distributed)
OCLC October 2006
a centre of expertise in data curation and preservation
• Staffed by archaeologist curators
• Understand special legal issues
• Strong relationship with community & peers
• Internationally still fragmented?
• Tomorrow: dependent on research council grants + deposit funding
OCLC October 2006
a centre of expertise in data curation and preservation
• Part of major international effort
• Expensive shared facilities, global reach
• Well integrated into community
• Enable new science
• Tomorrow: assured by community (another large facility)
OCLC October 2006
a centre of expertise in data curation and preservation
• Strong believer in need for domain scientists as curators
• Significant participant in
“community proxy” agenda-setting activities
• Internationally fragmented resources
• Tomorrow: mostly dependent on grant funding (but strong commitment)
OCLC October 2006
a centre of expertise in data curation and preservation
• International Scientific
Union
• Attempting to build credit for data contributions
• DB ownership rotates
• Tomorrow: extremely limited funding
OCLC October 2006
a centre of expertise in data curation and preservation
• Mature!
• Staffed by Social
Science curators
• Alert to opportunities
• Able to appraise material offered
• Strong relationship to discipline
• Tomorrow: assured through broad mix of funding streams
OCLC October 2006
a centre of expertise in data curation and preservation
• Publisher and Scientific
Union
• Created key domain crystallographic standard
(CIF)
• Strong motivator for deposit of structure data
• Consistent quality checks
• DOIs used for structure data
• Tomorrow: publishing business model
OCLC October 2006
• Slide from IUCr
a centre of expertise in data curation and preservation
• Serious and robust approach
• Legal deposit powers & responsibilities as driver
• Oriented primarily towards “cultural heritage” (broadly interpreted)
• Little data, no science domain experience
• Tomorrow: strong future commitment
OCLC October 2006
a centre of expertise in data curation and preservation
• Specialist archive for government datasets
• Understand government regulations, dynamics & requirements
• Subject generalists; disconnected from associated science
• Technology specialists
(understand databases)
• Tomorrow: likely to pass eventually to The National
Archives
OCLC October 2006
a centre of expertise in data curation and preservation
• Government body making serious data available
• Domain scientists curate data
• Operates in current political context (!)
• Tomorrow: reasonably assured but some unfunded mandates?
OCLC October 2006
a centre of expertise in data curation and preservation
• Should this be community?
• Demand driven
• No domain science expertise: rely on origins
• Tomorrow: business case
OCLC October 2006
a centre of expertise in data curation and preservation
• Specific area: eJournals
• Depends on publisher agreements
• No data or domain science expertise
• Tomorrow: commitment from Mellon + publishers + subscriptions, good funding mix
OCLC October 2006
a centre of expertise in data curation and preservation
• Records management
IS a curation problem
• Organisations like this very likely to branch out
• No domain science expertise
• Tomorrow: business case, viability, stock market…
OCLC October 2006
a centre of expertise in data curation and preservation
• Institutions have some fundamental sustainability
• Disciplines live in the network; sustainability is an issue
• Can we get the best of both?
OCLC October 2006
a centre of expertise in data curation and preservation
Discipline
1
Discipline
2
Institution
1
Institution
2
Institution
3
X X
X X
Discipline
3 etc
X X etc
OCLC October 2006
a centre of expertise in data curation and preservation
OCLC October 2006
a centre of expertise in data curation and preservation
• Discipline commonality from survey (Miller, UKDA, 2006):
• 2-way links between data & publication useful
• Barriers to actual deposit of data/outputs
• Sharing data important, likely between colleagues
• Perceived inconsistency across repositories
• Most common searching: Google type
• Researchers favour self-reliance rather than library support
• Recognise need for common minimum metadata
• Aim for pilot linking middleware demonstrator
• “Creating small scale ‘silos’ of information with institutional repositories is not … a compelling information management strategy in the ‘Google age’” (Heery &
Anderson for JISC, 2005)
OCLC October 2006
a centre of expertise in data curation and preservation
• Sustainability work package in DCC (new grant!)
• JISC/NDIIPP meeting addressed it
• AHRC report draft soon
• Research Information Network report draft
• JISC study on sustainable IT systems for HE
• Recent ARL/NSF workshop, NSF strategy
OCLC October 2006
a centre of expertise in data curation and preservation
• Repository as an organisation
• Repository as a service
• Repository as a system
• Repositories as a network (federation?)
• Collections and objects supported by repositories
• Commit to collection: contract the manager!
OCLC October 2006
a centre of expertise in data curation and preservation
• Commitment essential… much more than anything else
(cf persistent identifiers)
• Funder requirements express social determination
• Policy & grant application forms, selection criteria
• Monitoring essential
• Legal , ethical , IPR impacts all significant
• Public good questions
• Academic credit (citations?)
• Free-loaders (embargos?)
• Disciplines are different!
• Workforce skills: researcher, data librarian/scientist
OCLC October 2006
a centre of expertise in data curation and preservation
• Commitment
• Goals
• Value and cost
• Business model
• Time
• Environment
• Domain knowledge and information
• Dimensions (how much stuff)
• Technical approaches
• Usage
OCLC October 2006
a centre of expertise in data curation and preservation
• Digital data repositories already sustained > 30 years
• How?
• Vision, leadership, commitment
• Libraries, archives, museums sustained 100s of years
• How?
• Aggregate value proposition
• Perception now under threat!
• Collectively we need to identify the next steps toward digital data sustainability, for tomorrow, and tomorrow, and tomorrow!
OCLC October 2006
a centre of expertise in data curation and preservation
• "To-morrow, and to-morrow, and to-morrow,
• Creeps in this petty pace from day to day,
• To the last syllable of recorded time;
• …it is a tale
• Told by an idiot, full of sound and fury,
• Signifying nothing."
OCLC October 2006
a centre of expertise in data curation and preservation
• To that last syllable of recorded time
• Keep our tales forever full of significance!
Thank you
OCLC October 2006