a centre of expertise in data curation and preservation The future of the DCC Chris Rusbridge E-Science Workshop April 2009 Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-ncsa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. a centre of expertise in data curation and preservation Contents • • • • • Curation & integrated science Poetry & Philosophy of D H Rumsfeld Designated Community & Knowledge Base DCC services Future of the DCC E-Science Workshop a centre of expertise in data curation and preservation Curation • Wikipedia • Curator: a content specialist responsible for an institution's collections and, together with a publications specialist, their associated collections catalogs. • Digital Curation: the curation, preservation, maintenance, collection and archiving of digital assets • Sheer curation: an approach to digital curation where curation activities are quietly integrated into the normal work flow of those creating and managing data and other digital assets. • DCC: Digital curation is maintaining and adding value to a trusted body of digital information for current and future use. E-Science Workshop a centre of expertise in data curation and preservation Integrated Science • The application of multiple scientific disciplines to one or more core scientific challenges • Examples of integrated sciences? • Archaeology • Environmental sciences E-Science Workshop a centre of expertise in data curation and preservation Integrated Science implications • Scientists will be using unfamiliar data, therefore • Data curators and managers must make their data available for unfamiliar users! • And now for something unfamiliar? E-Science Workshop a centre of expertise in data curation and preservation Poetry & Philosophy of D H Rumsfeld Hart Seely, April 2, 2003, SLATE http://www.slate.com/id/2081042/ E-Science Workshop a centre of expertise in data curation and preservation A Confession ‘Once in a while, I'm standing here, doing something. And I think, "What in the world am I doing here?" It's a big surprise.’ —May 16, 2001, interview with the New York Times E-Science Workshop a centre of expertise in data curation and preservation The Unknown ‘As we know, There are known knowns. There are things we know we know. We also know There are known unknowns. That is to say We know there are some things We do not know. But there are also unknown unknowns, The ones we don't know We don't know.’ —Feb. 12, 2002, Department of Defense news briefing E-Science Workshop a centre of expertise in data curation and preservation The 4th Rumsfeld? • 3 epistemological classes (???) • Known knowns • Known unknowns • Unknown unknowns • 4th class? • Uknown knowns? • Critical issue for cross-disciplinary sciences E-Science Workshop a centre of expertise in data curation and preservation Some OAIS Concepts? • Knowledge Base: allows a consumer to understand something • Designated Community: the set of consumers for whom the archive curates something • Representation Information: helps you interpret a data object yielding an information object • The amount and nature of RepInfo required is dependent on the Knowledge Base of the Designated Community • If you curate for project colleagues in the short term, little if any RepInfo required • If you curate for those unfamiliar with the data, more RepInfo is needed • (All broadly interpreted!) •CCSDS (2002). Reference Model for an Open Archival Information System (OAIS). •Retrieved. from http://public.ccsds.org/publications/archive/650x0b1.pdf. E-Science Workshop a centre of expertise in data curation and preservation Time • KB is f1(DC, t) • DC is f2(t) • RepInfo needed is f3(f1(DC, t), f2(t)) • (but none of these concepts can be precisely defined!) • If DC is small and t is short (months to year or so), then both may be ignored, and RepInfo be assumed part of the KB • If DC is extensive (eg cross-discipline) and t is long (5 years to 25 plus), then RepInfo must be articulated • If t is very long, most bets are off (post-hoc reconstruction likely to be needed) E-Science Workshop a centre of expertise in data curation and preservation What might RepInfo include • • • • Structure information: file format definitions, etc Semantic information: data dictionaries, code books etc Robust methods (working code?) Not to mention many kinds of metadata, provenance, documentation of hidden assumptions, etc • Cross-domain schemas one approach to articulating RepInfo? • (Never perfect, of course) E-Science Workshop a centre of expertise in data curation and preservation What about Rumsfeld 4? • Biggest concern with unfamiliar user is clashing concepts, eg different baselines, units, geographies, granularity • Especially where terms are ambiguous or differently interpreted • The KBs of two DCs conflict, potentially silently • Happens all the time, of course • The unspoken: tacit knowledge, unknown knowns! E-Science Workshop a centre of expertise in data curation and preservation Timing • Curation starts before creation • Before project proposal! • Data acquisition should not happen at the end • Continuous acquisition much better? • Enforcement… or credit for data? E-Science Workshop a centre of expertise in data curation and preservation Other curation issues of concern • • • • • • • • Sustainability (work on your survival) Succession (what happens to your data if you don’t) Data audit (know what you’ve got) Data risk assessment (assess your chances of loss) Repository external audit??? Provenance & computational lineage Archiving database changes Community proxy roles: help your communities develop data standards & data practices • DCC has tools & support for some of these… E-Science Workshop a centre of expertise in data curation and preservation … and Research Outputs? • Need more semantically aware texts to support cross-community understanding • Coded up (cf microformats, RDFa) • • • • • People Citations & references Science features (eg chemicals, reactions) Graphs, spectra, tables linking to Supplementary data • PDF is pretty bad at this E-Science Workshop a centre of expertise in data curation and preservation DCC Phase 3 • • • • • • • • Post January 2010? Smaller (2/3 budget if we’re lucky) Joint planning with JISC More tightly managed (hub and spoke) No development (says JISC) Core services plus optional additional services 1st draft seen by JSR Feedback session next week E-Science Workshop a centre of expertise in data curation and preservation Proposed core services • Reference Resources and Exemplars • Training and Staff Development • Expertise, Advice, Consultancy and Hands-on Support • Community-building and Information-sharing activities • Data Management and Sharing Plans • Policy and Strategic Development • Providing Access to Tools and Toolkits E-Science Workshop a centre of expertise in data curation and preservation Possible additional services • Development of Tools, Toolkits, Wizards and Templates • Infrastructure Services • Model licences for data • Data citation guidelines E-Science Workshop a centre of expertise in data curation and preservation What do you want from the DCC? E-Science Workshop