The future of the DCC Chris Rusbridge E-Science Workshop April 2009

advertisement
a centre of expertise in data curation and preservation
The future of the DCC
Chris Rusbridge
E-Science Workshop April 2009
Funded by:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK:
Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-ncsa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San
Francisco, California, 94105, USA.
a centre of expertise in data curation and preservation
Contents
•
•
•
•
•
Curation & integrated science
Poetry & Philosophy of D H Rumsfeld
Designated Community & Knowledge Base
DCC services
Future of the DCC
E-Science Workshop
a centre of expertise in data curation and preservation
Curation
• Wikipedia
• Curator: a content specialist responsible for an institution's
collections and, together with a publications specialist, their
associated collections catalogs.
• Digital Curation: the curation, preservation, maintenance,
collection and archiving of digital assets
• Sheer curation: an approach to digital curation where
curation activities are quietly integrated into the normal work
flow of those creating and managing data and other digital
assets.
• DCC: Digital curation is maintaining and adding value
to a trusted body of digital information for current and
future use.
E-Science Workshop
a centre of expertise in data curation and preservation
Integrated Science
• The application of multiple scientific
disciplines to one or more core scientific
challenges
• Examples of integrated sciences?
• Archaeology
• Environmental sciences
E-Science Workshop
a centre of expertise in data curation and preservation
Integrated Science implications
• Scientists will be using unfamiliar data,
therefore
• Data curators and managers must make their
data available for unfamiliar users!
• And now for something unfamiliar?
E-Science Workshop
a centre of expertise in data curation and preservation
Poetry & Philosophy of D H
Rumsfeld
Hart Seely, April 2, 2003,
SLATE http://www.slate.com/id/2081042/
E-Science Workshop
a centre of expertise in data curation and preservation
A Confession
‘Once in a while,
I'm standing here, doing something.
And I think,
"What in the world am I doing here?"
It's a big surprise.’
—May 16, 2001, interview with the New York Times
E-Science Workshop
a centre of expertise in data curation and preservation
The Unknown
‘As we know,
There are known knowns.
There are things we know we know.
We also know
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don't know
We don't know.’
—Feb. 12, 2002, Department of Defense news briefing
E-Science Workshop
a centre of expertise in data curation and preservation
The 4th Rumsfeld?
• 3 epistemological classes (???)
• Known knowns
• Known unknowns
• Unknown unknowns
• 4th class?
• Uknown knowns?
• Critical issue for cross-disciplinary sciences
E-Science Workshop
a centre of expertise in data curation and preservation
Some OAIS Concepts?
• Knowledge Base: allows a consumer to understand
something
• Designated Community: the set of consumers for
whom the archive curates something
• Representation Information: helps you interpret a
data object yielding an information object
• The amount and nature of RepInfo required is dependent on
the Knowledge Base of the Designated Community
• If you curate for project colleagues in the short term, little if
any RepInfo required
• If you curate for those unfamiliar with the data, more RepInfo
is needed
• (All broadly interpreted!) •CCSDS (2002). Reference Model for an Open Archival Information System (OAIS).
•Retrieved. from http://public.ccsds.org/publications/archive/650x0b1.pdf.
E-Science Workshop
a centre of expertise in data curation and preservation
Time
• KB is f1(DC, t)
• DC is f2(t)
• RepInfo needed is f3(f1(DC, t), f2(t))
• (but none of these concepts can be precisely defined!)
• If DC is small and t is short (months to year or so),
then both may be ignored, and RepInfo be assumed
part of the KB
• If DC is extensive (eg cross-discipline) and t is long
(5 years to 25 plus), then RepInfo must be articulated
• If t is very long, most bets are off (post-hoc
reconstruction likely to be needed)
E-Science Workshop
a centre of expertise in data curation and preservation
What might RepInfo include
•
•
•
•
Structure information: file format definitions, etc
Semantic information: data dictionaries, code books etc
Robust methods (working code?)
Not to mention many kinds of metadata, provenance,
documentation of hidden assumptions, etc
• Cross-domain schemas one approach to articulating
RepInfo?
• (Never perfect, of course)
E-Science Workshop
a centre of expertise in data curation and preservation
What about Rumsfeld 4?
• Biggest concern with unfamiliar user is
clashing concepts, eg different baselines,
units, geographies, granularity
• Especially where terms are ambiguous or
differently interpreted
• The KBs of two DCs conflict, potentially silently
• Happens all the time, of course
• The unspoken: tacit knowledge, unknown
knowns!
E-Science Workshop
a centre of expertise in data curation and preservation
Timing
• Curation starts before creation
• Before project proposal!
• Data acquisition should not happen at the
end
• Continuous acquisition much better?
• Enforcement… or credit for data?
E-Science Workshop
a centre of expertise in data curation and preservation
Other curation issues of concern
•
•
•
•
•
•
•
•
Sustainability (work on your survival)
Succession (what happens to your data if you don’t)
Data audit (know what you’ve got)
Data risk assessment (assess your chances of loss)
Repository external audit???
Provenance & computational lineage
Archiving database changes
Community proxy roles: help your communities
develop data standards & data practices
• DCC has tools & support for some of these…
E-Science Workshop
a centre of expertise in data curation and preservation
… and Research Outputs?
• Need more semantically aware texts to
support cross-community understanding
• Coded up (cf microformats, RDFa)
•
•
•
•
•
People
Citations & references
Science features (eg chemicals, reactions)
Graphs, spectra, tables linking to
Supplementary data
• PDF is pretty bad at this
E-Science Workshop
a centre of expertise in data curation and preservation
DCC Phase 3
•
•
•
•
•
•
•
•
Post January 2010?
Smaller (2/3 budget if we’re lucky)
Joint planning with JISC
More tightly managed (hub and spoke)
No development (says JISC)
Core services plus optional additional services
1st draft seen by JSR
Feedback session next week
E-Science Workshop
a centre of expertise in data curation and preservation
Proposed core services
• Reference Resources and Exemplars
• Training and Staff Development
• Expertise, Advice, Consultancy and Hands-on
Support
• Community-building and Information-sharing
activities
• Data Management and Sharing Plans
• Policy and Strategic Development
• Providing Access to Tools and Toolkits
E-Science Workshop
a centre of expertise in data curation and preservation
Possible additional services
• Development of Tools, Toolkits, Wizards and
Templates
• Infrastructure Services
• Model licences for data
• Data citation guidelines
E-Science Workshop
a centre of expertise in data curation and preservation
What do you want from the DCC?
E-Science Workshop
Download