a centre of expertise in data curation and preservation Looking to the longer term: some perspectives on data curation and preservation Dr Liz Lyon, DCC Associate Director Outreach Director, UKOLN, University of Bath, UK This work is licensed under a Creative Commons Licence Attribution-ShareAlike 2.0 Funded by: IMechE Workshop, London, 26th September 2006 About UKOLN • “a centre of expertise in digital information management” • Funding: Joint Information Systems Committee (JISC) + Museums, Libraries & Archives Council (MLA) • Portfolio of R&D projects Delos, DRIVER, Grand Challenge • 29+ staff based at the University of Bath • Inform the library, information, education and cultural heritage communities • Policy, advocacy at national level, build innovative Webbased systems & services, R&D, e-journal Ariadne, workshops and conferences. • http://www.ukoln.ac.uk/ Acknowledgement: Alex Ball, Grand Challenge Project UK Digital Curation Centre • Digital Curation Centre • • • • • Funded by JISC & EPSRC Development activities Research agenda Delivering services Outreach Programme • http://www.dcc.ac.uk/ a centre of expertise in data curation and preservation Overview • Data curation and digital preservation issues • Draw on research and scholarship perspectives • Data / information flows and the “business process” • UK Digital Curation Centre activities “maintaining and adding value to a trusted body of digital information for current and future use” IMechE Workshop, London, 26th September 2006 Reference datasets as infrastructure? Datacentric 2020 vision (Very simple) Product Research Cycle & Data Curation (New) knowledge extraction: data mining, modelling, analysis, synthesis Data processing Formulate ideas / hypothesis, test, experiment, observe, design: data creation, collection & capture Data processing Data processing Adding value: Data linking, annotation, visualisation, simulation Data processing e-Infrastructure Open ?? access Collaboration Data management storage & validation: description, deposit, self-archiving, preservation, certification Data processing Scholarly communications & Business transactions: data disclosure, publication, citation, discovery, re-use This work is licensed under a Creative Commons Licence Attribution-ShareAlike 2.0 • RepoMMan: Repository Metadata and Management (Hull) using WS-BPEL • Are your engineering workflows identified and described? e-Scientist desktop? Slide: Carole Goble Workflow Airport Maintenance Engineer Visual Inspection DS&S Maintenance Analyst (Fleet Manager) Aircraft Lands Quote Diagnosis Rolls Royce Domain Expert DAME signal processing workflows using Grid Services Brief Diagnosis / Prognosis Check Diagnoses [ unknown ] Diagnosis Result Detailed Diagnosis / Prognosis [ fault unresolved ] [ Clear ] [ known ] [ information required ] Provide Information Maintenance Procedure Release Engine complete [ diagnosis Maintenance Result [ fault resolved ] Request Information Analyst [ unknown ] Decision Detailed Analysis [ diagnosis ] Expert Decision [ information required ] Sign-off Diagnosis Provide Further Details Request Further Details Research outputs in institutional repositories: engineering “JISC Vision”: a global landscape of federated repositories • Multi-disciplinary, crosssectoral • e-Framework and Information Environment context • National, institutional • Define common + domainspecific + repository “services” • Different platforms • Many format types: data, eprints, images, geospatial heterogeneous - metadata formats, content formats, identifiers, packaging standards homogeneous - metadata formats, content formats, identifiers, packaging standards repository • Interoperability based on open standards, software tools From Andy Powell: http://www.ukoln.ac.uk/distributed-systems/jiscie/arch/presentations/jiie-jcs-2005/ repository repository repository repository fusion layer ‘repository federator’ portal portal portal portal portal Pilot Engineering Repository Xsearch PerX http://www.engineering.ac.uk/ a centre of expertise in data curation and preservation IMechE Workshop, London, 26th September 2006 Interoperability??? STEP ISO10303 Repositories and OAIS Reference Model “an archive consisting of an organisation of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community..an identified group of potential consumers who should be able to understand a particular set of information” Assuring permanence: digital preservation • Trusted DR Audit Checklist for Certification Draft Research Libraries Group-NARA Taskforce 2005 Defined criteria: – – – – Organisation Functions, processes & procedures Designated community & usability Technologies & technical infrastructure • Revised Checklist based on feedback and pilot audits (KB, BADC) • Self-certification: DINI-Zertifikat: requirements & recommendations: – – – – – – – Server policy / Guidelines Author support Legal issues Authenticity and integrity Cataloguing Access statistics Long-term sustainability • Has your repository / PLM been audited? Interdisciplinary discovery • Validation, publication & discovery of data models & schema • Harmonisation and normalisation of metadata and semantics • Packaging standards: METS, MPEG-21 DIDL • Formal high-level and domain ontologies • ePrints DC Application Profile http://www.ukoln.ac.uk/repositories/digirep/index/ Eprints_Application_Profile • eBank Application Profile crystallography data http://www.ukoln.ac.uk/projects/ebankuk/schemas/ • What data models and metadata schema are in place? Persistent identifiers for data citation • How will they be used? We need use cases: depositor, author, service provider, researcher, publisher? • Schemes: DOI, Handle, ARK, PURL • Global identification: express as http URIs • Data citation (human and machine-actionable) • Publication & citation of scientific primary data project National Library for Science & Technology (TIB), University of Hanover, Germany. STD-DOI Project DOI registry for datasets http://www.std-doi.de • Is there a data citation policy? • What persistent identifiers have been assigned to your data? Discovering data: eBank Project • Domain identifier: International Chemical Identifier (INChI) code • Google molecule using INChI Slide from Simon Coles Coles, S.J., Day, N.E., Murray-Rust, P., Rzepa, H.S., Zhang, Y., Org. Biomol. Chem., 2005, (10),1832-1834. DOI: 10.1039/b502828k Domain identifiers for engineering? Format migration challenges? CAD Program Compatibility Chart http://www.okino.com/conv/filefrmt_cad.htm Registry development Development: Representation Information Registry Repository • “DCC Approach to Digital Curation” based on OAIS • Representation Information Registry Repository • Prototype demonstrator: based on 2 key concepts to facilitate sharing of the curation effort – Curation Persistent Identifier (CPID) – Descriptive “label” (structural, semantic, other metadata) • Development of (M2M) tools and interfaces for creating, using and re-using representation information • http://dev.dcc.ac.uk Wiki and email list • EU CASPAR Integrated Project http://www.casparpreserves.info/pages/1/index.htm • Task Force on the Permanent Access to the Records of Science http://tfpa.kb.nl/ Registry API Allows applications to talk to many different registry implementations e.g. GDFR, PRONOM, UDDI •GUI Access and via Web browser http://registry.dcc.ac.uk Adding value through annotation Research at the University of Edinburgh • Scientific databases: Annotation scoping report • New annotation model + prototype MONDRIAN • Intuitive visual interface iMONDRIAN • Annotate sets of values • Support for querying annotations NaCTeM http://www.nactem.ac.uk/ Emerging tools: TerMine, GENIA, Cafetiere Knowledge extraction: Nature 23 March 2006 OTMI: Open Text Mining Interface • Mining (data, text, structures) • Modelling (economic, climate, mathematical, biological…) • Analysis (statistical, lexical, gene….) Supporting the community: Services • HELPDESK@dcc.ac.uk • legal - technical guidance • Curation Manual 45 chapters planned – – – – – – Metadata (umbrella) Open Source Archival metadata Preservation metadata Selection & appraisal Curating emails • Briefing Papers – – – – – Curating emails Digital repositories Geospatial data Data protection eScience data • Case studies a centre of expertise in data curation and preservation DCC Case Study published: Wide Field Astronomy Unit IMechE Workshop, London, 26th September 2006 Supporting the community: Outreach & Services • Workshops: • Geospatial data, NeSC, 27 October • OAIS 5 year Review, October • Audit & Certification Forum, October • Records Management, L’pool 30 Nov • Curation & Preservation Training, Dec • 2007 Preservation of journals tbc • 2007 Legal environment tbc • 2007 Preparing for audit tbc • Information Days British Library L’pool UCL • 2nd International DCC Conference 21-22 November, Glasgow • Keynotes: Hans F. Hoffmann, CERN, Clifford Lynch, CNI a centre of expertise in data curation and preservation DCC Phase 2: 2007-2010 • Working more closely with data centres, e-Science Programmes and Research Councils • SCARP Project: disciplinary approach • JISC Digital Repository Programme collaboration • RepInfo Registry service migration • Define self-assessment procedures and tools • Collaborate with CASPAR, DPE and PLANETS (EUfunded Digital Preservation Projects) • Workshop Programme, International Conference 2007 IMechE Workshop, London, 26th September 2006 a centre of expertise in data curation and preservation Thank you. Questions? e.lyon@ukoln.ac.uk Join the DCC Associates Network at www.dcc.ac.uk University of Bath, 13 September 2006