CASPAR Framework and Lessons Learned David Giaretta Overview • • • • CASPAR OAIS Threats and Solutions Validation CASPAR Project EU FP6 Integrated Project Total spend approx. 16MEuro (8.8 MEuro from EU) http://www.casparpreserves.eu 3 Digital Preservation • Ensure that digitally encoded information are understandable and usable over the long term – Long term could start at just a few years • Easy to make claims – Difficult to provide proof • Reference Model for Open Archival Information System (ISO 14721) – The basic standard for work in digital pres. – Defines terminology and compliance criteria Information Model & Representation Information Information Object The Information Model is key 1+ Data Object interpreted using 1+ Representation Information interpreted using Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY (this knowledge will change over time and region) Physical Object Digital Object 1+ Bit Sequence 5 Basic concept of CASPAR • Digital preservation had been dominated by libraries and (state) archives • However there was a focus there on “rendered objects” and “metadata” • Tendency to think data is an “easy” add-on HOWEVER • Need to deal with DATA – processed to new things, not just rendered • Need to follow OAIS – finer grained view • Need to test and prove that things work Preservation Strategies Emulation Access software Migration Transformation Description techniques Data… Level 2 GOME Satellite instrument data Contains numbers – need meaning 9 ...to process to this 10 ...or this 11 ...through complex processing schemes 12 Just Format? sfqsftfoubujpo jogpsnbujpo svmft You have a file JHOVE tells you it is WORD version 7 13 ..with some extra information.. representation information rules Format Registries – useful but not enough: formats can be used for multiple purposes e.g. audio files used to store configuration parameters 14 Examples (cont) • “504b0304140000000800f696….” • “This is a ZIP file which contains Word files, each of which contains an encoded message which needs the key ‘!D$G^AJU*KI’ to decode it using encryption method SHA7” 15 Examples (cont) • LaTex file containing an EPS (Encapulated Postscript) version of an image • Web page containing Java Applet generating random numbers • SWISS-PROT data • Foreign Language emails 16 XML enough? – can stare at this and <family> probably understand it <father>John</father> <mother>Mary</mother> <son>Paul</son> </family> 17 ..but what about this? <VOTABLE version="1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.1 http://www.ivoa.net/xml/VOTable/v1.1" xmlns="http://www.ivoa.net/xml/VOTable/v1.1"> <RESOURCE> <TABLE name="6dfgs_E7_subset" nrows="875"> <PARAM arraysize="*" datatype="char" name="Original Source" value="http://wwwwfau.roe.ac.uk/6dFGS/6dfgs_E7.fld.gz"> <DESCRIPTION>URL of data file used to create this table.</DESCRIPTION> </PARAM> <PARAM arraysize="*" datatype="char" name="Comment" value="Cut down 6dfGS dataset for TOPCAT demo usage."/> <FIELD arraysize="15" datatype="char" name="TARGET"> <DESCRIPTION>Target name</DESCRIPTION> </FIELD> <FIELD arraysize="11" datatype="char" name="DEC" unit="DMS"> <DATA> <FITS> <STREAM encoding='base64'> U0lNUExFICA9ICAgICAgICAgICAgICAgICAgICBUIC8gU3RhbmRhcmQgRklUUyBm b3JtYXQgICAgICAgICAgICAgICAgICAgICAgICAgICBCSVRQSVggID0gICAgICAg ICAgICAgICAgICAgIDggLyBDaGFyYWN0ZXIgZGF0YSAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgIE5BWElTICAgPSAgICAgICAgICAgICAgICAgICAgMCAv IE5vIGltYWdlLCBqdXN0IGV4dGVuc2lvbnMgICAgICAgICAgICAgICAgICAgICAg 18 Representation Information The Information Model is key Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY (this knowledge will change over time and region) Representation Information Network •Rep •Info •Virtualisation /DISCIPLINE README.txt TEXT EDITOR ENGLISH LANGUAGE Modules and Dependencies: defining the Designated Community WINDOWS XP FITS FILE FITS DICTIONARY FITS STANDARD MULTIMEDIA PERFORMANCE DATA C3D 3D motion data files DirectX MAX/MSP 3D scene data files motion to music mapping strategy PDF STANDARD PDF s/w FITS JAVA s/w DICTIONARY SPECIFICATION XML SPECIFICATION JAVA VM UNICODE SPECIFICATION 24 described by Archival delimited by Packaging Package Package derived from Content further described by Interpreted using * Data Object Physical Object Interpreted using Digital Object 1 1...* 1 Other Structure Reference Provenance Context Fixity Access Rights adds meaning to Bit 25 Cost sharing DRM Preservable infrastructure USE DATA • Use application to find data in Repository • Create DIP with enough RepInfo for the user (via DC profile) • Obtain more RepInfo from Registry if necessary Threat Requirement for solution Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved Ability toolkit, to create and and maintain Representation RepInfo Packager Registryadequate – to create and store Representation Information. Information In addition the Orchestration Manager and Knowledge Gap Manager help to ensure that the RepInfo is adequate. Non-maintainability of essential hardware, software or support environment may make the information inaccessible Ability to share information about the availability of hardware Registry and Orchestration Manager to exchange information about the and software and their obsolescence of hardware andreplacements/substitutes software, amongst other changes. The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity Ability to bring together evidence from diverse sources about Authenticity toolkit will allow one to capture evidence from many sources the a digital object whichAuthenticity may be used toof judge Authenticity. Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future Ability to deal with Digital Rights correctly in a changing and Digital Rights and Access Rights tools allow one to virtualise and preserve evolving environment the DRM and Access Rights information which exist at the time the Content Loss of ability to identify the location of data Persistent Identifier system: a system will allow objects to be located An ID resolver which issuch really persistent The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future Brokering of organisations to hold data and the ability to Orchestration Manager will, amongst other things, allow the exchange of package together the information needed to transfer information about datasets which need to be passed from one curator to information between organisations ready for long term another. preservation The ones we trust to look after the digital holdings may let us down Certification process so that one can have confidence about The Audit and Certification standard to which CASPAR has contributed will whom to trust to preserve data allow a certification process to be set up. holdings over the long term The Representation Information will include such things as software source code and emulators. Information is submitted for preservation. over time. Accelerated Lifetime tests As part of the validation the CASPAR tested simulated the following: • hardware changes • software changes • changes in the environment (including legal framework) • changes to the knowledge bases of the Designated Communities Test scenarios vs Threats to digital preservation Threat STFC ESA UNESCO IRCAM UnivLeeds CIANT Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved Non-maintainability of essential hardware, software or support environment may make the information inaccessible The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future INA STFC Testbed – various STP data ESA testbed UNESCO testbed The Villa Livia dataset is a collection of files used within the "virtual museum of the ancient Via Flaminia" project: a 3D reconstruction of several archaeological sites along the ancient Via Flaminia, the largest of them being Villa Livia This is an elevation grid (height map) of the area where Villa Liva is located. It is an ASCII file in the ESRI GRID file format Contemporary Art Testbed Performance Viewer: side-by-side comparison and validation of the transformation. From left to right: 3D visualization in Ogre3D, 3D model of the stage including the virtual dancer in VRML. Figure 8 Some aspects of acousmatic production CASPAR Validation • In all cases members of the Designated Community, with appropriate changes to mimic changes over time, verified that the metadata was adequate for the use despite simulated changes of hardware, software, environment and Designated Community over time. • Full details are available in the validation report (CASPAR Validation report, 2009) Links • CASPAR – http://www.casparpreserves.eu • • • • • CASPAR Source code - http://sourceforge.net/projects/digitalpreserve/ OAIS Reference Model http://public.ccsds.org/publications/archive/650x0b1.pdf and the updated draft is available from http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206500P11/Overview.as px CASPAR Validation report http://www.casparpreserves.eu/Members/cclrc/Deliverables/casparvalidation-evaluation-report/at_download/file PARSE.Insight: – www.parse-insight.eu • Alliance for Permanent Access: – www.alliancepermanentaccess.eu • Digital Curation Centre: – www.dcc.ac.uk 38 FUTURE • Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved • Non-maintainability of essential hardware, software or support environment may make the information inaccessible • The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity • Access and use restrictions may not be respected in the future • Loss of ability to identify the location of data • The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future • The ones we trust to look after the digital holdings may let us down END