Centre de Données de la Physique des Plasmas How to Evaluate the Ability of a File Format to Ensure Long-Term Preservation for Digital Information? N.Lormant1, C. Huc2, D. Boucon1, C.Miquel1 1 Silogic, 2 CNES PV 2005 – Edinburgh 21-23/11/2005 1 SUMMARY • Introduction • Criteria for Evaluating a Format Necessary Condition Principal Rules Additionnal Recommendations • Case Study 1 : PNG Format Introduction PNG / Criteria Conclusion • Case Study 2 : PDF Format Introduction PDF / Criteria Restrictions On Use Conclusion • Conclusion PV 2005 – Edinburgh 21-23/11/2005 2 INTRODUCTION • Long-Term Preservation means : - Storing bit streams on a long-term medium -Preserving representation information • Part of the representation information is contained by the storage format. • A large variety of formats is generally available to store a given type of data. => We need a methodology to evaluate which format is the most suitable in a given context. PV 2005 – Edinburgh 21-23/11/2005 3 SUMMARY • Introduction • Criteria for Evaluating a Format Necessary Condition Principal Rules Additionnal Recommendations • Case Study 1 : PNG Format Introduction PNG / Criteria Conclusion • Case Study 2 : PDF Format Introduction PDF / Criteria Restrictions On Use Conclusion • Conclusion PV 2005 – Edinburgh 21-23/11/2005 4 NECESSARY CONDITION • Data Format contains a critical part of the representation information FORMAT-1 The format of the data must be fully and explicitly specified. The format specification must be known to the body responsible for preserving the data. • If not - Irrecoverable loss of data - Costly migrations - Re-entering data => All unpublished formats are eliminated PV 2005 – Edinburgh 21-23/11/2005 5 PRINCIPAL RULES (1/2) • Chosing a format that suits the type of information to be preserved: • Document appeareance PNG : • Document signification PDF/A : • Format capability to structure data and to introduce highlevel abstraction. FORMAT-2 The format of the data must be suitable for representing the semantics and complexity of the information. PV 2005 – Edinburgh 21-23/11/2005 6 PRINCIPAL RULES (2/2) FORMAT-3 The use of standard formats is recommended. The use of proprietary elements within a standard format should be avoided. -No formal prohibition of proprietary published formats. -In the absence of any standard, a format specified by an open collegiate group should be chosen (W3C,…). FORMAT-4 If a need to be able to modify the data has been identified, the choice of the data format must take account of this constraint. - Not applicable to all categories of data, - May be a considerable constraint if required. PV 2005 – Edinburgh 21-23/11/2005 7 ADDITIONAL RECOMMENDATIONS (1/2) FORMAT-5 The choice of a format must take account of the availability and cost of the tools and other facilities needed to create the data. FORMAT-6 It must be possible to verify automatically that a data file complies with the format specification, and with the rules and restrictions specified for data preservation. FORMAT-7 The ability to extract all or part of the metadata from the data is a definite advantage. FORMAT-8 The use of unnecessarily voluminous formats should be avoided. FORMAT-9 A simple format is preferable to a complex format. PV 2005 – Edinburgh 21-23/11/2005 8 ADDITIONAL RECOMMENDATIONS (2/2) FORMAT-10 Widely recognized and used formats should be preferred. FORMAT-11 The choice of format must take account of the availability and cost of the tools needed to convert between formats and to display the data. FORMAT-12 The choice of format must take account of the availability and potential of developments in value added services. PV 2005 – Edinburgh 21-23/11/2005 9 SUMMARY • Introduction • Criteria for Evaluating a Format Necessary Condition Principal Rules Additionnal Recommendations • Case Study 1 : PNG Format Introduction PNG / Criteria Conclusion • Case Study 2 : PDF Format Introduction PDF / Criteria Restrictions On Use Conclusion • Conclusion PV 2005 – Edinburgh 21-23/11/2005 10 PNG FORMAT (Introduction) Portable Network Graphics designed to be: Raster format Simple and Portable Free of Rights Lossless Compression Flexible and Robust Upwards and Downwards Compatible OS Independent PV 2005 – Edinburgh 21-23/11/2005 11 PNG FORMAT (1/4) FORMAT-1 PNG specifications are published (http://www.libpng.org/pub/png/spec/iso) FORMAT-2 Raster format, Lossless compression, ‘True Color’ images up to 48 bits/pixel, correction, Progressive display, Separation of data (raw image) and display information (filtering, correction, transparency,…), Ability to add text information. PV 2005 – Edinburgh 21-23/11/2005 12 PNG FORMAT (2/4) FORMAT-3 Published Standard ISO/IEC 15948 (2004-03-03) W3C recommendation (2003-11-10) Internal components : ZLIB : Published and free of rights (RFC-1950 / IETF) DEFLATE : Published and free of rights (RFC-1951 / IETF) LATIN-1 : ISO 8859-1 UTF-8: ISO/IEC-10646-1 FORMAT-4 Possible to modify all or part of the data using an image editor (eg: The Gimp, Photoshop,…) PV 2005 – Edinburgh 21-23/11/2005 13 PNG FORMAT (3/4) FORMAT-5 Almost all the software of the market supports PNG for dipslay, creation, edition,… FORMAT-6 Free library of checking tools distributed by PNG Team (integrity, compliance with the specifications,’chunks’ contents,…). Chunks structure makes development of utilities for checking any given rule easier. FORMAT-7 PNGMETA utility freely distributed by PNG Team. Chunks structure makes metadata recovery easier. FORMAT-8 PNG file is ~30% smaller than an equivalent GIF file. FORMAT-9 Highly structured therefore quite simple format. PV 2005 – Edinburgh 21-23/11/2005 14 PNG FORMAT (4/4) FORMAT-10 Preferred alternative to GIF format since 1996. Supported natively by almost all the software of the market FORMAT-11 Library of utilities to convert PNG to ‘Portable Pixmap‘ range of format, freely distributed by PNG Team. Image editing software can convert PNG to almost all the other widespread image formats. FORMAT-12 Chunks structure and free libraries makes data or metadata extraction easier. It thus simplifies the development of value added services. PV 2005 – Edinburgh 21-23/11/2005 15 PNG FORMAT (Conclusion) • An ideal format for raster images preservation Free ISO/IEC Standard Fully Published Widely suppported BUT • Private chunks Non-conformity Loss of information • Web browser transparency support not always optimal • Quality of the compression implementation PV 2005 – Edinburgh 21-23/11/2005 16 SUMMARY • Introduction • Criteria for Evaluating a Format Necessary Condition Principal Rules Additionnal Recommendations • Case Study 1 : PNG Format Introduction PNG / Criteria Conclusion • Case Study 2 : PDF Format Introduction PDF / Criteria Restrictions On Use Conclusion • Conclusion PV 2005 – Edinburgh 21-23/11/2005 17 PDF FORMAT (Introduction) Portable Document Format designed to : Create, display and exchange electronic documents Be independant form creation software and display medium Be an object collection Contain interactive components and high-level application data Represent composite document (text, images, sound, video,…) PV 2005 – Edinburgh 21-23/11/2005 18 PDF FORMAT (1/3) FORMAT-1 Published specifications but property of Adobe (http://partners.adobe.com/public/developer/pdf/in dex_reference.html) FORMAT-2 Great richness that enables to represent all type of informations generally contained in a document. FORMAT-3 Not a standard. Specifications evolve rapidly (~18 months). Proprietary components may be included PV 2005 – Edinburgh 21-23/11/2005 19 PDF FORMAT (2/3) FORMAT-4 Modifications via commercial software (often expensives). Integrity of the data are not guaranteed (nonlinearized document) FORMAT-5 Wide range of free and commercial software available for creation. FORMAT-6 Currently no tool for controlling document. May be developed in the same way as for PDF/X. FORMAT-7 Metadata extraction if using ‘Tagged-PDF’ and XMP metadata. FORMAT-8 PDF documents are smaller than equivalent Word files PV 2005 – Edinburgh 21-23/11/2005 20 PDF FORMAT (3/3) FORMAT-9 Not a simple format No more complex than the few other candidates FORMAT-10 De facto standard Widely used by a growing number of communities (pre-press, pharmaceuticals, government,…) FORMAT-11 Large number of software suites available for conversion to text, Word, RTF, Excel,… Imperfect conversion and expensive suites FORMAT-12 Suites offer large number of value added services. Growing popularity, standardization efforts leads to new developments of value added services. PV 2005 – Edinburgh 21-23/11/2005 21 PDF FORMAT (Restrictions on Use) Avoid use of multimedia and external objects Use only standard characters fonts embedded in the document Use only colour spaces independent of the creation or display terminal Use only standard compression algorithms Prohibit encryption of the contents Avoid use of hidden content Avoid use of transparency Forms must not execute action Use “Tagged-PDF” and XMP metadata PV 2005 – Edinburgh 21-23/11/2005 22 PDF FORMAT (Conclusion) • Undeniable qualities Composite documents Terminal and OS independence Widely used by large communities BUT • Adobe dependence • Many functionalities incompatible with archiving needs FUTURE PDF/A (ISO 19005-1) Other solutions based on XML PV 2005 – Edinburgh 21-23/11/2005 23 SUMMARY • Introduction • Criteria for Evaluating a Format Necessary Condition Principal Rules Additionnal Recommendations • Case Study 1 : PNG Format Introduction PNG / Criteria Conclusion • Case Study 2 : PDF Format Introduction PDF / Criteria Restrictions On Use Conclusion • Conclusion PV 2005 – Edinburgh 21-23/11/2005 24 CONCLUSION • Set of rules to evaluate file formats in order to preserve digital data • Set of recommendations to take account of archiving services problematic Making the best compromise available between preservation requirements and access to the information Need of an international database maintaining informations on formats and their evaluations PV 2005 – Edinburgh 21-23/11/2005 25