How to Evaluate the Ability of a Preservation for Digital Information? N.Lormant

advertisement
Centre de Données de la Physique des Plasmas
How to Evaluate the Ability of a
File Format to Ensure Long-Term
Preservation for Digital Information?
N.Lormant1, C. Huc2, D. Boucon1, C.Miquel1
1 Silogic, 2 CNES
PV 2005 – Edinburgh 21-23/11/2005
1
SUMMARY
• Introduction
• Criteria for Evaluating a Format
Necessary Condition
Principal Rules
Additionnal Recommendations
• Case Study 1 : PNG Format
Introduction
PNG / Criteria
Conclusion
• Case Study 2 : PDF Format
Introduction
PDF / Criteria
Restrictions On Use
Conclusion
• Conclusion
PV 2005 – Edinburgh 21-23/11/2005
2
INTRODUCTION
• Long-Term Preservation means :
- Storing bit streams on a long-term medium
-Preserving representation information
• Part of the representation information is contained
by the storage format.
• A large variety of formats is generally available to
store a given type of data.
=> We need a methodology to evaluate which format
is the most suitable in a given context.
PV 2005 – Edinburgh 21-23/11/2005
3
SUMMARY
• Introduction
• Criteria for Evaluating a Format
Necessary Condition
Principal Rules
Additionnal Recommendations
• Case Study 1 : PNG Format
Introduction
PNG / Criteria
Conclusion
• Case Study 2 : PDF Format
Introduction
PDF / Criteria
Restrictions On Use
Conclusion
• Conclusion
PV 2005 – Edinburgh 21-23/11/2005
4
NECESSARY CONDITION
• Data Format contains a critical part of the representation
information
FORMAT-1
The format of the data must be fully and
explicitly
specified.
The
format
specification must be known to the body
responsible for preserving the data.
• If not
- Irrecoverable loss of data
- Costly migrations
- Re-entering data
=> All unpublished formats are eliminated
PV 2005 – Edinburgh 21-23/11/2005
5
PRINCIPAL RULES
(1/2)
• Chosing a format that suits the type of information to be
preserved:
• Document appeareance PNG
:
• Document signification PDF/A
:
• Format capability to structure data and to introduce highlevel abstraction.
FORMAT-2
The format of the data must be suitable for
representing the semantics and complexity
of the information.
PV 2005 – Edinburgh 21-23/11/2005
6
PRINCIPAL RULES
(2/2)
FORMAT-3
The use of standard formats is recommended. The
use of proprietary elements within a standard
format should be avoided.
-No formal prohibition of proprietary published formats.
-In the absence of any standard, a format specified by an open
collegiate group should be chosen (W3C,…).
FORMAT-4
If a need to be able to modify the data has
been identified, the choice of the data
format must take account of this constraint.
- Not applicable to all categories of data,
- May be a considerable constraint if required.
PV 2005 – Edinburgh 21-23/11/2005
7
ADDITIONAL
RECOMMENDATIONS
(1/2)
FORMAT-5
The choice of a format must take account of the
availability and cost of the tools and other
facilities needed to create the data.
FORMAT-6
It must be possible to verify automatically that a data file
complies with the format specification, and with the rules
and restrictions specified for data preservation.
FORMAT-7
The ability to extract all or part of the metadata from
the data is a definite advantage.
FORMAT-8
The use of unnecessarily voluminous formats
should be avoided.
FORMAT-9
A simple format is preferable to a complex format.
PV 2005 – Edinburgh 21-23/11/2005
8
ADDITIONAL
RECOMMENDATIONS
(2/2)
FORMAT-10 Widely recognized and used formats should be
preferred.
FORMAT-11
The choice of format must take account of the
availability and cost of the tools needed to convert
between formats and to display the data.
FORMAT-12 The choice of format must take account of the
availability and potential of developments in value
added services.
PV 2005 – Edinburgh 21-23/11/2005
9
SUMMARY
• Introduction
• Criteria for Evaluating a Format
Necessary Condition
Principal Rules
Additionnal Recommendations
• Case Study 1 : PNG Format
Introduction
PNG / Criteria
Conclusion
• Case Study 2 : PDF Format
Introduction
PDF / Criteria
Restrictions On Use
Conclusion
• Conclusion
PV 2005 – Edinburgh 21-23/11/2005
10
PNG FORMAT
(Introduction)
Portable Network Graphics designed to be:
 Raster format
 Simple and Portable
 Free of Rights
 Lossless Compression
 Flexible and Robust
 Upwards and Downwards Compatible
 OS Independent
PV 2005 – Edinburgh 21-23/11/2005
11
PNG FORMAT
(1/4)
FORMAT-1
PNG specifications are published
(http://www.libpng.org/pub/png/spec/iso)
FORMAT-2
Raster format,
Lossless compression,
‘True Color’ images up to 48 bits/pixel,
 correction,
Progressive display,
Separation of data (raw image) and display
information (filtering,  correction, transparency,…),
Ability to add text information.
PV 2005 – Edinburgh 21-23/11/2005
12
PNG FORMAT
(2/4)
FORMAT-3
Published Standard ISO/IEC 15948 (2004-03-03)
W3C recommendation (2003-11-10)
Internal components :
ZLIB : Published and free of rights (RFC-1950 / IETF)
DEFLATE : Published and free of rights (RFC-1951 /
IETF)
LATIN-1 : ISO 8859-1
UTF-8: ISO/IEC-10646-1
FORMAT-4
Possible to modify all or part of the data using an
image editor (eg: The Gimp, Photoshop,…)
PV 2005 – Edinburgh 21-23/11/2005
13
PNG FORMAT
(3/4)
FORMAT-5
Almost all the software of the market supports
PNG for dipslay, creation, edition,…
FORMAT-6
Free library of checking tools distributed by PNG
Team
(integrity,
compliance
with
the
specifications,’chunks’ contents,…).
Chunks structure makes development of utilities
for checking any given rule easier.
FORMAT-7
PNGMETA utility freely distributed by PNG Team.
Chunks structure makes metadata recovery easier.
FORMAT-8
PNG file is ~30% smaller than an equivalent GIF file.
FORMAT-9
Highly structured therefore quite simple format.
PV 2005 – Edinburgh 21-23/11/2005
14
PNG FORMAT
(4/4)
FORMAT-10 Preferred alternative to GIF format since 1996.
Supported natively by almost all the software of
the market
FORMAT-11
Library of utilities to convert PNG to ‘Portable
Pixmap‘ range of format, freely distributed by PNG
Team.
Image editing software can convert PNG to almost
all the other widespread image formats.
FORMAT-12 Chunks structure and free libraries makes data or
metadata extraction easier. It thus simplifies the
development of value added services.
PV 2005 – Edinburgh 21-23/11/2005
15
PNG FORMAT
(Conclusion)
• An ideal format for raster images preservation
Free
ISO/IEC Standard
Fully Published
Widely suppported
BUT
• Private chunks
Non-conformity
Loss of information
• Web browser transparency support not always optimal
• Quality of the compression implementation
PV 2005 – Edinburgh 21-23/11/2005
16
SUMMARY
• Introduction
• Criteria for Evaluating a Format
Necessary Condition
Principal Rules
Additionnal Recommendations
• Case Study 1 : PNG Format
Introduction
PNG / Criteria
Conclusion
• Case Study 2 : PDF Format
Introduction
PDF / Criteria
Restrictions On Use
Conclusion
• Conclusion
PV 2005 – Edinburgh 21-23/11/2005
17
PDF FORMAT
(Introduction)
Portable Document Format designed to :
Create, display and exchange electronic documents
Be independant form creation software and display
medium
Be an object collection
Contain interactive components and high-level application
data
Represent composite document (text, images, sound,
video,…)
PV 2005 – Edinburgh 21-23/11/2005
18
PDF FORMAT
(1/3)
FORMAT-1
Published specifications but property of Adobe
(http://partners.adobe.com/public/developer/pdf/in
dex_reference.html)
FORMAT-2
Great richness that enables to represent all type of
informations generally contained in a document.
FORMAT-3
Not a standard.
Specifications evolve rapidly (~18 months).
Proprietary components may be included
PV 2005 – Edinburgh 21-23/11/2005
19
PDF FORMAT
(2/3)
FORMAT-4
Modifications via commercial software (often
expensives).
Integrity of the data are not guaranteed (nonlinearized document)
FORMAT-5
Wide range of free and commercial software
available for creation.
FORMAT-6
Currently no tool for controlling document.
May be developed in the same way as for PDF/X.
FORMAT-7
Metadata extraction if using ‘Tagged-PDF’ and
XMP metadata.
FORMAT-8
PDF documents are smaller than equivalent Word files
PV 2005 – Edinburgh 21-23/11/2005
20
PDF FORMAT
(3/3)
FORMAT-9
Not a simple format
No more complex than the few other candidates
FORMAT-10 De facto standard
Widely used by a growing number of communities
(pre-press, pharmaceuticals, government,…)
FORMAT-11
Large number of software suites available for
conversion to text, Word, RTF, Excel,…
Imperfect conversion and expensive suites
FORMAT-12 Suites offer large number of value added services.
Growing popularity, standardization efforts leads
to new developments of value added services.
PV 2005 – Edinburgh 21-23/11/2005
21
PDF FORMAT
(Restrictions on Use)
Avoid use of multimedia and external objects
Use only standard characters fonts embedded in the
document
Use only colour spaces independent of the creation or
display terminal
Use only standard compression algorithms
Prohibit encryption of the contents
Avoid use of hidden content
Avoid use of transparency
Forms must not execute action
Use “Tagged-PDF” and XMP metadata
PV 2005 – Edinburgh 21-23/11/2005
22
PDF FORMAT
(Conclusion)
• Undeniable qualities
Composite documents
Terminal and OS independence
Widely used by large communities
BUT
• Adobe dependence
• Many functionalities incompatible with archiving needs
FUTURE
PDF/A (ISO 19005-1)
Other solutions based on XML
PV 2005 – Edinburgh 21-23/11/2005
23
SUMMARY
• Introduction
• Criteria for Evaluating a Format
Necessary Condition
Principal Rules
Additionnal Recommendations
• Case Study 1 : PNG Format
Introduction
PNG / Criteria
Conclusion
• Case Study 2 : PDF Format
Introduction
PDF / Criteria
Restrictions On Use
Conclusion
• Conclusion
PV 2005 – Edinburgh 21-23/11/2005
24
CONCLUSION
• Set of rules to evaluate file formats in order to preserve
digital data
• Set of recommendations to take account of archiving
services problematic
Making the best compromise available between
preservation requirements and access to the
information
Need of an international database maintaining
informations on formats and their evaluations
PV 2005 – Edinburgh 21-23/11/2005
25
Download