A Guide to Formats

advertisement
A Guide to Formats
This guidance relates to:
Stage 1: Plan for action
Stage 2: Define your digital continuity requirements
Stage 3: Assess and manage risks to digital continuity
Stage 4: Maintain digital continuity
This guidance is an addendum to our guidance Evaluating Your File Formats.
The National Archives
A Guide to Formats Version: 1
© Crown copyright 2011
You may re-use this document (not including logos) free of charge in any format or medium,
under the terms of the Open Government Licence. To view this licence, visit
http://www.nationalarchives.gov.uk/doc/open-government-licence/open-government-licence.htm
;or write to the Information Policy Team, The National Archives, Kew, Richmond, Surrey, TW9
4DU; or email: psi@nationalarchives.gsi.gov.uk .
Any enquiries regarding the content of this document should be sent to
digitalcontinuity@nationalarchives.gsi.gov.uk
Page 2 of 83
The National Archives
A Guide to Formats Version: 1
A Guide to Formats .................................................................................................................. 1
1.
Introduction ....................................................................................................................... 5
1.1
1.2
2.
Plain text ............................................................................................................................ 7
2.1
2.2
2.3
2.4
3.
Introduction ................................................................................................................ 37
Microsoft Excel 97-2003 (.xls) .................................................................................... 39
Microsoft Excel 2007 (.xlsx)........................................................................................ 40
OpenDocument Spreadsheet (.ods) ........................................................................... 41
Presentations .................................................................................................................. 43
7.1
7.2
7.3
7.4
8.
Introduction ................................................................................................................ 26
Postscript (.ps) ........................................................................................................... 28
Portable Document Format (.pdf) ............................................................................... 30
Open XML Paper Specification (.xps) ......................................................................... 31
Microsoft Word 97-2003 (.doc) ................................................................................... 32
Open Document Text (.odf .odt) ................................................................................. 33
Microsoft Word 2007 (.docx) ...................................................................................... 34
Microsoft Rich Text Format (.rtf) ................................................................................. 35
Spreadsheets ................................................................................................................... 37
6.1
6.2
6.3
6.4
7.
Introduction ................................................................................................................ 19
Zip (.zip) ..................................................................................................................... 19
Gzip (.gz) ................................................................................................................... 21
Tar (.tar) ..................................................................................................................... 22
OLE2 Compound Document Format .......................................................................... 24
Documents....................................................................................................................... 26
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
6.
Introduction ................................................................................................................ 15
Hypertext Markup Language (.html, .htm) .................................................................. 16
Extensible Markup Language (.xml) ........................................................................... 17
File containers ................................................................................................................. 19
4.1
4.2
4.3
4.4
4.5
5.
Introduction .................................................................................................................. 7
ASCII.......................................................................................................................... 10
EBCDIC ..................................................................................................................... 11
Unicode ...................................................................................................................... 11
Mark-up languages.......................................................................................................... 15
3.1
3.2
3.3
4.
What is the purpose of this guide? ............................................................................... 5
Information sources ...................................................................................................... 6
Introduction ................................................................................................................ 43
Microsoft PowerPoint 97-2003 (.ppt) .......................................................................... 45
Microsoft PowerPoint 2007 (.pptx).............................................................................. 46
OpenDocument Presentation (.odp) ........................................................................... 47
Datasets ........................................................................................................................... 48
8.1
8.2
8.3
8.4
8.5
8.6
Introduction ................................................................................................................ 48
Microsoft Access (.mdb) ............................................................................................ 49
Microsoft Access 2007 (.accdb).................................................................................. 50
Comma Separated Values (.csv) ................................................................................ 51
Structured Query Language (.sql) .............................................................................. 52
Resource Description Framework (.rdf) ...................................................................... 54
Page 3 of 83
The National Archives
9.
A Guide to Formats Version: 1
Emails .............................................................................................................................. 55
9.1
9.2
9.3
9.4
9.5
Introduction ................................................................................................................ 55
EML (.eml) ................................................................................................................ 55
Microsoft Message (.msg) .......................................................................................... 56
MBOX (.mbox) ........................................................................................................... 57
Personal Storage Table (.pst) ..................................................................................... 58
10. Images (raster)................................................................................................................. 60
10.1
10.2
10.3
10.4
10.5
10.6
Introduction ................................................................................................................ 60
Windows Bitmap (.bmp) ............................................................................................. 62
Tagged Image File Format (.tif, .tiff) ........................................................................... 63
Graphics Interchange Format (.gif) ............................................................................. 64
Portable Network Graphics (.png) .............................................................................. 65
Joint Photographic Experts Group (.jpg, .jpeg) ........................................................... 66
11. Images (vector) ................................................................................................................ 68
11.1
11.2
11.3
11.4
Introduction ................................................................................................................ 68
Encapsulated Postscript (.eps) ................................................................................... 69
Windows Metafile Format (.wmf) ................................................................................ 70
Scalable Vector Graphics (.svg) ................................................................................. 71
12. Audio ................................................................................................................................ 72
12.1
12.2
12.3
12.4
12.5
Introduction ................................................................................................................ 72
Waveform Audio File Format (.wav) ........................................................................... 73
Windows Media Audio (.wma) .................................................................................... 74
MPEG Layer 3 Audio (.mp3) ...................................................................................... 75
Advanced Audio Coding (.aac) ................................................................................... 76
13. Video ................................................................................................................................ 78
13.1
13.2
13.3
13.4
13.5
Introduction ................................................................................................................ 78
Moving Pictures Expert Group (.mpg, .mpeg) ............................................................. 79
Windows Media Video (.wmv) .................................................................................... 80
Audio Video Interleave (.avi) ...................................................................................... 81
Flash Video (.flv) ........................................................................................................ 82
Page 4 of 83
The National Archives
A Guide to Formats Version: 1
1.
Introduction
1.1
What is the purpose of this guide?
To help in evaluating file formats, this guide will present factual information about selected
existing file formats, specifically focussing on the digital continuity risks associated with them.
Formats are broken up into a dozen broad groups covering different types of formats. For each
of these groups there is a discussion of the general properties and issues, followed by more
detail on a sample of specific formats.
There are too many formats to write about all of them, so only a representative sample is
presented here. No preference for a file format should be understood by its inclusion. Formats
were selected on the basis that they are widely encountered, or that they serve as an exemplar
of a general type of format. This guide will not state whether any given format is better or worse
than another, as this can only be determined by evaluating formats in your own context, against
your own business needs and technological environment. Formats which work well in one
context may be inappropriate in another.
A separate piece of guidance, Evaluating Your File Formats, 1 outlines a process by which you
can compare different file formats with one another. This document is an addendum to that
guidance and presents information which is useful in following that process, as well as more
general discussion around particular file formats. In particular, the aspects of file formats which
will be described here include:
•
•
•
Resilience
o
Any standardisation of the format
o
How old the format currently is
o
Whether the format is textual or binary
o
Whether the format is compressed, encrypted or otherwise obscured
o
Any other recoverability features in the format
Quality
o
Any known precision issues with the format
o
If the format is ‘lossy’ (i.e. does it discard information)
Flexibility
o
Whether software currently exists to programmatically access information in the
file format
o
1
How much existing software can access the file formats on common platforms
See Evaluating your File Formats nationalarchives.gov.uk/documents/information-
management/evaluating-file-formats.pdf
Page 5 of 83
The National Archives
A Guide to Formats Version: 1
It should be emphasised that it is file formats, not software, that is described in this guide. While
it is common to refer to file formats by the software most commonly used to create them, file
formats in principle are software-agnostic, even if in practice (particularly for complex formats),
very few applications can actually access information in the format. The degree to which
software can interact with the file formats described here will be assessed, as this can aid in
understanding the continuity of file formats.
Note that these assessments are, by their nature, quite subjective and you may determine
different assessments when looking at the use of file formats in your own environment. For
example you may determine that for interoperability, you are only interested in platforms or
applications which appear in your own environment, rather than looking at the full spectrum of
support. Nevertheless, the assessments presented here will provide a useful starting point when
assessing formats.
1.2
Information sources
Please note that information contained in this guidance may become out of date, as new
formats are introduced, further standardisation work is undertaken, or new information comes to
light. The information contained here has been assembled by internet research primarily using
search, software vendor web sites, standardisation bodies, industry news sites and Wikipedia.
Page 6 of 83
The National Archives
2.
Plain text
2.1
Introduction
A Guide to Formats Version: 1
Plain text is not technically a file format, in that there is no formal structure (i.e. format) imposed
on the content. A text file simply contains any characters a creator wishes, in any order. There
are conventions for ending lines, producing layout using tab characters, and other ‘control
codes’, but like text itself, these can be used in any way that the creator desires, without any
formal structure.
Many other file formats are built on top of text, as it is easy to read and work with, so there is a
high degree of flexibility. It is generally trivial to read and write text files in software, subject to
the risks outlined below. However, note that if another format is built using text as a base, then
reading this other format may be non-trivial, even if reading the text it is based on is easy.
Plain text is generally very resilient to corruption, for most encodings changing only a single
character in the face of 1 byte changing, without affecting the rest of the text contained. There
are no direct quality issues with text (although more complex formats based on text may have).
Text files themselves are not lossy, and have no precision issues.
The only features of plain text which generally needs interpreting are the encoding used by the
text (how the characters are numerically represented), and how the ends of lines in the text are
represented (which can occur in at least two common ways). Both of these are described below.
2.1.1
Encodings
Computers do not understand text directly – they only work on numbers. The encoding of a text
file is the method by which different text characters are numerically represented. For example,
the letter ‘A’ may be represented by the number 1, ‘B’ by the number 2, and so on. There are
many different possible encodings, some of which are not directly compatible with one another.
Encodings differ in at least two principle ways:
1. They may represent different sets of characters from one another, making it impossible
to translate between them if characters not shared by both are used.
2. They may use different methods of encoding the same characters, making translation
between them possible, but requiring knowledge of which encodings are being used to
read and write them correctly.
Page 7 of 83
The National Archives
A Guide to Formats Version: 1
Encodings frequently found (at least, in the Western world) are:
•
ASCII
see section 2.2
•
EBCDIC
see section 2.3
•
Unicode
see section 2.4
Note that modern encodings (e.g. Unicode) are very broad, encoding almost all known
characters in them, so translation to Unicode text is almost always possible from any given
source encoding, but not necessarily vice versa. The vast majority of text files produced in the
UK tend to be ASCII, or Unicode UTF-8.
2.1.2
Encoding risks
Loss of encoding knowledge is the principle long-term continuity risk to text, as the encoding
used by a text file is not usually defined anywhere in the file itself. To determine the encoding of
a text file, there are a few libraries of code available 2 which can make a guess at the encoding
given a sample of the text, but these will require custom software development to use and are
not always correct.
It is always possible to manually open an individual text file using a text editor, specifying which
encoding to use, and to check that the file opened in that way is readable or not. Clearly, this
approach does not scale up if there are a large number of text files for which the encoding is not
known. Files found together in the same location will frequently (but not always) use the same
encoding. If a text file was automatically produced by a piece of software, then it is likely that all
the files produced by that software will share a common encoding.
If you discover you have a large number of different encodings in use, you should consider
migrating them to a single, modern standardised encoding, such as Unicode UTF-8, assuming
your technological environment and business requirements permit this.
Finally, note that some older encodings use ‘code pages’. 3 Code pages are essentially national
variations on a common base of characters, re-using a few numbers to represent different
specifically national symbols. This is done where the encoding scheme does not permit a wide
enough range of numbers to represent all the characters needed for all nations at once. Each
code page is similar, but not identical to other code pages. For example, a French code page
may have an encoding for é, while the German variation could use the same number to mean ü.
2
For example, see International Components for Unicode at http://site.icu-project.org/
3
See http://en.wikipedia.org/wiki/Code_page
Page 8 of 83
The National Archives
A Guide to Formats Version: 1
Other common characters will be encoded in the same way in both code pages. A subtle risk is
introduced using code pages which are very similar. For example, the difference between US
and UK code pages is very small, varying in only a few symbols, and this difference may not be
easily detectable – for example, £ signs may be visually transformed into # symbols, but almost
all the other text will be unchanged if opened using the wrong code page.
Hence, knowing the code page (if any) is just as important as knowing the overall encoding. It is
helpful to think of code pages as different encodings from each other in the first place (which
simply happen to share a common base of characters).
2.1.3
Line ending risks
There are two common ways to encode line endings in text files. Some text files use an invisible
Line Feed (LF) control code 4 to indicate the end of a line, whereas others use a Line Feed
followed by a Carriage Return (CR) control code, reflecting old requirements of teletype printing
systems. In general, UNIX-like systems produce text files with only an LF to terminate lines,
whereas Microsoft DOS and Windows systems produce text files with LF/CR line endings.
Much software will not process text files properly with different line endings than expected.
However, it is easy to translate between them, by simply substituting LF for LF/CR and vice
versa.
2.1.4
Migration risks
When migrating text from one encoding to another, the main risk is not understanding either the
source encoding, the target encoding, or the characters in your text you specifically need to
migrate.
In general, older encodings such as ASCII or EBCDIC only support a very limited range of
characters, or implicitly use code pages to support a greater range of characters. It is very easy
to think that you are using one national character set, as most characters are in common, when
in fact there is an occasional character which implies a different encoding (code page). For
example, your text may appear to be in UK English, when in fact it is encoded as US English.
This can lead to some symbols migrating incorrectly when transformed into a wider encoding
such as Unicode.
When choosing a format to migrate to, you must consider your own technological environment
and business requirements. However, all other things being equal, a modern encoding such as
4
A control code is an invisible, non-printing character encoded by some number not used for normal text.
Page 9 of 83
The National Archives
A Guide to Formats Version: 1
Unicode UTF-8 is generally a good choice, as it can supports most characters in use today and
is backwards compatible with earlier standards like ASCII.
2.1.5
Continuity properties of plain text
Flexibility
Interoperability
Very high. Line endings may vary between platforms.
Implementability Very high. Almost all programming languages can read and
write most common text encodings.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Variable. ASCII, EBCDIC, UTF-8 are very high. UTF-16 is
average, and UTF-32 is below average.
Ubiquity
Very high. Almost all software that needs to can read text in
common encodings.
Stability
Very high. Text encodings are highly standardised and survive
unchanged for decades.
2.2
ASCII
The American Standard Code for Information Interchange (ASCII) 5 is very common, and many
other encodings are compatible with ASCII. It was first defined in the early 1960s, and is still in
widespread use today.
However, it only provides a very limited range of characters for the English alphabet. Each
character is represented by a single byte, ranging from 0 to 127 in value. Various attempts to
extend ASCII to cover other alphabets by using up to 256 different characters exist. These are
often described as Extended ASCII, 6 but this is not a single standard encoding.
A set of standardised extended ASCII encodings are the ISO 8859 7 family of encodings. These
provide standard encodings for various language families – for example, ISO 8859-1 for
Western European languages and ISO 8859-2 for Eastern European languages. All of the plain
ASCII encodings are common to these standards, with the regional variations occupying values
equal to or above 128.
5
See http://en.wikipedia.org/wiki/ASCII
6
See http://en.wikipedia.org/wiki/Extended_ASCII
7
See http://en.wikipedia.org/wiki/ISO/IEC_8859
Page 10 of 83
The National Archives
A Guide to Formats Version: 1
One way to determine if a file is likely to be plain ASCII is if all the bytes in it are less than 128 in
value. Generally, text encoded using other standards will include values equal to or above this
number.
By design of the Unicode creators, ASCII files are also completely valid UTF-8 files (a form of
Unicode encoding – see section 2.4.1). Note that the reverse is not necessarily true, as Unicode
can encode far more characters than ASCII.
ASCII files are very resilient, in that a change to a byte, or a loss or addition of a byte only
affects that byte – the rest of the text is never affected by local corruption.
2.3
EBCDIC
EBCDIC 8 encoding is generally found on IBM mainframe computers or in systems which
interact with them. It is similar to ASCII in that it can only represent very few characters, and so
uses code pages to extend it to cover other languages. However, it is not compatible with
ASCII, and has itself several versions which are not compatible with each other. It has been in
use since the late 1950s, but it is not formally standardised, being a vendor-controlled encoding.
Because this encoding has existed for a long time, it is possible to encounter EBCDIC encoded
text files, although this is uncommon outside of an IBM environment. If possible in your business
and technological environment, it is recommended to migrate files out of EBCDIC encodings, as
they are not widely used.
2.4
Unicode
Unicode 9 is an international standard for text which allows the representation of most of the
writing systems in the world, by allowing a much greater number of characters within it and
explicit support for various specialised symbols. It does not need code pages to represent
different characters, as the allowable range of numbers in it is large enough to accommodate
national variations, special purpose symbols and any other character requirements. It was first
developed in 1987, and has been through regular revisions since then, adding support for
increasing numbers of characters and languages.
8
See http://en.wikipedia.org/wiki/EBCDIC
9
See http://en.wikipedia.org/wiki/Unicode
Page 11 of 83
The National Archives
A Guide to Formats Version: 1
It is closely related to the ISO/IEC 10646 10 standard in that the characters defined in it are the
same in both, but the Unicode standard imposes some additional constraints on how those
characters must be processed. The characters of the ISO 8859-1 encoding represent the first
256 characters of the Unicode standard, to make it easy to convert existing Western European
text and ASCII, in which a large amount of text files were originally encoded.
However, there are actually several different possible encodings of Unicode text. The Unicode
standard itself defines the characters which can be encoded by it (called ‘code-points’), then
there are several different ways of actually encoding those characters. A useful comparison of
Unicode encodings can be found at:
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
2.4.1
UTF-8
The most common encoding is UTF-8 11, which is directly compatible with ASCII (all ASCII text is
automatically valid UTF-8, but not vice versa if characters outside the first 128 characters are
used). ASCII compatibility is a useful property given the large volume of existing files using
ASCII.
UTF-8 is known as a multi-byte encoding, in that from 1 to 4 bytes are used to encode each
character. For values under 128, a single byte is used to encode a character, for any other
character, more than one byte is used, with at least one byte having a value over 128 to
distinguish it from single-byte characters.
File sizes of UTF-8 encoded files are often relatively small, as the standard only uses as many
bytes as needed for each character, in contrast to other encodings which may use a fixed
number of bytes regardless.
UTF-8 also has the useful property of being self-synchronising, meaning that the loss, insertion
or corruption of a byte in the text will not generally prevent software from determining where the
other characters begin or end. This keeps any problems localised, and the rest of the text
readable in the face of errors.
10
See http://en.wikipedia.org/wiki/Universal_Character_Set
11
See http://en.wikipedia.org/wiki/UTF-8
Page 12 of 83
The National Archives
2.4.2
A Guide to Formats Version: 1
UTF-16
UTF-16 12 uses groups of 2 bytes to encode what are called ‘code-units’. It usually uses one
code-unit to encode a single character, but will sometimes use 2 code-units (4 bytes) to encode
a character.
UTF-16 encodings have two variants in terms of the order in which the group of two bytes are
written, termed the ‘endianness’ of the encoding. These variants are termed ‘Big Endian’ and
‘Little Endian’. Some files specify a Byte-Order-Mark (BOM), which is a 2-byte prefix at the start
of the file which indicates the endianness of the file. However, this is not mandatory, and many
files do not include a BOM.
UTF-16 can handle corruptions to individual bytes, re-synchronising on the next valid Unicode
code-point, but the loss of bytes or insertion of additional bytes can cause the succeeding text
to become unintelligible.
File sizes of UTF-16 encoded text are reasonably small, but are usually larger than the
equivalent text encoded in UTF-8 (depending on which characters appear in the text).
UTF-16 is frequently used internally in software and programming languages to represent
Unicode text, and is not infrequently found in text-files, although it is not as common as the
UTF-8 encoding for storage purposes.
2.4.3
UTF-32
UTF-3213 is known as a fixed-byte encoding, in that UTF-32 always uses 4 bytes to encode
each character. However, since Unicode allows for adjacent characters to be combined in some
circumstances, this does not lead to a direct relationship between the number of bytes and the
number of displayed characters. The value of a UTF-32 character is the direct numeric value of
its corresponding Unicode code-point.
Using 4 bytes per character is much less space efficient than UTF-8 or UTF-16, resulting in
much larger file or memory sizes when processing text in this encoding.
UTF-32 can handle corruptions to individual bytes, re-synchronising on the next valid Unicode
code-point, but the loss of bytes or insertion of additional bytes can cause the succeeding text
to become unintelligible.
12
See http://en.wikipedia.org/wiki/UTF-16/UCS-2
13
See http://en.wikipedia.org/wiki/UTF-32
Page 13 of 83
The National Archives
A Guide to Formats Version: 1
Hence, UTF-32 is less commonly found in text files, and is more commonly used as an internal
representation of Unicode code-points in software.
Page 14 of 83
The National Archives
3.
Mark-up languages
3.1
Introduction
A Guide to Formats Version: 1
Mark-up languages are file formats built on text, which use ‘tags’ inside the format to add
additional structure and meaning to the plain text. For example, we could write:
<Title>Format facts</Title>
<Body>To help in evaluating file formats, ...</Body>
Like the text they are based on, they are also fairly resilient to corruption, and can be opened in
a common text editor, a specialised markup editor or processed programmatically using
commonly available libraries of code. Markup languages themselves are not innately lossy and
have no precision issues in principle (although a lossy format or one with precision issues could
be created using markup).
Almost all markup languages in use today inherit from a specification known as Standardised
General Markup Language 14 (SGML), which itself is not in widespread use anymore. Markup
languages in widespread use include:
•
Hypertext Markup Language
HTML
see section 3.2
•
Extensible Markup Language
XML
see section 3.3
3.1.1
Schemas
Markup languages define a specific set of tags used to annotate the text. There may be
constraints on the valid structures of tags – for example, which ones appear next to one another
or how they can be nested within others. The definition of valid tags and their structure is called
a schema. There are many ways to define schemas for markup languages, including Document
Type Definitions 15 (DTD), XML Schemas 16 (XSD) and RELAX NG. 17
Schemas both provide a technical level of documentation on how a format defined using
markup is constructed, and a way to automatically validate that a markup-format conforms to a
specification.
14
See http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language
15
See http://en.wikipedia.org/wiki/Document_Type_Definition
16
See http://en.wikipedia.org/wiki/XML_Schema_(W3C)
17
See http://en.wikipedia.org/wiki/RELAX_NG
Page 15 of 83
The National Archives
3.1.2
A Guide to Formats Version: 1
Continuity risks
Formats based on markup languages can be quite complex. Since it is also quite easy to define
file formats based on markup languages (in particular, XML-based formats are very common),
then a large number of highly bespoke, and often poorly documented formats are in existence.
Without knowledge of the schema used to define the format, data encoded in a markup
language can be hard, if not impossible to interpret, even though it remains technically
accessible. In cases where software that provides the necessary layer of processing needs
maintenance, or where you otherwise need to programmatically access the data encoded in a
markup language, you should ensure that you understand the schemas used by them. Without
this understanding you will be at risk of losing continuity if the software changes or becomes
unavailable.
Understanding a schema implies more than simply knowing which schema was used. While
schemas provide a base level of technical documentation for a markup-based format, they are
generally not sufficient to interpret the meaning of the markup and the data contained within
them. For example, if a tag is called ‘<Creator>’, does this refer to a person, or an organisation?
Does it matter? Are there any other constraints on valid data which the schema does not
capture? You should make sure that you have both schemas and documentation explaining the
intended meaning and constraints of the markup.
3.2
Hypertext Markup Language (.html, .htm)
A well-known example of a mark-up file format is HyperText Markup Language 18 (HTML, HTM),
used to create web pages. The first specification of HTML was made available by 1992. Various
versions of HTML exist, including 2.0, 3.2, 4.0, 4.01, and version 5 is currently in development.
Version 2 is now considered obsolete, as it contains elements which have been dropped in
subsequent versions. From version 3.2 onwards, it is standardised through the W3C (World
Wide Web Consortium).
While HTML itself is standardised, many technologies interpret these standards slightly
differently from one another. In addition, many real-world HTML files do not strictly conform to
the standards. This can create situations where an HTML file will render correctly on one
platform, but not on another. Increasingly, HTML processing technologies are becoming more
consistent, but differences can still be expected.
18
See http://en.wikipedia.org/wiki/HTML
Page 16 of 83
The National Archives
A Guide to Formats Version: 1
In addition to the HTML standards, there is also a variant called XHTML 19, which is a version of
HTML which is fully compatible with XML (see section 3.3). XHTML is stricter than HTML in the
structures it allows, ensuring that all the markup conforms to a precise standard. Theoretically,
this would make it easier to write applications which work with HTML, which in practice is often
quite loosely written. For example, in real HTML documents, markup tags are not always closed
after they are opened. However, XHTML has not been widely adopted on the web, so in
practice it is necessary to be able to deal with HTML as well.
Note that HTML files are rarely found on their own; they make reference to external resources
(via hyperlinks, or file includes). These external resources may not be HTML themselves – they
can be images, video, audio, programming languages (e.g. JavaScript), style sheets which
affect how the HTML displays (e.g. css files). All of these external resources can affect how the
HTML displays, or even which content is ultimately loaded into it. You should ensure that you
understand which external resources are required to use your HTML files into the future, and
whether they pose any continuity risks of their own.
3.2.1
Continuity properties of HTM and HTML
Flexibility
Interoperability
Very high. All platforms can process HTML data.
Implementability Very high. Almost all programming languages can read and
write HTML with ease.
Quality
Lossiness
None.
Precision
Very high. Note that HTML is a document layout format
intended for screens, not print. It does not generally provide
ways to represent precise page layouts.
Resilience Recoverability
Very high. As long as the text encoding used is recoverable.
Ubiquity
Very high.
Stability
Very high. Although new versions of HTML are occasionally
introduced, this process is quite slow and backwards
compatibility is largely preserved.
3.3
Extensible Markup Language (.xml)
The most common type of markup language in use today after HTML is known as Extensible
Markup Language (XML), 20 also standardised by the World Wide Web Consortium (W3C). In
fact, SGML and XML are not file formats in their own right. They are a standardised method of
19
See http://en.wikipedia.org/wiki/XHTML
20
See http://en.wikipedia.org/wiki/XML
Page 17 of 83
The National Archives
A Guide to Formats Version: 1
creating file formats using markup. XML defines a set of syntax and rules which markup
languages must obey to be considered valid XML, but the specific tags and structures used in
the XML markup are left up to the format designer. XML files are written using Unicode text.
Almost all recent document formats are created using XML as a base, as are increasing
numbers of other formats. The advantages include human readability of the underlying files,
ease of extending the information within them and programmatic processing. Almost all (if not
all) programming languages can process XML using widely available libraries, and there are
many different software applications available to author, edit and transform XML directly.
However, due to the ease of creating formats based on XML, there has been a huge increase
in bespoke file formats. To assure their continuity requires detailed knowledge of the XML
schema and other documentation on the meaning of the tags.
Disadvantages of XML over binary file formats include larger file sizes, although this can be
largely mitigated by compressing them (they typically compress very well), and difficulty
including binary objects (e.g. an image) within an XML document, since XML is a textual format.
Binary objects can be converted to a textual representation, but this takes up a lot of space (and
they do not compress as well if such objects are included).
3.3.1
Flexibility
Continuity properties of XML
Interoperability
Very high. All platforms can process XML data.
Implementability Very high. Almost all programming languages can read and
write XML with ease.
Quality
Lossiness
None.
Precision
No issues. But note that formats defined using XML may have
precision issues.
Resilience Recoverability
Ubiquity
Very high. As long as the text encoding used is recoverable.
Very high. But note that formats defined using XML may not
be ubiquitous.
Stability
Very high. But note that formats defined using XML may not
be stable.
Page 18 of 83
The National Archives
4.
File containers
4.1
Introduction
A Guide to Formats Version: 1
File container formats are designed to contain other files. They are often used deliberately to
archive files, compress them to save storage space or encrypt them. But it may be less well
understood that they are used to support applications which require more than one file-resource
to be available from within a single file (e.g. documents with embedded images).
4.1.1
Continuity risks
There is a broad continuity risk arising from the use of container formats in the first place. By
placing files in container formats, they become obscured from other information management
tools. It is not enough to know you have a zip or a tar file – you need to know what is inside
them to manage your digital continuity properly. However, note that this is not a continuity risk to
file container formats themselves; it is a continuity risk created by using file container formats.
In general, file containers tend to be very long-lived formats and widely supported in software.
However, note that few container formats are formally standardised – they are typically de facto
standards, with freely available specifications.
File containers are not innately lossy (they must be able to accurately reconstitute the files
contained within them), although most do not preserve all available file system metadata along
with the files, and some have precision issues (e.g. around date-times).
All of the file container formats described here are binary, which is more compact, as opposed
to textual. Some email formats which are based on text encode file attachments textually (e.g.
EML – see section 9.1). However, in general, most file container formats are binary. Binary
formats tend to be less recoverable than textual formats (as, aside from other considerations,
they store information more densely, meaning errors have a correspondingly greater impact).
Some container formats include error detection and correction features to aid recoverability,
since their role is to hold other files safely.
4.2
Zip (.zip)
The ZIP file format (ZIP) 21 is one of the most widely used file container formats in use today. It is
a binary format, provides good compression of contained files, is fairly fast to compress and
decompress, and supports several different compression algorithms (including using no
compression). Zip files can be accessed by a wide variety of software and support is found in all
programming environments.
21
See http://en.wikipedia.org/wiki/ZIP_(file_format)
Page 19 of 83
The National Archives
A Guide to Formats Version: 1
It was first created in 1989, and its specification was released into the public domain. It has not
been formally standardised, although the ISO organisation is currently investigating whether it
should produce an ISO standard for the zip specification. However, note that the legal status of
recent versions of ZIP (particularly 64-bit zip) is not clear, and software support is more limited.
These versions give support for strong encryption and file sizes greater than four Gigabytes.
File system metadata, such as original file names, folder structure and dates are usually
included in zip files. However, note that the default timestamp in zip files is only accurate to two
seconds, so dates and times will often be slightly different when compared to the original file
system. Also note that no other file system metadata is preserved by default, including file
system permissions.
The zip file format is inherently extensible, and some extensions provide for more accurate date
times, and some file system permissions to be preserved. Whether these extensions are used
or not will depend on the zip software in use. If zip software cannot understand an extension
written by different zip software, the standard behaviour is simply to ignore it (while still dealing
with what can be understood). This can lead to metadata loss if different zip software is used to
zip and unzip files. If recovery of file system metadata is important, you should ensure that both
the software used to zip and unzip files can handle the same metadata in equivalent ways.
The zip format provides a measure of integrity protection against corruption, using CRC 22
checksums to detect errors, and it stores two copies of a file directory structure to provide some
redundancy. Having a file listing inside the zip file allows access to each file in the zip
independently of the others, without having to read the entire zip file to access them individually.
Each file inside the zip file is compressed separately, meaning that corruption which affects one
file contained in it may not affect others, assuming the files can still be properly located within
the zip file itself. Tools to repair corrupted zip files are readily available, although repair cannot
be guaranteed.
4.2.1
Flexibility
Continuity properties of ZIP file formats
Interoperability
Very high. All platforms can process zip files, although note
that the 64-bit zip format is not so interoperable.
Implementability Very high. Code to process zip files exists in all major
programming languages, although note that the 64-bit format is
not as well supported.
22
See http://en.wikipedia.org/wiki/Cyclic_redundancy_check
Page 20 of 83
The National Archives
Quality
Lossiness
A Guide to Formats Version: 1
Almost none. Files are contained completely losslessly.
However, note that file system metadata depends on
extensions which may not be supported in all zip software.
Precision
Resilience Recoverability
High. Date/times are only accurate by default to two seconds.
Above average. Provides several different recovery
mechanisms which permit the zip file itself to be read in the
face of limited corruption, and error detection (but not
correction) for the contained files themselves.
Ubiquity
Very high. The standard (not 64-bit) zip format is extremely
widespread, and serves as a basis for many other formats.
Stability
Very high. Zip files from the 1980s can still be processed by
current software. It is likely that support for zip files will
continue into the indefinite future. However, note that it is not
formally standardised.
4.3
Gzip (.gz)
Gzip (GZ) 23 is a compression format which, unlike the other file containers described here,
normally only contains a single file. Where multiple files must be compressed, it is common to
first archive them together using the Tar format (see section 4.4) into a single tar file, then to
compress the tar file using gzip. It provides good compression and is fast to compress and
decompress.
The file format was first released in 1992, and the specification is openly available, although it
has not been formally standardised. It was originally created to work around patents (now
expired) which existed on other compression algorithms at the time.
It consists of a short header, followed by the compressed data, ending with a CRC 24 checksum
and the length of the original file. This checksum and original file length provides some error
detection in the face of corruption, but recovery options are limited.
File system metadata such as dates, folder structure and permissions are not preserved by
gzip. Sometimes the original name of the file is included in the format header.
23
See http://en.wikipedia.org/wiki/Gzip
24
See http://en.wikipedia.org/wiki/Cyclic_redundancy_check
Page 21 of 83
The National Archives
A Guide to Formats Version: 1
It is frequently found on UNIX-like systems, although software to process it on other platforms is
widely available. Support for the format in common programming languages is also
widespread. While not as full-featured as other file container formats, it follows the UNIX
philosophy of doing one job well – compressing a file – leaving bundling files together and
preserving file system metadata as tasks for other tools.
4.3.1
Continuity properties of Gzip file formats
Flexibility
Interoperability
High. Gzip can be processed on most, if not all, platforms.
Implementability High. Code to process gzip files is available on most major
programming languages, although not as well supported as
zip.
Quality
Lossiness
Almost none. Files are contained completely losslessly.
However, note that file system metadata is generally not
preserved by the format.
Precision
Resilience Recoverability
No issues.
Average. The gzip format is so simple, it is hard to break the
format itself, and easy to repair if the format is corrupt.
However, the recoverability of files contained within it is quite
low. Corruption can be detected, but not easily fixed.
Ubiquity
High. The gzip format is very widespread, although it is mostly
found on UNIX-based systems.
Stability
Very high. The gzip format has survived unchanged for many
years, and support is very likely into the indefinite future. It is
not standardised, but the specification is openly available.
4.4
Tar (.tar)
The Tar format (TAR) 25 takes its name from ‘Tape Archive’, and is used to append multiple files
sequentially into a single file. It originated in the UNIX operating system, and is still
predominantly found on UNIX-like platforms. It was standardised through the IEEE in 1988 as
POSIX.1-1988, 26 and in POSIX.1-2001. The POSIX standard is also the international standard
ISO/IEC 9945. 27
It does not compress the files contained, or obscure them in any way. Tar files are not lossy in
terms of the files they contain, and do not suffer from precision issues. It is common to find that
25
See http://en.wikipedia.org/wiki/Tar_(file_format)
26
See http://en.wikipedia.org/wiki/POSIX
27
See http://www.unix.org/version3/iso_std.html
Page 22 of 83
The National Archives
A Guide to Formats Version: 1
tar files are themselves compressed using the gzip file format (see section 4.3). Note that the
files are written out sequentially, one after another (reflecting its origin in tape archiving), and
there is no index of files in a tar file, so knowledge of, and access to, all files in it is not possible
without first scanning across the entire tar file.
Some file system metadata is captured by the tar format, including file names, size and the last
modified time (stored as numeric UNIX time format). UNIX-style file permissions are also
captured, although these will not translate into other platforms.
It provides a simple checksum to detect corruption for each file which is stored. However, the
checksum is quite basic, and does not check that the file contents themselves have not been
corrupted, only that the metadata block is correct. Hence, recoverability has several different
dimensions. Repairing a corrupted tar file so it can be read can be relatively straightforward, but
the individual files within it may be corrupt and irreparable, and this may not be evident. On the
other hand, a corruption to one part of a tar file may not impact on the recoverability of other
files contained within it.
4.4.1
Flexibility
Continuity properties of TAR file formats
Interoperability
High. Tar can be processed on most, if not all, platforms.
Implementability High. Code to process tar files is available on most major
programming languages, although not as well supported as
zip.
Quality
Lossiness
Almost none. Files are contained completely losslessly.
However, note that some file system metadata is not preserved
by the format.
Precision
Resilience Recoverability
No issues.
Average. The tar format is simple, with most data in it simply
being the files contained as they are with no encryption or
compression. Corruption of metadata headers can be detected,
but not fixed.
Ubiquity
High. The tar format is very widespread, although it is mostly
found on UNIX-based systems.
Stability
Very high. The tar format has survived unchanged for many
years, and support is very likely into the indefinite future. It is
standardised through the POSIX standard.
Page 23 of 83
The National Archives
4.5
A Guide to Formats Version: 1
OLE2 Compound Document Format
The OLE2 Compound Document Format 28 is slightly different to the other file container formats
presented here, in that it is not used as a consumer container format, and tools to manipulate
OLE2 are not widely available. However, it is an important container format, in that it serves as
a base container for almost all binary Microsoft file formats.
Hence, it is unlikely that anyone will ever need to directly use or choose an OLE2 file format,
and thus will have no direct continuity issues with it. However, to avoid replicating information
about OLE2 in all the Microsoft binary format descriptions, some information on this key
underlying format is provided here.
Programmatic code to access this format can be found, albeit not always well supported on all
platforms.
Since OLE2’s role is not to archive files from an external file system, but to allow applications to
store and manage multiple resources in a single file, it does not typically preserve file system
metadata at all. However, it is possible to set a file date and time for each contained file if
required.
OLE2 has a complex internal structure, allowing files and folders to be created within it. It
attempts to re-use space as files or folders are changed or deleted, leading to internal
fragmentation of its resources (much as files can become fragmented on a disk). While this
reduces the space required for formats based on OLE2, it reduces the recoverability of the files
based on the format, by mixing up files together requiring the file indexes to reassemble them in
all cases. A single corruption to the file can prevent the entire file being read successfully. It
provides no built-in error detection or repair.
4.5.1
Flexibility
Continuity properties of OLE2 Compound Document Format
Interoperability
Very low. It is not directly used as a consumer container
format. However, applications which make file formats on top
of this format may have a high interoperability.
Implementability Low. Some code to access OLE2 files directly can be found,
but it may not be well supported, and may not work in all
programming environments.
28
See http://download.microsoft.com/download/0/b/e/0be8bdd7-e5e8-422a-abfd-
4342ed7ad886/windowscompoundbinaryfileformatspecification.pdf
Page 24 of 83
The National Archives
Quality
Lossiness
A Guide to Formats Version: 1
None. All files contained within an OLE2 file are stored
losslessly. No file system metadata is preserved.
Precision
Resilience Recoverability
No issues.
Very low. Corruption cannot be detected, and a single
corruption can prevent all the files within it being read.
Ubiquity
Very high. The format serves as a base container format for
almost all Microsoft binary formats.
Stability
Very high. The format has not changed in a long time, and
being a base for almost all Microsoft binary formats ensures it
will remain supported for some time to come.
Page 25 of 83
The National Archives
5.
Documents
5.1
Introduction
A Guide to Formats Version: 1
Document file formats are among the most common types of file format encountered. There is a
wide variety of document file formats in use today, which fulfil different needs. This guidance will
not describe older document formats no longer in widespread use (although there are many of
these).
5.1.1
Document format types
There tends to be a basic division between page-oriented document formats aimed at printperfect layout and those aimed at user editing. Page-oriented document formats are suitable for
publication, but are not suitable where the document needs to be further changed.
Page-oriented formats
•
Postscript
PS
see section 5.2
•
Portable Document Format
PDF
see section 5.3
•
Open XML Paper Specification
XPS
see section 5.4
User-editable formats
•
Microsoft Word 97-2003
DOC
see section 5.5
•
Open Document Format Text
ODF, ODT
see section 5.6
•
Microsoft Office Open XML
DOCX
see section 5.7
•
Microsoft Rich Text Format
RTF
see section 5.8
5.1.2
Complexity risks
Digital documents are often imagined to be quite simple, as they largely consist of text on
pages, replicating physical paper documents which are easily understood. However, in reality
they are extremely complex file formats. The more complex a format, the harder it is to re-use
the data in other contexts, access data in it programmatically, or to migrate to different formats.
The risk of vendor lock-in is substantially increased.
Documents may have many different resources embedded within them, including images, video
and even audio. Spreadsheets or other complex formats may also be directly embedded within
them. They may have programmatic code (e.g. ‘macros’), which perform tasks on the content or
access external data sources. Typically, programmatic code embedded in documents does not
survive migration to other formats, as the code language is usually non-standard and heavily
oriented towards the primary creating application.
Page 26 of 83
The National Archives
A Guide to Formats Version: 1
Some user-editable document formats track changes to the content (but usually not all kinds of
content), and allow review and commenting of the content by different parties. User-defined
fields may exist to contain defined data (e.g. to support mail-merge functionality). Many
document formats have specifically defined fields to hold user metadata, such as the author of a
document. They may also have embedded dependencies on external data (e.g. a link to
another file on a disk, which can break if either file is moved), and cross-links within the
document which can also break.
Some features of document file formats only exist to preserve backwards compatibility with
documents written in earlier formats. While this mitigates some continuity risks, it also further
increases the complexity of the formats going forwards.
5.1.3
Migration risks
All document migration carries risk, due to the complexity of document formats. It is entirely
normal that a document migration will lose or change some features of the original, unless the
document is very simple. In many cases, the change or loss can be quite minimal and may not
be considered vital (e.g. the style of a heading changes slightly). However, it is essential that all
document migrations are tested thoroughly on a selected set of candidate documents, to assure
that essential features are not lost in the process. Document migration can be largely separated
into three broad types of migration, which typically carry different risks:
•
within a family of file formats (e.g. Microsoft Word 95 to Microsoft Word 97-2003)
•
across format families (e.g. Microsoft Rich Text Format to OpenDocument Text 1.1)
•
from a user-editable to a page-layout format (e.g. OpenDocument Text 1.1 to PDF 1.7).
Within a family of file formats
Upgrading within a family of file formats generally poses few direct continuity risks, as most file
formats are specifically engineered to be backwards-compatible with earlier versions of the
‘same’ format. However, migration is never risk free, and some small changes to documents
may be found – e.g. styles and formatting may change. By contrast, downgrading to earlier
versions may entirely lose formatting, embedded objects, programmatic code or other advanced
features depending on what is supported in the earlier versions. The textual content itself is
usually preserved when downgrading.
Across format families
Migrating from one broad type of document file format to an entirely different one poses the
highest direct continuity risks. No two broad families of document file format support exactly the
same features, in the same ways, so some change and loss to a document should be expected.
Page 27 of 83
The National Archives
A Guide to Formats Version: 1
For example, the Microsoft family of document formats fundamentally manages the pagination
of documents (replicating a paper-model of documents), whereas the OpenDocument Text
family of formats largely leaves pagination up to the rendering software (given that it is a digital
document which may be printed or displayed at different sizes), and does not therefore store
this information in the format. Therefore translation between them may produce pagination
changes.
In general, migration between recent versions of most document formats will produce
documents which are still readable, but with some formatting changes. However, advanced
features such as embedded programming (‘macros’) and change-tracking will often not survive
the process.
From user-editable to page-layout
A frequent use-case in document workflows is taking a user-editable document and migrating it
to a page-layout format, either for publication or archiving. This process will generally produce a
high-quality output document which preserves the layout and styles of the original. However, all
advanced interactive features will generally be lost (since this is the fundamental difference
between user-editable and page-layout formats).
Some page-layout formats may faithfully replicate the look of a document, but may incidentally
lose other features that are still required. For example, the PDF format can store text in a way
which can be rendered absolutely accurately on screen or paper, but is not electronically
searchable. If the ability to copy and paste out of the document is important, attention should be
paid to how the text can be further manipulated in the page layout format. Some page-layout
formats make it hard to select and copy text out of them (e.g. columns are not properly
wrapped, mixing up text from several columns when it is selected out of the document).
While page-layout formats are very useful for human readability of documents, it is normal that
some form of digital access to the content will still be required. Special attention should be paid
to the features used in the page-layout format and your business requirements for ongoing use
of the information.
5.2
Postscript (.ps)
Postscript (PS) 29 is one of the oldest page layout formats, which has its origin as a printer page
specification language, developed by Adobe Systems and first issued in 1984. It is also used
29
See http://en.wikipedia.org/wiki/PostScript
Page 28 of 83
The National Archives
A Guide to Formats Version: 1
widely to publish electronically, particularly for academic papers, although Portable Document
Format (PDF) is now supplanting it for most purposes.
Postscript is a textual format, although not a mark-up language, consisting of a series of
programmatic commands to layout graphics and text. Postscript can only handle numbers up to
a precision of nine decimal digits, so calculations made using its programming language can
produce rounding errors. Most people will not encounter this issue if simply saving documents in
a postscript format – however, advanced users of postscript should be aware of this limitation in
the format.
It is not an international standard, although it has the status of a de-facto standard, as it is still in
widespread use and there are many legacy documents written in it. There are three versions of
Postscript – level 1, level 2 and version 3, and the specification is freely available from Adobe
Systems. A large variety of software can read and produce postscript documents, on most
computer platforms.
5.2.1
Flexibility
Continuity properties of Postscript
Interoperability
High. Postscript is readable on all platforms.
Implementability High. Code to manipulate postscript can be found in most
programming environments.
Quality
Lossiness
None.
Precision
Some issues. Numbers are only represented to a precision of
nine decimal digits, potentially creating rounding errors if
calculations are performed using the postscript programming
language.
Resilience Recoverability
Average. Being a textual format, small corruptions to postscript
files will often not prevent the file being opened, but no specific
error detection or recovery mechanisms are part of the format.
Ubiquity
Very high. Postscript files are very widespread, and are still in
active use, but note that many early uses of postscript are
being replaced by PDF.
Stability
Very high. Postscript files are largely unchanged since they
were first specified, and support for the format is likely to be
found into the foreseeable future.
Page 29 of 83
The National Archives
5.3
A Guide to Formats Version: 1
Portable Document Format (.pdf)
Portable Document Format (PDF) 30 is an extremely widely used format for electronic publishing,
also created by Adobe Systems. PDF consists of a subset of Postscript (see section 5.2), along
with other technologies for embedding fonts and storing additional data. Although much of the
content of a PDF file can appear as text, it is a binary format and includes support to compress
parts of the data it stores, and to encrypt its contents. Therefore a PDF file may be more or less
recoverable depending on exactly how the particular file was written out.
Although initially a closed, proprietary format, it was made an open international standard ISO
32000-1:2008 in 2008, which anyone may implement freely without payment of royalties. PDF
files are accessible on almost every platform, there is a huge range of software which can read
them, and a substantial body of software which can create them, although due to being a pageoriented format, it is often not easy or possible to edit them once created. Many Software
Development Kits are available to manipulate PDF files on all major platforms.
There are nine separate versions of the PDF specification dating back to 1993, the most recent
being released in 2009. PDF is now a very complex standard, including many features which go
beyond a simple page layout specification. For this reason, targeted subsets of the PDF
standard have been defined, simplifying and removing unnecessary features, standardised
under the International Standards Organisation. These are:
•
•
•
PDF/X
PDF/A
PDF/E
5.3.1
Flexibility
for the printing and graphic arts
for archiving documents
for exchange of engineering drawings
ISO 15930
ISO 19005
ISO 24517
Continuity properties of PDF
Interoperability
Very high. PDFs can be accessed on all platforms.
Implementability Very high. Code to read and write PDFs is available for most
programming environments.
Quality
Lossiness
None. The PDF format does not discard information given to
it. However, you may lose functionality when moving from a
user-editable format to a page-oriented format.
Precision
Resilience Recoverability
None.
Average. PDF is a binary format, although much of its content
can appear directly as text which if changed would not prevent
the file being accessed. Sometimes the content can be
30
See http://en.wikipedia.org/wiki/Portable_Document_Format
Page 30 of 83
The National Archives
A Guide to Formats Version: 1
compressed or encrypted, which reduces its recoverability.
Ubiquity
Very high. PDF files are found on all platforms and have been
around for a long time.
Stability
High. The format is an international standard, but note that
there are many different versions and subsets of it defined, and
more may be defined in future.
5.4
Open XML Paper Specification (.xps)
The Open XML Paper Specification (XPS) 31 is an XML-based page layout specification format
created by Microsoft and later standardised through Ecma International as ECMA-388 in 2009.
It consists of XML files (see section 3.3) and other media resources contained in a zip format
(see section 4.2) archive file. Since the file is compressed, damage to the file can result in being
unable to open the file, so recoverability in the face of corruption may be limited. However, note
that there are zip repair tools available which may make it possible to recover a corrupted ODF
file.
This format is not in widespread use as an electronic publishing format, but XPS files are
supported natively on Microsoft Windows Vista, being part of its printing system. Viewers,
converters and Software Development Kits are available on other versions of Windows, and on
some other platforms including Mac OS/X and Linux, although support on these platforms is not
as well developed.
5.4.1
Flexibility
Continuity properties of XPS
Interoperability
Low. It is mostly only supported on recent Microsoft Windows
platforms, although software to access it on other platforms
can be found.
Implementability Low. Code to manipulate this format is not widely found in
many programming environments.
Quality
Lossiness
None.
Precision
None.
Resilience Recoverability
High. The format is an XML-based format, meaning small
errors may only produce small content changes, or errors
which are easily fixable. However, no specific error detection or
correction is included in the format.
Ubiquity
31
Low. The format is mostly only found on recent Microsoft
See http://en.wikipedia.org/wiki/Open_XML_Paper_Specification
Page 31 of 83
The National Archives
A Guide to Formats Version: 1
Windows platforms.
Stability
High. Even though the format is not widely used outside of
recent Microsoft Windows platforms, support for it is likely to be
found for many years into the future. It has been standardised
through ECMA.
5.5
Microsoft Word 97-2003 (.doc)
The Microsoft Word 97-2003 (DOC) 32 format is the de-facto standard for user-editable business
documents in use today. As its name suggests, it first appeared in 1997, and was used as the
default document format until 2003, after which several new formats appeared. It is still
supported by all major user-editable document software on all platforms.
The format not been formalised through a standards body, but the specification is now made
available by Microsoft, and it is mostly supported on almost every platform. Application
Programming Interfaces and Software Development Kits are widely available.
However, the format has several advanced features which are fully supported only on Microsoft
platforms, including programmatic scripts and macros. In addition, DOC files can embed other
objects which may require additional software to be installed to access them.
It is a binary format, consisting of various document resources embedded in an OLE2 container
format (see section 4.5). OLE2 files can be hard to recover in the face of corruption, as they
have a complex and fragmented internal structure.
5.5.1
Flexibility
Continuity properties of Microsoft Word 97-2003
Interoperability
Very high. Almost all platforms can read and write this format.
Implementability High. Many programming environments can access
information in this format.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Low. The binary OLE2 format on which it is based is hard to
recover in the face of corruption, and there are not many tools
to do so.
32
Ubiquity
Very high. The format is found almost everywhere.
Stability
Very high. Although not formally standardised, its status as a
See http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
Page 32 of 83
The National Archives
A Guide to Formats Version: 1
de facto standard ensures that support for the format will be
found many years into the future.
5.6
Open Document Text (.odf .odt)
The Open Document Text Format (ODF/ODT) is a user-editable document format, consisting of
various XML-based files (see section 3.3) and other document resources, such as image files,
contained in a zip (see section 4.2) file. Since the file is compressed, damage to the file can
result in being unable to open the file, so recoverability in the face of corruption may be limited.
However, note that there are zip repair tools available which may make it possible to recover a
corrupted ODF file.
Originally created by Sun Microsystems, it is now developed through the Organisation for the
Advancement of Structured Information Standards (OASIS). There are two versions of the
standard published: 1.0 and 1.1, with a new version, 1.2, near to completion. Version 1.0 was
also standardised in 2006 through the International Standards Organisation as ISO 26300.
The Open Document family of standards (including documents, spreadsheets, presentations
and drawings) are designed to be highly re-usable and interoperable. Software Development
Kits and Application Programming Interfaces are widely available on all major platforms.
It is possible to access ODF documents in Microsoft Word, Open Office and many other
applications, although note that minor changes to formatting may occur in different applications
opening the same file.
5.6.1
Flexibility
Continuity properties of Open Document Text Format (ODF/ODT)
Interoperability
High. Most platforms can read OpenDocument Text format.
Implementability High. Most programming environments can access information
in OpenDocument Text format.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Above average. The zip format it is based on provides several
error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face
of errors.
Ubiquity
Above average. ODF/ODT files are widely found, but they are
not the dominant document format.
Stability
Very high. The OpenDocument Text format is highly
Page 33 of 83
The National Archives
A Guide to Formats Version: 1
standardised, and backwards compatible with earlier versions.
5.7
Microsoft Word 2007 (.docx)
The Microsoft Office Open XML format 33 (DOCX) is a user-editable document format, consisting
of various XML-based files (see section 3.3) and other document resources, such as image
files, contained in a zip file (see section 4.2). Since the file is compressed, damage to the file
can result in being unable to open the file, so recoverability in the face of corruption may be
limited. However, note that there are zip repair tools available which may make it possible to
recover a corrupted DOCX file.
The specification has been standardised through Ecma International as ECMA-376, is the
default file format for Microsoft Word 2007, and can be read using plug-ins in earlier versions of
Microsoft Office. There is some support available on other platforms, which is increasing over
time as more documents are exchanged in this format. Application Programming Interface
support is still largely confined to Microsoft platforms, although code to access it on other
platforms is increasing.
In 2007 Microsoft submitted DOCX to the International Standards Organisation. However, the
format as implemented in Office 2007 was not agreed for standardisation, as it included many
Microsoft-specific legacy technologies which were not deemed suitable for inclusion. The result
of this process was two standards, published in 2008, largely based on DOCX but not
compatible with it:
•
•
ISO 29500 Transitional
ISO 29500 Strict
ISO 29500 Transitional is intended as an interim standard, to allow migration of legacy Microsoft
documents, by including features relating to implementation-specific details of earlier versions of
Microsoft Office. Note that the ISO committee reserves the right to remove the ‘Transitional’ set
of features from the standard at some point in the future. Microsoft Office 2010 is the first
software to implement read and write support for this variant.
ISO 29500 Strict is intended as a standard for new documents, removing the Microsoft-specific
legacy features which were deemed unacceptable. Only read support for the ‘Strict’ variant will
be included in Microsoft Office 2010; no software can currently write documents conforming to
this standard.
33
See http://en.wikipedia.org/wiki/Office_Open_XML
Page 34 of 83
The National Archives
A Guide to Formats Version: 1
At present, DOCX documents are not ISO 29500 documents, although they are valid ECMA376 documents. There is a proposal before the ISO committee to amend the ‘Transitional’
standard so that existing DOCX files become compatible with it.
5.7.1
Continuity properties of DOCX
Flexibility
Interoperability
High. Most platforms can read DOCX files.
Implementability Average. Software to programmatically access information in
DOCX files is mostly confined to the Microsoft platform,
although support in other environments is growing.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Above average. The zip format it is based on provides several
error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face
of errors.
Ubiquity
Above average. DOCX files are widely used, but they are not
the dominant document format, which is still DOC.
Stability
Unclear. Although subject to several different standardisation
processes, these have not resulted in a single standard, and
instead produced several different and incompatible standards,
which are not yet supported in software. The status of the
format is unclear going forward into the future – it may be
replaced by one of the newer standards, or the standards may
be changed to make existing documents compatible with them.
However, since there are a large number of files encoded in
the current format, support for it is likely to be found into the
near future.
5.8
Microsoft Rich Text Format (.rtf)
Microsoft Rich Text Format (RTF) 34 is a widely used document format developed by Microsoft in
1987. It has limited features compared with more recent formats, but is implemented on all
major platforms and can serve as a simple document interchange format.
34
See http://en.wikipedia.org/wiki/Rich_Text_Format
Page 35 of 83
The National Archives
A Guide to Formats Version: 1
RTF is a textual format, so recoverability in the face of corruption is reasonably good. It consists
of a series of nested brackets and control codes surrounding the text, so it is essentially a markup language (see section 3).
It has not been standardised through a formal body, although the specifications are freely
available from Microsoft. There are ten major versions of the format in existence, the earliest
(version 1.0) being issued in 1987, and the most recent (version 1.9.1 35) being published in
2008.
It is not possible to determine which version of RTF is being used without analysing all of the
features contained in a given document, as the documents themselves do not specify the
version being used. In the past this has made it hard to fully support RTF without continual
maintenance, as the specification was a moving target. However, Microsoft does not now
anticipate making further substantive changes to the last specification.
5.8.1
Flexibility
Continuity properties of RTF files
Interoperability
Very high. RTF files can be accessed on most platforms.
Implementability Very high. Programmatic access to the RTF format is found in
most programming environments.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
High. The format is a simple textual mark-up-like format,
although it does not provide any specific error detection or
recovery mechanisms.
Ubiquity
Very high. RTF files are widely found and still in active use as
a simple document interchange format.
Stability
High. Although not formally standardised, the specifications
are openly available, and Microsoft has indicated that it does
not intend to make further changes to the specification.
35
See www.microsoft.com/downloads/en/details.aspx?familyid=dd422b8d-ff06-4207-b476-
6b5396a18a2b&displaylang=en&tm
Page 36 of 83
The National Archives
6.
Spreadsheets
6.1
Introduction
A Guide to Formats Version: 1
Spreadsheets are ubiquitous in business, having expanded from their primary role as numbercrunchers, to becoming a convenient way of organising tabular (and often non-numeric)
structured data. Spreadsheet formats are not as numerous as document formats, although
there have been many since their first widespread use in VisiCalc in 1979. The formats
described here include:
•
Microsoft Excel 97-2003
XLS
see section 6.2
•
Microsoft Excel 2007
XLSX
see section 6.3
•
OpenDocument Spreadsheet
ODS
see section 6.4
Spreadsheets are not lossy in any way, although all have precision issues of some degree,
since they are primarily intended to compute numbers. The degree of precision supported by
each spreadsheet format (and the software which processes them) determines how large any
unavoidable rounding errors may be.
6.1.1
Complexity risks
Modern spreadsheets, like documents, carry text, formatting, embedded objects (e.g. images),
links to external resources and embedded programming languages (e.g. macros). Again, like
documents, they often have features intended to preserve backwards compatibility with older
formats, which mitigates some continuity risk while increasing the complexity going forward.
6.1.2
Migration risks
The migration risks of spreadsheets are also similar to those of documents, in that there are
three common migration use-cases:
•
within a family of file formats (e.g. Microsoft Excel 95 to Microsoft Excel 97-2003)
•
across format families (e.g. Microsoft Excel 2007 to OpenDocument Spreadsheet 1.1)
•
from a spreadsheet to a page-layout document format (e.g. OpenDocument
Spreadsheet 1.1 to PDF 1.7).
Note that a generic risk in moving from any spreadsheet file format to another spreadsheet file
format lies in the number of rows and columns supported in the format. Early spreadsheet
formats are often limited, supporting (for example), only 65,000 rows. Modern spreadsheet
formats typically support at least 250,000 rows or higher. For many spreadsheets this will not be
an issue, but for spreadsheets in which large amounts of tabular data have been compiled, you
Page 37 of 83
The National Archives
A Guide to Formats Version: 1
should check whether you will exceed the row or column limit for the format you are migrating
to.
Within a family of file formats
Upgrading within a family of file formats generally poses few direct continuity risks, as most file
formats are specifically engineered to be backwards-compatible with earlier versions of the
‘same’ format. However, migration is never risk free, and some small changes to spreadsheets
may be found – e.g. styles and formatting may change. By contrast, downgrading to earlier
versions may entirely lose formatting, embedded objects, programmatic code or other advanced
features depending on what is supported in the earlier versions. Many early spreadsheet
formats only support a small number of rows and columns, so it may not be possible to
downgrade a large spreadsheet without losing data entirely.
Across format families
Migrating from one broad type of spreadsheet file format to an entirely different one poses the
highest direct continuity risks. No two broad families of spreadsheet file format support exactly
the same features, in the same ways, so you should expect some change and loss to a
spreadsheet.
In general, migration between recent versions of most spreadsheet formats will produce
spreadsheets which are still workable, but with some formatting changes. However, advanced
features such as embedded programming (‘macros’) and will often not survive the process.
More seriously, not all spreadsheets support exactly the same formulae used in calculations –
and there are differences in the implementation of some formulae which can produce different
results. However, differences tend to be found in the more complex functions rather than the
simple, everyday functions (e.g. sum or count). If the answers to any complex calculations must
be preserved as they are, then a review of the compatibility of the functions used must be
undertaken.
From spreadsheet to page-layout document
A frequent use-case in business workflows is taking a spreadsheet and migrating it to a pagelayout document format, either for publication or archiving. This process will generally produce a
high-quality output document which preserves the layout and styles of the original. However, all
advanced interactive features will be lost – in particular, any formulae used to calculate values
in the sheet will disappear, with only the results of the calculation left in the final output
document. If it is important for your audiences to understand how the spreadsheet was
Page 38 of 83
The National Archives
A Guide to Formats Version: 1
calculated, you must either provide these details as an additional piece of documentation, or not
provide the spreadsheet as a document in the first place, instead making a spreadsheet
available.
Some page-layout formats may faithfully replicate the look of a spreadsheet, but may
incidentally lose other features that are still required. For example, the PDF format can store
text in a way which can be rendered absolutely accurately on screen or paper, but is not
electronically searchable. If the ability to copy and paste out of the document is important,
attention should be paid to how the text can be further manipulated in the page layout format.
Some page-layout formats make it hard to select and copy text out of them (e.g. columns are
not properly wrapped, mixing up text from several columns when it is selected out of the
document).
While page-layout formats are very useful for human readability of documents, it is normal that
some form of digital access to the content will still be required. Special attention should be paid
to the features used in the page layout format and your business requirements for ongoing use
of the information.
6.2
Microsoft Excel 97-2003 (.xls)
The Microsoft Excel 97-2003 format (XLS) 36 is the de facto standard for business spreadsheets
in use today. As its name suggests, it first appeared in 1997, and was used as the default
spreadsheet format until 2003, after which several new formats appeared. It is still supported by
all major spreadsheet software on all platforms.
The format has not been formalised through a standards body, but the specification has now
been made available by Microsoft, and it is mostly supported on almost every platform.
Application Programming Interfaces and Software Development Kits are widely available.
However, the format has several advanced features which are fully supported only on Microsoft
platforms, including programmatic scripts and macros. In addition, XLS files can embed other
objects which may require additional software to be installed to access them.
It is a binary format, based on a format called the Binary Interchange File Format (BIFF),
consisting of data stored in records describing the spreadsheet. These records, along with other
36
See http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-
4342ED7AD886/Excel97-2007BinaryFileFormat(xls)Specification.xps
Page 39 of 83
The National Archives
A Guide to Formats Version: 1
resources such as images, are embedded in an OLE2 container format (see section 4.5). OLE2
files can be hard to recover in the face of corruption, as they have a complex and fragmented
internal structure.
6.2.1
Continuity properties of Microsoft Excel 97-2003
Flexibility
Interoperability
Very high. Almost all platforms can read and write this format.
Implementability High. Many programming environments can access
information in this format.
Quality
Lossiness
None.
Precision
High. Numbers are stored using double-precision floating
point, which gives them a precision of about 15 decimal places.
Resilience Recoverability
Low. The binary OLE2 format on which it is based is hard to
recover in the face of corruption, and there are not many tools
to do so.
Ubiquity
Very high. The format is found almost everywhere.
Stability
Very high. Although not formally standardised, its status as a
de facto standard ensures that support for the format will be
found many years into the future.
6.3
Microsoft Excel 2007 (.xlsx)
The Microsoft Office Open XML format (XLSX) 37 consists of various XML-based files (see
section 3.3) and other resources, such as image files, contained in a zip file (see section 4.2).
Since the file is compressed, damage to the file can result in being unable to open the file, so
recoverability in the face of corruption may be limited. However, note that there are zip repair
tools available which may make it possible to recover a corrupted XSLX file.
The standardisation status of the Excel 2007 format is complex, but is the same as for the Word
2007 format, which is fully discussed in section 5.7.1.
6.3.1
Flexibility
Continuity properties of Microsoft Excel 2007 (XLSX)
Interoperability
High. Most platforms can read XSLX files.
Implementability Average. Software to programmatically access information in
XSLX files is mostly confined to the Microsoft platform,
although support in other environments is growing.
Quality
37
Lossiness
None.
http://en.wikipedia.org/wiki/Office_Open_XML
Page 40 of 83
The National Archives
Precision
A Guide to Formats Version: 1
High. Numbers are stored using double-precision floating
point, which gives them a precision of about 15 decimal places.
Resilience Recoverability
Above average. The zip format it is based on provides several
error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face
of errors.
Ubiquity
Above average. XLSX files are widely used, but they are not
the dominant spreadsheet format, which is still XLS.
Stability
Unclear. Although subject to several different standardisation
processes, these have not resulted in a single standard, and
instead produced several different and incompatible standards,
which are not yet supported in software. The status of the
format is unclear going forward into the future – it may be
replaced by one of the newer standards, or the standards may
be changed to make existing spreadsheets compatible with
them. However, since there are a large number of files
encoded in the current format, support for it is likely to be found
into the near future.
6.4
OpenDocument Spreadsheet (.ods)
The OpenDocument 38 Spreadsheet Format (ODS) consists of various XML-based files (see
section 3.3) and other document resources, such as image files, contained in a zip (see section
4.2) file. Since the file is compressed, damage to the file can result in being unable to open the
file, so recoverability in the face of corruption may be limited. However, note that there are zip
repair tools available which may make it possible to recover a corrupted ODS file.
The overall standardisation status of OpenDocument Spreadsheets is the same for all
OpenDocument formats, and is discussed in section 5.6.1.
However, note that the formulae used in OpenDocument Spreadsheets have not been
standardised in the 1.0 and 1.1 versions of the standard, although they are standardised in the
upcoming 1.2 standard. In the meantime, most implementations of OpenDocument
Spreadsheet have followed the lead of the Open Office Calc application (from which the
OpenDocument standards were originally derived). A major exception to this rule is the
38
See http://en.wikipedia.org/wiki/OpenDocument
Page 41 of 83
The National Archives
A Guide to Formats Version: 1
OpenDocument support in the Microsoft Office 2007 SP2, which interprets the standard
differently, creating potential interoperability problems. 39
6.4.1
Flexibility
Continuity properties of ODS format
Interoperability
Above average. Most platforms can read OpenDocument
Spreadsheet format. However, note that the dominant platform
(Microsoft Office) interprets certain aspects of the format
differently to other implementations, which can result in noninteroperable spreadsheets.
Implementability High. Most programming environments can access information
in OpenDocument Spreadsheet format.
Quality
Lossiness
None.
Precision
High. Numbers are stored using double-precision floating
point, which gives them a precision of about 15 decimal places.
Resilience Recoverability
Above average. The zip format it is based on provides several
error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face
of errors.
Ubiquity
Average. ODS files are fairly widely found, but they are not the
dominant spreadsheet format.
Stability
High. The OpenDocument Spreadsheet format is highly
standardised, and backwards compatible with earlier versions.
Support for information in these formats is likely to continue
into the indefinite future. Note that formulae will not be
standardised until the 1.2 family of standards is approved.
39
See
http://en.wikipedia.org/wiki/OpenDocument_software#Microsoft_Office_2007_SP2_support_controversy
Page 42 of 83
The National Archives
7.
Presentations
7.1
Introduction
A Guide to Formats Version: 1
Presentation formats are somewhat simpler than document formats, as they have one clearly
defined purpose, and consist of a defined number of slides, with no wrapping of content
between them (and hence no pagination issues). Presentation formats described here are:
•
Microsoft PowerPoint 97-2003
PPT
see section 7.2
•
Microsoft PowerPoint 2007
PPTX
see section 7.3
•
OpenDocument Presentation
ODP
see section 7.4
7.1.1
Complex media risks
Presentations tend to contain complex media resources, including time-based media like audio
and video, each of which may pose continuity issues of their own. Unlike images, whose
formats are highly standardised, time-based media often use standardised containers, which
compress their content using different ‘codecs’ (compression-decompression). It can be hard to
determine which codecs are in use, or whether support for them will be found in future
platforms.
7.1.2
Linked resource risks
Resources used in a presentation may not be embedded in the presentation file itself, but may
take the form of a link to a file resource on the local computer on a network shared drive. If the
presentation is moved, or the external resources are unavailable, then the presentation will not
work properly. You should ensure that any resources required by a presentation are embedded,
or that the use of linked resources does not pose any continuity issues for you.
7.1.3
Migration risks
In common with documents and spreadsheets, there are three typical migration use-cases for
presentations:
•
within a family of file formats (e.g. PowerPoint 95 to PowerPoint 97-2003)
•
across format families (e.g. PowerPoint 2007 to OpenDocument Presentation 1.1)
•
from a presentation to a page-layout document format (e.g. OpenDocument
Presentation 1.1 to PDF 1.7).
Within a family of file formats
Upgrading within a family of file formats generally poses few direct continuity risks, as most file
formats are specifically engineered to be backwards-compatible with earlier versions of the
‘same’ format. However, migration is never risk free, and some small changes to presentations
Page 43 of 83
The National Archives
A Guide to Formats Version: 1
may be found – e.g. styles and formatting may change. By contrast, downgrading to earlier
versions may entirely lose formatting, slide transitions, macros or other features depending on
what is supported in the earlier versions.
Across format families
Migrating from one broad type of presentation file format to an entirely different one poses the
highest direct continuity risks. No two broad families of presentation file format support exactly
the same features, in the same ways, so some change and loss to a presentation should be
expected.
In general, migration between recent versions of most presentation formats will produce
presentations which still roughly contain the same content, but the layout can frequently be
changed in ways which require a lot of manual intervention to fix. The layout of presentations is
quite central to their purpose, so while content may not be lost, automatic migration cannot be
relied upon at present if presentations must be usable after migration without manual
intervention.
From presentation to page-layout document
A frequent use-case in business workflows is taking a presentation and migrating it to a pagelayout document format, either for publication or archiving. This process will generally produce a
high-quality output document which preserves the layout and styles of the original.
However, all advanced interactive features will be lost, including slide transitions, animations,
and any time-based media such as audio and video. Despite this, it is quite common for simple
presentations, consisting of text and images to be rendered as a document for download.
Presentation software often also includes a ‘slide-show’ version of the main file format, which
will accurately preserve transitions and complex media, but becomes non-editable.
Some page-layout formats may faithfully replicate the look of a spreadsheet, but may
incidentally lose other features that are still required. For example, the PDF format can store
text in a way which can be rendered absolutely accurately on screen or paper, but is not
electronically searchable. If the ability to copy and paste out of the document is important,
attention should be paid to how the text can be further manipulated in the page layout format.
Some page-layout formats make it hard to select and copy text out of them (e.g. columns are
not properly wrapped, mixing up text from several columns when it is selected out of the
document).
Page 44 of 83
The National Archives
A Guide to Formats Version: 1
While page-layout formats are very useful for human readability of documents, it is normal that
some form of digital access to the content will still be required. Special attention should be paid
to the features used in the page-layout format and your business requirements for ongoing use
of the information.
7.2
Microsoft PowerPoint 97-2003 (.ppt)
The Microsoft PowerPoint 97-2003 40 (PPT) format is the de-facto standard for business
presentations in use today. As its name suggests, it first appeared in 1997, and was used as the
default presentation format until 2003, after which several new formats appeared.
The format not been formalised through a standards body, but the specification is now made
available by Microsoft, and almost every platform has some level of support.
PPT is a binary format, consisting of various document resources embedded in an OLE2
container format (see section 4.5). OLE2 files can be hard to recover in the face of corruption,
as they have a complex and fragmented internal structure.
7.2.1
Flexibility
Continuity properties of Microsoft PowerPoint 97-2003
Interoperability
Very high. Almost all platforms can read and write this format.
Implementability Average. Some programming environments can access
information in this format, although programmatic control over
presentations is a fairly uncommon requirement.
Quality
Lossiness
None. Although note that media contained in a presentation
can be lossy.
Precision
Resilience Recoverability
No issues.
Low. The binary OLE2 format on which it is based is hard to
recover in the face of corruption, and there are not many tools
to do so.
Ubiquity
Very high. The format is found almost everywhere.
Stability
Very high. Although not formally standardised, its status as a
de facto standard ensures that support for the format will be
found many years into the future.
40
See www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
Page 45 of 83
The National Archives
7.3
A Guide to Formats Version: 1
Microsoft PowerPoint 2007 (.pptx)
The Microsoft Office Open XML format (PPTX) 41 consists of various XML-based files (see
section 3.3) and other resources, such as image files, contained in a zip file (see section 4.2).
Since the file is compressed, damage to the file can result in the user being unable to open the
file, so recoverability in the face of corruption may be limited. However, note that there are zip
repair tools available which may make it possible to recover a corrupted PPTX file.
The standardisation status of the PowerPoint 2007 format is complex, but is the same as for the
Word 2007 format, which is fully discussed in section 5.7.1.
7.3.1
Flexibility
Continuity properties of Microsoft PowerPoint 2007 (XLSX)
Interoperability
High. Most platforms can read PPTX files.
Implementability Average. Software to programmatically access information in
PPTX files is mostly confined to the Microsoft platform,
although programmatic control over presentations is a fairly
uncommon requirement.
Quality
Lossiness
None. Although note that media formats contained in a
presentation may be lossy.
Precision
Resilience Recoverability
No issues.
Above average. The zip format it is based on provides several
error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face
of errors.
Ubiquity
Above average. PPTX files are widely found, but they are not
the dominant presentation format, which is still PPT.
Stability
Unclear. Although subject to several different standardisation
processes, these have not resulted in a single standard, and
instead produced several different and incompatible standards,
which are not yet supported in software. The status of the
format is unclear looking to the future – it may be replaced by
one of the newer standards, or the standards may be changed
to make existing presentations compatible with them.
However, as there are a large number of files encoded in the
current format, support for it is likely to be found into the near
future.
41
See http://en.wikipedia.org/wiki/Office_Open_XML
Page 46 of 83
The National Archives
7.2
A Guide to Formats Version: 1
OpenDocument Presentation (.odp)
The OpenDocument 42 Presentation Format (ODP) consists of various XML-based files (see
section 3.3) and other document resources, such as image files, contained in a zip (see section
4.2) file. Since the file is compressed, damage to the file can result in being unable to open the
file, so recoverability in the face of corruption may be limited. However, note that there are zip
repair tools available which may make it possible to recover a corrupted ODP file.
The overall standardisation status of OpenDocument Presentations is the same for all
OpenDocument formats, and is discussed in section 5.6.1.
7.2.1
Flexibility
Continuity properties of OpenDocument Presentation Format (ODP)
Interoperability
High. Most platforms can read OpenDocument Presentation
format.
Implementability Average. Some programming environments can access
information in OpenDocument Presentation format, although
note that programmatic control over presentations is a fairly
uncommon requirement.
Quality
Lossiness
None. Although note that media formats contained in a
presentation may be lossy.
Precision
Resilience Recoverability
No issues.
Above average. The zip format it is based on provides several
error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face
of errors.
Ubiquity
Above average. ODP files are widely found, but they are not
the dominant document format.
Stability
Very high. The OpenDocument Presentation format is highly
standardised, and backwards compatible with earlier versions.
Support for information in these formats is likely to continue
into the indefinite future.
42
See http://en.wikipedia.org/wiki/OpenDocument
Page 47 of 83
The National Archives
8.
Datasets
8.1
Introduction
A Guide to Formats Version: 1
Datasets are collections of structured data. File formats containing datasets include desktop
databases, structured text files (see sections 2 and 3), and spreadsheets (see section 6). This
section will specifically focus on desktop database formats, and some specific text file formats
commonly found to contain structured data.
Formats which are described here include:
Desktop databases
•
Microsoft Access
MDB
see section 8.2
•
Microsoft Access 2007
ACCDB
see section 8.3
Structured text
•
Comma Separated Values
CSV
see section 8.4
•
Structured Query Language
SQL
see section 8.5
•
Resource Description Framework
RDF
see section 8.6
Note that in the interest of balance, this guide was originally going to describe the
OpenDocument Database (odb) format. This format is related to the other OpenDocument
formats described here (see sections 5.6, 6.4, 7.4), but it has not been standardised as they
have been, and remains accessible only by the Open Office suite of applications. Due to the
almost complete lack of any accessible documentation on this format, it proved impossible to
say anything definitive on it. Hence this format should be regarded as a very high continuity risk.
8.1.1
General dataset continuity
The continuity of datasets is a complex subject, as a dataset can contain any data, in any
structure, with any meaning attached to the data and structure (whose explanation does not
usually appear in the dataset itself). For this reason, to understand and manage the continuity of
datasets in general, please read the separate guidance Managing Dataset Continuity. 43
8.1.2
Desktop database risks
A desktop database is a single-user, small-scale database intended to run within a desktop
environment, by contrast with enterprise database systems, which run on servers and support
43
See Managing the Continuity of Datasets nationalarchives.gov.uk/documents/information-
management/managing-continuity-of-datasets.pdf
Page 48 of 83
The National Archives
A Guide to Formats Version: 1
multiple concurrent users. Desktop databases typically save data, queries and forms into a file
format which can be passed around like a spreadsheet or other data file.
Desktop database file formats are typically hard to access without the specific creating software.
They are not very interoperable, implementable or standardised. They use structured text
formats as a data interchange format, although these formats usually only capture the data, not
the queries, forms or other access mechanisms defined in the desktop database format.
In addition, desktop databases are often independently created by staff to fulfil a temporary
need, without the involvement of skilled database designers. Organisations are usually unaware
that important data is being managed in these formats, and it is uncommon to find good quality
documentation on the data and structures found within them. However, they also often end up
being used and expanded beyond their original temporary purpose, and far beyond the point at
which they should be formally documented, controlled and migrated into an enterprise database
system (or entirely replaced by a properly designed system).
For these reasons, desktop database file formats have very poor continuity properties, on
almost all levels. They may not be suitable formats to manage and hold any kind of important
business data – but they can be very useful to facilitate analysis of such data, or to enable quick
solutions to temporary data management issues.
As with all formats you must carefully evaluate the business need you require the format to
meet to ensure that all your requirements are met and any continuity risks are acceptable.
8.1.3
Structured text risks
Aside from the risks which apply to all text files (see section 2.1), structured text files have no
general risks which do not also apply to any other form of dataset format (e.g. the need for
documentation on the structure and meaning of the data). Specific risks do exist for particular
file formats, which will be discussed in each sub-section.
Structured text files are typically very accessible with good interoperability and implementability
properties, and thus serve as data interchange mechanisms between many different kinds of
technology.
8.2
Microsoft Access (.mdb)
Page 49 of 83
The National Archives
Microsoft Access (MDB)
A Guide to Formats Version: 1
44
format is a binary, proprietary desktop database format created by
Microsoft, used before 2007. It is not standardised through any standards body, and its
specification is not available. The format supports various advanced features beyond storing
structured data, including macros, queries and forms to enter and validate data.
Application Programming Interfaces to enable programmatic access to MDB files are available
on the Microsoft platform, via Data Access Objects and ActiveX Data Objects, but the data
contained within this format is not widely accessible outside of this software on other platforms,
unless exported into a structured text format.
8.2.1
Continuity properties of Microsoft Access (MDB)
Flexibility
Interoperability
Very low. Almost no software other than Microsoft Access can
read MDB files.
Implementability Very low. Almost no support for programmatic access to MDB
files exists outside of Microsoft Access itself.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Very low. MDB is a complex binary format with no specific
recoverability features. Due to the absence of other tools
available to read and process MDB files, once it is corrupted,
the chance of recovery is very low.
Ubiquity
High. There are many databases defined using MDB.
Stability
Below average. Although the MDB format has been in use for
many years, it is not standardised or documented. It has now
been replaced by the ACCDB format in more recent versions of
Microsoft Access, which remains capable of reading MDB files
for the time being – but ongoing support cannot be guaranteed.
8.3
Microsoft Access 2007 (.accdb)
The Microsoft Access 2007 (ACCDB) 45 format is a binary, proprietary desktop database format
created by Microsoft, replacing the earlier MDB format (see section 8.2). It is not standardised
through any standards body, and its specification is not available. The format supports various
advanced features beyond storing structured data, including macros, queries and forms to enter
and validate data.
44
See http://en.wikipedia.org/wiki/Microsoft_Access
45
See http://en.wikipedia.org/wiki/Microsoft_Access
Page 50 of 83
The National Archives
A Guide to Formats Version: 1
Application Programming Interfaces to enable programmatic access are available via the
Access database engine object library, but the data contained within these formats is not widely
accessible outside of this software on any platform, unless exported into a structured text
format.
8.3.1
Continuity properties of Microsoft Access 2007 (ACCDB)
Flexibility
Interoperability
Very low. Almost no software other than Microsoft Access can
read ACCDB files.
Implementability Very low. Almost no support for programmatic access to
ACCDB files exists outside of Microsoft Access itself.
Quality
Lossiness
None
Precision
No issues.
Resilience Recoverability
Very low. ACCDB is a complex binary format with no specific
recoverability features. Due to the absence of other tools
available to read and process ACCDB files, once it is
corrupted, the chance of recovery is very low.
Ubiquity
Average. There are some databases defined using ACCDB,
although MDB is still more common.
Stability
Below average. The ACCDB format is relatively new, and is
not standardised or documented. Support for it is likely to
continue for the foreseeable future, but cannot be guaranteed.
8.4
Comma Separated Values (.csv)
Comma Separated Values (CSV) 46 is an informal family of textual file formats, used to store
tabular data. While the format has been in use for at least a decade before the advent of
personal computers, it is not standardised in any way, and many variations of it exist.
The format is not lossy, and there are no innate precision issues. However, software reading a
CSV file may interpret the data in them inconsistently, as there are no standards defining how to
process the data represented in CSV files.
The basic format consists of columns of text separated by commas, with each row on a single
line. However, in some countries commas are used to represent decimal points in numbers, so
semi-colons, or other punctuation may be used to separate the columns, including tab
46
See http://en.wikipedia.org/wiki/Comma-separated_values
Page 51 of 83
The National Archives
A Guide to Formats Version: 1
characters. Other variations include whether text columns are quoted or not (usually using
double quotes), and how quotes in the text itself are represented (sometimes by placing two
double quotes next to one another with no intervening text). Sometimes the first line of a CSV
file contains ‘header’ names for each column, but there is no reliable way to determine whether
the first line contains data or headers without prior knowledge or manual review.
8.4.1
Continuity properties of Comma Separated Values (CSV)
Flexibility
Interoperability
Very high. Almost all structured data applications which
produce tabular data can read or write data in a CSV format.
Implementability Very high. All programming environments can produce or
consume data in a CSV format.
Quality
Lossiness
None.
Precision
No issues. Although note that the data represented in the CSV
file may have precision issues, depending on the applications
which read and write the files.
Resilience Recoverability
High. CSV is a purely textual format, and depending on the
text encoding (see section 2.1.1) used to create the file, small
changes will remain local and the file will normally remain
readable. However, there are no error detection or recovery
features.
Ubiquity
Very high. CSV files are found almost everywhere.
Stability
Very high. Despite the lack of formal standardisation, the CSV
format family has been in use since before the advent of
personal computers, is still very widespread and in active use.
8.5
Structured Query Language (.sql)
Structured Query Language (SQL) 47 is a family of database programming languages, rather
than being specifically a file format. However, it is common for data and database structure to
be represented using SQL and stored in text files (see section 2), primarily for database creation
and data interchange between database management systems (usually created by the same
vendor).
SQL is not a lossy format, but it does have potential precision issues if SQL is created using one
database product and consumed in another. Not all data-types are handled consistently
between vendors, particularly numbers and date-times.
47
See http://en.wikipedia.org/wiki/SQL
Page 52 of 83
The National Archives
A Guide to Formats Version: 1
SQL has several standards, the first being an ANSI standard in 1986. It was made an ISO
standard (ISO 9075) in 1992 (often referred to as SQL-92), and additions to the standard have
been made in 1999, 2003, 2006 and 2008. Earlier standards are forwards compatible with the
later standards (meaning they are valid even if processed with software expecting a later
standard, but the later standards add new features which cannot be understood if an earlier
standard is expected).
However, despite the standards, it is common for database-vendors to create non-standard
extensions to the language, and they do not always process elements of the standard
compatibly between them.
8.5.1
Flexibility
Continuity properties of Structured Query language (sql)
Interoperability
Average. The standards are not interpreted consistently
between vendors – but the basic language is highly
standardised, and usually easy to change to achieve
interoperability.
Implementability Very high. All programming environments can create SQL,
and there are many libraries of code to process it.
Quality
Lossiness
None.
Precision
Some issues. Data-types are not always handled consistently
between database vendors, so care must be taken with
numbers and date/times if moving data between different
databases.
Resilience Recoverability
High. SQL is a purely textual format, and depending on the
text encoding (see section 2.1.1) used to create the file, small
changes will remain local and the file will normally remain
readable. However, there are no error detection or recovery
features.
Ubiquity
Very high. SQL files are found almost everywhere.
Stability
High. Although there are many standards, earlier versions are
forwards compatible with later ones. However, note that
vendor-extensions to the SQL standard cannot be guaranteed
to be stable (although in practice, they appear to be fairly
stable).
Page 53 of 83
The National Archives
8.6
A Guide to Formats Version: 1
Resource Description Framework (.rdf)
Resource Description Framework (RDF) 48 is a data model which has several different possible
formats. RDF models information using sets of ‘subject-predicate-object’ statements, or more
colloquially, ‘something relates-to something-else’. RDF is one of the components of the
‘semantic web’ – the attempt to impute meaning and links to data on the web.
The two principle formats in which RDF statements are represented are RDF-XML (RDF
statements written using an XML-based format - see section 3.3), and a simpler textual format
called ‘Notation 3’, or ‘N3’. These formats were standardised through the World Wide Web
Consortium (W3C) as a Recommendation in 1999, and subsequently updated in 2004. No
matter which format is used to represent RDF models, both are textual, non-lossy, and have no
precision issues.
From a continuity perspective, both RDF formats score well. However, there is an innate risk
(and opportunity) in using RDF, which is that RDF is designed to facilitate linked data. This
means that an RDF file can reference data which is found elsewhere on the web. You must
ensure that if an RDF file references external data, changes to that data (or its removal entirely)
will not adversely impact your use of the RDF data contained in the format.
8.6.1
Flexibility
Continuity properties of Resource Description Framework (rdf)
Interoperability
Very high. Software to process RDF can be found on most
platforms, and this is increasing as its adoption grows. It is very
standardised.
Implementability High. Many programming languages on most platforms can
process RDF.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
High. RDF formats are textual, so inherit the recovery
properties of text. However, there are no built-in error detection
or recovery mechanisms.
Ubiquity
Below average. RDF adoption is growing steadily, but is still
not in widespread use.
48
See http://en.wikipedia.org/wiki/Resource_Description_Framework
Page 54 of 83
The National Archives
Stability
A Guide to Formats Version: 1
Very high. The RDF formats are standardised and have been
in use for more than a decade.
9.
Emails
9.1
Introduction
As a service, email is ubiquitous, highly interoperable and very successful. However, behind the
scenes, email is a mix of standards, conventions and technologies, and emails themselves are
usually stored and managed on dedicated servers. Email servers work in a variety of different
ways, but usually use database technologies rather than file formats to store emails. Hence,
email file formats tend to be used for personal archiving. The email file formats covered in this
guide include:
•
EML format
EML
•
Microsoft Message Format
MSG see section 9.3
•
MBOX format
MBX
see section 9.4
•
Personal Storage Table
PST
see section 9.5
9.1.1
see section 9.2
General risks
Email file formats are not usually standardised, and interoperability can be quite low. Most email
client software supports at least one email file format, but it can be difficult to find software
which can read several formats, or convert emails between formats. Another general risk of
email file formats is that they are frequently used by individual users to archive organisational
mail, often placing them outside of email controls and retention policies set by organisations.
This is frequently done to work around quota limits on the size of email inboxes.
9.1.2
File attachment risks
Emails can have file attachments, which are stored within the email file format, along with the
email itself. Those files are essentially obscured from most information management software,
and may not be searchable, therefore there is a continuity risk to the attached files. This is the
same risk which applies to files contained in generic file containers (see section 4.1.1).
9.2
EML (.eml)
The EML format 49 is a plain ASCII text file (see section 2.2) containing a single email, including
email metadata (e.g. sender, subject, dates), and the text of the email. File attachments are also
49
See http://www.ietf.org/rfc/rfc0822.txt
Page 55 of 83
The National Archives
A Guide to Formats Version: 1
included in the same text file, using various encoding schemes which convert binary files into a
textual representation, including Base64 50 and Uuencoding. 51
The format was semi-standardised in 1982 as RFC-822, although this standard does not cover
all the data which may appear in an eml file. It is not lossy, and has no precision issues.
9.2.1
Continuity properties of EML
Flexibility
Interoperability
Above average. Many email clients can read or write EML
files, and it is frequently used as an interchange mechanism.
Implementability Low. Support for reading and writing EML files is hard to find in
most programming environments.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
High. The format is a plain ASCII text file, meaning
corruptions, insertions or deletions have a local effect. Being
text, it is easy to open and correct errors, although there is no
explicit error detection or recovery mechanisms defined in it.
Ubiquity
Low. The principle use appears to be interchange between
email clients. Hence, it is uncommon to find large volumes of
emails stored in eml format.
Stability
High. The format has been in use for decades and is at least
partially standardised.
9.3
Microsoft Message (.msg)
The Microsoft Message format is a proprietary binary file format used by the Microsoft Outlook
email clients. It is based on the OLE2 Compound Document Format (see section 4.5). It is not
standardised, although the specification was made available from Microsoft in 2008. 52 All file
attachments are embedded inside the msg file.
A large number of emails are found stored in this format, due to the ubiquity of the Microsoft
Outlook application. It is not lossy, and has no precision issues.
9.3.1
Flexibility
Continuity properties of Microsoft Message (msg)
Interoperability
Average. Few email client applications can read msg files,
50
See http://en.wikipedia.org/wiki/Base64
51
See http://en.wikipedia.org/wiki/Uuencoding
52
See http://msdn.microsoft.com/en-us/library/cc463912.aspx
Page 56 of 83
The National Archives
A Guide to Formats Version: 1
although various information management tools can, due to the
ubiquity of the format.
Implementability Low. Some libraries of code to access data in msg format files
exist, although they may not be supported and are not found in
all programming environments.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Low. The format is based on OLE2, which can be hard to
recover in the face of corruption. Few tools other than
Microsoft’s own email client exist to read them.
Ubiquity
Very high. A large number of emails are archived in this
format.
Stability
Above average. Although not standardised, the specification is
available and due to the large number of emails found in it,
support for this format is likely to be found into the immediate
future.
9.4
MBOX (.mbox)
MBOX 53 is a family of related, text-based formats (see section 2) originating in the UNIX
operating system, which store entire mailboxes, rather than a single email. Emails are
appended one after another into a single text file. MBOX files are not lossy, and have no
precision issues.
The structure of MBOX has never been officially standardised, although some documentation
can be found, 54 and the format of the emails within an MBOX file is also not standardised,
although emails in an EML-like format are common (see section 9.2).
There are at least four common variations on the MBOX format, which are incompatible with
each other: mboxo, mboxrd, mboxcl and mboxcl2. Even within these broad types, variations can
be found in different software implementations. In general, although MBOX format is
widespread in a broad sense, each MBOX format is largely tied to the software which produces
and reads it, and so, despite being a textual format, must be regarded as having generally poor
continuity properties.
53
See http://en.wikipedia.org/wiki/Mbox
54
See http://tools.ietf.org/html/rfc4155
Page 57 of 83
The National Archives
9.4.1
A Guide to Formats Version: 1
Continuity properties of MBOX
Flexibility
Interoperability
Very low. MBOX files can generally only be processed by the
software which creates it.
Implementability Very low. There are few libraries of code able to process
MBOX files.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Average. The format is based on plain text, meaning
corruptions tend to only affect the local area where they occur
and they are easy to read and correct manually. However, due
to the way that MBOX files are used (holding multiple emails in
a single file), corruption can occur fairly easily, and there are
no built in error detection or recovery mechanisms.
Ubiquity
Average. MBOX is common on UNIX platforms – but the
specific variation used by each are incompatible with each
other.
Stability
Low. The family of formats is not standardised, leading to
multiple different incompatible implementations. Although
MBOX files have been in use for decades, they cannot be
considered stable, as they are tightly bound to the particular
software which reads them, and could change without warning.
9.5
Personal Storage Table (.pst)
Personal Storage Table (PST) 55 is a proprietary binary file format used to store multiple emails,
folders and calendar items using Microsoft Outlook. The format is not standardised, but the
specification was recently made available by Microsoft. 56 It is not lossy, and has no precision
issues.
Few tools other than Microsoft Outlook can currently read pst files, although there are tools to
convert other formats to PST format. Some libraries of code exist to programmatically access
data in PST files, although most of these were reverse-engineered before the specification was
made available, and so do not necessarily support all features in PST files. There are two major
55
See http://en.wikipedia.org/wiki/Personal_Storage_Table
57
See http://www.fileformat.info/format/bmp/egff.htm and http://en.wikipedia.org/wiki/BMP_file_format
Page 58 of 83
The National Archives
A Guide to Formats Version: 1
variants of the PST format: 32 bit and 64 bit, of which the older 32 bit variety has the greater
level of support.
9.5.1
Flexibility
Continuity properties of Personal Storage Table (PST)
Interoperability
Low. Few tools can access data in PST files.
Implementability Low. A few libraries of code enable programmatic access to
data in PST files.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Low. The format is a dense binary format, with little tools
support and no known error detection or recovery features.
Ubiquity
Very high. Many users archive their inboxes using the PST
format.
Stability
Above average. Although not standardised, the 32 bit variety
has been in use for many years, and due to the large amount
of information recorded in this format, support is likely for the
foreseeable future.
Page 59 of 83
The National Archives
10.
Images (raster)
10.1
Introduction
A Guide to Formats Version: 1
Raster images are images encoded as a rectangular matrix of colour values (‘pixels’), in the
same way that a television or computer monitor displays images. Being a matrix, raster images
have a natural width and height in pixels, or ‘resolution’.
This is in contrast to vector images (see section 11) which store images as a series of
instructions to draw lines and shapes in various ways. Vector images have no natural resolution,
as the shapes can always be redrawn at any desired resolution on the display device.
The raster image formats described here include:
•
Windows bitmap format
BMP
see section 10.2
•
Tagged Image File Format
TIFF
see section 10.3
•
Graphics Interchange Format
GIF
see section 10.4
•
Portable Network Graphics Format
PNG
see section 10.5
•
Joint Photographic Experts Group Format
JPG, JPEG
see section 10.6
10.1.1
Scaling risks
If a raster image is displayed much larger than its natural resolution, then the image becomes
‘blocky’, as pixels are scaled up into squares, rather than remaining as individual dots in the
image. For this reason, it is generally preferable to keep raster images in as high a resolution as
possible, and produce lower-resolution versions to fill different needs. For example, a high
resolution master version may be kept for print and other high resolution needs, and lowerresolution versions produced for delivery on the web. Once resolution is discarded, it is not
possible to scale up the image again without producing blockiness (or blurring, if ‘interpolation’
algorithms are used to smooth the differences across blocks).
A risk of scaling a raster image down is that areas of high contrast may be entirely lost in the
smaller version, rather than just becoming smaller. For example, text in a large image may
become unreadable in the smaller version, even if the image is still large enough to contain
readable text. This is because the process of scaling an image down involves discarding pixels,
sometimes averaging the colour values across the areas being discarded. This tends to entirely
remove areas of high contrast, such as sharp lines, rather than preserving them as the image
becomes smaller.
Page 60 of 83
The National Archives
10.1.2
A Guide to Formats Version: 1
Compression risks
It can take a lot of storage space to store raster images, so most raster image file formats apply
compression to the image. Compression algorithms may be lossy (discarding subtle variations
in the image to save space), or non-lossy (reproducing the exact pixels fed into the format). If a
lossy format is used, then the image should generally not be changed and re-saved again, as
each time the image is changed and saved, more information will be discarded, degrading the
quality each time until it becomes unusable.
In addition, by necessity, lossy compression algorithms make assumptions about what
information can be safely discarded. This means that although a lossy algorithm may produce
images almost indistinguishable from the original to the human eye, it can also produce
noticeable artefacts in the compressed image.
Non-lossy algorithms will never discard information, but different types of image may compress
better or worse than others. In extreme cases, a non-lossy compression scheme may actually
create a larger file than the uncompressed version. In general, non-lossy compression works
well on images that contain blocks or lines of identical colour (e.g. graphic art, or black and
white images), but work poorly on images with subtle continuous tones (e.g. photographs).
You should ensure that you understand the suitability of any compression scheme for the types
of image you intend to compress.
10.1.3
Colour-space risks
There are different ways of representing colour, which may give greater or lower fidelity in
different circumstances. For example, the colours which are reproducible when printed are
different to those that can be displayed on a television or computer monitor.
Many raster image formats use 24-bit colour, which allows up to 256 different values for the red,
green and blue components of each pixel (the ‘RGB’ model). The human eye cannot distinguish
much more than 24-bit colour, so RGB may be assumed to be sufficient for most electronic
display purposes. However, if an image is edited and transformed in some ways, then colour
values may be lost during the transformation process. For this reason, some advanced image
formats can store greater than 24-bit colour to allow for the possibility of loss during editing.
Bear in mind that an RGB image may look quite different when printed. For print, another colour
model is frequently used: cyan, magenta, yellow, black (the ‘CMYK’ model). The colours which
can be represented by the two colour models are not equivalent, although reasonable
translations can be made between them.
Page 61 of 83
The National Archives
A Guide to Formats Version: 1
Another form of colour model is ‘indexed colour’, in which the image only contains a limited
number of different RGB colours. These colours can be any RGB colour, but generally only up
to 256 different colours can appear in the image as a whole. This form of colour model can
degrade the appearance of images with subtle continuous tones, such as photographs, but may
be suitable for graphic art or other images with few variations in colour in them. Indexed colour
innately requires less storage space than full RGB colour images, and will typically compress
better using lossless compression algorithms.
In addition to the colours of pixels themselves, some image formats also have what is called an
‘alpha channel’ – which is how transparent each pixel in an image is. This allows images to let
the background on which they are displayed to show through to a greater or lesser extent in
different parts of the image. This is common on images intended for display on the web, or for
icons on a computer desktop.
10.2
Windows Bitmap (.bmp)
The Windows Bitmap format 57 was first defined in 1985, being the native image format for the
Microsoft Windows 1.0 and IBM OS/2 operating systems. There are many variations of this
format now in existence, being updated for Windows versions 2.0, 3.0, ’95 and NT, each adding
more capability to the format. OS/2 also introduced incompatible versions of the format,
although these are not now very widespread. However, in essence it is still quite a simple
image format, and is widely used, even outside of the original platforms it was defined on.
It is not standardised, although the specifications (at least, for recent versions) are freely
available. The format can store raster images uncompressed, or with a simple lossless
compression scheme (Run Length Encoding 58), which does not compress most images by
much, but can help in reducing storage space. The compression scheme works best well for
simple images with large blocks of identical colour.
It supports indexed colour and RGB colour depths up to 24-bit, and alpha channels using 32
bits per pixel.
10.2.1
Flexibility
Continuity properties of Bitmap (bmp)
Interoperability
Very high. All platforms can read bmp format.
Implementability Very high. All programming environments can process bmp.
57
See http://www.fileformat.info/format/bmp/egff.htm and http://en.wikipedia.org/wiki/BMP_file_format
58
See http://en.wikipedia.org/wiki/Run-length_encoding
Page 62 of 83
The National Archives
Quality
A Guide to Formats Version: 1
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Above average. If uncompressed, corruption will usually only
change a small part of the image. There are no specific error
detection or recovery features.
Ubiquity
Very high.
Stability
Very high. Although there were frequent changes in the early
versions of the bmp format, the format as it is encountered
today has remained unchanged for over a decade, and is likely
to be supported for the foreseeable future.
10.3
Tagged Image File Format (.tif, .tiff)
The Tagged Image File Format (TIFF) 59 was first formally specified in 1986 by Aldus
Corporation, after two earlier draft specifications. Hence, the first specification of TIFF is known
as TIFF 3.0. Three further versions were released in 1987, 1988 and 1992, the latest
specification being TIFF 6.0. It has not substantially changed since then, although minor
additions have been made. Note that TIFF files do not specify which version of the specification
they comply with. Each new version simply added more features to the previous version. TIFF
files should be assumed to be version 6.0, as this will always cover all the previous versions.
TIFF is unusual among raster image formats, in that it can hold more than one image at a time
(‘multi-page’), reflecting its origin in a file format to contain scanned images. It is still widely used
as a digitisation format.
It is inherently an extensible format, allowing many different options to be specified with it. This
has given rise to compatibility problems, as not all software can process all the options. 60 All
software which can process TIFF today must conform to a baseline specification, which
alleviates many (but not all) of these issues. It supports many different sorts of compression
(including lossy and lossless), colour models (including RGB and CMYK), and many other
features too numerous to list here.
The TIFF specification 61 itself is now owned by Adobe Corporation and itself is not formally
standardised, although the specifications are openly available. Various other TIFF-like formats
59
See http://en.wikipedia.org/wiki/Tagged_Image_File_Format
60
Giving rise to the joke that TIFF stands for ‘Thousands of Incompatible File Formats’.
61
See http://partners.adobe.com/public/developer/tiff/index.html
Page 63 of 83
The National Archives
A Guide to Formats Version: 1
have been standardised, including TIFF/IT (ISO 12639), TIFF/IT P1 (ISO 12639:1988) and
TIFF/IT P2 (ISO 12639:2004), although these are not entirely compatible with TIFF itself.
Care must be taken using TIFF to ensure that the particular features used will be compatible
with the environment it is being used in, but it is highly flexible format suitable for many
advanced image tasks.
10.3.1
Continuity properties of Tagged Image File Format (tif, tiff)
Flexibility
Interoperability
Mixed. Baseline TIFFs are highly interoperable, being
supported on almost all platforms. However, the high number
of variations possible with the format can limit its
interoperability.
Implementability Mixed. Baseline TIFFs are easy to implement in most
programming environments.
Quality
Lossiness
Mixed. TIFF supports both lossy and non-lossy compression
schemes.
Precision
Resilience Recoverability
No issues.
Mixed. Very simple TIFFs may be possible to recover,
although there are no error detection or recovery mechanisms
built in.
Ubiquity
Very high. TIFF files are widely found.
Stability
High. The specification itself is very stable, being largely
unchanged for nearly 2 decades, but is not formally
standardised. However, once again note that while the
specification is stable, the format itself is so extensible that the
stability of images encoded with it can be questioned.
10.4
Graphics Interchange Format (.gif)
The Graphics Interchange Format 62 (GIF) was first specified in 1987 as GIF 87a. It supports
more than one image in a single file. A later specification was made in 1989 (GIF 89a), adding
support for animation delays between images. The format is not standardised, although the
specifications are freely available.
62
See http://en.wikipedia.org/wiki/Graphics_Interchange_Format
Page 64 of 83
The National Archives
A Guide to Formats Version: 1
GIF uses LZW 63 compression, which is a lossless compression algorithm. It provides better
compression than the Run Length Encoding found in Windows Bitmaps (see section 10.2). GIF
does not support full 24-bit RBG colours; it uses indexed colour (see 10.1.3), allowing up to 256
different colours in the image, 64 and also supports transparency.
It is most suitable for simple graphic images with a limited number of colours. GIF images are
often found on the web, used for simple animations.
10.4.1
Continuity properties of Graphics Interchange Format (gif)
Flexibility
Interoperability
Very high. All platforms can read GIF files.
Implementability Very high. The format is easy to implement and use in most
programming environments.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Average. The format is quite simple, but there are no error
detection or recovery features.
Ubiquity
Very high. GIF files are extremely widespread.
Stability
Very high. The format is unchanged over two decades,
although it is not formally standardised.
10.5
Portable Network Graphics (.png)
The Portable Network Graphics (PNG) 65 file format was first specified in 1996 as Version 1.0,
and is also standardised as a W3C Recommendation. Two further versions were later defined:
version 1.1 in 1998, version 1.2 in 1999, adding a few additional features. It was standardised in
2003 as ISO 15948:2003 and subsequently as ISO 15948:2004. The standardised versions are
marginally different to version 1.2.
It was developed as a competing format to GIF (see section 10.4). At the time of development,
the GIF format was enmeshed in patent issues on the underlying compression algorithm used
by it, although these patents have now expired. 66 PNG used patent-free lossless compression
algorithms, 67 which generally achieves better compression than GIF for most images. Unlike
63
See http://en.wikipedia.org/wiki/Lempel-Ziv-Welch
64
Note that there is a rarely used ‘hack’ which can produce true RGB images without changing the
underlying format. See: http://en.wikipedia.org/wiki/Graphics_Interchange_Format#True_color
65
See http://en.wikipedia.org/wiki/Portable_Network_Graphics
66
See http://en.wikipedia.org/wiki/Graphics_Interchange_Format#Unisys_and_LZW_patent_enforcement
67
See http://en.wikipedia.org/wiki/Portable_Network_Graphics#Compression
Page 65 of 83
The National Archives
A Guide to Formats Version: 1
GIF, it is a single image format, with a separate format not described here - Multiple Image
Network Graphics (MNG) 68 - being defined for animation purposes.
The PNG format supports indexed colour and RGB true-colour with an alpha channel for pixel
transparency. However, although any RGB image can be represented using PNG, the
compression works best for graphic images, rather than photographic images, where subtle
variations in colour prevent the compression from working well.
10.5.1
Continuity properties of Portable Network Graphics (png)
Flexibility
Interoperability
High. Almost all recent graphic software can process PNG
images, although older web browsers may not be able to.
Implementability Very high. Most programming environments can process PNG
images.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
Average. The format is quite simple, but there are no specific
error detection or recovery features.
Ubiquity
High. PNG images are quite widely found and their adoption is
growing.
Stability
Very high. The format is standardised and has largely
remained unchanged for over a decade.
10.6
Joint Photographic Experts Group (.jpg, .jpeg)
The Joint Photographic Experts Group format (JPG) 69 is designed for the compact
representation of photographs, or other images with subtle tone variations. These sorts of
image typically compress poorly using lossless compression algorithms, so JPG specifies a
lossy algorithm, which selectively discards small changes in colour to achieve higher levels of
compression. The JPG standard also includes a lossless mode, but this is frequently not
supported in many applications which process JPG files. It has some small precision issues, in
that the lossy compression algorithm can produce small rounding errors using numbers with
decimal points, which may change the final image in small ways. However, these changes are
minimal when compared to the intentional discarding of information performed by the lossy
compression itself.
68
See http://en.wikipedia.org/wiki/Multiple-image_Network_Graphics
69
See http://en.wikipedia.org/wiki/JPEG
Page 66 of 83
The National Archives
A Guide to Formats Version: 1
JPG was first issued and standardised in 1992, as ISO 10918-1. However, note that this
standard principally covers the method of image compression and decompression (the ‘codec’).
The file formats in which JPG compressed images are commonly contained are known as
EXIF70 and JFIF71. However, files encoded in these formats still generally use the common JPG
or JPEG file extensions.
Due to its compression it is a suitable format for the storage of photographic images which will
not change further, or require editing, and for which the loss of subtle data is not critical. Due to
the way that the JPG algorithm works, areas of high contrast (e.g. sharp boundaries, or text) in
the image can end up with visible ‘artefacts’ in the image surrounding the boundary. Therefore,
these sorts of image are not generally suitable for JPG. It is possible to select how much
information JPG discards, trading off space against fidelity.
10.6.1
Flexibility
Continuity properties of Joint Photographic Experts Group (jpg, jpeg)
Interoperability
Very high. Most platforms can read and process JPG images.
Implementability Very high. Most programming environments can process JPG
images.
Quality
Lossiness
Usually lossy. Sharp boundaries (e.g. text) may have visible
artefacts surrounding them. A lossless mode exists, but is not
widely supported.
Precision
No major issues. Note that the compression algorithm can
produce small rounding errors in its calculations, even when
the compression is minimal.
Resilience Recoverability
Below average. The formats are reasonably complex, and
there are no specific error detection or recovery features.
Ubiquity
Very high. JPG images are found almost everywhere.
Stability
Very high. JPG images are highly standardised and have
been in use for nearly two decades. Good support is likely into
the foreseeable future.
70
See http://en.wikipedia.org/wiki/Exif
71
See http://en.wikipedia.org/wiki/JFIF
Page 67 of 83
The National Archives
11.
Images (vector)
11.1
Introduction
A Guide to Formats Version: 1
Vector images are formats which store images as a series of instructions to draw lines and
shapes in various ways. This is in contrast to raster images (see section 10) which store images
as a rectangular matrix of colour values (‘pixels’), in the same way that a television or computer
monitor displays images. The vector image formats described here include:
•
Encapsulated Postscript
EPS
see section 11.2
•
Windows Metafile Format
WMF, EMF
see section 11.3
•
Scalable Vector Graphics
SVG
see section 11.4
11.1.1
Continuity risks
Vector images have no natural dimensions of width or height (‘resolution’), as the shapes can
always be redrawn at any desired resolution on the display device. This means there are no
scaling risks with vector images.
It is not possible to use lossy compression on a vector image, as you cannot easily determine
which of the drawing instructions can be safely removed or simplified. However, they can be
compressed fairly well using standard lossless compression, for example zip (see section 4.2).
In addition, vector image files are typically much smaller than raster image files in the first place,
as they only store descriptions of how to reproduce a graphic image, rather than the image
itself.
The principle continuity risks of vector images are that interoperability is not as high as for raster
images. In particular, there is no common definition of what features can be specified to draw,
hence most formats will support different methods of drawing colours, or entirely different
shapes. Hence, migration of vector file formats must be undertaken with special care.
Vector formats are not as widely used as raster formats, although support for them is growing
on many platforms, as they provide compact, resolution-independent representations of graphic
images, such as logos and icons. Avoiding scaling issues and small file sizes are useful
properties when content must be easily viewable on, or repurposed for, a variety of networked
devices with widely different screen sizes, such as mobile smartphones, tablets and full size
desktop screens.
Page 68 of 83
The National Archives
11.2
A Guide to Formats Version: 1
Encapsulated Postscript (.eps)
The Encapsulated Postscript format (EPS) 72 file is a text-based Postscript file (see section 5.2)
which conforms to a specification called Document Structuring Conventions 73 (DSC). It is
intended as a way to use postscript to describe drawings which can be embedded in other
documents.
It was first specified in 1992, but is not formally standardised. It is not a lossy format, but it has a
precision issue, in that numbers are only represented to an accuracy of nine decimal digits,
which can produce rounding errors.
11.2.1
Flexibility
Continuity properties of Encapsulated Postscript (EPS)
Interoperability
Average. EPS format is usable as a drawing format for some,
but by no means all, vector graphics software and to embed
into other documents.
Implementability Average. EPS format support is found in some programming
environments, but by no means all.
Quality
Lossiness
None.
Precision
Some issues. As for Postscript, numbers are only represented
to a precision of nine decimal digits, potentially creating
rounding errors if calculations are performed using the
postscript programming language.
Resilience Recoverability
Average. Being a textual format, small corruptions to EPS files
will often not prevent the file being opened, but no specific
error detection or recovery mechanisms are part of the format.
Ubiquity
Above average. EPS files are widespread, and are still in
active use, but it is not the vector format of choice for many
applications.
Stability
High. EPS files are largely unchanged since they were first
specified, and support for the format is likely to be found into
the foreseeable future.
72
See http://en.wikipedia.org/wiki/Encapsulated_PostScript
73
See http://en.wikipedia.org/wiki/Document_Structuring_Conventions
Page 69 of 83
The National Archives
11.3
A Guide to Formats Version: 1
Windows Metafile Format (.wmf)
The Windows Metafile Format (WMF) 74 is a binary 16-bit vector image file format defined in the
1990s, which consists of commands to the Windows Graphics Device Interface (GDI). As such,
it is very highly coupled with Microsoft Windows, although reverse-engineered support for it on
other platforms can be found. It can also optionally include bitmap (raster image) components
in addition to the vector images.
The format defines no compression (so is not lossy), and has minor precision issues in that the
format is innately 16-bit. This may limit the theoretical accuracy of very large drawings specified
with it, but in practice, this should not be a concern.
It is not standardised in any way, although the specifications of the formats were released in
2006. 75
11.3.1
Flexibility
Continuity properties of Windows Metafile Format (WMF)
Interoperability
Low. It is tightly coupled to the Microsoft Windows platform.
Some reverse engineered implementations can be found.
Implementability Low. Support for the format is mostly limited to Microsoft
Windows programming environments.
Quality
Lossiness
None.
Precision
Small issues. The format is 16-bit only, which may limit the
size or accuracy of very large drawings.
Resilience Recoverability
Below average. It is a dense binary format, there are no
specific error detection or recovery features, and few tools exist
to read or repair it.
Ubiquity
Low. Although it is a format used internally by Microsoft
Windows and as a drawing format for Microsoft Office, files
encoded in WMF are not particularly widespread.
Stability
Average. Since the format is essentially a representation of the
underlying Windows Graphics Device Interface, they have
been quite stable. However, they are not standardised, and
support for the format cannot be guaranteed for much beyond
the immediate future, as Windows itself changes.
74
See http://en.wikipedia.org/wiki/Windows_Metafile
75
See http://msdn.microsoft.com/en-us/library/cc215212.aspx
Page 70 of 83
The National Archives
11.4
A Guide to Formats Version: 1
Scalable Vector Graphics (.svg)
The Scalable Vector Graphics file format 76 (SVG) is a textual format based on XML (see section
3.3). It was first defined in 1999 by the World Wide Web Consortium (W3C) and there have
been several versions defined since then. SVG 1.0 became a W3C recommendation in 2001,
1.1 in 2003 and 1.2 Tiny in 2008. SVG 1.2 Full has been working draft for many years, but is
likely to be replaced by SVG 2.0.
Support for SVG is increasingly common, particularly on the web, however Microsoft Internet
Explorer has only supported it from version 8.
It supports both static and interactive vector graphics, with a built in scripting language
(ECMAScript 77). Note that advanced scripted features will probably not survive migration into
another format. Raster images can also be embedded in an SVG file, and it also includes some
basic page layout features. It is a non-lossy format, and has no precision issues.
11.4.1
Flexibility
Continuity properties of Scalable Vector Graphics (SVG)
Interoperability
High. Most vector applications and browsers can access SVG
format.
Implementability High. SVG support is found in many programming
environments.
Quality
Lossiness
None.
Precision
No issues.
Resilience Recoverability
High. Being based on a textual XML format, it is quite easy to
repair damaged SVG files, although there is no specific error
detection or recovery built in.
Ubiquity
High. SVG files are very widespread, particularly on the web.
Stability
High. The format is standardised through the W3C. Although
new versions appear reasonably regularly, support for format in
all versions is likely to continue into the foreseeable future.
76
See http://en.wikipedia.org/wiki/Scalable_Vector_Graphics
77
See http://en.wikipedia.org/wiki/ECMAScript
Page 71 of 83
The National Archives
12.
Audio
12.1
Introduction
A Guide to Formats Version: 1
Audio formats are quite diverse, being engineered to support different qualities, file sizes and
business uses. Consumer grade formats typically focus on small file sizes, support stereo
channel audio, and have relatively low quality (around CD-quality). Professional grade formats
may support higher qualities to give some head-room when editing and a greater number of
channels. Audio formats described here include:
•
Waveform Audio File Format
WAV see section 12.2
•
Windows Media Audio
WMA see section 12.3
•
MPEG Layer 3 Audio
MP3
see section 12.4
•
Advanced Audio Coding
AAC
see section 12.5
12.1.1
Sampling risks
In order to reproduce audio, computers must capture the sound level at a particular intervals of
time, and convert it to a number. The higher the number of samples taken per second, the more
faithfully the sound can be reproduced. Since human ears can hear frequencies up to around
22,000 Hz, then a sampling rate of double this (around 44,000 samples per second) is generally
good enough to reproduce most frequencies a human ear can distinguish. Capturing more
samples gives more flexibility to edit the sound without noticeably degrading the quality.
However, when processing audio, if the sampling rate of the sound is adjusted, this can produce
audible artefacts in the sound. In general, you should capture and store audio in as high a
sample rate as possible.
12.1.2
Codec risks
A ‘codec’ refers to the algorithm used to compress and decompress the audio data. Some
codecs are ‘lossy’, in that they intentionally discard data to reduce the file size. Others are
lossless, reproducing the exact sound data fed into it – although these typically do not compress
as much as lossy codecs.
A particular risk of codecs is knowing which codec is actually being used. Many audio file
formats allow many different codecs to be used within them, and this is not evident from the file
extension, which simply tells you which audio file container format is being used, not the codec.
Although it is possible for dedicated audio software to determine the codec in use (otherwise it
could not play back the audio), it is harder for information managers to acquire this information,
which may create risk of unusual or older codecs remaining in use in older audio files.
Page 72 of 83
The National Archives
12.1.3
A Guide to Formats Version: 1
Digital rights management risks
Some audio file formats use ‘Digital Rights Management’ (DRM) to protect the content from
copyright infringement, or to otherwise control the use of the content. By necessity, DRM
encrypts the content of the audio file format, preventing the use without a key to unlock the
content. Because of this, all audio files with DRM carry a very high continuity risk. In order to
facilitate legitimate playback of content, the software must have the decryption key available to
it. Unless the DRM scheme requires online negotiation, all off-line use (which includes most
audio players) must include the decryption key in the software client.
It is often possible to reverse engineer the decryption key, however, there are serious legal
issues with using such tools to unlock content protected by DRM schemes unless you are the
legitimate copyright owner. 78
12.2
Waveform Audio File Format (.wav)
Waveform Audio File Format (WAV) 79 is a simple audio file format used by the Microsoft
Windows and IBM OS/2 operating systems. However, support for the format is widespread on
other platforms. It is not formally standardised, but the specifications are available. 80
It can store tw channels of audio at up to 44,100 samples per second, using 16 bits per sample,
so sound quality is reasonably good, but quality may suffer if edits which transform the audio
are applied. There are no digital rights management issues with the wav format.
It is an innately non-lossy format, but does support compression using a variety of codecs
supplied by the Windows Audio Compression Manager. 81 Like many media formats, the ability
to use a variety of codecs within the format means that you can experience continuity issues if
an unusual codec is selected, as not all systems may support all codecs, and it is not directly
evident from the file which codec is being used. However, note that the wav format is most
frequently used uncompressed, avoiding such issues, although making the file size of wav files
quite large.
78
See http://en.wikipedia.org/wiki/Software_cracking
79
See http://en.wikipedia.org/wiki/WAV
80
See http://msdn.microsoft.com/en-us/windows/hardware/gg463006.aspx
81
See http://en.wikipedia.org/wiki/Audio_Compression_Manager#Audio_Compression_Manager
Page 73 of 83
The National Archives
12.2.1
A Guide to Formats Version: 1
Continuity properties of Waveform Audio File Format (WAV)
Flexibility
Interoperability
Very high.
Implementability Very high.
Quality
Lossiness
Usually not. Unless an unusual codec is used. The files are
usually entirely uncompressed.
Precision
Minor issues. Only uses 16 bits per sample up to 44,100
samples per second.
Resilience Recoverability
Average. It is a binary format, which although usually
uncompressed has no specific error detection or recovery
features.
Ubiquity
Very high. WAV files are very widespread.
Stability
High. Although not formally standardised, WAV files from
many years ago are still accessible, and support for the format
is likely to be found into the foreseeable future.
12.3
Windows Media Audio (.wma)
The Windows Media Audio (WMA) 82 file format is something of a misnomer, in that there are at
least four incompatible formats defined using the same name. In fact, WMA refers to a family of
four audio codecs defined by Microsoft, which are contained in an Advanced Systems Format 83
media container file, whose specification is available. 84 The four codecs defined are:
•
Windows Media Audio
The most common codec, released in 1999. It uses lossy
compression, encoding two channels (stereo) at up to
48,000 samples per second.
•
Windows Media Audio Pro
Uses a better (but still lossy) compression algorithm,
supporting up to 96,000 samples per second and up to
eight discrete channels of sound.
•
WMA Lossless
A lossless audio codec, designed for archival purposes,
supporting up to 96,000 samples per second with six
discrete channels of sound.
82
See http://en.wikipedia.org/wiki/Windows_Media_Audio
83
See http://en.wikipedia.org/wiki/Advanced_Systems_Format
84
See http://download.microsoft.com/download/7/9/0/790fecaa-f64a-4a5e-a430-
0bccdab3f1b4/ASF_Specification.doc
Page 74 of 83
The National Archives
•
A Guide to Formats Version: 1
WMA Voice
A lossy codec designed for low-bandwidth voice
communication, supporting up to 22,000 samples a second
for a single channel of sound.
However, most WMA audio files encountered use the first codec with the same name –
Windows Media Audio. The others were defined later in 2003, and may be encountered in
specialised scenarios, but their use is not particularly common.
The format optionally supports various forms of digital rights management, which can restrict
playback of content except on authorised devices, or only allow playback for a limited time.
Hence, care must be taken with audio in any WMA format to ensure it is not protected by DRM
schemes if the audio must be reliably accessed into the future.
12.3.1
Continuity properties of Windows Media Audio (WMA)
Flexibility
Interoperability
High. Many platforms can process wma files, assuming digital
rights management is not used.
Implementability Low. Most programming environments do not include support
for WMA files.
Quality
Lossiness
Mostly lossy. Normally lossy, unless the WMA Lossless codec
is used.
Precision
Resilience Recoverability
Minor issues. Quality can vary according the codec used.
Low. The format is complex and few tools exist to process it.
Ubiquity
High. WMA files are widespread.
Stability
Below average. The ASF format specification is available, but
is not standardised. The WMA codec specifications are harder
to acquire, and are also not standardised.
12.4
MPEG Layer 3 Audio (.mp3)
The MPEG Layer 3 Audio (MP3) 85 file format is the de facto standard for consumer-grade digital
music playback, being supported in almost all playback devices. It was defined by the Moving
Picture Expert Group (MPEG) as part of the original MPEG-1 standard, and updated in the
MPEG-2 standard. It was standardised as ISO 11172-3:1993 in 1993, and later as ISO 138183:1995 with some additions. It can support two channels of audio at up to 48,000 samples per
second in MPEG-1 mode, and up to 6 channels (5.1 audio) in MPEG-2 mode.
85
See http://en.wikipedia.org/wiki/MP3
Page 75 of 83
The National Archives
A Guide to Formats Version: 1
MP3 uses a lossy compression algorithm to achieve small file sizes, which discards part of the
audio signal which human ears cannot easily distinguish, particularly when a lower tone
obscures the perception of a higher one.
The amount of loss is configurable, by setting the ‘bit-rate’ of the format – where a higher bitrate gives a better quality output. Many MP3 files are encoded using a128-bit rate, but a 192-bit
rate or higher is not uncommon. In general for continuity purposes, unless space is a prime
consideration, a higher quality bit-rate should be preferred.
Since the codec of an MP3 file is part of the format, there are no additional codec risks with
MP3, other than the MP3 algorithm itself is the subject of patents, which may require license
fees to be paid if implemented in software. MP3 files do not have any digital rights management
issues, allowing the unrestricted playback or modification or content, although note that since it
uses lossy compression, it should not be used if the audio needs to be edited – each time it is
resaved after a change more information will be discarded.
12.4.1
Continuity properties of MPEG Layer 3 (MP3)
Flexibility
Interoperability
Very high. Almost all platforms can process MP3 files.
Implementability Very high. Almost all programming environments can process
MP3 files.
Quality
Lossiness
Lossy. The MP3 format discards parts of the audio signal
which human ears cannot normally distinguish. The amount of
loss is configurable.
Precision
Resilience Recoverability
No issues.
Above average. Many tools can access and process audio
data in MP3 format. It is possible to recover audio in the face of
local corruption.
Ubiquity
Very high. It is the de facto format for consumer music
playback.
Stability
Very high. The format is standardised and has been in use for
nearly two decades.
12.5
Advanced Audio Coding (.aac)
The Advanced Audio Coding (AAC) 86 file format is a lossy audio file format designed to achieve
better quality than MP3 for similar file sizes, and includes a number of advanced features. It can
86
See http://en.wikipedia.org/wiki/Advanced_Audio_Coding
Page 76 of 83
The National Archives
A Guide to Formats Version: 1
support up to 48 channels of audio, each with up to a 96,000 samples per second. It includes
support for error detection and correction within the encoding. Note that AAC is a method of
encoding audio, but AAC-encoded audio must be contained in various standardised audio
‘container’ formats, including MP4 87, 3GP88 and other ISO-based media formats. 89
It was first standardised as part of the MPEG-2 specification in 1997, as ISO 13838-7:1997. It
was subsequently updated in 1999 as part of the MPEG-4 specification, as ISO 14496-3:1999.
Further additions have been made in 2000 (ISO 14496-3:1999/Amd 1:2000), 2003 (ISO 144963:2001/Amd 1:2003), 2004 (ISO 14496-3:2001/Amd 2:2004), 2005 (ISO 14496-3:2005/Amd
2:2006), with the latest being in 2009 (ISO 14496-3:2009).
It is the default audio encoding for the Apple range of consumer hardware and software,
including iPhone, iPad and iTunes. While AAC files do not themselves have any digital rights
management (DRM) built in to the specification, it is possible to add DRM in to the format. For
example, some AAC files in iTunes are protected by a DRM scheme called FairPlay. 90 Care
must be taken with AAC files to ensure that you can access content which you own for as long
as you need to, and that any DRM restrictions will not prevent access you need to your content.
12.5.1
Flexibility
Continuity properties of Advanced Audio Coding (AAC)
Interoperability
High. Support for AAC encoded files can be found on many
platforms.
Implementability Average. Support for AAC encoded files can be found in some
programming environments.
Quality
Lossiness
Lossy.
Precision
No issues.
Resilience Recoverability
High. The encoding has explicit support for error detection and
correction, which can be applied flexibly within a file.
Ubiquity
High. It is the default encoding for Apple’s consumer products.
Stability
Above average. While it is standardised, there have been
many revisions to the standard. It is unclear whether there will
be many more. However, support for existing AAC encoded
files should be found into the immediate future.
87
See http://en.wikipedia.org/wiki/MP4
88
See http://en.wikipedia.org/wiki/3GP
89
See http://en.wikipedia.org/wiki/ISO_base_media_file_format
90
See http://en.wikipedia.org/wiki/FairPlay_%28DRM%29
Page 77 of 83
The National Archives
13.
Video
13.1
Introduction
A Guide to Formats Version: 1
There are many video formats in existence, designed to support differing qualities of video and
audio. Video takes up an extremely large amount of space, so without exception all the formats
described here use lossy compression to reduce the data to manageable (if still large) volumes.
This makes them unsuitable for work which involves repeated changes to the video picture, as
each time they are changed and saved, more quality is lost. Most video formats also include
audio with them, which may share common codecs (compression-decompression) algorithms
with audio-only formats (see section 12).
Video formats described here include:
•
Moving Pictures Expert Group
MPG, MPEG see section 13.2
•
Windows Media Video
WMV
see section 13.3
•
Audio Video Interleave
AVI
see section 13.4
•
Flash Video
FLV
see section 13.5
13.1.1
Scaling risks
Video, like raster images (see section 10.1.1) have a natural dimension of width and height in
pixels. If a video is scaled up to a higher resolution, or downscaled to a lower resolution, then
the video can appear blurred, areas of high contrast in the video (such as sharp lines) can be
lost, or flickering can occur as different frames of the video discard slightly different parts of the
image. In general, video should be kept at as high a resolution as possible, with lower quality
versions being produced to fill particular needs (e.g. delivery on the web).
13.1.2
Codec risks
A ‘codec’ refers to an algorithm used to compress and decompress the video or audio data.
Most video codecs are ‘lossy’, in that they intentionally discard data to reduce the file size.
A particular risk of codecs is knowing which codec is actually being used. Many video file
formats allow many different codecs to be used within them, and this is not evident from the file
extension, which simply tells you which video file container format is being used, not the codec.
Although it is possible for dedicated video software to determine the codec in use (otherwise it
could not play back the video), it is harder for information managers to acquire this information,
which may create risk of unusual or older codecs remaining in use in older video files.
Page 78 of 83
The National Archives
13.1.3
A Guide to Formats Version: 1
Digital rights management risks
Some video file formats use ‘Digital Rights Management’ (DRM) to protect the content from
copyright infringement, or to otherwise control the use of the content. By necessity, DRM
encrypts the content of the video file format, preventing the use without a key to unlock the
content. Because of this, all video files with DRM carry a very high continuity risk. In order to
facilitate legitimate playback of content, the software must have the decryption key available to
it. Unless the DRM scheme requires on-line negotiation, all off-line use (which includes most
video players) must include the decryption key in the software client.
It is often possible to reverse engineer the decryption key, however, there are serious legal
issues with using such tools to unlock content protected by DRM schemes unless you are the
legitimate copyright owner. 91
13.2
Moving Pictures Expert Group (.mpg, .mpeg)
The Moving Pictures Expert Group (MPG) defined two major video and audio standards with
corresponding file formats: MPEG-1 92 and MPEG-2 93, although both can use the .mpg file
extension. The MP3 audio format (see section 12.4) is also part of the MPEG-1 standard.
After a lengthy development, MPEG-1 was finally approved in 1992 and standardised as ISO
11172 in 1993, with subsequent additions to the same standard being made in 1995 and 1998.
It is intended to encode VHS-tape quality video, and is still in widespread use. MPEG-2 was in
development before MPEG-1 was standardised, and provides higher quality (it is the encoding
used in DVD videos).
MPEG-2 was standardised as ISO 13818 in 1996, with many subsequent additions being made.
MPEG-1 videos are a valid subset of MPEG-2 videos, so software or devices capable of
decoding MPEG-2 videos can automatically decode MPEG-1.
MPEG video uses a lossy codec, and has no built-in digital rights management.
13.2.1
Flexibility
Continuity properties of Moving Pictures Expert Group (MPG)
Interoperability
Very high. Almost all platforms can access content in MPEG
format.
Implementability High. Many programming environments have support for the
91
See http://en.wikipedia.org/wiki/Software_cracking
92
See http://en.wikipedia.org/wiki/MPEG-1
93
See http://en.wikipedia.org/wiki/MPEG-2
Page 79 of 83
The National Archives
A Guide to Formats Version: 1
MPG format, although note that MPEG-2 is subject to patent
restrictions.
Quality
Lossiness
Lossy.
Precision
No issues.
Resilience Recoverability
Above average. The format is complex, but is designed to
work when streaming across networks. Corruption generally
affects only a few frames of the video.
Ubiquity
Very high. MPEG videos are extremely widespread.
Stability
Very high. The formats have been in use for nearly two
decades and are highly standardised.
13.3
Windows Media Video (.wmv)
The Windows Media Video (WMV) 94 file format is something of a misnomer, in Windows Media
Video refers to a family of codecs, rather than a file format. The codecs are contained in an
Advanced Systems Format 95 media container file, whose specification is available. 96 The three
codecs defined are:
•
Windows Media Video
The most common codec, released in 1999.
•
Windows Media Video Stream
Designed for the capture of live screen content.
•
Windows Media Video Image
A video slideshow codec.
However, most WMV video files encountered use the first codec with the same name –
Windows Media Video. The others may be encountered in specialised scenarios, but their use is
not particularly common. The Windows Media Video codec was first specified in 1999, as WMV7. It was subsequently updated to WMV-9, and standardised through the Society of Motion
Picture and Television Engineers (SMPTE) as VC-1. This format is used in both Blu-Ray and
HD-DVD discs.
The format optionally supports various forms of digital rights management, which can restrict
playback of content except on authorised devices, or only allow playback for a limited time.
Hence, care must be taken with video in any WMV format to ensure it is not protected by DRM
schemes if the video must be reliably accessed into the future.
94
See http://en.wikipedia.org/wiki/Windows_Media_Video
95
See http://en.wikipedia.org/wiki/Advanced_Systems_Format
96
See http://download.microsoft.com/download/7/9/0/790fecaa-f64a-4a5e-a430-
0bccdab3f1b4/ASF_Specification.doc
Page 80 of 83
The National Archives
13.3.1
A Guide to Formats Version: 1
Continuity properties of Windows Media Video (WMV)
Flexibility
Interoperability
Very high. Most platforms can process content in the most
common codec.
Implementability Average. Some programming environments have support for
WMV format files.
Quality
Lossiness
Lossy.
Precision
No issues.
Resilience Recoverability
Above average. The format is complex, but is designed to
work when streaming across networks. Corruption generally
affects only a few frames of the video.
Ubiquity
Very high. The format is widespread on the internet and used
as a delivery format for consumer disks like Blu-Ray and HDDVD.
Stability
High. WMV-9 is standardised as VC-1, but the other variations
are not.
13.4
Audio Video Interleave (.avi)
The Audio Video Interleave (AVI) format 97 was first introduced in 1992 as a proprietary video
and audio container format by Microsoft. In theory, it can contain video and audio encoded
using any codec, but more recent developments in advanced codecs are hard to encapsulate in
it. Hence, AVI files tend to contain video and audio using older codecs. This can present a
continuity risk, as the codecs used by an AVI file are hard to determine without specialised
software. These codecs may themselves be at risk of obsolescence.
The format has no digital rights management issues.
13.4.1
Flexibility
Continuity properties of Audio Video Interleave (AVI)
Interoperability
High. Most platforms can process AVI files. However, note that
availability of the codecs used in an AVI file are the true
measure of interoperability. Some codecs may not be available
on all platforms.
Implementability High. The AVI format is widespread and support for it exists in
many programming environments.
Quality
Lossiness
Mixed. AVI files can use any codec (in theory both lossy and
lossless codecs). In practice, most codecs will be lossy.
97
See http://en.wikipedia.org/wiki/Audio_Video_Interleave
Page 81 of 83
The National Archives
Precision
Resilience Recoverability
A Guide to Formats Version: 1
No issues.
Unknown. It will largely depend on the choice of codec, in
which most of the information in an AVI file is encoded.
Ubiquity
High. AVI files are widespread, although gradually being
replaced by more modern video formats.
Stability
Below average. While the AVI format itself (as a container of
other data) has not changed, it is not standardised, and there
are several incompatible implementations of various features in
existence. Support for all variations and codecs used in the
format cannot be guaranteed into the future.
13.5
Flash Video (.flv)
The Flash Video (FLV) format 98 is widespread as a delivery mechanism for video on the world
wide web. It is a proprietary video container format, created by Adobe Corporation, which allows
the use of various codecs to compress and decompress video and audio data contained within
it. However, the codecs usually used are the Sorenson Spark 99 or VP6 100 video compression
formats, and more recently H.264 video (although this codec is covered by patents). Audio in
Flash videos is usually encoded as MP3 (see section 12.4).
It was first specified in 2003 as the FLV file format (previously, the same video could be
embedded in the Shockwave Flash format, but not standalone as an FLV file). The format was
updated in 2007 to a new container format based on and extending the ISO base media file
format. 101 This is effectively a different file format, but it shares the FLV extension with the
earlier format. Software to decode FLV files must look inside the files to determine what type of
format it actually is.
Competition currently exists to define a new video standard for internet videos, with various
formats being proposed. There is an ongoing debate on whether internet video should use nonproprietary, open standards video formats which do not require license payments to use. 102
98
See http://en.wikipedia.org/wiki/Flash_Video
99
See http://en.wikipedia.org/wiki/Sorenson_Spark
100
See http://en.wikipedia.org/wiki/VP6
101
See http://en.wikipedia.org/wiki/ISO_base_media_file_format
102
See http://en.wikipedia.org/wiki/HTML5_video#Default_video_format_debate
Page 82 of 83
The National Archives
13.5.1
Flexibility
A Guide to Formats Version: 1
Continuity properties of Flash Video (FLV)
Interoperability
High. Almost all platforms (with the notable exception of the
iPad) can process Flash video.
Implementability High. Many tools and programming environments can process
flash video.
Quality
Lossiness
Lossy.
Precision
No issues.
Resilience Recoverability
High. The format is designed to support delivery over the
internet, so corruption will generally only affect a few video
frames.
Ubiquity
Very high. The format is the de facto standard for delivery of
video over the internet.
Stability
Below average. The more recent formats are based on a
standardised container, but the extensions are not
standardised. The earlier format is still in use, but Adobe
recommend moving away from it. Support for these formats
cannot be guaranteed except in the immediate future,
particularly if a competing format becomes the new de facto
standard for internet video.
Page 83 of 83
Download