A Guide to Formats This guidance relates to: Stage 1: Plan for action Stage 2: Define your digital continuity requirements Stage 3: Assess and manage risks to digital continuity Stage 4: Maintain digital continuity This guidance is an addendum to our guidance Evaluating Your File Formats. The National Archives A Guide to Formats Version: 1 © Crown copyright 2011 You may re-use this document (not including logos) free of charge in any format or medium, under the terms of the Open Government Licence. To view this licence, visit http://www.nationalarchives.gov.uk/doc/open-government-licence/open-government-licence.htm ;or write to the Information Policy Team, The National Archives, Kew, Richmond, Surrey, TW9 4DU; or email: psi@nationalarchives.gsi.gov.uk . Any enquiries regarding the content of this document should be sent to digitalcontinuity@nationalarchives.gsi.gov.uk Page 2 of 83 The National Archives A Guide to Formats Version: 1 A Guide to Formats .................................................................................................................. 1 1. Introduction ....................................................................................................................... 5 1.1 1.2 2. Plain text ............................................................................................................................ 7 2.1 2.2 2.3 2.4 3. Introduction ................................................................................................................ 37 Microsoft Excel 97-2003 (.xls) .................................................................................... 39 Microsoft Excel 2007 (.xlsx)........................................................................................ 40 OpenDocument Spreadsheet (.ods) ........................................................................... 41 Presentations .................................................................................................................. 43 7.1 7.2 7.3 7.4 8. Introduction ................................................................................................................ 26 Postscript (.ps) ........................................................................................................... 28 Portable Document Format (.pdf) ............................................................................... 30 Open XML Paper Specification (.xps) ......................................................................... 31 Microsoft Word 97-2003 (.doc) ................................................................................... 32 Open Document Text (.odf .odt) ................................................................................. 33 Microsoft Word 2007 (.docx) ...................................................................................... 34 Microsoft Rich Text Format (.rtf) ................................................................................. 35 Spreadsheets ................................................................................................................... 37 6.1 6.2 6.3 6.4 7. Introduction ................................................................................................................ 19 Zip (.zip) ..................................................................................................................... 19 Gzip (.gz) ................................................................................................................... 21 Tar (.tar) ..................................................................................................................... 22 OLE2 Compound Document Format .......................................................................... 24 Documents....................................................................................................................... 26 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 6. Introduction ................................................................................................................ 15 Hypertext Markup Language (.html, .htm) .................................................................. 16 Extensible Markup Language (.xml) ........................................................................... 17 File containers ................................................................................................................. 19 4.1 4.2 4.3 4.4 4.5 5. Introduction .................................................................................................................. 7 ASCII.......................................................................................................................... 10 EBCDIC ..................................................................................................................... 11 Unicode ...................................................................................................................... 11 Mark-up languages.......................................................................................................... 15 3.1 3.2 3.3 4. What is the purpose of this guide? ............................................................................... 5 Information sources ...................................................................................................... 6 Introduction ................................................................................................................ 43 Microsoft PowerPoint 97-2003 (.ppt) .......................................................................... 45 Microsoft PowerPoint 2007 (.pptx).............................................................................. 46 OpenDocument Presentation (.odp) ........................................................................... 47 Datasets ........................................................................................................................... 48 8.1 8.2 8.3 8.4 8.5 8.6 Introduction ................................................................................................................ 48 Microsoft Access (.mdb) ............................................................................................ 49 Microsoft Access 2007 (.accdb).................................................................................. 50 Comma Separated Values (.csv) ................................................................................ 51 Structured Query Language (.sql) .............................................................................. 52 Resource Description Framework (.rdf) ...................................................................... 54 Page 3 of 83 The National Archives 9. A Guide to Formats Version: 1 Emails .............................................................................................................................. 55 9.1 9.2 9.3 9.4 9.5 Introduction ................................................................................................................ 55 EML (.eml) ................................................................................................................ 55 Microsoft Message (.msg) .......................................................................................... 56 MBOX (.mbox) ........................................................................................................... 57 Personal Storage Table (.pst) ..................................................................................... 58 10. Images (raster)................................................................................................................. 60 10.1 10.2 10.3 10.4 10.5 10.6 Introduction ................................................................................................................ 60 Windows Bitmap (.bmp) ............................................................................................. 62 Tagged Image File Format (.tif, .tiff) ........................................................................... 63 Graphics Interchange Format (.gif) ............................................................................. 64 Portable Network Graphics (.png) .............................................................................. 65 Joint Photographic Experts Group (.jpg, .jpeg) ........................................................... 66 11. Images (vector) ................................................................................................................ 68 11.1 11.2 11.3 11.4 Introduction ................................................................................................................ 68 Encapsulated Postscript (.eps) ................................................................................... 69 Windows Metafile Format (.wmf) ................................................................................ 70 Scalable Vector Graphics (.svg) ................................................................................. 71 12. Audio ................................................................................................................................ 72 12.1 12.2 12.3 12.4 12.5 Introduction ................................................................................................................ 72 Waveform Audio File Format (.wav) ........................................................................... 73 Windows Media Audio (.wma) .................................................................................... 74 MPEG Layer 3 Audio (.mp3) ...................................................................................... 75 Advanced Audio Coding (.aac) ................................................................................... 76 13. Video ................................................................................................................................ 78 13.1 13.2 13.3 13.4 13.5 Introduction ................................................................................................................ 78 Moving Pictures Expert Group (.mpg, .mpeg) ............................................................. 79 Windows Media Video (.wmv) .................................................................................... 80 Audio Video Interleave (.avi) ...................................................................................... 81 Flash Video (.flv) ........................................................................................................ 82 Page 4 of 83 The National Archives A Guide to Formats Version: 1 1. Introduction 1.1 What is the purpose of this guide? To help in evaluating file formats, this guide will present factual information about selected existing file formats, specifically focussing on the digital continuity risks associated with them. Formats are broken up into a dozen broad groups covering different types of formats. For each of these groups there is a discussion of the general properties and issues, followed by more detail on a sample of specific formats. There are too many formats to write about all of them, so only a representative sample is presented here. No preference for a file format should be understood by its inclusion. Formats were selected on the basis that they are widely encountered, or that they serve as an exemplar of a general type of format. This guide will not state whether any given format is better or worse than another, as this can only be determined by evaluating formats in your own context, against your own business needs and technological environment. Formats which work well in one context may be inappropriate in another. A separate piece of guidance, Evaluating Your File Formats, 1 outlines a process by which you can compare different file formats with one another. This document is an addendum to that guidance and presents information which is useful in following that process, as well as more general discussion around particular file formats. In particular, the aspects of file formats which will be described here include: • • • Resilience o Any standardisation of the format o How old the format currently is o Whether the format is textual or binary o Whether the format is compressed, encrypted or otherwise obscured o Any other recoverability features in the format Quality o Any known precision issues with the format o If the format is ‘lossy’ (i.e. does it discard information) Flexibility o Whether software currently exists to programmatically access information in the file format o 1 How much existing software can access the file formats on common platforms See Evaluating your File Formats nationalarchives.gov.uk/documents/information- management/evaluating-file-formats.pdf Page 5 of 83 The National Archives A Guide to Formats Version: 1 It should be emphasised that it is file formats, not software, that is described in this guide. While it is common to refer to file formats by the software most commonly used to create them, file formats in principle are software-agnostic, even if in practice (particularly for complex formats), very few applications can actually access information in the format. The degree to which software can interact with the file formats described here will be assessed, as this can aid in understanding the continuity of file formats. Note that these assessments are, by their nature, quite subjective and you may determine different assessments when looking at the use of file formats in your own environment. For example you may determine that for interoperability, you are only interested in platforms or applications which appear in your own environment, rather than looking at the full spectrum of support. Nevertheless, the assessments presented here will provide a useful starting point when assessing formats. 1.2 Information sources Please note that information contained in this guidance may become out of date, as new formats are introduced, further standardisation work is undertaken, or new information comes to light. The information contained here has been assembled by internet research primarily using search, software vendor web sites, standardisation bodies, industry news sites and Wikipedia. Page 6 of 83 The National Archives 2. Plain text 2.1 Introduction A Guide to Formats Version: 1 Plain text is not technically a file format, in that there is no formal structure (i.e. format) imposed on the content. A text file simply contains any characters a creator wishes, in any order. There are conventions for ending lines, producing layout using tab characters, and other ‘control codes’, but like text itself, these can be used in any way that the creator desires, without any formal structure. Many other file formats are built on top of text, as it is easy to read and work with, so there is a high degree of flexibility. It is generally trivial to read and write text files in software, subject to the risks outlined below. However, note that if another format is built using text as a base, then reading this other format may be non-trivial, even if reading the text it is based on is easy. Plain text is generally very resilient to corruption, for most encodings changing only a single character in the face of 1 byte changing, without affecting the rest of the text contained. There are no direct quality issues with text (although more complex formats based on text may have). Text files themselves are not lossy, and have no precision issues. The only features of plain text which generally needs interpreting are the encoding used by the text (how the characters are numerically represented), and how the ends of lines in the text are represented (which can occur in at least two common ways). Both of these are described below. 2.1.1 Encodings Computers do not understand text directly – they only work on numbers. The encoding of a text file is the method by which different text characters are numerically represented. For example, the letter ‘A’ may be represented by the number 1, ‘B’ by the number 2, and so on. There are many different possible encodings, some of which are not directly compatible with one another. Encodings differ in at least two principle ways: 1. They may represent different sets of characters from one another, making it impossible to translate between them if characters not shared by both are used. 2. They may use different methods of encoding the same characters, making translation between them possible, but requiring knowledge of which encodings are being used to read and write them correctly. Page 7 of 83 The National Archives A Guide to Formats Version: 1 Encodings frequently found (at least, in the Western world) are: • ASCII see section 2.2 • EBCDIC see section 2.3 • Unicode see section 2.4 Note that modern encodings (e.g. Unicode) are very broad, encoding almost all known characters in them, so translation to Unicode text is almost always possible from any given source encoding, but not necessarily vice versa. The vast majority of text files produced in the UK tend to be ASCII, or Unicode UTF-8. 2.1.2 Encoding risks Loss of encoding knowledge is the principle long-term continuity risk to text, as the encoding used by a text file is not usually defined anywhere in the file itself. To determine the encoding of a text file, there are a few libraries of code available 2 which can make a guess at the encoding given a sample of the text, but these will require custom software development to use and are not always correct. It is always possible to manually open an individual text file using a text editor, specifying which encoding to use, and to check that the file opened in that way is readable or not. Clearly, this approach does not scale up if there are a large number of text files for which the encoding is not known. Files found together in the same location will frequently (but not always) use the same encoding. If a text file was automatically produced by a piece of software, then it is likely that all the files produced by that software will share a common encoding. If you discover you have a large number of different encodings in use, you should consider migrating them to a single, modern standardised encoding, such as Unicode UTF-8, assuming your technological environment and business requirements permit this. Finally, note that some older encodings use ‘code pages’. 3 Code pages are essentially national variations on a common base of characters, re-using a few numbers to represent different specifically national symbols. This is done where the encoding scheme does not permit a wide enough range of numbers to represent all the characters needed for all nations at once. Each code page is similar, but not identical to other code pages. For example, a French code page may have an encoding for é, while the German variation could use the same number to mean ü. 2 For example, see International Components for Unicode at http://site.icu-project.org/ 3 See http://en.wikipedia.org/wiki/Code_page Page 8 of 83 The National Archives A Guide to Formats Version: 1 Other common characters will be encoded in the same way in both code pages. A subtle risk is introduced using code pages which are very similar. For example, the difference between US and UK code pages is very small, varying in only a few symbols, and this difference may not be easily detectable – for example, £ signs may be visually transformed into # symbols, but almost all the other text will be unchanged if opened using the wrong code page. Hence, knowing the code page (if any) is just as important as knowing the overall encoding. It is helpful to think of code pages as different encodings from each other in the first place (which simply happen to share a common base of characters). 2.1.3 Line ending risks There are two common ways to encode line endings in text files. Some text files use an invisible Line Feed (LF) control code 4 to indicate the end of a line, whereas others use a Line Feed followed by a Carriage Return (CR) control code, reflecting old requirements of teletype printing systems. In general, UNIX-like systems produce text files with only an LF to terminate lines, whereas Microsoft DOS and Windows systems produce text files with LF/CR line endings. Much software will not process text files properly with different line endings than expected. However, it is easy to translate between them, by simply substituting LF for LF/CR and vice versa. 2.1.4 Migration risks When migrating text from one encoding to another, the main risk is not understanding either the source encoding, the target encoding, or the characters in your text you specifically need to migrate. In general, older encodings such as ASCII or EBCDIC only support a very limited range of characters, or implicitly use code pages to support a greater range of characters. It is very easy to think that you are using one national character set, as most characters are in common, when in fact there is an occasional character which implies a different encoding (code page). For example, your text may appear to be in UK English, when in fact it is encoded as US English. This can lead to some symbols migrating incorrectly when transformed into a wider encoding such as Unicode. When choosing a format to migrate to, you must consider your own technological environment and business requirements. However, all other things being equal, a modern encoding such as 4 A control code is an invisible, non-printing character encoded by some number not used for normal text. Page 9 of 83 The National Archives A Guide to Formats Version: 1 Unicode UTF-8 is generally a good choice, as it can supports most characters in use today and is backwards compatible with earlier standards like ASCII. 2.1.5 Continuity properties of plain text Flexibility Interoperability Very high. Line endings may vary between platforms. Implementability Very high. Almost all programming languages can read and write most common text encodings. Quality Lossiness None. Precision No issues. Resilience Recoverability Variable. ASCII, EBCDIC, UTF-8 are very high. UTF-16 is average, and UTF-32 is below average. Ubiquity Very high. Almost all software that needs to can read text in common encodings. Stability Very high. Text encodings are highly standardised and survive unchanged for decades. 2.2 ASCII The American Standard Code for Information Interchange (ASCII) 5 is very common, and many other encodings are compatible with ASCII. It was first defined in the early 1960s, and is still in widespread use today. However, it only provides a very limited range of characters for the English alphabet. Each character is represented by a single byte, ranging from 0 to 127 in value. Various attempts to extend ASCII to cover other alphabets by using up to 256 different characters exist. These are often described as Extended ASCII, 6 but this is not a single standard encoding. A set of standardised extended ASCII encodings are the ISO 8859 7 family of encodings. These provide standard encodings for various language families – for example, ISO 8859-1 for Western European languages and ISO 8859-2 for Eastern European languages. All of the plain ASCII encodings are common to these standards, with the regional variations occupying values equal to or above 128. 5 See http://en.wikipedia.org/wiki/ASCII 6 See http://en.wikipedia.org/wiki/Extended_ASCII 7 See http://en.wikipedia.org/wiki/ISO/IEC_8859 Page 10 of 83 The National Archives A Guide to Formats Version: 1 One way to determine if a file is likely to be plain ASCII is if all the bytes in it are less than 128 in value. Generally, text encoded using other standards will include values equal to or above this number. By design of the Unicode creators, ASCII files are also completely valid UTF-8 files (a form of Unicode encoding – see section 2.4.1). Note that the reverse is not necessarily true, as Unicode can encode far more characters than ASCII. ASCII files are very resilient, in that a change to a byte, or a loss or addition of a byte only affects that byte – the rest of the text is never affected by local corruption. 2.3 EBCDIC EBCDIC 8 encoding is generally found on IBM mainframe computers or in systems which interact with them. It is similar to ASCII in that it can only represent very few characters, and so uses code pages to extend it to cover other languages. However, it is not compatible with ASCII, and has itself several versions which are not compatible with each other. It has been in use since the late 1950s, but it is not formally standardised, being a vendor-controlled encoding. Because this encoding has existed for a long time, it is possible to encounter EBCDIC encoded text files, although this is uncommon outside of an IBM environment. If possible in your business and technological environment, it is recommended to migrate files out of EBCDIC encodings, as they are not widely used. 2.4 Unicode Unicode 9 is an international standard for text which allows the representation of most of the writing systems in the world, by allowing a much greater number of characters within it and explicit support for various specialised symbols. It does not need code pages to represent different characters, as the allowable range of numbers in it is large enough to accommodate national variations, special purpose symbols and any other character requirements. It was first developed in 1987, and has been through regular revisions since then, adding support for increasing numbers of characters and languages. 8 See http://en.wikipedia.org/wiki/EBCDIC 9 See http://en.wikipedia.org/wiki/Unicode Page 11 of 83 The National Archives A Guide to Formats Version: 1 It is closely related to the ISO/IEC 10646 10 standard in that the characters defined in it are the same in both, but the Unicode standard imposes some additional constraints on how those characters must be processed. The characters of the ISO 8859-1 encoding represent the first 256 characters of the Unicode standard, to make it easy to convert existing Western European text and ASCII, in which a large amount of text files were originally encoded. However, there are actually several different possible encodings of Unicode text. The Unicode standard itself defines the characters which can be encoded by it (called ‘code-points’), then there are several different ways of actually encoding those characters. A useful comparison of Unicode encodings can be found at: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings 2.4.1 UTF-8 The most common encoding is UTF-8 11, which is directly compatible with ASCII (all ASCII text is automatically valid UTF-8, but not vice versa if characters outside the first 128 characters are used). ASCII compatibility is a useful property given the large volume of existing files using ASCII. UTF-8 is known as a multi-byte encoding, in that from 1 to 4 bytes are used to encode each character. For values under 128, a single byte is used to encode a character, for any other character, more than one byte is used, with at least one byte having a value over 128 to distinguish it from single-byte characters. File sizes of UTF-8 encoded files are often relatively small, as the standard only uses as many bytes as needed for each character, in contrast to other encodings which may use a fixed number of bytes regardless. UTF-8 also has the useful property of being self-synchronising, meaning that the loss, insertion or corruption of a byte in the text will not generally prevent software from determining where the other characters begin or end. This keeps any problems localised, and the rest of the text readable in the face of errors. 10 See http://en.wikipedia.org/wiki/Universal_Character_Set 11 See http://en.wikipedia.org/wiki/UTF-8 Page 12 of 83 The National Archives 2.4.2 A Guide to Formats Version: 1 UTF-16 UTF-16 12 uses groups of 2 bytes to encode what are called ‘code-units’. It usually uses one code-unit to encode a single character, but will sometimes use 2 code-units (4 bytes) to encode a character. UTF-16 encodings have two variants in terms of the order in which the group of two bytes are written, termed the ‘endianness’ of the encoding. These variants are termed ‘Big Endian’ and ‘Little Endian’. Some files specify a Byte-Order-Mark (BOM), which is a 2-byte prefix at the start of the file which indicates the endianness of the file. However, this is not mandatory, and many files do not include a BOM. UTF-16 can handle corruptions to individual bytes, re-synchronising on the next valid Unicode code-point, but the loss of bytes or insertion of additional bytes can cause the succeeding text to become unintelligible. File sizes of UTF-16 encoded text are reasonably small, but are usually larger than the equivalent text encoded in UTF-8 (depending on which characters appear in the text). UTF-16 is frequently used internally in software and programming languages to represent Unicode text, and is not infrequently found in text-files, although it is not as common as the UTF-8 encoding for storage purposes. 2.4.3 UTF-32 UTF-3213 is known as a fixed-byte encoding, in that UTF-32 always uses 4 bytes to encode each character. However, since Unicode allows for adjacent characters to be combined in some circumstances, this does not lead to a direct relationship between the number of bytes and the number of displayed characters. The value of a UTF-32 character is the direct numeric value of its corresponding Unicode code-point. Using 4 bytes per character is much less space efficient than UTF-8 or UTF-16, resulting in much larger file or memory sizes when processing text in this encoding. UTF-32 can handle corruptions to individual bytes, re-synchronising on the next valid Unicode code-point, but the loss of bytes or insertion of additional bytes can cause the succeeding text to become unintelligible. 12 See http://en.wikipedia.org/wiki/UTF-16/UCS-2 13 See http://en.wikipedia.org/wiki/UTF-32 Page 13 of 83 The National Archives A Guide to Formats Version: 1 Hence, UTF-32 is less commonly found in text files, and is more commonly used as an internal representation of Unicode code-points in software. Page 14 of 83 The National Archives 3. Mark-up languages 3.1 Introduction A Guide to Formats Version: 1 Mark-up languages are file formats built on text, which use ‘tags’ inside the format to add additional structure and meaning to the plain text. For example, we could write: <Title>Format facts</Title> <Body>To help in evaluating file formats, ...</Body> Like the text they are based on, they are also fairly resilient to corruption, and can be opened in a common text editor, a specialised markup editor or processed programmatically using commonly available libraries of code. Markup languages themselves are not innately lossy and have no precision issues in principle (although a lossy format or one with precision issues could be created using markup). Almost all markup languages in use today inherit from a specification known as Standardised General Markup Language 14 (SGML), which itself is not in widespread use anymore. Markup languages in widespread use include: • Hypertext Markup Language HTML see section 3.2 • Extensible Markup Language XML see section 3.3 3.1.1 Schemas Markup languages define a specific set of tags used to annotate the text. There may be constraints on the valid structures of tags – for example, which ones appear next to one another or how they can be nested within others. The definition of valid tags and their structure is called a schema. There are many ways to define schemas for markup languages, including Document Type Definitions 15 (DTD), XML Schemas 16 (XSD) and RELAX NG. 17 Schemas both provide a technical level of documentation on how a format defined using markup is constructed, and a way to automatically validate that a markup-format conforms to a specification. 14 See http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language 15 See http://en.wikipedia.org/wiki/Document_Type_Definition 16 See http://en.wikipedia.org/wiki/XML_Schema_(W3C) 17 See http://en.wikipedia.org/wiki/RELAX_NG Page 15 of 83 The National Archives 3.1.2 A Guide to Formats Version: 1 Continuity risks Formats based on markup languages can be quite complex. Since it is also quite easy to define file formats based on markup languages (in particular, XML-based formats are very common), then a large number of highly bespoke, and often poorly documented formats are in existence. Without knowledge of the schema used to define the format, data encoded in a markup language can be hard, if not impossible to interpret, even though it remains technically accessible. In cases where software that provides the necessary layer of processing needs maintenance, or where you otherwise need to programmatically access the data encoded in a markup language, you should ensure that you understand the schemas used by them. Without this understanding you will be at risk of losing continuity if the software changes or becomes unavailable. Understanding a schema implies more than simply knowing which schema was used. While schemas provide a base level of technical documentation for a markup-based format, they are generally not sufficient to interpret the meaning of the markup and the data contained within them. For example, if a tag is called ‘<Creator>’, does this refer to a person, or an organisation? Does it matter? Are there any other constraints on valid data which the schema does not capture? You should make sure that you have both schemas and documentation explaining the intended meaning and constraints of the markup. 3.2 Hypertext Markup Language (.html, .htm) A well-known example of a mark-up file format is HyperText Markup Language 18 (HTML, HTM), used to create web pages. The first specification of HTML was made available by 1992. Various versions of HTML exist, including 2.0, 3.2, 4.0, 4.01, and version 5 is currently in development. Version 2 is now considered obsolete, as it contains elements which have been dropped in subsequent versions. From version 3.2 onwards, it is standardised through the W3C (World Wide Web Consortium). While HTML itself is standardised, many technologies interpret these standards slightly differently from one another. In addition, many real-world HTML files do not strictly conform to the standards. This can create situations where an HTML file will render correctly on one platform, but not on another. Increasingly, HTML processing technologies are becoming more consistent, but differences can still be expected. 18 See http://en.wikipedia.org/wiki/HTML Page 16 of 83 The National Archives A Guide to Formats Version: 1 In addition to the HTML standards, there is also a variant called XHTML 19, which is a version of HTML which is fully compatible with XML (see section 3.3). XHTML is stricter than HTML in the structures it allows, ensuring that all the markup conforms to a precise standard. Theoretically, this would make it easier to write applications which work with HTML, which in practice is often quite loosely written. For example, in real HTML documents, markup tags are not always closed after they are opened. However, XHTML has not been widely adopted on the web, so in practice it is necessary to be able to deal with HTML as well. Note that HTML files are rarely found on their own; they make reference to external resources (via hyperlinks, or file includes). These external resources may not be HTML themselves – they can be images, video, audio, programming languages (e.g. JavaScript), style sheets which affect how the HTML displays (e.g. css files). All of these external resources can affect how the HTML displays, or even which content is ultimately loaded into it. You should ensure that you understand which external resources are required to use your HTML files into the future, and whether they pose any continuity risks of their own. 3.2.1 Continuity properties of HTM and HTML Flexibility Interoperability Very high. All platforms can process HTML data. Implementability Very high. Almost all programming languages can read and write HTML with ease. Quality Lossiness None. Precision Very high. Note that HTML is a document layout format intended for screens, not print. It does not generally provide ways to represent precise page layouts. Resilience Recoverability Very high. As long as the text encoding used is recoverable. Ubiquity Very high. Stability Very high. Although new versions of HTML are occasionally introduced, this process is quite slow and backwards compatibility is largely preserved. 3.3 Extensible Markup Language (.xml) The most common type of markup language in use today after HTML is known as Extensible Markup Language (XML), 20 also standardised by the World Wide Web Consortium (W3C). In fact, SGML and XML are not file formats in their own right. They are a standardised method of 19 See http://en.wikipedia.org/wiki/XHTML 20 See http://en.wikipedia.org/wiki/XML Page 17 of 83 The National Archives A Guide to Formats Version: 1 creating file formats using markup. XML defines a set of syntax and rules which markup languages must obey to be considered valid XML, but the specific tags and structures used in the XML markup are left up to the format designer. XML files are written using Unicode text. Almost all recent document formats are created using XML as a base, as are increasing numbers of other formats. The advantages include human readability of the underlying files, ease of extending the information within them and programmatic processing. Almost all (if not all) programming languages can process XML using widely available libraries, and there are many different software applications available to author, edit and transform XML directly. However, due to the ease of creating formats based on XML, there has been a huge increase in bespoke file formats. To assure their continuity requires detailed knowledge of the XML schema and other documentation on the meaning of the tags. Disadvantages of XML over binary file formats include larger file sizes, although this can be largely mitigated by compressing them (they typically compress very well), and difficulty including binary objects (e.g. an image) within an XML document, since XML is a textual format. Binary objects can be converted to a textual representation, but this takes up a lot of space (and they do not compress as well if such objects are included). 3.3.1 Flexibility Continuity properties of XML Interoperability Very high. All platforms can process XML data. Implementability Very high. Almost all programming languages can read and write XML with ease. Quality Lossiness None. Precision No issues. But note that formats defined using XML may have precision issues. Resilience Recoverability Ubiquity Very high. As long as the text encoding used is recoverable. Very high. But note that formats defined using XML may not be ubiquitous. Stability Very high. But note that formats defined using XML may not be stable. Page 18 of 83 The National Archives 4. File containers 4.1 Introduction A Guide to Formats Version: 1 File container formats are designed to contain other files. They are often used deliberately to archive files, compress them to save storage space or encrypt them. But it may be less well understood that they are used to support applications which require more than one file-resource to be available from within a single file (e.g. documents with embedded images). 4.1.1 Continuity risks There is a broad continuity risk arising from the use of container formats in the first place. By placing files in container formats, they become obscured from other information management tools. It is not enough to know you have a zip or a tar file – you need to know what is inside them to manage your digital continuity properly. However, note that this is not a continuity risk to file container formats themselves; it is a continuity risk created by using file container formats. In general, file containers tend to be very long-lived formats and widely supported in software. However, note that few container formats are formally standardised – they are typically de facto standards, with freely available specifications. File containers are not innately lossy (they must be able to accurately reconstitute the files contained within them), although most do not preserve all available file system metadata along with the files, and some have precision issues (e.g. around date-times). All of the file container formats described here are binary, which is more compact, as opposed to textual. Some email formats which are based on text encode file attachments textually (e.g. EML – see section 9.1). However, in general, most file container formats are binary. Binary formats tend to be less recoverable than textual formats (as, aside from other considerations, they store information more densely, meaning errors have a correspondingly greater impact). Some container formats include error detection and correction features to aid recoverability, since their role is to hold other files safely. 4.2 Zip (.zip) The ZIP file format (ZIP) 21 is one of the most widely used file container formats in use today. It is a binary format, provides good compression of contained files, is fairly fast to compress and decompress, and supports several different compression algorithms (including using no compression). Zip files can be accessed by a wide variety of software and support is found in all programming environments. 21 See http://en.wikipedia.org/wiki/ZIP_(file_format) Page 19 of 83 The National Archives A Guide to Formats Version: 1 It was first created in 1989, and its specification was released into the public domain. It has not been formally standardised, although the ISO organisation is currently investigating whether it should produce an ISO standard for the zip specification. However, note that the legal status of recent versions of ZIP (particularly 64-bit zip) is not clear, and software support is more limited. These versions give support for strong encryption and file sizes greater than four Gigabytes. File system metadata, such as original file names, folder structure and dates are usually included in zip files. However, note that the default timestamp in zip files is only accurate to two seconds, so dates and times will often be slightly different when compared to the original file system. Also note that no other file system metadata is preserved by default, including file system permissions. The zip file format is inherently extensible, and some extensions provide for more accurate date times, and some file system permissions to be preserved. Whether these extensions are used or not will depend on the zip software in use. If zip software cannot understand an extension written by different zip software, the standard behaviour is simply to ignore it (while still dealing with what can be understood). This can lead to metadata loss if different zip software is used to zip and unzip files. If recovery of file system metadata is important, you should ensure that both the software used to zip and unzip files can handle the same metadata in equivalent ways. The zip format provides a measure of integrity protection against corruption, using CRC 22 checksums to detect errors, and it stores two copies of a file directory structure to provide some redundancy. Having a file listing inside the zip file allows access to each file in the zip independently of the others, without having to read the entire zip file to access them individually. Each file inside the zip file is compressed separately, meaning that corruption which affects one file contained in it may not affect others, assuming the files can still be properly located within the zip file itself. Tools to repair corrupted zip files are readily available, although repair cannot be guaranteed. 4.2.1 Flexibility Continuity properties of ZIP file formats Interoperability Very high. All platforms can process zip files, although note that the 64-bit zip format is not so interoperable. Implementability Very high. Code to process zip files exists in all major programming languages, although note that the 64-bit format is not as well supported. 22 See http://en.wikipedia.org/wiki/Cyclic_redundancy_check Page 20 of 83 The National Archives Quality Lossiness A Guide to Formats Version: 1 Almost none. Files are contained completely losslessly. However, note that file system metadata depends on extensions which may not be supported in all zip software. Precision Resilience Recoverability High. Date/times are only accurate by default to two seconds. Above average. Provides several different recovery mechanisms which permit the zip file itself to be read in the face of limited corruption, and error detection (but not correction) for the contained files themselves. Ubiquity Very high. The standard (not 64-bit) zip format is extremely widespread, and serves as a basis for many other formats. Stability Very high. Zip files from the 1980s can still be processed by current software. It is likely that support for zip files will continue into the indefinite future. However, note that it is not formally standardised. 4.3 Gzip (.gz) Gzip (GZ) 23 is a compression format which, unlike the other file containers described here, normally only contains a single file. Where multiple files must be compressed, it is common to first archive them together using the Tar format (see section 4.4) into a single tar file, then to compress the tar file using gzip. It provides good compression and is fast to compress and decompress. The file format was first released in 1992, and the specification is openly available, although it has not been formally standardised. It was originally created to work around patents (now expired) which existed on other compression algorithms at the time. It consists of a short header, followed by the compressed data, ending with a CRC 24 checksum and the length of the original file. This checksum and original file length provides some error detection in the face of corruption, but recovery options are limited. File system metadata such as dates, folder structure and permissions are not preserved by gzip. Sometimes the original name of the file is included in the format header. 23 See http://en.wikipedia.org/wiki/Gzip 24 See http://en.wikipedia.org/wiki/Cyclic_redundancy_check Page 21 of 83 The National Archives A Guide to Formats Version: 1 It is frequently found on UNIX-like systems, although software to process it on other platforms is widely available. Support for the format in common programming languages is also widespread. While not as full-featured as other file container formats, it follows the UNIX philosophy of doing one job well – compressing a file – leaving bundling files together and preserving file system metadata as tasks for other tools. 4.3.1 Continuity properties of Gzip file formats Flexibility Interoperability High. Gzip can be processed on most, if not all, platforms. Implementability High. Code to process gzip files is available on most major programming languages, although not as well supported as zip. Quality Lossiness Almost none. Files are contained completely losslessly. However, note that file system metadata is generally not preserved by the format. Precision Resilience Recoverability No issues. Average. The gzip format is so simple, it is hard to break the format itself, and easy to repair if the format is corrupt. However, the recoverability of files contained within it is quite low. Corruption can be detected, but not easily fixed. Ubiquity High. The gzip format is very widespread, although it is mostly found on UNIX-based systems. Stability Very high. The gzip format has survived unchanged for many years, and support is very likely into the indefinite future. It is not standardised, but the specification is openly available. 4.4 Tar (.tar) The Tar format (TAR) 25 takes its name from ‘Tape Archive’, and is used to append multiple files sequentially into a single file. It originated in the UNIX operating system, and is still predominantly found on UNIX-like platforms. It was standardised through the IEEE in 1988 as POSIX.1-1988, 26 and in POSIX.1-2001. The POSIX standard is also the international standard ISO/IEC 9945. 27 It does not compress the files contained, or obscure them in any way. Tar files are not lossy in terms of the files they contain, and do not suffer from precision issues. It is common to find that 25 See http://en.wikipedia.org/wiki/Tar_(file_format) 26 See http://en.wikipedia.org/wiki/POSIX 27 See http://www.unix.org/version3/iso_std.html Page 22 of 83 The National Archives A Guide to Formats Version: 1 tar files are themselves compressed using the gzip file format (see section 4.3). Note that the files are written out sequentially, one after another (reflecting its origin in tape archiving), and there is no index of files in a tar file, so knowledge of, and access to, all files in it is not possible without first scanning across the entire tar file. Some file system metadata is captured by the tar format, including file names, size and the last modified time (stored as numeric UNIX time format). UNIX-style file permissions are also captured, although these will not translate into other platforms. It provides a simple checksum to detect corruption for each file which is stored. However, the checksum is quite basic, and does not check that the file contents themselves have not been corrupted, only that the metadata block is correct. Hence, recoverability has several different dimensions. Repairing a corrupted tar file so it can be read can be relatively straightforward, but the individual files within it may be corrupt and irreparable, and this may not be evident. On the other hand, a corruption to one part of a tar file may not impact on the recoverability of other files contained within it. 4.4.1 Flexibility Continuity properties of TAR file formats Interoperability High. Tar can be processed on most, if not all, platforms. Implementability High. Code to process tar files is available on most major programming languages, although not as well supported as zip. Quality Lossiness Almost none. Files are contained completely losslessly. However, note that some file system metadata is not preserved by the format. Precision Resilience Recoverability No issues. Average. The tar format is simple, with most data in it simply being the files contained as they are with no encryption or compression. Corruption of metadata headers can be detected, but not fixed. Ubiquity High. The tar format is very widespread, although it is mostly found on UNIX-based systems. Stability Very high. The tar format has survived unchanged for many years, and support is very likely into the indefinite future. It is standardised through the POSIX standard. Page 23 of 83 The National Archives 4.5 A Guide to Formats Version: 1 OLE2 Compound Document Format The OLE2 Compound Document Format 28 is slightly different to the other file container formats presented here, in that it is not used as a consumer container format, and tools to manipulate OLE2 are not widely available. However, it is an important container format, in that it serves as a base container for almost all binary Microsoft file formats. Hence, it is unlikely that anyone will ever need to directly use or choose an OLE2 file format, and thus will have no direct continuity issues with it. However, to avoid replicating information about OLE2 in all the Microsoft binary format descriptions, some information on this key underlying format is provided here. Programmatic code to access this format can be found, albeit not always well supported on all platforms. Since OLE2’s role is not to archive files from an external file system, but to allow applications to store and manage multiple resources in a single file, it does not typically preserve file system metadata at all. However, it is possible to set a file date and time for each contained file if required. OLE2 has a complex internal structure, allowing files and folders to be created within it. It attempts to re-use space as files or folders are changed or deleted, leading to internal fragmentation of its resources (much as files can become fragmented on a disk). While this reduces the space required for formats based on OLE2, it reduces the recoverability of the files based on the format, by mixing up files together requiring the file indexes to reassemble them in all cases. A single corruption to the file can prevent the entire file being read successfully. It provides no built-in error detection or repair. 4.5.1 Flexibility Continuity properties of OLE2 Compound Document Format Interoperability Very low. It is not directly used as a consumer container format. However, applications which make file formats on top of this format may have a high interoperability. Implementability Low. Some code to access OLE2 files directly can be found, but it may not be well supported, and may not work in all programming environments. 28 See http://download.microsoft.com/download/0/b/e/0be8bdd7-e5e8-422a-abfd- 4342ed7ad886/windowscompoundbinaryfileformatspecification.pdf Page 24 of 83 The National Archives Quality Lossiness A Guide to Formats Version: 1 None. All files contained within an OLE2 file are stored losslessly. No file system metadata is preserved. Precision Resilience Recoverability No issues. Very low. Corruption cannot be detected, and a single corruption can prevent all the files within it being read. Ubiquity Very high. The format serves as a base container format for almost all Microsoft binary formats. Stability Very high. The format has not changed in a long time, and being a base for almost all Microsoft binary formats ensures it will remain supported for some time to come. Page 25 of 83 The National Archives 5. Documents 5.1 Introduction A Guide to Formats Version: 1 Document file formats are among the most common types of file format encountered. There is a wide variety of document file formats in use today, which fulfil different needs. This guidance will not describe older document formats no longer in widespread use (although there are many of these). 5.1.1 Document format types There tends to be a basic division between page-oriented document formats aimed at printperfect layout and those aimed at user editing. Page-oriented document formats are suitable for publication, but are not suitable where the document needs to be further changed. Page-oriented formats • Postscript PS see section 5.2 • Portable Document Format PDF see section 5.3 • Open XML Paper Specification XPS see section 5.4 User-editable formats • Microsoft Word 97-2003 DOC see section 5.5 • Open Document Format Text ODF, ODT see section 5.6 • Microsoft Office Open XML DOCX see section 5.7 • Microsoft Rich Text Format RTF see section 5.8 5.1.2 Complexity risks Digital documents are often imagined to be quite simple, as they largely consist of text on pages, replicating physical paper documents which are easily understood. However, in reality they are extremely complex file formats. The more complex a format, the harder it is to re-use the data in other contexts, access data in it programmatically, or to migrate to different formats. The risk of vendor lock-in is substantially increased. Documents may have many different resources embedded within them, including images, video and even audio. Spreadsheets or other complex formats may also be directly embedded within them. They may have programmatic code (e.g. ‘macros’), which perform tasks on the content or access external data sources. Typically, programmatic code embedded in documents does not survive migration to other formats, as the code language is usually non-standard and heavily oriented towards the primary creating application. Page 26 of 83 The National Archives A Guide to Formats Version: 1 Some user-editable document formats track changes to the content (but usually not all kinds of content), and allow review and commenting of the content by different parties. User-defined fields may exist to contain defined data (e.g. to support mail-merge functionality). Many document formats have specifically defined fields to hold user metadata, such as the author of a document. They may also have embedded dependencies on external data (e.g. a link to another file on a disk, which can break if either file is moved), and cross-links within the document which can also break. Some features of document file formats only exist to preserve backwards compatibility with documents written in earlier formats. While this mitigates some continuity risks, it also further increases the complexity of the formats going forwards. 5.1.3 Migration risks All document migration carries risk, due to the complexity of document formats. It is entirely normal that a document migration will lose or change some features of the original, unless the document is very simple. In many cases, the change or loss can be quite minimal and may not be considered vital (e.g. the style of a heading changes slightly). However, it is essential that all document migrations are tested thoroughly on a selected set of candidate documents, to assure that essential features are not lost in the process. Document migration can be largely separated into three broad types of migration, which typically carry different risks: • within a family of file formats (e.g. Microsoft Word 95 to Microsoft Word 97-2003) • across format families (e.g. Microsoft Rich Text Format to OpenDocument Text 1.1) • from a user-editable to a page-layout format (e.g. OpenDocument Text 1.1 to PDF 1.7). Within a family of file formats Upgrading within a family of file formats generally poses few direct continuity risks, as most file formats are specifically engineered to be backwards-compatible with earlier versions of the ‘same’ format. However, migration is never risk free, and some small changes to documents may be found – e.g. styles and formatting may change. By contrast, downgrading to earlier versions may entirely lose formatting, embedded objects, programmatic code or other advanced features depending on what is supported in the earlier versions. The textual content itself is usually preserved when downgrading. Across format families Migrating from one broad type of document file format to an entirely different one poses the highest direct continuity risks. No two broad families of document file format support exactly the same features, in the same ways, so some change and loss to a document should be expected. Page 27 of 83 The National Archives A Guide to Formats Version: 1 For example, the Microsoft family of document formats fundamentally manages the pagination of documents (replicating a paper-model of documents), whereas the OpenDocument Text family of formats largely leaves pagination up to the rendering software (given that it is a digital document which may be printed or displayed at different sizes), and does not therefore store this information in the format. Therefore translation between them may produce pagination changes. In general, migration between recent versions of most document formats will produce documents which are still readable, but with some formatting changes. However, advanced features such as embedded programming (‘macros’) and change-tracking will often not survive the process. From user-editable to page-layout A frequent use-case in document workflows is taking a user-editable document and migrating it to a page-layout format, either for publication or archiving. This process will generally produce a high-quality output document which preserves the layout and styles of the original. However, all advanced interactive features will generally be lost (since this is the fundamental difference between user-editable and page-layout formats). Some page-layout formats may faithfully replicate the look of a document, but may incidentally lose other features that are still required. For example, the PDF format can store text in a way which can be rendered absolutely accurately on screen or paper, but is not electronically searchable. If the ability to copy and paste out of the document is important, attention should be paid to how the text can be further manipulated in the page layout format. Some page-layout formats make it hard to select and copy text out of them (e.g. columns are not properly wrapped, mixing up text from several columns when it is selected out of the document). While page-layout formats are very useful for human readability of documents, it is normal that some form of digital access to the content will still be required. Special attention should be paid to the features used in the page-layout format and your business requirements for ongoing use of the information. 5.2 Postscript (.ps) Postscript (PS) 29 is one of the oldest page layout formats, which has its origin as a printer page specification language, developed by Adobe Systems and first issued in 1984. It is also used 29 See http://en.wikipedia.org/wiki/PostScript Page 28 of 83 The National Archives A Guide to Formats Version: 1 widely to publish electronically, particularly for academic papers, although Portable Document Format (PDF) is now supplanting it for most purposes. Postscript is a textual format, although not a mark-up language, consisting of a series of programmatic commands to layout graphics and text. Postscript can only handle numbers up to a precision of nine decimal digits, so calculations made using its programming language can produce rounding errors. Most people will not encounter this issue if simply saving documents in a postscript format – however, advanced users of postscript should be aware of this limitation in the format. It is not an international standard, although it has the status of a de-facto standard, as it is still in widespread use and there are many legacy documents written in it. There are three versions of Postscript – level 1, level 2 and version 3, and the specification is freely available from Adobe Systems. A large variety of software can read and produce postscript documents, on most computer platforms. 5.2.1 Flexibility Continuity properties of Postscript Interoperability High. Postscript is readable on all platforms. Implementability High. Code to manipulate postscript can be found in most programming environments. Quality Lossiness None. Precision Some issues. Numbers are only represented to a precision of nine decimal digits, potentially creating rounding errors if calculations are performed using the postscript programming language. Resilience Recoverability Average. Being a textual format, small corruptions to postscript files will often not prevent the file being opened, but no specific error detection or recovery mechanisms are part of the format. Ubiquity Very high. Postscript files are very widespread, and are still in active use, but note that many early uses of postscript are being replaced by PDF. Stability Very high. Postscript files are largely unchanged since they were first specified, and support for the format is likely to be found into the foreseeable future. Page 29 of 83 The National Archives 5.3 A Guide to Formats Version: 1 Portable Document Format (.pdf) Portable Document Format (PDF) 30 is an extremely widely used format for electronic publishing, also created by Adobe Systems. PDF consists of a subset of Postscript (see section 5.2), along with other technologies for embedding fonts and storing additional data. Although much of the content of a PDF file can appear as text, it is a binary format and includes support to compress parts of the data it stores, and to encrypt its contents. Therefore a PDF file may be more or less recoverable depending on exactly how the particular file was written out. Although initially a closed, proprietary format, it was made an open international standard ISO 32000-1:2008 in 2008, which anyone may implement freely without payment of royalties. PDF files are accessible on almost every platform, there is a huge range of software which can read them, and a substantial body of software which can create them, although due to being a pageoriented format, it is often not easy or possible to edit them once created. Many Software Development Kits are available to manipulate PDF files on all major platforms. There are nine separate versions of the PDF specification dating back to 1993, the most recent being released in 2009. PDF is now a very complex standard, including many features which go beyond a simple page layout specification. For this reason, targeted subsets of the PDF standard have been defined, simplifying and removing unnecessary features, standardised under the International Standards Organisation. These are: • • • PDF/X PDF/A PDF/E 5.3.1 Flexibility for the printing and graphic arts for archiving documents for exchange of engineering drawings ISO 15930 ISO 19005 ISO 24517 Continuity properties of PDF Interoperability Very high. PDFs can be accessed on all platforms. Implementability Very high. Code to read and write PDFs is available for most programming environments. Quality Lossiness None. The PDF format does not discard information given to it. However, you may lose functionality when moving from a user-editable format to a page-oriented format. Precision Resilience Recoverability None. Average. PDF is a binary format, although much of its content can appear directly as text which if changed would not prevent the file being accessed. Sometimes the content can be 30 See http://en.wikipedia.org/wiki/Portable_Document_Format Page 30 of 83 The National Archives A Guide to Formats Version: 1 compressed or encrypted, which reduces its recoverability. Ubiquity Very high. PDF files are found on all platforms and have been around for a long time. Stability High. The format is an international standard, but note that there are many different versions and subsets of it defined, and more may be defined in future. 5.4 Open XML Paper Specification (.xps) The Open XML Paper Specification (XPS) 31 is an XML-based page layout specification format created by Microsoft and later standardised through Ecma International as ECMA-388 in 2009. It consists of XML files (see section 3.3) and other media resources contained in a zip format (see section 4.2) archive file. Since the file is compressed, damage to the file can result in being unable to open the file, so recoverability in the face of corruption may be limited. However, note that there are zip repair tools available which may make it possible to recover a corrupted ODF file. This format is not in widespread use as an electronic publishing format, but XPS files are supported natively on Microsoft Windows Vista, being part of its printing system. Viewers, converters and Software Development Kits are available on other versions of Windows, and on some other platforms including Mac OS/X and Linux, although support on these platforms is not as well developed. 5.4.1 Flexibility Continuity properties of XPS Interoperability Low. It is mostly only supported on recent Microsoft Windows platforms, although software to access it on other platforms can be found. Implementability Low. Code to manipulate this format is not widely found in many programming environments. Quality Lossiness None. Precision None. Resilience Recoverability High. The format is an XML-based format, meaning small errors may only produce small content changes, or errors which are easily fixable. However, no specific error detection or correction is included in the format. Ubiquity 31 Low. The format is mostly only found on recent Microsoft See http://en.wikipedia.org/wiki/Open_XML_Paper_Specification Page 31 of 83 The National Archives A Guide to Formats Version: 1 Windows platforms. Stability High. Even though the format is not widely used outside of recent Microsoft Windows platforms, support for it is likely to be found for many years into the future. It has been standardised through ECMA. 5.5 Microsoft Word 97-2003 (.doc) The Microsoft Word 97-2003 (DOC) 32 format is the de-facto standard for user-editable business documents in use today. As its name suggests, it first appeared in 1997, and was used as the default document format until 2003, after which several new formats appeared. It is still supported by all major user-editable document software on all platforms. The format not been formalised through a standards body, but the specification is now made available by Microsoft, and it is mostly supported on almost every platform. Application Programming Interfaces and Software Development Kits are widely available. However, the format has several advanced features which are fully supported only on Microsoft platforms, including programmatic scripts and macros. In addition, DOC files can embed other objects which may require additional software to be installed to access them. It is a binary format, consisting of various document resources embedded in an OLE2 container format (see section 4.5). OLE2 files can be hard to recover in the face of corruption, as they have a complex and fragmented internal structure. 5.5.1 Flexibility Continuity properties of Microsoft Word 97-2003 Interoperability Very high. Almost all platforms can read and write this format. Implementability High. Many programming environments can access information in this format. Quality Lossiness None. Precision No issues. Resilience Recoverability Low. The binary OLE2 format on which it is based is hard to recover in the face of corruption, and there are not many tools to do so. 32 Ubiquity Very high. The format is found almost everywhere. Stability Very high. Although not formally standardised, its status as a See http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx Page 32 of 83 The National Archives A Guide to Formats Version: 1 de facto standard ensures that support for the format will be found many years into the future. 5.6 Open Document Text (.odf .odt) The Open Document Text Format (ODF/ODT) is a user-editable document format, consisting of various XML-based files (see section 3.3) and other document resources, such as image files, contained in a zip (see section 4.2) file. Since the file is compressed, damage to the file can result in being unable to open the file, so recoverability in the face of corruption may be limited. However, note that there are zip repair tools available which may make it possible to recover a corrupted ODF file. Originally created by Sun Microsystems, it is now developed through the Organisation for the Advancement of Structured Information Standards (OASIS). There are two versions of the standard published: 1.0 and 1.1, with a new version, 1.2, near to completion. Version 1.0 was also standardised in 2006 through the International Standards Organisation as ISO 26300. The Open Document family of standards (including documents, spreadsheets, presentations and drawings) are designed to be highly re-usable and interoperable. Software Development Kits and Application Programming Interfaces are widely available on all major platforms. It is possible to access ODF documents in Microsoft Word, Open Office and many other applications, although note that minor changes to formatting may occur in different applications opening the same file. 5.6.1 Flexibility Continuity properties of Open Document Text Format (ODF/ODT) Interoperability High. Most platforms can read OpenDocument Text format. Implementability High. Most programming environments can access information in OpenDocument Text format. Quality Lossiness None. Precision No issues. Resilience Recoverability Above average. The zip format it is based on provides several error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face of errors. Ubiquity Above average. ODF/ODT files are widely found, but they are not the dominant document format. Stability Very high. The OpenDocument Text format is highly Page 33 of 83 The National Archives A Guide to Formats Version: 1 standardised, and backwards compatible with earlier versions. 5.7 Microsoft Word 2007 (.docx) The Microsoft Office Open XML format 33 (DOCX) is a user-editable document format, consisting of various XML-based files (see section 3.3) and other document resources, such as image files, contained in a zip file (see section 4.2). Since the file is compressed, damage to the file can result in being unable to open the file, so recoverability in the face of corruption may be limited. However, note that there are zip repair tools available which may make it possible to recover a corrupted DOCX file. The specification has been standardised through Ecma International as ECMA-376, is the default file format for Microsoft Word 2007, and can be read using plug-ins in earlier versions of Microsoft Office. There is some support available on other platforms, which is increasing over time as more documents are exchanged in this format. Application Programming Interface support is still largely confined to Microsoft platforms, although code to access it on other platforms is increasing. In 2007 Microsoft submitted DOCX to the International Standards Organisation. However, the format as implemented in Office 2007 was not agreed for standardisation, as it included many Microsoft-specific legacy technologies which were not deemed suitable for inclusion. The result of this process was two standards, published in 2008, largely based on DOCX but not compatible with it: • • ISO 29500 Transitional ISO 29500 Strict ISO 29500 Transitional is intended as an interim standard, to allow migration of legacy Microsoft documents, by including features relating to implementation-specific details of earlier versions of Microsoft Office. Note that the ISO committee reserves the right to remove the ‘Transitional’ set of features from the standard at some point in the future. Microsoft Office 2010 is the first software to implement read and write support for this variant. ISO 29500 Strict is intended as a standard for new documents, removing the Microsoft-specific legacy features which were deemed unacceptable. Only read support for the ‘Strict’ variant will be included in Microsoft Office 2010; no software can currently write documents conforming to this standard. 33 See http://en.wikipedia.org/wiki/Office_Open_XML Page 34 of 83 The National Archives A Guide to Formats Version: 1 At present, DOCX documents are not ISO 29500 documents, although they are valid ECMA376 documents. There is a proposal before the ISO committee to amend the ‘Transitional’ standard so that existing DOCX files become compatible with it. 5.7.1 Continuity properties of DOCX Flexibility Interoperability High. Most platforms can read DOCX files. Implementability Average. Software to programmatically access information in DOCX files is mostly confined to the Microsoft platform, although support in other environments is growing. Quality Lossiness None. Precision No issues. Resilience Recoverability Above average. The zip format it is based on provides several error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face of errors. Ubiquity Above average. DOCX files are widely used, but they are not the dominant document format, which is still DOC. Stability Unclear. Although subject to several different standardisation processes, these have not resulted in a single standard, and instead produced several different and incompatible standards, which are not yet supported in software. The status of the format is unclear going forward into the future – it may be replaced by one of the newer standards, or the standards may be changed to make existing documents compatible with them. However, since there are a large number of files encoded in the current format, support for it is likely to be found into the near future. 5.8 Microsoft Rich Text Format (.rtf) Microsoft Rich Text Format (RTF) 34 is a widely used document format developed by Microsoft in 1987. It has limited features compared with more recent formats, but is implemented on all major platforms and can serve as a simple document interchange format. 34 See http://en.wikipedia.org/wiki/Rich_Text_Format Page 35 of 83 The National Archives A Guide to Formats Version: 1 RTF is a textual format, so recoverability in the face of corruption is reasonably good. It consists of a series of nested brackets and control codes surrounding the text, so it is essentially a markup language (see section 3). It has not been standardised through a formal body, although the specifications are freely available from Microsoft. There are ten major versions of the format in existence, the earliest (version 1.0) being issued in 1987, and the most recent (version 1.9.1 35) being published in 2008. It is not possible to determine which version of RTF is being used without analysing all of the features contained in a given document, as the documents themselves do not specify the version being used. In the past this has made it hard to fully support RTF without continual maintenance, as the specification was a moving target. However, Microsoft does not now anticipate making further substantive changes to the last specification. 5.8.1 Flexibility Continuity properties of RTF files Interoperability Very high. RTF files can be accessed on most platforms. Implementability Very high. Programmatic access to the RTF format is found in most programming environments. Quality Lossiness None. Precision No issues. Resilience Recoverability High. The format is a simple textual mark-up-like format, although it does not provide any specific error detection or recovery mechanisms. Ubiquity Very high. RTF files are widely found and still in active use as a simple document interchange format. Stability High. Although not formally standardised, the specifications are openly available, and Microsoft has indicated that it does not intend to make further changes to the specification. 35 See www.microsoft.com/downloads/en/details.aspx?familyid=dd422b8d-ff06-4207-b476- 6b5396a18a2b&displaylang=en&tm Page 36 of 83 The National Archives 6. Spreadsheets 6.1 Introduction A Guide to Formats Version: 1 Spreadsheets are ubiquitous in business, having expanded from their primary role as numbercrunchers, to becoming a convenient way of organising tabular (and often non-numeric) structured data. Spreadsheet formats are not as numerous as document formats, although there have been many since their first widespread use in VisiCalc in 1979. The formats described here include: • Microsoft Excel 97-2003 XLS see section 6.2 • Microsoft Excel 2007 XLSX see section 6.3 • OpenDocument Spreadsheet ODS see section 6.4 Spreadsheets are not lossy in any way, although all have precision issues of some degree, since they are primarily intended to compute numbers. The degree of precision supported by each spreadsheet format (and the software which processes them) determines how large any unavoidable rounding errors may be. 6.1.1 Complexity risks Modern spreadsheets, like documents, carry text, formatting, embedded objects (e.g. images), links to external resources and embedded programming languages (e.g. macros). Again, like documents, they often have features intended to preserve backwards compatibility with older formats, which mitigates some continuity risk while increasing the complexity going forward. 6.1.2 Migration risks The migration risks of spreadsheets are also similar to those of documents, in that there are three common migration use-cases: • within a family of file formats (e.g. Microsoft Excel 95 to Microsoft Excel 97-2003) • across format families (e.g. Microsoft Excel 2007 to OpenDocument Spreadsheet 1.1) • from a spreadsheet to a page-layout document format (e.g. OpenDocument Spreadsheet 1.1 to PDF 1.7). Note that a generic risk in moving from any spreadsheet file format to another spreadsheet file format lies in the number of rows and columns supported in the format. Early spreadsheet formats are often limited, supporting (for example), only 65,000 rows. Modern spreadsheet formats typically support at least 250,000 rows or higher. For many spreadsheets this will not be an issue, but for spreadsheets in which large amounts of tabular data have been compiled, you Page 37 of 83 The National Archives A Guide to Formats Version: 1 should check whether you will exceed the row or column limit for the format you are migrating to. Within a family of file formats Upgrading within a family of file formats generally poses few direct continuity risks, as most file formats are specifically engineered to be backwards-compatible with earlier versions of the ‘same’ format. However, migration is never risk free, and some small changes to spreadsheets may be found – e.g. styles and formatting may change. By contrast, downgrading to earlier versions may entirely lose formatting, embedded objects, programmatic code or other advanced features depending on what is supported in the earlier versions. Many early spreadsheet formats only support a small number of rows and columns, so it may not be possible to downgrade a large spreadsheet without losing data entirely. Across format families Migrating from one broad type of spreadsheet file format to an entirely different one poses the highest direct continuity risks. No two broad families of spreadsheet file format support exactly the same features, in the same ways, so you should expect some change and loss to a spreadsheet. In general, migration between recent versions of most spreadsheet formats will produce spreadsheets which are still workable, but with some formatting changes. However, advanced features such as embedded programming (‘macros’) and will often not survive the process. More seriously, not all spreadsheets support exactly the same formulae used in calculations – and there are differences in the implementation of some formulae which can produce different results. However, differences tend to be found in the more complex functions rather than the simple, everyday functions (e.g. sum or count). If the answers to any complex calculations must be preserved as they are, then a review of the compatibility of the functions used must be undertaken. From spreadsheet to page-layout document A frequent use-case in business workflows is taking a spreadsheet and migrating it to a pagelayout document format, either for publication or archiving. This process will generally produce a high-quality output document which preserves the layout and styles of the original. However, all advanced interactive features will be lost – in particular, any formulae used to calculate values in the sheet will disappear, with only the results of the calculation left in the final output document. If it is important for your audiences to understand how the spreadsheet was Page 38 of 83 The National Archives A Guide to Formats Version: 1 calculated, you must either provide these details as an additional piece of documentation, or not provide the spreadsheet as a document in the first place, instead making a spreadsheet available. Some page-layout formats may faithfully replicate the look of a spreadsheet, but may incidentally lose other features that are still required. For example, the PDF format can store text in a way which can be rendered absolutely accurately on screen or paper, but is not electronically searchable. If the ability to copy and paste out of the document is important, attention should be paid to how the text can be further manipulated in the page layout format. Some page-layout formats make it hard to select and copy text out of them (e.g. columns are not properly wrapped, mixing up text from several columns when it is selected out of the document). While page-layout formats are very useful for human readability of documents, it is normal that some form of digital access to the content will still be required. Special attention should be paid to the features used in the page layout format and your business requirements for ongoing use of the information. 6.2 Microsoft Excel 97-2003 (.xls) The Microsoft Excel 97-2003 format (XLS) 36 is the de facto standard for business spreadsheets in use today. As its name suggests, it first appeared in 1997, and was used as the default spreadsheet format until 2003, after which several new formats appeared. It is still supported by all major spreadsheet software on all platforms. The format has not been formalised through a standards body, but the specification has now been made available by Microsoft, and it is mostly supported on almost every platform. Application Programming Interfaces and Software Development Kits are widely available. However, the format has several advanced features which are fully supported only on Microsoft platforms, including programmatic scripts and macros. In addition, XLS files can embed other objects which may require additional software to be installed to access them. It is a binary format, based on a format called the Binary Interchange File Format (BIFF), consisting of data stored in records describing the spreadsheet. These records, along with other 36 See http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD- 4342ED7AD886/Excel97-2007BinaryFileFormat(xls)Specification.xps Page 39 of 83 The National Archives A Guide to Formats Version: 1 resources such as images, are embedded in an OLE2 container format (see section 4.5). OLE2 files can be hard to recover in the face of corruption, as they have a complex and fragmented internal structure. 6.2.1 Continuity properties of Microsoft Excel 97-2003 Flexibility Interoperability Very high. Almost all platforms can read and write this format. Implementability High. Many programming environments can access information in this format. Quality Lossiness None. Precision High. Numbers are stored using double-precision floating point, which gives them a precision of about 15 decimal places. Resilience Recoverability Low. The binary OLE2 format on which it is based is hard to recover in the face of corruption, and there are not many tools to do so. Ubiquity Very high. The format is found almost everywhere. Stability Very high. Although not formally standardised, its status as a de facto standard ensures that support for the format will be found many years into the future. 6.3 Microsoft Excel 2007 (.xlsx) The Microsoft Office Open XML format (XLSX) 37 consists of various XML-based files (see section 3.3) and other resources, such as image files, contained in a zip file (see section 4.2). Since the file is compressed, damage to the file can result in being unable to open the file, so recoverability in the face of corruption may be limited. However, note that there are zip repair tools available which may make it possible to recover a corrupted XSLX file. The standardisation status of the Excel 2007 format is complex, but is the same as for the Word 2007 format, which is fully discussed in section 5.7.1. 6.3.1 Flexibility Continuity properties of Microsoft Excel 2007 (XLSX) Interoperability High. Most platforms can read XSLX files. Implementability Average. Software to programmatically access information in XSLX files is mostly confined to the Microsoft platform, although support in other environments is growing. Quality 37 Lossiness None. http://en.wikipedia.org/wiki/Office_Open_XML Page 40 of 83 The National Archives Precision A Guide to Formats Version: 1 High. Numbers are stored using double-precision floating point, which gives them a precision of about 15 decimal places. Resilience Recoverability Above average. The zip format it is based on provides several error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face of errors. Ubiquity Above average. XLSX files are widely used, but they are not the dominant spreadsheet format, which is still XLS. Stability Unclear. Although subject to several different standardisation processes, these have not resulted in a single standard, and instead produced several different and incompatible standards, which are not yet supported in software. The status of the format is unclear going forward into the future – it may be replaced by one of the newer standards, or the standards may be changed to make existing spreadsheets compatible with them. However, since there are a large number of files encoded in the current format, support for it is likely to be found into the near future. 6.4 OpenDocument Spreadsheet (.ods) The OpenDocument 38 Spreadsheet Format (ODS) consists of various XML-based files (see section 3.3) and other document resources, such as image files, contained in a zip (see section 4.2) file. Since the file is compressed, damage to the file can result in being unable to open the file, so recoverability in the face of corruption may be limited. However, note that there are zip repair tools available which may make it possible to recover a corrupted ODS file. The overall standardisation status of OpenDocument Spreadsheets is the same for all OpenDocument formats, and is discussed in section 5.6.1. However, note that the formulae used in OpenDocument Spreadsheets have not been standardised in the 1.0 and 1.1 versions of the standard, although they are standardised in the upcoming 1.2 standard. In the meantime, most implementations of OpenDocument Spreadsheet have followed the lead of the Open Office Calc application (from which the OpenDocument standards were originally derived). A major exception to this rule is the 38 See http://en.wikipedia.org/wiki/OpenDocument Page 41 of 83 The National Archives A Guide to Formats Version: 1 OpenDocument support in the Microsoft Office 2007 SP2, which interprets the standard differently, creating potential interoperability problems. 39 6.4.1 Flexibility Continuity properties of ODS format Interoperability Above average. Most platforms can read OpenDocument Spreadsheet format. However, note that the dominant platform (Microsoft Office) interprets certain aspects of the format differently to other implementations, which can result in noninteroperable spreadsheets. Implementability High. Most programming environments can access information in OpenDocument Spreadsheet format. Quality Lossiness None. Precision High. Numbers are stored using double-precision floating point, which gives them a precision of about 15 decimal places. Resilience Recoverability Above average. The zip format it is based on provides several error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face of errors. Ubiquity Average. ODS files are fairly widely found, but they are not the dominant spreadsheet format. Stability High. The OpenDocument Spreadsheet format is highly standardised, and backwards compatible with earlier versions. Support for information in these formats is likely to continue into the indefinite future. Note that formulae will not be standardised until the 1.2 family of standards is approved. 39 See http://en.wikipedia.org/wiki/OpenDocument_software#Microsoft_Office_2007_SP2_support_controversy Page 42 of 83 The National Archives 7. Presentations 7.1 Introduction A Guide to Formats Version: 1 Presentation formats are somewhat simpler than document formats, as they have one clearly defined purpose, and consist of a defined number of slides, with no wrapping of content between them (and hence no pagination issues). Presentation formats described here are: • Microsoft PowerPoint 97-2003 PPT see section 7.2 • Microsoft PowerPoint 2007 PPTX see section 7.3 • OpenDocument Presentation ODP see section 7.4 7.1.1 Complex media risks Presentations tend to contain complex media resources, including time-based media like audio and video, each of which may pose continuity issues of their own. Unlike images, whose formats are highly standardised, time-based media often use standardised containers, which compress their content using different ‘codecs’ (compression-decompression). It can be hard to determine which codecs are in use, or whether support for them will be found in future platforms. 7.1.2 Linked resource risks Resources used in a presentation may not be embedded in the presentation file itself, but may take the form of a link to a file resource on the local computer on a network shared drive. If the presentation is moved, or the external resources are unavailable, then the presentation will not work properly. You should ensure that any resources required by a presentation are embedded, or that the use of linked resources does not pose any continuity issues for you. 7.1.3 Migration risks In common with documents and spreadsheets, there are three typical migration use-cases for presentations: • within a family of file formats (e.g. PowerPoint 95 to PowerPoint 97-2003) • across format families (e.g. PowerPoint 2007 to OpenDocument Presentation 1.1) • from a presentation to a page-layout document format (e.g. OpenDocument Presentation 1.1 to PDF 1.7). Within a family of file formats Upgrading within a family of file formats generally poses few direct continuity risks, as most file formats are specifically engineered to be backwards-compatible with earlier versions of the ‘same’ format. However, migration is never risk free, and some small changes to presentations Page 43 of 83 The National Archives A Guide to Formats Version: 1 may be found – e.g. styles and formatting may change. By contrast, downgrading to earlier versions may entirely lose formatting, slide transitions, macros or other features depending on what is supported in the earlier versions. Across format families Migrating from one broad type of presentation file format to an entirely different one poses the highest direct continuity risks. No two broad families of presentation file format support exactly the same features, in the same ways, so some change and loss to a presentation should be expected. In general, migration between recent versions of most presentation formats will produce presentations which still roughly contain the same content, but the layout can frequently be changed in ways which require a lot of manual intervention to fix. The layout of presentations is quite central to their purpose, so while content may not be lost, automatic migration cannot be relied upon at present if presentations must be usable after migration without manual intervention. From presentation to page-layout document A frequent use-case in business workflows is taking a presentation and migrating it to a pagelayout document format, either for publication or archiving. This process will generally produce a high-quality output document which preserves the layout and styles of the original. However, all advanced interactive features will be lost, including slide transitions, animations, and any time-based media such as audio and video. Despite this, it is quite common for simple presentations, consisting of text and images to be rendered as a document for download. Presentation software often also includes a ‘slide-show’ version of the main file format, which will accurately preserve transitions and complex media, but becomes non-editable. Some page-layout formats may faithfully replicate the look of a spreadsheet, but may incidentally lose other features that are still required. For example, the PDF format can store text in a way which can be rendered absolutely accurately on screen or paper, but is not electronically searchable. If the ability to copy and paste out of the document is important, attention should be paid to how the text can be further manipulated in the page layout format. Some page-layout formats make it hard to select and copy text out of them (e.g. columns are not properly wrapped, mixing up text from several columns when it is selected out of the document). Page 44 of 83 The National Archives A Guide to Formats Version: 1 While page-layout formats are very useful for human readability of documents, it is normal that some form of digital access to the content will still be required. Special attention should be paid to the features used in the page-layout format and your business requirements for ongoing use of the information. 7.2 Microsoft PowerPoint 97-2003 (.ppt) The Microsoft PowerPoint 97-2003 40 (PPT) format is the de-facto standard for business presentations in use today. As its name suggests, it first appeared in 1997, and was used as the default presentation format until 2003, after which several new formats appeared. The format not been formalised through a standards body, but the specification is now made available by Microsoft, and almost every platform has some level of support. PPT is a binary format, consisting of various document resources embedded in an OLE2 container format (see section 4.5). OLE2 files can be hard to recover in the face of corruption, as they have a complex and fragmented internal structure. 7.2.1 Flexibility Continuity properties of Microsoft PowerPoint 97-2003 Interoperability Very high. Almost all platforms can read and write this format. Implementability Average. Some programming environments can access information in this format, although programmatic control over presentations is a fairly uncommon requirement. Quality Lossiness None. Although note that media contained in a presentation can be lossy. Precision Resilience Recoverability No issues. Low. The binary OLE2 format on which it is based is hard to recover in the face of corruption, and there are not many tools to do so. Ubiquity Very high. The format is found almost everywhere. Stability Very high. Although not formally standardised, its status as a de facto standard ensures that support for the format will be found many years into the future. 40 See www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx Page 45 of 83 The National Archives 7.3 A Guide to Formats Version: 1 Microsoft PowerPoint 2007 (.pptx) The Microsoft Office Open XML format (PPTX) 41 consists of various XML-based files (see section 3.3) and other resources, such as image files, contained in a zip file (see section 4.2). Since the file is compressed, damage to the file can result in the user being unable to open the file, so recoverability in the face of corruption may be limited. However, note that there are zip repair tools available which may make it possible to recover a corrupted PPTX file. The standardisation status of the PowerPoint 2007 format is complex, but is the same as for the Word 2007 format, which is fully discussed in section 5.7.1. 7.3.1 Flexibility Continuity properties of Microsoft PowerPoint 2007 (XLSX) Interoperability High. Most platforms can read PPTX files. Implementability Average. Software to programmatically access information in PPTX files is mostly confined to the Microsoft platform, although programmatic control over presentations is a fairly uncommon requirement. Quality Lossiness None. Although note that media formats contained in a presentation may be lossy. Precision Resilience Recoverability No issues. Above average. The zip format it is based on provides several error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face of errors. Ubiquity Above average. PPTX files are widely found, but they are not the dominant presentation format, which is still PPT. Stability Unclear. Although subject to several different standardisation processes, these have not resulted in a single standard, and instead produced several different and incompatible standards, which are not yet supported in software. The status of the format is unclear looking to the future – it may be replaced by one of the newer standards, or the standards may be changed to make existing presentations compatible with them. However, as there are a large number of files encoded in the current format, support for it is likely to be found into the near future. 41 See http://en.wikipedia.org/wiki/Office_Open_XML Page 46 of 83 The National Archives 7.2 A Guide to Formats Version: 1 OpenDocument Presentation (.odp) The OpenDocument 42 Presentation Format (ODP) consists of various XML-based files (see section 3.3) and other document resources, such as image files, contained in a zip (see section 4.2) file. Since the file is compressed, damage to the file can result in being unable to open the file, so recoverability in the face of corruption may be limited. However, note that there are zip repair tools available which may make it possible to recover a corrupted ODP file. The overall standardisation status of OpenDocument Presentations is the same for all OpenDocument formats, and is discussed in section 5.6.1. 7.2.1 Flexibility Continuity properties of OpenDocument Presentation Format (ODP) Interoperability High. Most platforms can read OpenDocument Presentation format. Implementability Average. Some programming environments can access information in OpenDocument Presentation format, although note that programmatic control over presentations is a fairly uncommon requirement. Quality Lossiness None. Although note that media formats contained in a presentation may be lossy. Precision Resilience Recoverability No issues. Above average. The zip format it is based on provides several error detection and recovery mechanisms, and the use of xmlbased textual content is likewise fairly recoverable in the face of errors. Ubiquity Above average. ODP files are widely found, but they are not the dominant document format. Stability Very high. The OpenDocument Presentation format is highly standardised, and backwards compatible with earlier versions. Support for information in these formats is likely to continue into the indefinite future. 42 See http://en.wikipedia.org/wiki/OpenDocument Page 47 of 83 The National Archives 8. Datasets 8.1 Introduction A Guide to Formats Version: 1 Datasets are collections of structured data. File formats containing datasets include desktop databases, structured text files (see sections 2 and 3), and spreadsheets (see section 6). This section will specifically focus on desktop database formats, and some specific text file formats commonly found to contain structured data. Formats which are described here include: Desktop databases • Microsoft Access MDB see section 8.2 • Microsoft Access 2007 ACCDB see section 8.3 Structured text • Comma Separated Values CSV see section 8.4 • Structured Query Language SQL see section 8.5 • Resource Description Framework RDF see section 8.6 Note that in the interest of balance, this guide was originally going to describe the OpenDocument Database (odb) format. This format is related to the other OpenDocument formats described here (see sections 5.6, 6.4, 7.4), but it has not been standardised as they have been, and remains accessible only by the Open Office suite of applications. Due to the almost complete lack of any accessible documentation on this format, it proved impossible to say anything definitive on it. Hence this format should be regarded as a very high continuity risk. 8.1.1 General dataset continuity The continuity of datasets is a complex subject, as a dataset can contain any data, in any structure, with any meaning attached to the data and structure (whose explanation does not usually appear in the dataset itself). For this reason, to understand and manage the continuity of datasets in general, please read the separate guidance Managing Dataset Continuity. 43 8.1.2 Desktop database risks A desktop database is a single-user, small-scale database intended to run within a desktop environment, by contrast with enterprise database systems, which run on servers and support 43 See Managing the Continuity of Datasets nationalarchives.gov.uk/documents/information- management/managing-continuity-of-datasets.pdf Page 48 of 83 The National Archives A Guide to Formats Version: 1 multiple concurrent users. Desktop databases typically save data, queries and forms into a file format which can be passed around like a spreadsheet or other data file. Desktop database file formats are typically hard to access without the specific creating software. They are not very interoperable, implementable or standardised. They use structured text formats as a data interchange format, although these formats usually only capture the data, not the queries, forms or other access mechanisms defined in the desktop database format. In addition, desktop databases are often independently created by staff to fulfil a temporary need, without the involvement of skilled database designers. Organisations are usually unaware that important data is being managed in these formats, and it is uncommon to find good quality documentation on the data and structures found within them. However, they also often end up being used and expanded beyond their original temporary purpose, and far beyond the point at which they should be formally documented, controlled and migrated into an enterprise database system (or entirely replaced by a properly designed system). For these reasons, desktop database file formats have very poor continuity properties, on almost all levels. They may not be suitable formats to manage and hold any kind of important business data – but they can be very useful to facilitate analysis of such data, or to enable quick solutions to temporary data management issues. As with all formats you must carefully evaluate the business need you require the format to meet to ensure that all your requirements are met and any continuity risks are acceptable. 8.1.3 Structured text risks Aside from the risks which apply to all text files (see section 2.1), structured text files have no general risks which do not also apply to any other form of dataset format (e.g. the need for documentation on the structure and meaning of the data). Specific risks do exist for particular file formats, which will be discussed in each sub-section. Structured text files are typically very accessible with good interoperability and implementability properties, and thus serve as data interchange mechanisms between many different kinds of technology. 8.2 Microsoft Access (.mdb) Page 49 of 83 The National Archives Microsoft Access (MDB) A Guide to Formats Version: 1 44 format is a binary, proprietary desktop database format created by Microsoft, used before 2007. It is not standardised through any standards body, and its specification is not available. The format supports various advanced features beyond storing structured data, including macros, queries and forms to enter and validate data. Application Programming Interfaces to enable programmatic access to MDB files are available on the Microsoft platform, via Data Access Objects and ActiveX Data Objects, but the data contained within this format is not widely accessible outside of this software on other platforms, unless exported into a structured text format. 8.2.1 Continuity properties of Microsoft Access (MDB) Flexibility Interoperability Very low. Almost no software other than Microsoft Access can read MDB files. Implementability Very low. Almost no support for programmatic access to MDB files exists outside of Microsoft Access itself. Quality Lossiness None. Precision No issues. Resilience Recoverability Very low. MDB is a complex binary format with no specific recoverability features. Due to the absence of other tools available to read and process MDB files, once it is corrupted, the chance of recovery is very low. Ubiquity High. There are many databases defined using MDB. Stability Below average. Although the MDB format has been in use for many years, it is not standardised or documented. It has now been replaced by the ACCDB format in more recent versions of Microsoft Access, which remains capable of reading MDB files for the time being – but ongoing support cannot be guaranteed. 8.3 Microsoft Access 2007 (.accdb) The Microsoft Access 2007 (ACCDB) 45 format is a binary, proprietary desktop database format created by Microsoft, replacing the earlier MDB format (see section 8.2). It is not standardised through any standards body, and its specification is not available. The format supports various advanced features beyond storing structured data, including macros, queries and forms to enter and validate data. 44 See http://en.wikipedia.org/wiki/Microsoft_Access 45 See http://en.wikipedia.org/wiki/Microsoft_Access Page 50 of 83 The National Archives A Guide to Formats Version: 1 Application Programming Interfaces to enable programmatic access are available via the Access database engine object library, but the data contained within these formats is not widely accessible outside of this software on any platform, unless exported into a structured text format. 8.3.1 Continuity properties of Microsoft Access 2007 (ACCDB) Flexibility Interoperability Very low. Almost no software other than Microsoft Access can read ACCDB files. Implementability Very low. Almost no support for programmatic access to ACCDB files exists outside of Microsoft Access itself. Quality Lossiness None Precision No issues. Resilience Recoverability Very low. ACCDB is a complex binary format with no specific recoverability features. Due to the absence of other tools available to read and process ACCDB files, once it is corrupted, the chance of recovery is very low. Ubiquity Average. There are some databases defined using ACCDB, although MDB is still more common. Stability Below average. The ACCDB format is relatively new, and is not standardised or documented. Support for it is likely to continue for the foreseeable future, but cannot be guaranteed. 8.4 Comma Separated Values (.csv) Comma Separated Values (CSV) 46 is an informal family of textual file formats, used to store tabular data. While the format has been in use for at least a decade before the advent of personal computers, it is not standardised in any way, and many variations of it exist. The format is not lossy, and there are no innate precision issues. However, software reading a CSV file may interpret the data in them inconsistently, as there are no standards defining how to process the data represented in CSV files. The basic format consists of columns of text separated by commas, with each row on a single line. However, in some countries commas are used to represent decimal points in numbers, so semi-colons, or other punctuation may be used to separate the columns, including tab 46 See http://en.wikipedia.org/wiki/Comma-separated_values Page 51 of 83 The National Archives A Guide to Formats Version: 1 characters. Other variations include whether text columns are quoted or not (usually using double quotes), and how quotes in the text itself are represented (sometimes by placing two double quotes next to one another with no intervening text). Sometimes the first line of a CSV file contains ‘header’ names for each column, but there is no reliable way to determine whether the first line contains data or headers without prior knowledge or manual review. 8.4.1 Continuity properties of Comma Separated Values (CSV) Flexibility Interoperability Very high. Almost all structured data applications which produce tabular data can read or write data in a CSV format. Implementability Very high. All programming environments can produce or consume data in a CSV format. Quality Lossiness None. Precision No issues. Although note that the data represented in the CSV file may have precision issues, depending on the applications which read and write the files. Resilience Recoverability High. CSV is a purely textual format, and depending on the text encoding (see section 2.1.1) used to create the file, small changes will remain local and the file will normally remain readable. However, there are no error detection or recovery features. Ubiquity Very high. CSV files are found almost everywhere. Stability Very high. Despite the lack of formal standardisation, the CSV format family has been in use since before the advent of personal computers, is still very widespread and in active use. 8.5 Structured Query Language (.sql) Structured Query Language (SQL) 47 is a family of database programming languages, rather than being specifically a file format. However, it is common for data and database structure to be represented using SQL and stored in text files (see section 2), primarily for database creation and data interchange between database management systems (usually created by the same vendor). SQL is not a lossy format, but it does have potential precision issues if SQL is created using one database product and consumed in another. Not all data-types are handled consistently between vendors, particularly numbers and date-times. 47 See http://en.wikipedia.org/wiki/SQL Page 52 of 83 The National Archives A Guide to Formats Version: 1 SQL has several standards, the first being an ANSI standard in 1986. It was made an ISO standard (ISO 9075) in 1992 (often referred to as SQL-92), and additions to the standard have been made in 1999, 2003, 2006 and 2008. Earlier standards are forwards compatible with the later standards (meaning they are valid even if processed with software expecting a later standard, but the later standards add new features which cannot be understood if an earlier standard is expected). However, despite the standards, it is common for database-vendors to create non-standard extensions to the language, and they do not always process elements of the standard compatibly between them. 8.5.1 Flexibility Continuity properties of Structured Query language (sql) Interoperability Average. The standards are not interpreted consistently between vendors – but the basic language is highly standardised, and usually easy to change to achieve interoperability. Implementability Very high. All programming environments can create SQL, and there are many libraries of code to process it. Quality Lossiness None. Precision Some issues. Data-types are not always handled consistently between database vendors, so care must be taken with numbers and date/times if moving data between different databases. Resilience Recoverability High. SQL is a purely textual format, and depending on the text encoding (see section 2.1.1) used to create the file, small changes will remain local and the file will normally remain readable. However, there are no error detection or recovery features. Ubiquity Very high. SQL files are found almost everywhere. Stability High. Although there are many standards, earlier versions are forwards compatible with later ones. However, note that vendor-extensions to the SQL standard cannot be guaranteed to be stable (although in practice, they appear to be fairly stable). Page 53 of 83 The National Archives 8.6 A Guide to Formats Version: 1 Resource Description Framework (.rdf) Resource Description Framework (RDF) 48 is a data model which has several different possible formats. RDF models information using sets of ‘subject-predicate-object’ statements, or more colloquially, ‘something relates-to something-else’. RDF is one of the components of the ‘semantic web’ – the attempt to impute meaning and links to data on the web. The two principle formats in which RDF statements are represented are RDF-XML (RDF statements written using an XML-based format - see section 3.3), and a simpler textual format called ‘Notation 3’, or ‘N3’. These formats were standardised through the World Wide Web Consortium (W3C) as a Recommendation in 1999, and subsequently updated in 2004. No matter which format is used to represent RDF models, both are textual, non-lossy, and have no precision issues. From a continuity perspective, both RDF formats score well. However, there is an innate risk (and opportunity) in using RDF, which is that RDF is designed to facilitate linked data. This means that an RDF file can reference data which is found elsewhere on the web. You must ensure that if an RDF file references external data, changes to that data (or its removal entirely) will not adversely impact your use of the RDF data contained in the format. 8.6.1 Flexibility Continuity properties of Resource Description Framework (rdf) Interoperability Very high. Software to process RDF can be found on most platforms, and this is increasing as its adoption grows. It is very standardised. Implementability High. Many programming languages on most platforms can process RDF. Quality Lossiness None. Precision No issues. Resilience Recoverability High. RDF formats are textual, so inherit the recovery properties of text. However, there are no built-in error detection or recovery mechanisms. Ubiquity Below average. RDF adoption is growing steadily, but is still not in widespread use. 48 See http://en.wikipedia.org/wiki/Resource_Description_Framework Page 54 of 83 The National Archives Stability A Guide to Formats Version: 1 Very high. The RDF formats are standardised and have been in use for more than a decade. 9. Emails 9.1 Introduction As a service, email is ubiquitous, highly interoperable and very successful. However, behind the scenes, email is a mix of standards, conventions and technologies, and emails themselves are usually stored and managed on dedicated servers. Email servers work in a variety of different ways, but usually use database technologies rather than file formats to store emails. Hence, email file formats tend to be used for personal archiving. The email file formats covered in this guide include: • EML format EML • Microsoft Message Format MSG see section 9.3 • MBOX format MBX see section 9.4 • Personal Storage Table PST see section 9.5 9.1.1 see section 9.2 General risks Email file formats are not usually standardised, and interoperability can be quite low. Most email client software supports at least one email file format, but it can be difficult to find software which can read several formats, or convert emails between formats. Another general risk of email file formats is that they are frequently used by individual users to archive organisational mail, often placing them outside of email controls and retention policies set by organisations. This is frequently done to work around quota limits on the size of email inboxes. 9.1.2 File attachment risks Emails can have file attachments, which are stored within the email file format, along with the email itself. Those files are essentially obscured from most information management software, and may not be searchable, therefore there is a continuity risk to the attached files. This is the same risk which applies to files contained in generic file containers (see section 4.1.1). 9.2 EML (.eml) The EML format 49 is a plain ASCII text file (see section 2.2) containing a single email, including email metadata (e.g. sender, subject, dates), and the text of the email. File attachments are also 49 See http://www.ietf.org/rfc/rfc0822.txt Page 55 of 83 The National Archives A Guide to Formats Version: 1 included in the same text file, using various encoding schemes which convert binary files into a textual representation, including Base64 50 and Uuencoding. 51 The format was semi-standardised in 1982 as RFC-822, although this standard does not cover all the data which may appear in an eml file. It is not lossy, and has no precision issues. 9.2.1 Continuity properties of EML Flexibility Interoperability Above average. Many email clients can read or write EML files, and it is frequently used as an interchange mechanism. Implementability Low. Support for reading and writing EML files is hard to find in most programming environments. Quality Lossiness None. Precision No issues. Resilience Recoverability High. The format is a plain ASCII text file, meaning corruptions, insertions or deletions have a local effect. Being text, it is easy to open and correct errors, although there is no explicit error detection or recovery mechanisms defined in it. Ubiquity Low. The principle use appears to be interchange between email clients. Hence, it is uncommon to find large volumes of emails stored in eml format. Stability High. The format has been in use for decades and is at least partially standardised. 9.3 Microsoft Message (.msg) The Microsoft Message format is a proprietary binary file format used by the Microsoft Outlook email clients. It is based on the OLE2 Compound Document Format (see section 4.5). It is not standardised, although the specification was made available from Microsoft in 2008. 52 All file attachments are embedded inside the msg file. A large number of emails are found stored in this format, due to the ubiquity of the Microsoft Outlook application. It is not lossy, and has no precision issues. 9.3.1 Flexibility Continuity properties of Microsoft Message (msg) Interoperability Average. Few email client applications can read msg files, 50 See http://en.wikipedia.org/wiki/Base64 51 See http://en.wikipedia.org/wiki/Uuencoding 52 See http://msdn.microsoft.com/en-us/library/cc463912.aspx Page 56 of 83 The National Archives A Guide to Formats Version: 1 although various information management tools can, due to the ubiquity of the format. Implementability Low. Some libraries of code to access data in msg format files exist, although they may not be supported and are not found in all programming environments. Quality Lossiness None. Precision No issues. Resilience Recoverability Low. The format is based on OLE2, which can be hard to recover in the face of corruption. Few tools other than Microsoft’s own email client exist to read them. Ubiquity Very high. A large number of emails are archived in this format. Stability Above average. Although not standardised, the specification is available and due to the large number of emails found in it, support for this format is likely to be found into the immediate future. 9.4 MBOX (.mbox) MBOX 53 is a family of related, text-based formats (see section 2) originating in the UNIX operating system, which store entire mailboxes, rather than a single email. Emails are appended one after another into a single text file. MBOX files are not lossy, and have no precision issues. The structure of MBOX has never been officially standardised, although some documentation can be found, 54 and the format of the emails within an MBOX file is also not standardised, although emails in an EML-like format are common (see section 9.2). There are at least four common variations on the MBOX format, which are incompatible with each other: mboxo, mboxrd, mboxcl and mboxcl2. Even within these broad types, variations can be found in different software implementations. In general, although MBOX format is widespread in a broad sense, each MBOX format is largely tied to the software which produces and reads it, and so, despite being a textual format, must be regarded as having generally poor continuity properties. 53 See http://en.wikipedia.org/wiki/Mbox 54 See http://tools.ietf.org/html/rfc4155 Page 57 of 83 The National Archives 9.4.1 A Guide to Formats Version: 1 Continuity properties of MBOX Flexibility Interoperability Very low. MBOX files can generally only be processed by the software which creates it. Implementability Very low. There are few libraries of code able to process MBOX files. Quality Lossiness None. Precision No issues. Resilience Recoverability Average. The format is based on plain text, meaning corruptions tend to only affect the local area where they occur and they are easy to read and correct manually. However, due to the way that MBOX files are used (holding multiple emails in a single file), corruption can occur fairly easily, and there are no built in error detection or recovery mechanisms. Ubiquity Average. MBOX is common on UNIX platforms – but the specific variation used by each are incompatible with each other. Stability Low. The family of formats is not standardised, leading to multiple different incompatible implementations. Although MBOX files have been in use for decades, they cannot be considered stable, as they are tightly bound to the particular software which reads them, and could change without warning. 9.5 Personal Storage Table (.pst) Personal Storage Table (PST) 55 is a proprietary binary file format used to store multiple emails, folders and calendar items using Microsoft Outlook. The format is not standardised, but the specification was recently made available by Microsoft. 56 It is not lossy, and has no precision issues. Few tools other than Microsoft Outlook can currently read pst files, although there are tools to convert other formats to PST format. Some libraries of code exist to programmatically access data in PST files, although most of these were reverse-engineered before the specification was made available, and so do not necessarily support all features in PST files. There are two major 55 See http://en.wikipedia.org/wiki/Personal_Storage_Table 57 See http://www.fileformat.info/format/bmp/egff.htm and http://en.wikipedia.org/wiki/BMP_file_format Page 58 of 83 The National Archives A Guide to Formats Version: 1 variants of the PST format: 32 bit and 64 bit, of which the older 32 bit variety has the greater level of support. 9.5.1 Flexibility Continuity properties of Personal Storage Table (PST) Interoperability Low. Few tools can access data in PST files. Implementability Low. A few libraries of code enable programmatic access to data in PST files. Quality Lossiness None. Precision No issues. Resilience Recoverability Low. The format is a dense binary format, with little tools support and no known error detection or recovery features. Ubiquity Very high. Many users archive their inboxes using the PST format. Stability Above average. Although not standardised, the 32 bit variety has been in use for many years, and due to the large amount of information recorded in this format, support is likely for the foreseeable future. Page 59 of 83 The National Archives 10. Images (raster) 10.1 Introduction A Guide to Formats Version: 1 Raster images are images encoded as a rectangular matrix of colour values (‘pixels’), in the same way that a television or computer monitor displays images. Being a matrix, raster images have a natural width and height in pixels, or ‘resolution’. This is in contrast to vector images (see section 11) which store images as a series of instructions to draw lines and shapes in various ways. Vector images have no natural resolution, as the shapes can always be redrawn at any desired resolution on the display device. The raster image formats described here include: • Windows bitmap format BMP see section 10.2 • Tagged Image File Format TIFF see section 10.3 • Graphics Interchange Format GIF see section 10.4 • Portable Network Graphics Format PNG see section 10.5 • Joint Photographic Experts Group Format JPG, JPEG see section 10.6 10.1.1 Scaling risks If a raster image is displayed much larger than its natural resolution, then the image becomes ‘blocky’, as pixels are scaled up into squares, rather than remaining as individual dots in the image. For this reason, it is generally preferable to keep raster images in as high a resolution as possible, and produce lower-resolution versions to fill different needs. For example, a high resolution master version may be kept for print and other high resolution needs, and lowerresolution versions produced for delivery on the web. Once resolution is discarded, it is not possible to scale up the image again without producing blockiness (or blurring, if ‘interpolation’ algorithms are used to smooth the differences across blocks). A risk of scaling a raster image down is that areas of high contrast may be entirely lost in the smaller version, rather than just becoming smaller. For example, text in a large image may become unreadable in the smaller version, even if the image is still large enough to contain readable text. This is because the process of scaling an image down involves discarding pixels, sometimes averaging the colour values across the areas being discarded. This tends to entirely remove areas of high contrast, such as sharp lines, rather than preserving them as the image becomes smaller. Page 60 of 83 The National Archives 10.1.2 A Guide to Formats Version: 1 Compression risks It can take a lot of storage space to store raster images, so most raster image file formats apply compression to the image. Compression algorithms may be lossy (discarding subtle variations in the image to save space), or non-lossy (reproducing the exact pixels fed into the format). If a lossy format is used, then the image should generally not be changed and re-saved again, as each time the image is changed and saved, more information will be discarded, degrading the quality each time until it becomes unusable. In addition, by necessity, lossy compression algorithms make assumptions about what information can be safely discarded. This means that although a lossy algorithm may produce images almost indistinguishable from the original to the human eye, it can also produce noticeable artefacts in the compressed image. Non-lossy algorithms will never discard information, but different types of image may compress better or worse than others. In extreme cases, a non-lossy compression scheme may actually create a larger file than the uncompressed version. In general, non-lossy compression works well on images that contain blocks or lines of identical colour (e.g. graphic art, or black and white images), but work poorly on images with subtle continuous tones (e.g. photographs). You should ensure that you understand the suitability of any compression scheme for the types of image you intend to compress. 10.1.3 Colour-space risks There are different ways of representing colour, which may give greater or lower fidelity in different circumstances. For example, the colours which are reproducible when printed are different to those that can be displayed on a television or computer monitor. Many raster image formats use 24-bit colour, which allows up to 256 different values for the red, green and blue components of each pixel (the ‘RGB’ model). The human eye cannot distinguish much more than 24-bit colour, so RGB may be assumed to be sufficient for most electronic display purposes. However, if an image is edited and transformed in some ways, then colour values may be lost during the transformation process. For this reason, some advanced image formats can store greater than 24-bit colour to allow for the possibility of loss during editing. Bear in mind that an RGB image may look quite different when printed. For print, another colour model is frequently used: cyan, magenta, yellow, black (the ‘CMYK’ model). The colours which can be represented by the two colour models are not equivalent, although reasonable translations can be made between them. Page 61 of 83 The National Archives A Guide to Formats Version: 1 Another form of colour model is ‘indexed colour’, in which the image only contains a limited number of different RGB colours. These colours can be any RGB colour, but generally only up to 256 different colours can appear in the image as a whole. This form of colour model can degrade the appearance of images with subtle continuous tones, such as photographs, but may be suitable for graphic art or other images with few variations in colour in them. Indexed colour innately requires less storage space than full RGB colour images, and will typically compress better using lossless compression algorithms. In addition to the colours of pixels themselves, some image formats also have what is called an ‘alpha channel’ – which is how transparent each pixel in an image is. This allows images to let the background on which they are displayed to show through to a greater or lesser extent in different parts of the image. This is common on images intended for display on the web, or for icons on a computer desktop. 10.2 Windows Bitmap (.bmp) The Windows Bitmap format 57 was first defined in 1985, being the native image format for the Microsoft Windows 1.0 and IBM OS/2 operating systems. There are many variations of this format now in existence, being updated for Windows versions 2.0, 3.0, ’95 and NT, each adding more capability to the format. OS/2 also introduced incompatible versions of the format, although these are not now very widespread. However, in essence it is still quite a simple image format, and is widely used, even outside of the original platforms it was defined on. It is not standardised, although the specifications (at least, for recent versions) are freely available. The format can store raster images uncompressed, or with a simple lossless compression scheme (Run Length Encoding 58), which does not compress most images by much, but can help in reducing storage space. The compression scheme works best well for simple images with large blocks of identical colour. It supports indexed colour and RGB colour depths up to 24-bit, and alpha channels using 32 bits per pixel. 10.2.1 Flexibility Continuity properties of Bitmap (bmp) Interoperability Very high. All platforms can read bmp format. Implementability Very high. All programming environments can process bmp. 57 See http://www.fileformat.info/format/bmp/egff.htm and http://en.wikipedia.org/wiki/BMP_file_format 58 See http://en.wikipedia.org/wiki/Run-length_encoding Page 62 of 83 The National Archives Quality A Guide to Formats Version: 1 Lossiness None. Precision No issues. Resilience Recoverability Above average. If uncompressed, corruption will usually only change a small part of the image. There are no specific error detection or recovery features. Ubiquity Very high. Stability Very high. Although there were frequent changes in the early versions of the bmp format, the format as it is encountered today has remained unchanged for over a decade, and is likely to be supported for the foreseeable future. 10.3 Tagged Image File Format (.tif, .tiff) The Tagged Image File Format (TIFF) 59 was first formally specified in 1986 by Aldus Corporation, after two earlier draft specifications. Hence, the first specification of TIFF is known as TIFF 3.0. Three further versions were released in 1987, 1988 and 1992, the latest specification being TIFF 6.0. It has not substantially changed since then, although minor additions have been made. Note that TIFF files do not specify which version of the specification they comply with. Each new version simply added more features to the previous version. TIFF files should be assumed to be version 6.0, as this will always cover all the previous versions. TIFF is unusual among raster image formats, in that it can hold more than one image at a time (‘multi-page’), reflecting its origin in a file format to contain scanned images. It is still widely used as a digitisation format. It is inherently an extensible format, allowing many different options to be specified with it. This has given rise to compatibility problems, as not all software can process all the options. 60 All software which can process TIFF today must conform to a baseline specification, which alleviates many (but not all) of these issues. It supports many different sorts of compression (including lossy and lossless), colour models (including RGB and CMYK), and many other features too numerous to list here. The TIFF specification 61 itself is now owned by Adobe Corporation and itself is not formally standardised, although the specifications are openly available. Various other TIFF-like formats 59 See http://en.wikipedia.org/wiki/Tagged_Image_File_Format 60 Giving rise to the joke that TIFF stands for ‘Thousands of Incompatible File Formats’. 61 See http://partners.adobe.com/public/developer/tiff/index.html Page 63 of 83 The National Archives A Guide to Formats Version: 1 have been standardised, including TIFF/IT (ISO 12639), TIFF/IT P1 (ISO 12639:1988) and TIFF/IT P2 (ISO 12639:2004), although these are not entirely compatible with TIFF itself. Care must be taken using TIFF to ensure that the particular features used will be compatible with the environment it is being used in, but it is highly flexible format suitable for many advanced image tasks. 10.3.1 Continuity properties of Tagged Image File Format (tif, tiff) Flexibility Interoperability Mixed. Baseline TIFFs are highly interoperable, being supported on almost all platforms. However, the high number of variations possible with the format can limit its interoperability. Implementability Mixed. Baseline TIFFs are easy to implement in most programming environments. Quality Lossiness Mixed. TIFF supports both lossy and non-lossy compression schemes. Precision Resilience Recoverability No issues. Mixed. Very simple TIFFs may be possible to recover, although there are no error detection or recovery mechanisms built in. Ubiquity Very high. TIFF files are widely found. Stability High. The specification itself is very stable, being largely unchanged for nearly 2 decades, but is not formally standardised. However, once again note that while the specification is stable, the format itself is so extensible that the stability of images encoded with it can be questioned. 10.4 Graphics Interchange Format (.gif) The Graphics Interchange Format 62 (GIF) was first specified in 1987 as GIF 87a. It supports more than one image in a single file. A later specification was made in 1989 (GIF 89a), adding support for animation delays between images. The format is not standardised, although the specifications are freely available. 62 See http://en.wikipedia.org/wiki/Graphics_Interchange_Format Page 64 of 83 The National Archives A Guide to Formats Version: 1 GIF uses LZW 63 compression, which is a lossless compression algorithm. It provides better compression than the Run Length Encoding found in Windows Bitmaps (see section 10.2). GIF does not support full 24-bit RBG colours; it uses indexed colour (see 10.1.3), allowing up to 256 different colours in the image, 64 and also supports transparency. It is most suitable for simple graphic images with a limited number of colours. GIF images are often found on the web, used for simple animations. 10.4.1 Continuity properties of Graphics Interchange Format (gif) Flexibility Interoperability Very high. All platforms can read GIF files. Implementability Very high. The format is easy to implement and use in most programming environments. Quality Lossiness None. Precision No issues. Resilience Recoverability Average. The format is quite simple, but there are no error detection or recovery features. Ubiquity Very high. GIF files are extremely widespread. Stability Very high. The format is unchanged over two decades, although it is not formally standardised. 10.5 Portable Network Graphics (.png) The Portable Network Graphics (PNG) 65 file format was first specified in 1996 as Version 1.0, and is also standardised as a W3C Recommendation. Two further versions were later defined: version 1.1 in 1998, version 1.2 in 1999, adding a few additional features. It was standardised in 2003 as ISO 15948:2003 and subsequently as ISO 15948:2004. The standardised versions are marginally different to version 1.2. It was developed as a competing format to GIF (see section 10.4). At the time of development, the GIF format was enmeshed in patent issues on the underlying compression algorithm used by it, although these patents have now expired. 66 PNG used patent-free lossless compression algorithms, 67 which generally achieves better compression than GIF for most images. Unlike 63 See http://en.wikipedia.org/wiki/Lempel-Ziv-Welch 64 Note that there is a rarely used ‘hack’ which can produce true RGB images without changing the underlying format. See: http://en.wikipedia.org/wiki/Graphics_Interchange_Format#True_color 65 See http://en.wikipedia.org/wiki/Portable_Network_Graphics 66 See http://en.wikipedia.org/wiki/Graphics_Interchange_Format#Unisys_and_LZW_patent_enforcement 67 See http://en.wikipedia.org/wiki/Portable_Network_Graphics#Compression Page 65 of 83 The National Archives A Guide to Formats Version: 1 GIF, it is a single image format, with a separate format not described here - Multiple Image Network Graphics (MNG) 68 - being defined for animation purposes. The PNG format supports indexed colour and RGB true-colour with an alpha channel for pixel transparency. However, although any RGB image can be represented using PNG, the compression works best for graphic images, rather than photographic images, where subtle variations in colour prevent the compression from working well. 10.5.1 Continuity properties of Portable Network Graphics (png) Flexibility Interoperability High. Almost all recent graphic software can process PNG images, although older web browsers may not be able to. Implementability Very high. Most programming environments can process PNG images. Quality Lossiness None. Precision No issues. Resilience Recoverability Average. The format is quite simple, but there are no specific error detection or recovery features. Ubiquity High. PNG images are quite widely found and their adoption is growing. Stability Very high. The format is standardised and has largely remained unchanged for over a decade. 10.6 Joint Photographic Experts Group (.jpg, .jpeg) The Joint Photographic Experts Group format (JPG) 69 is designed for the compact representation of photographs, or other images with subtle tone variations. These sorts of image typically compress poorly using lossless compression algorithms, so JPG specifies a lossy algorithm, which selectively discards small changes in colour to achieve higher levels of compression. The JPG standard also includes a lossless mode, but this is frequently not supported in many applications which process JPG files. It has some small precision issues, in that the lossy compression algorithm can produce small rounding errors using numbers with decimal points, which may change the final image in small ways. However, these changes are minimal when compared to the intentional discarding of information performed by the lossy compression itself. 68 See http://en.wikipedia.org/wiki/Multiple-image_Network_Graphics 69 See http://en.wikipedia.org/wiki/JPEG Page 66 of 83 The National Archives A Guide to Formats Version: 1 JPG was first issued and standardised in 1992, as ISO 10918-1. However, note that this standard principally covers the method of image compression and decompression (the ‘codec’). The file formats in which JPG compressed images are commonly contained are known as EXIF70 and JFIF71. However, files encoded in these formats still generally use the common JPG or JPEG file extensions. Due to its compression it is a suitable format for the storage of photographic images which will not change further, or require editing, and for which the loss of subtle data is not critical. Due to the way that the JPG algorithm works, areas of high contrast (e.g. sharp boundaries, or text) in the image can end up with visible ‘artefacts’ in the image surrounding the boundary. Therefore, these sorts of image are not generally suitable for JPG. It is possible to select how much information JPG discards, trading off space against fidelity. 10.6.1 Flexibility Continuity properties of Joint Photographic Experts Group (jpg, jpeg) Interoperability Very high. Most platforms can read and process JPG images. Implementability Very high. Most programming environments can process JPG images. Quality Lossiness Usually lossy. Sharp boundaries (e.g. text) may have visible artefacts surrounding them. A lossless mode exists, but is not widely supported. Precision No major issues. Note that the compression algorithm can produce small rounding errors in its calculations, even when the compression is minimal. Resilience Recoverability Below average. The formats are reasonably complex, and there are no specific error detection or recovery features. Ubiquity Very high. JPG images are found almost everywhere. Stability Very high. JPG images are highly standardised and have been in use for nearly two decades. Good support is likely into the foreseeable future. 70 See http://en.wikipedia.org/wiki/Exif 71 See http://en.wikipedia.org/wiki/JFIF Page 67 of 83 The National Archives 11. Images (vector) 11.1 Introduction A Guide to Formats Version: 1 Vector images are formats which store images as a series of instructions to draw lines and shapes in various ways. This is in contrast to raster images (see section 10) which store images as a rectangular matrix of colour values (‘pixels’), in the same way that a television or computer monitor displays images. The vector image formats described here include: • Encapsulated Postscript EPS see section 11.2 • Windows Metafile Format WMF, EMF see section 11.3 • Scalable Vector Graphics SVG see section 11.4 11.1.1 Continuity risks Vector images have no natural dimensions of width or height (‘resolution’), as the shapes can always be redrawn at any desired resolution on the display device. This means there are no scaling risks with vector images. It is not possible to use lossy compression on a vector image, as you cannot easily determine which of the drawing instructions can be safely removed or simplified. However, they can be compressed fairly well using standard lossless compression, for example zip (see section 4.2). In addition, vector image files are typically much smaller than raster image files in the first place, as they only store descriptions of how to reproduce a graphic image, rather than the image itself. The principle continuity risks of vector images are that interoperability is not as high as for raster images. In particular, there is no common definition of what features can be specified to draw, hence most formats will support different methods of drawing colours, or entirely different shapes. Hence, migration of vector file formats must be undertaken with special care. Vector formats are not as widely used as raster formats, although support for them is growing on many platforms, as they provide compact, resolution-independent representations of graphic images, such as logos and icons. Avoiding scaling issues and small file sizes are useful properties when content must be easily viewable on, or repurposed for, a variety of networked devices with widely different screen sizes, such as mobile smartphones, tablets and full size desktop screens. Page 68 of 83 The National Archives 11.2 A Guide to Formats Version: 1 Encapsulated Postscript (.eps) The Encapsulated Postscript format (EPS) 72 file is a text-based Postscript file (see section 5.2) which conforms to a specification called Document Structuring Conventions 73 (DSC). It is intended as a way to use postscript to describe drawings which can be embedded in other documents. It was first specified in 1992, but is not formally standardised. It is not a lossy format, but it has a precision issue, in that numbers are only represented to an accuracy of nine decimal digits, which can produce rounding errors. 11.2.1 Flexibility Continuity properties of Encapsulated Postscript (EPS) Interoperability Average. EPS format is usable as a drawing format for some, but by no means all, vector graphics software and to embed into other documents. Implementability Average. EPS format support is found in some programming environments, but by no means all. Quality Lossiness None. Precision Some issues. As for Postscript, numbers are only represented to a precision of nine decimal digits, potentially creating rounding errors if calculations are performed using the postscript programming language. Resilience Recoverability Average. Being a textual format, small corruptions to EPS files will often not prevent the file being opened, but no specific error detection or recovery mechanisms are part of the format. Ubiquity Above average. EPS files are widespread, and are still in active use, but it is not the vector format of choice for many applications. Stability High. EPS files are largely unchanged since they were first specified, and support for the format is likely to be found into the foreseeable future. 72 See http://en.wikipedia.org/wiki/Encapsulated_PostScript 73 See http://en.wikipedia.org/wiki/Document_Structuring_Conventions Page 69 of 83 The National Archives 11.3 A Guide to Formats Version: 1 Windows Metafile Format (.wmf) The Windows Metafile Format (WMF) 74 is a binary 16-bit vector image file format defined in the 1990s, which consists of commands to the Windows Graphics Device Interface (GDI). As such, it is very highly coupled with Microsoft Windows, although reverse-engineered support for it on other platforms can be found. It can also optionally include bitmap (raster image) components in addition to the vector images. The format defines no compression (so is not lossy), and has minor precision issues in that the format is innately 16-bit. This may limit the theoretical accuracy of very large drawings specified with it, but in practice, this should not be a concern. It is not standardised in any way, although the specifications of the formats were released in 2006. 75 11.3.1 Flexibility Continuity properties of Windows Metafile Format (WMF) Interoperability Low. It is tightly coupled to the Microsoft Windows platform. Some reverse engineered implementations can be found. Implementability Low. Support for the format is mostly limited to Microsoft Windows programming environments. Quality Lossiness None. Precision Small issues. The format is 16-bit only, which may limit the size or accuracy of very large drawings. Resilience Recoverability Below average. It is a dense binary format, there are no specific error detection or recovery features, and few tools exist to read or repair it. Ubiquity Low. Although it is a format used internally by Microsoft Windows and as a drawing format for Microsoft Office, files encoded in WMF are not particularly widespread. Stability Average. Since the format is essentially a representation of the underlying Windows Graphics Device Interface, they have been quite stable. However, they are not standardised, and support for the format cannot be guaranteed for much beyond the immediate future, as Windows itself changes. 74 See http://en.wikipedia.org/wiki/Windows_Metafile 75 See http://msdn.microsoft.com/en-us/library/cc215212.aspx Page 70 of 83 The National Archives 11.4 A Guide to Formats Version: 1 Scalable Vector Graphics (.svg) The Scalable Vector Graphics file format 76 (SVG) is a textual format based on XML (see section 3.3). It was first defined in 1999 by the World Wide Web Consortium (W3C) and there have been several versions defined since then. SVG 1.0 became a W3C recommendation in 2001, 1.1 in 2003 and 1.2 Tiny in 2008. SVG 1.2 Full has been working draft for many years, but is likely to be replaced by SVG 2.0. Support for SVG is increasingly common, particularly on the web, however Microsoft Internet Explorer has only supported it from version 8. It supports both static and interactive vector graphics, with a built in scripting language (ECMAScript 77). Note that advanced scripted features will probably not survive migration into another format. Raster images can also be embedded in an SVG file, and it also includes some basic page layout features. It is a non-lossy format, and has no precision issues. 11.4.1 Flexibility Continuity properties of Scalable Vector Graphics (SVG) Interoperability High. Most vector applications and browsers can access SVG format. Implementability High. SVG support is found in many programming environments. Quality Lossiness None. Precision No issues. Resilience Recoverability High. Being based on a textual XML format, it is quite easy to repair damaged SVG files, although there is no specific error detection or recovery built in. Ubiquity High. SVG files are very widespread, particularly on the web. Stability High. The format is standardised through the W3C. Although new versions appear reasonably regularly, support for format in all versions is likely to continue into the foreseeable future. 76 See http://en.wikipedia.org/wiki/Scalable_Vector_Graphics 77 See http://en.wikipedia.org/wiki/ECMAScript Page 71 of 83 The National Archives 12. Audio 12.1 Introduction A Guide to Formats Version: 1 Audio formats are quite diverse, being engineered to support different qualities, file sizes and business uses. Consumer grade formats typically focus on small file sizes, support stereo channel audio, and have relatively low quality (around CD-quality). Professional grade formats may support higher qualities to give some head-room when editing and a greater number of channels. Audio formats described here include: • Waveform Audio File Format WAV see section 12.2 • Windows Media Audio WMA see section 12.3 • MPEG Layer 3 Audio MP3 see section 12.4 • Advanced Audio Coding AAC see section 12.5 12.1.1 Sampling risks In order to reproduce audio, computers must capture the sound level at a particular intervals of time, and convert it to a number. The higher the number of samples taken per second, the more faithfully the sound can be reproduced. Since human ears can hear frequencies up to around 22,000 Hz, then a sampling rate of double this (around 44,000 samples per second) is generally good enough to reproduce most frequencies a human ear can distinguish. Capturing more samples gives more flexibility to edit the sound without noticeably degrading the quality. However, when processing audio, if the sampling rate of the sound is adjusted, this can produce audible artefacts in the sound. In general, you should capture and store audio in as high a sample rate as possible. 12.1.2 Codec risks A ‘codec’ refers to the algorithm used to compress and decompress the audio data. Some codecs are ‘lossy’, in that they intentionally discard data to reduce the file size. Others are lossless, reproducing the exact sound data fed into it – although these typically do not compress as much as lossy codecs. A particular risk of codecs is knowing which codec is actually being used. Many audio file formats allow many different codecs to be used within them, and this is not evident from the file extension, which simply tells you which audio file container format is being used, not the codec. Although it is possible for dedicated audio software to determine the codec in use (otherwise it could not play back the audio), it is harder for information managers to acquire this information, which may create risk of unusual or older codecs remaining in use in older audio files. Page 72 of 83 The National Archives 12.1.3 A Guide to Formats Version: 1 Digital rights management risks Some audio file formats use ‘Digital Rights Management’ (DRM) to protect the content from copyright infringement, or to otherwise control the use of the content. By necessity, DRM encrypts the content of the audio file format, preventing the use without a key to unlock the content. Because of this, all audio files with DRM carry a very high continuity risk. In order to facilitate legitimate playback of content, the software must have the decryption key available to it. Unless the DRM scheme requires online negotiation, all off-line use (which includes most audio players) must include the decryption key in the software client. It is often possible to reverse engineer the decryption key, however, there are serious legal issues with using such tools to unlock content protected by DRM schemes unless you are the legitimate copyright owner. 78 12.2 Waveform Audio File Format (.wav) Waveform Audio File Format (WAV) 79 is a simple audio file format used by the Microsoft Windows and IBM OS/2 operating systems. However, support for the format is widespread on other platforms. It is not formally standardised, but the specifications are available. 80 It can store tw channels of audio at up to 44,100 samples per second, using 16 bits per sample, so sound quality is reasonably good, but quality may suffer if edits which transform the audio are applied. There are no digital rights management issues with the wav format. It is an innately non-lossy format, but does support compression using a variety of codecs supplied by the Windows Audio Compression Manager. 81 Like many media formats, the ability to use a variety of codecs within the format means that you can experience continuity issues if an unusual codec is selected, as not all systems may support all codecs, and it is not directly evident from the file which codec is being used. However, note that the wav format is most frequently used uncompressed, avoiding such issues, although making the file size of wav files quite large. 78 See http://en.wikipedia.org/wiki/Software_cracking 79 See http://en.wikipedia.org/wiki/WAV 80 See http://msdn.microsoft.com/en-us/windows/hardware/gg463006.aspx 81 See http://en.wikipedia.org/wiki/Audio_Compression_Manager#Audio_Compression_Manager Page 73 of 83 The National Archives 12.2.1 A Guide to Formats Version: 1 Continuity properties of Waveform Audio File Format (WAV) Flexibility Interoperability Very high. Implementability Very high. Quality Lossiness Usually not. Unless an unusual codec is used. The files are usually entirely uncompressed. Precision Minor issues. Only uses 16 bits per sample up to 44,100 samples per second. Resilience Recoverability Average. It is a binary format, which although usually uncompressed has no specific error detection or recovery features. Ubiquity Very high. WAV files are very widespread. Stability High. Although not formally standardised, WAV files from many years ago are still accessible, and support for the format is likely to be found into the foreseeable future. 12.3 Windows Media Audio (.wma) The Windows Media Audio (WMA) 82 file format is something of a misnomer, in that there are at least four incompatible formats defined using the same name. In fact, WMA refers to a family of four audio codecs defined by Microsoft, which are contained in an Advanced Systems Format 83 media container file, whose specification is available. 84 The four codecs defined are: • Windows Media Audio The most common codec, released in 1999. It uses lossy compression, encoding two channels (stereo) at up to 48,000 samples per second. • Windows Media Audio Pro Uses a better (but still lossy) compression algorithm, supporting up to 96,000 samples per second and up to eight discrete channels of sound. • WMA Lossless A lossless audio codec, designed for archival purposes, supporting up to 96,000 samples per second with six discrete channels of sound. 82 See http://en.wikipedia.org/wiki/Windows_Media_Audio 83 See http://en.wikipedia.org/wiki/Advanced_Systems_Format 84 See http://download.microsoft.com/download/7/9/0/790fecaa-f64a-4a5e-a430- 0bccdab3f1b4/ASF_Specification.doc Page 74 of 83 The National Archives • A Guide to Formats Version: 1 WMA Voice A lossy codec designed for low-bandwidth voice communication, supporting up to 22,000 samples a second for a single channel of sound. However, most WMA audio files encountered use the first codec with the same name – Windows Media Audio. The others were defined later in 2003, and may be encountered in specialised scenarios, but their use is not particularly common. The format optionally supports various forms of digital rights management, which can restrict playback of content except on authorised devices, or only allow playback for a limited time. Hence, care must be taken with audio in any WMA format to ensure it is not protected by DRM schemes if the audio must be reliably accessed into the future. 12.3.1 Continuity properties of Windows Media Audio (WMA) Flexibility Interoperability High. Many platforms can process wma files, assuming digital rights management is not used. Implementability Low. Most programming environments do not include support for WMA files. Quality Lossiness Mostly lossy. Normally lossy, unless the WMA Lossless codec is used. Precision Resilience Recoverability Minor issues. Quality can vary according the codec used. Low. The format is complex and few tools exist to process it. Ubiquity High. WMA files are widespread. Stability Below average. The ASF format specification is available, but is not standardised. The WMA codec specifications are harder to acquire, and are also not standardised. 12.4 MPEG Layer 3 Audio (.mp3) The MPEG Layer 3 Audio (MP3) 85 file format is the de facto standard for consumer-grade digital music playback, being supported in almost all playback devices. It was defined by the Moving Picture Expert Group (MPEG) as part of the original MPEG-1 standard, and updated in the MPEG-2 standard. It was standardised as ISO 11172-3:1993 in 1993, and later as ISO 138183:1995 with some additions. It can support two channels of audio at up to 48,000 samples per second in MPEG-1 mode, and up to 6 channels (5.1 audio) in MPEG-2 mode. 85 See http://en.wikipedia.org/wiki/MP3 Page 75 of 83 The National Archives A Guide to Formats Version: 1 MP3 uses a lossy compression algorithm to achieve small file sizes, which discards part of the audio signal which human ears cannot easily distinguish, particularly when a lower tone obscures the perception of a higher one. The amount of loss is configurable, by setting the ‘bit-rate’ of the format – where a higher bitrate gives a better quality output. Many MP3 files are encoded using a128-bit rate, but a 192-bit rate or higher is not uncommon. In general for continuity purposes, unless space is a prime consideration, a higher quality bit-rate should be preferred. Since the codec of an MP3 file is part of the format, there are no additional codec risks with MP3, other than the MP3 algorithm itself is the subject of patents, which may require license fees to be paid if implemented in software. MP3 files do not have any digital rights management issues, allowing the unrestricted playback or modification or content, although note that since it uses lossy compression, it should not be used if the audio needs to be edited – each time it is resaved after a change more information will be discarded. 12.4.1 Continuity properties of MPEG Layer 3 (MP3) Flexibility Interoperability Very high. Almost all platforms can process MP3 files. Implementability Very high. Almost all programming environments can process MP3 files. Quality Lossiness Lossy. The MP3 format discards parts of the audio signal which human ears cannot normally distinguish. The amount of loss is configurable. Precision Resilience Recoverability No issues. Above average. Many tools can access and process audio data in MP3 format. It is possible to recover audio in the face of local corruption. Ubiquity Very high. It is the de facto format for consumer music playback. Stability Very high. The format is standardised and has been in use for nearly two decades. 12.5 Advanced Audio Coding (.aac) The Advanced Audio Coding (AAC) 86 file format is a lossy audio file format designed to achieve better quality than MP3 for similar file sizes, and includes a number of advanced features. It can 86 See http://en.wikipedia.org/wiki/Advanced_Audio_Coding Page 76 of 83 The National Archives A Guide to Formats Version: 1 support up to 48 channels of audio, each with up to a 96,000 samples per second. It includes support for error detection and correction within the encoding. Note that AAC is a method of encoding audio, but AAC-encoded audio must be contained in various standardised audio ‘container’ formats, including MP4 87, 3GP88 and other ISO-based media formats. 89 It was first standardised as part of the MPEG-2 specification in 1997, as ISO 13838-7:1997. It was subsequently updated in 1999 as part of the MPEG-4 specification, as ISO 14496-3:1999. Further additions have been made in 2000 (ISO 14496-3:1999/Amd 1:2000), 2003 (ISO 144963:2001/Amd 1:2003), 2004 (ISO 14496-3:2001/Amd 2:2004), 2005 (ISO 14496-3:2005/Amd 2:2006), with the latest being in 2009 (ISO 14496-3:2009). It is the default audio encoding for the Apple range of consumer hardware and software, including iPhone, iPad and iTunes. While AAC files do not themselves have any digital rights management (DRM) built in to the specification, it is possible to add DRM in to the format. For example, some AAC files in iTunes are protected by a DRM scheme called FairPlay. 90 Care must be taken with AAC files to ensure that you can access content which you own for as long as you need to, and that any DRM restrictions will not prevent access you need to your content. 12.5.1 Flexibility Continuity properties of Advanced Audio Coding (AAC) Interoperability High. Support for AAC encoded files can be found on many platforms. Implementability Average. Support for AAC encoded files can be found in some programming environments. Quality Lossiness Lossy. Precision No issues. Resilience Recoverability High. The encoding has explicit support for error detection and correction, which can be applied flexibly within a file. Ubiquity High. It is the default encoding for Apple’s consumer products. Stability Above average. While it is standardised, there have been many revisions to the standard. It is unclear whether there will be many more. However, support for existing AAC encoded files should be found into the immediate future. 87 See http://en.wikipedia.org/wiki/MP4 88 See http://en.wikipedia.org/wiki/3GP 89 See http://en.wikipedia.org/wiki/ISO_base_media_file_format 90 See http://en.wikipedia.org/wiki/FairPlay_%28DRM%29 Page 77 of 83 The National Archives 13. Video 13.1 Introduction A Guide to Formats Version: 1 There are many video formats in existence, designed to support differing qualities of video and audio. Video takes up an extremely large amount of space, so without exception all the formats described here use lossy compression to reduce the data to manageable (if still large) volumes. This makes them unsuitable for work which involves repeated changes to the video picture, as each time they are changed and saved, more quality is lost. Most video formats also include audio with them, which may share common codecs (compression-decompression) algorithms with audio-only formats (see section 12). Video formats described here include: • Moving Pictures Expert Group MPG, MPEG see section 13.2 • Windows Media Video WMV see section 13.3 • Audio Video Interleave AVI see section 13.4 • Flash Video FLV see section 13.5 13.1.1 Scaling risks Video, like raster images (see section 10.1.1) have a natural dimension of width and height in pixels. If a video is scaled up to a higher resolution, or downscaled to a lower resolution, then the video can appear blurred, areas of high contrast in the video (such as sharp lines) can be lost, or flickering can occur as different frames of the video discard slightly different parts of the image. In general, video should be kept at as high a resolution as possible, with lower quality versions being produced to fill particular needs (e.g. delivery on the web). 13.1.2 Codec risks A ‘codec’ refers to an algorithm used to compress and decompress the video or audio data. Most video codecs are ‘lossy’, in that they intentionally discard data to reduce the file size. A particular risk of codecs is knowing which codec is actually being used. Many video file formats allow many different codecs to be used within them, and this is not evident from the file extension, which simply tells you which video file container format is being used, not the codec. Although it is possible for dedicated video software to determine the codec in use (otherwise it could not play back the video), it is harder for information managers to acquire this information, which may create risk of unusual or older codecs remaining in use in older video files. Page 78 of 83 The National Archives 13.1.3 A Guide to Formats Version: 1 Digital rights management risks Some video file formats use ‘Digital Rights Management’ (DRM) to protect the content from copyright infringement, or to otherwise control the use of the content. By necessity, DRM encrypts the content of the video file format, preventing the use without a key to unlock the content. Because of this, all video files with DRM carry a very high continuity risk. In order to facilitate legitimate playback of content, the software must have the decryption key available to it. Unless the DRM scheme requires on-line negotiation, all off-line use (which includes most video players) must include the decryption key in the software client. It is often possible to reverse engineer the decryption key, however, there are serious legal issues with using such tools to unlock content protected by DRM schemes unless you are the legitimate copyright owner. 91 13.2 Moving Pictures Expert Group (.mpg, .mpeg) The Moving Pictures Expert Group (MPG) defined two major video and audio standards with corresponding file formats: MPEG-1 92 and MPEG-2 93, although both can use the .mpg file extension. The MP3 audio format (see section 12.4) is also part of the MPEG-1 standard. After a lengthy development, MPEG-1 was finally approved in 1992 and standardised as ISO 11172 in 1993, with subsequent additions to the same standard being made in 1995 and 1998. It is intended to encode VHS-tape quality video, and is still in widespread use. MPEG-2 was in development before MPEG-1 was standardised, and provides higher quality (it is the encoding used in DVD videos). MPEG-2 was standardised as ISO 13818 in 1996, with many subsequent additions being made. MPEG-1 videos are a valid subset of MPEG-2 videos, so software or devices capable of decoding MPEG-2 videos can automatically decode MPEG-1. MPEG video uses a lossy codec, and has no built-in digital rights management. 13.2.1 Flexibility Continuity properties of Moving Pictures Expert Group (MPG) Interoperability Very high. Almost all platforms can access content in MPEG format. Implementability High. Many programming environments have support for the 91 See http://en.wikipedia.org/wiki/Software_cracking 92 See http://en.wikipedia.org/wiki/MPEG-1 93 See http://en.wikipedia.org/wiki/MPEG-2 Page 79 of 83 The National Archives A Guide to Formats Version: 1 MPG format, although note that MPEG-2 is subject to patent restrictions. Quality Lossiness Lossy. Precision No issues. Resilience Recoverability Above average. The format is complex, but is designed to work when streaming across networks. Corruption generally affects only a few frames of the video. Ubiquity Very high. MPEG videos are extremely widespread. Stability Very high. The formats have been in use for nearly two decades and are highly standardised. 13.3 Windows Media Video (.wmv) The Windows Media Video (WMV) 94 file format is something of a misnomer, in Windows Media Video refers to a family of codecs, rather than a file format. The codecs are contained in an Advanced Systems Format 95 media container file, whose specification is available. 96 The three codecs defined are: • Windows Media Video The most common codec, released in 1999. • Windows Media Video Stream Designed for the capture of live screen content. • Windows Media Video Image A video slideshow codec. However, most WMV video files encountered use the first codec with the same name – Windows Media Video. The others may be encountered in specialised scenarios, but their use is not particularly common. The Windows Media Video codec was first specified in 1999, as WMV7. It was subsequently updated to WMV-9, and standardised through the Society of Motion Picture and Television Engineers (SMPTE) as VC-1. This format is used in both Blu-Ray and HD-DVD discs. The format optionally supports various forms of digital rights management, which can restrict playback of content except on authorised devices, or only allow playback for a limited time. Hence, care must be taken with video in any WMV format to ensure it is not protected by DRM schemes if the video must be reliably accessed into the future. 94 See http://en.wikipedia.org/wiki/Windows_Media_Video 95 See http://en.wikipedia.org/wiki/Advanced_Systems_Format 96 See http://download.microsoft.com/download/7/9/0/790fecaa-f64a-4a5e-a430- 0bccdab3f1b4/ASF_Specification.doc Page 80 of 83 The National Archives 13.3.1 A Guide to Formats Version: 1 Continuity properties of Windows Media Video (WMV) Flexibility Interoperability Very high. Most platforms can process content in the most common codec. Implementability Average. Some programming environments have support for WMV format files. Quality Lossiness Lossy. Precision No issues. Resilience Recoverability Above average. The format is complex, but is designed to work when streaming across networks. Corruption generally affects only a few frames of the video. Ubiquity Very high. The format is widespread on the internet and used as a delivery format for consumer disks like Blu-Ray and HDDVD. Stability High. WMV-9 is standardised as VC-1, but the other variations are not. 13.4 Audio Video Interleave (.avi) The Audio Video Interleave (AVI) format 97 was first introduced in 1992 as a proprietary video and audio container format by Microsoft. In theory, it can contain video and audio encoded using any codec, but more recent developments in advanced codecs are hard to encapsulate in it. Hence, AVI files tend to contain video and audio using older codecs. This can present a continuity risk, as the codecs used by an AVI file are hard to determine without specialised software. These codecs may themselves be at risk of obsolescence. The format has no digital rights management issues. 13.4.1 Flexibility Continuity properties of Audio Video Interleave (AVI) Interoperability High. Most platforms can process AVI files. However, note that availability of the codecs used in an AVI file are the true measure of interoperability. Some codecs may not be available on all platforms. Implementability High. The AVI format is widespread and support for it exists in many programming environments. Quality Lossiness Mixed. AVI files can use any codec (in theory both lossy and lossless codecs). In practice, most codecs will be lossy. 97 See http://en.wikipedia.org/wiki/Audio_Video_Interleave Page 81 of 83 The National Archives Precision Resilience Recoverability A Guide to Formats Version: 1 No issues. Unknown. It will largely depend on the choice of codec, in which most of the information in an AVI file is encoded. Ubiquity High. AVI files are widespread, although gradually being replaced by more modern video formats. Stability Below average. While the AVI format itself (as a container of other data) has not changed, it is not standardised, and there are several incompatible implementations of various features in existence. Support for all variations and codecs used in the format cannot be guaranteed into the future. 13.5 Flash Video (.flv) The Flash Video (FLV) format 98 is widespread as a delivery mechanism for video on the world wide web. It is a proprietary video container format, created by Adobe Corporation, which allows the use of various codecs to compress and decompress video and audio data contained within it. However, the codecs usually used are the Sorenson Spark 99 or VP6 100 video compression formats, and more recently H.264 video (although this codec is covered by patents). Audio in Flash videos is usually encoded as MP3 (see section 12.4). It was first specified in 2003 as the FLV file format (previously, the same video could be embedded in the Shockwave Flash format, but not standalone as an FLV file). The format was updated in 2007 to a new container format based on and extending the ISO base media file format. 101 This is effectively a different file format, but it shares the FLV extension with the earlier format. Software to decode FLV files must look inside the files to determine what type of format it actually is. Competition currently exists to define a new video standard for internet videos, with various formats being proposed. There is an ongoing debate on whether internet video should use nonproprietary, open standards video formats which do not require license payments to use. 102 98 See http://en.wikipedia.org/wiki/Flash_Video 99 See http://en.wikipedia.org/wiki/Sorenson_Spark 100 See http://en.wikipedia.org/wiki/VP6 101 See http://en.wikipedia.org/wiki/ISO_base_media_file_format 102 See http://en.wikipedia.org/wiki/HTML5_video#Default_video_format_debate Page 82 of 83 The National Archives 13.5.1 Flexibility A Guide to Formats Version: 1 Continuity properties of Flash Video (FLV) Interoperability High. Almost all platforms (with the notable exception of the iPad) can process Flash video. Implementability High. Many tools and programming environments can process flash video. Quality Lossiness Lossy. Precision No issues. Resilience Recoverability High. The format is designed to support delivery over the internet, so corruption will generally only affect a few video frames. Ubiquity Very high. The format is the de facto standard for delivery of video over the internet. Stability Below average. The more recent formats are based on a standardised container, but the extensions are not standardised. The earlier format is still in use, but Adobe recommend moving away from it. Support for these formats cannot be guaranteed except in the immediate future, particularly if a competing format becomes the new de facto standard for internet video. Page 83 of 83