Evaluating Your File Formats This guidance relates to: Stage 1: Plan for action Stage 2: Define your digital continuity requirements Stage 3: Assess and address risks to digital continuity Stage 4: Maintain digital continuity This guidance has been produced by the Digital Continuity Project Page 1 of 24 The National Archives Evaluating Your File Formats Version: 1.2 © Crown copyright 2011 You may re-use this document/publication (not including the Royal Arms and other departmental or agency logos) free of charge in any format for research, private study or internal circulation within an organisation. You must re-use it accurately and not use it in a misleading context. The material must be acknowledged as Crown copyright and you must give the title of the source document/publication. Where we have identified any third party copyright material you will need to obtain permission from the copyright holders concerned. This document/publication is also available at www.nationalarchives.gov.uk/digitalcontinuity For any other use of this material please apply for a Click-Use PSI License or by writing to: Information Policy Team Kew Richmond Surrey TW9 4DU Email: psi@nationalarchives.gsi.gov.uk The National Archives Evaluating Your File Formats Version: 1.2 CONTENTS 1. 2. Introduction .................................................................................................................. 4 1.1 What is the purpose of this guidance? ..................................................................... 4 1.2 Who is this guidance for? ........................................................................................ 5 What is a file format? ................................................................................................... 6 2.1 3. 4. 5. Why evaluate file formats? ...................................................................................... 6 Scoping your evaluation .............................................................................................. 8 3.1 Audiences ............................................................................................................... 8 3.2 Platforms ................................................................................................................. 8 3.3 Usages .................................................................................................................... 9 Evaluating file formats ................................................................................................10 4.1 Capability ...............................................................................................................11 4.2 Quality ....................................................................................................................13 4.3 Resilience ..............................................................................................................14 4.4 Flexibility ................................................................................................................18 4.5 Conclusions ...........................................................................................................20 Further reading ............................................................................................................24 1. Introduction Digital continuity is the ability to use your information in the way you need, for as long as you need. Managing digital continuity protects the information you need to do business. This enables you to operate accountably, legally, effectively and efficiently. It helps you to protect your reputation, make informed decisions, avoid and reduce costs, and deliver better public services. If you lose information because you haven't managed your digital continuity properly, the consequences can be as serious as those of any other information loss. Digital information is more vulnerable than paper. It is underpinned by high volumes of fragile, invisibly related data and access to it is mediated by fast-changing, proprietary or bespoke technologies, often only partially understood by a few people in your organisation. If you do not actively work to ensure digital continuity, your information can easily become unusable. Digital continuity is not about digital archiving (although that may be a valid strategy in some cases). It is about assuring an appropriate level of access to your digital information for as long as you need it. 1.1 What is the purpose of this guidance? This guidance will enable you to: evaluate file formats from a digital continuity perspective employ various strategies to maintain the continuity of your digital information. This guidance forms part of a suite of guidance1 that The National Archives has delivered as part of a digital continuity service for government, in consultation with central government departments. This piece of guidance provides you with practical information and support to help you complete Stage 4 of the four-stage process of managing digital continuity, which helps you to maintain the continuity of your digital information over time and through change.2 1 2 For more information and guidance, visit www.nationalarchives.gov.uk/digitalcontinuity See Managing Digital Continuity nationalarchives.gov.uk/documents/managing-digital-continuity.pdf Page 4 of 24 The National Archives Evaluating Your File Formats Version: 1.2 1.2 Who is this guidance for? This guidance is primarily aimed at information and IT managers who need to assess the usage of file formats in different business situations across their organisation and beyond. While the methodology outlined in this guidance is high-level, specialist technical knowledge will be required to assess the file formats against the criteria. Page 5 of 24 The National Archives 2. Evaluating Your File Formats Version: 1.2 What is a file format? A file format is an arbitrary method of storing digital content in a file, allowing its later retrieval or interchange with other people and computers. There are many different file formats for different kinds of digital content and often different versions of the ‘same’ file format. A file format is often confused with the software most often used to create it. For example, it is common to talk about ‘Microsoft Word’ files, or ‘Acrobat PDF’ files. Despite these naming conventions, in principle a file format is not bound to any particular software – even if in practice this is sometimes the case. A file format is like a language whose only speakers are certain pieces of software. In general, the more languages you have to deal with, and the fewer speakers you have for each, the more risks exist to digital continuity. Translating between languages can cause errors, yet leaving information in old or rare languages creates another kind of risk. 2.1 Why evaluate file formats? There are many different reasons for evaluating file formats. You can apply the methodology presented in this guidance in all cases, but your reason for evaluation will influence the constraints and criteria you use. 2.1.1 Assessing the continuity of existing information You may already have information stored in various file formats, and wish to understand which have the best or worst properties in terms of digital continuity in your own organisation. This will influence whether you need to migrate existing information, to standardise a file format with good continuity properties, or to avoid obsolescence. Note: the file formats you use or support will have different digital continuity properties from one another. It is not possible to simply recommend the ‘best’ file formats to use from a continuity perspective – you must evaluate these formats in the context of your own business needs and technological environment. A good format in one circumstance can be a bad one in another, and vice versa. there – including the content and context, such as metadata – so, for example, you have still got links to external files or you have maintained important connections between files and metadata. Page 6 of 24 Available: This means you can find what you need and it can be opened with The National Archives 2.1.2 Evaluating Your File Formats Version: 1.2 Upgrading file formats File formats are often upgraded as software is upgraded. You may wish to assess the properties of your existing file formats against the newer ones, to decide whether to change. It can often be the case that a newer format presents a higher risk to resilience and flexibility. The format may have underlying weaknesses that are not immediately apparent, which require later correction. Even if a previous format was widely used, there are no guarantees that a successor format will become so. There will be less software that works with it, reducing interoperability and implementability, and less experience available in recovering from any problems with it. Unless there are compelling new capabilities offered by a newer format, it can be better to adopt a ‘wait and see’ policy. 2.1.3 Selecting different file formats You may be in a position to select a different file format or formats to use in your organisation, taking into account your existing technological infrastructure. In this case, compatibility with existing software will act as a constraint on your assessment and selection. You may also be influenced by technology in use outside of your organisation, to enable interchange with other bodies or the public. 2.1.4 Selecting new software Your choice of software can equally be influenced by a principled choice of file formats to use. In this case, your assessment of which file formats have the right properties will influence your choice of software rather than software acting as a constraint on your assessment of file formats. Page 7 of 24 The National Archives 3. Evaluating Your File Formats Version: 1.2 Scoping your evaluation This guidance helps you to evaluate your file formats so that you can continue to use your information in the way you need, for as long as you need. To contextualise your file format evaluation, it is important to define your: audiences – who needs to access the information (section 3.1) platforms – what technologies the information is deliverable on (section 3.2) usages – how people need to be able to use the information (section 3.3). 3.1 Audiences The ‘audiences’ of your information are all the different individuals or communities who may need to access your information in different ways. Often only the immediate audience of a format is considered, but you need to consider other parties who may need access. For example: a project involves creating advanced spreadsheets and the project team needs full, editable access to them, whereas the rest of your staff only needs to read them. External auditors may also need to able to review the spreadsheets. Looking more widely, there is a need to do joint work on the spreadsheets with partner organisations. Finally, the public may require access at some later stage once the project is finished, or even earlier if a Freedom of Information (FOI) enquiry is issued. there – including the content and context, such as metadata – so, for example, you have still got links to external files or you have maintained important connections 3.2 between Platforms files and metadata. Available: This means you can find what you need and it can be opened with The platforms for your information are all the technologies you wish to make your information available technology – so, for example, you have the metadata you need, or have available on. This covers not only the obvious desktop systems staff typically use on a daily information in versions that can be processed using available IT applications. basis, but also server-based systems (e.g. web based publication, or automated processing) That means that it is fit for purpose and can be used in a way that meets and Usable: mobile systems (e.g. Smartphones). the business needs Complete: Everything you need to use and understand the For example: the main use of your spreadsheets is by the project staff using information is there – including the content and context, such as metadata – so, for Microsoft you Windows XP on some of maintained the staff in your example, have still gottheir linksdesktops. to externalHowever, files or you have important organisation between use Mac files OS/X. Business information is made available on your intranet, connections and metadata. and some remote workers also use Smartphones. Finally, project details are Available: This means you can find what you need and it can be opened with regularly published on your website. available technology – so, for example, you have the metadata you need, or have information in versions that can be processed using available IT applications. there – including the content and context, such as metadata – so, for example, you Usable: That means that it is fit Page for purpose 8 of 24 and can be used in a way that meets have still got links to external files or you have maintained important connections the business needs of the organisation – so, for example information is not locked between files and metadata. into formats or systems that restrict your ability to use or re-use it, or restrict the The National Archives 3.3 Evaluating Your File Formats Version: 1.2 Usages For each audience and platform, you should define what each of your audiences needs to be able to do with your information. To ensure that important usages are not missed, ask how each audience can: create the information find the information open the information work with the information understand the information trust the information. For example, taking the examples above, we may define our usages as follows: Windows XP Project staff All staff Intranet Smartphone Create, Edit, Read, Read Read Search Read Mac Read Web Read, Search External auditors Review Read, Search Partner organisations Create, Edit, Read, Read Search Public Read, Search Public Review (Freedom of Information) Notice that we have made a distinction between ‘reviewing’ and ‘reading’. The reason is that reviewing a spreadsheet involves being able to see the formulae inside the spreadsheet. Reading, in our definition, only implies seeing the resulting data in it. You must be careful to separate different kinds of usage, even if they appear to be superficially the same. Where you use arbitrary terminology to make these distinctions, you should document what you mean by it. Page 9 of 24 The National Archives 4. Evaluating Your File Formats Version: 1.2 Evaluating file formats Once you have a clear picture of the audiences, platforms and usages for your information, you should evaluate your file formats in that context. To provide a basis for evaluation and comparison, we suggest that you assess file formats against each of the following characteristics: capability – how well your business requirements are met (section 4.1) quality – how accurately your information is stored (section 4.2) resilience – how resilient your information is to time (section 4.3) flexibility – how well you can adapt to changing requirements (section 4.4). By providing a measure of these characteristics on a common scale, you can compare different formats to assess how well they each might meet your needs. For example, you may be evaluating four spreadsheet formats to use across your organisation. You could produce a table with scores ranging from 0 to 5, like this: Format A 5 Format B 3 Format C 5 Format D 5 Quality 5 4 5 5 Resilience 4 5 3 2 Flexibility 3 3 5 2 Capability You can determine the score for each characteristic by evaluating a file format against various sub-criteria, with the final score being an average of the sub-criteria scores. These characteristics (and the criteria defined within them) have been chosen to provide a good basis for comparing the continuity properties of file formats. You can, of course, define further characteristics to assess your file formats against, or define further sub-criteria within them. In some cases, it may be useful to pull out distinct areas of importance to your organisation in order to avoid averaging important and less important features. You should feel free to adapt the methodology to fit your particular reasons for assessing file formats. The following sections will explain each of these four characteristics and their sub-criteria, and how to assess file formats against them, in more detail. Page 10 of 24 The National Archives 4.1 Evaluating Your File Formats Version: 1.2 Capability The capabilities of your formats are the features they must support to meet your business needs. One way to determine which features are important is to look at the usages and platforms you defined in the scoping stage. By examining the requirements of the audience on each platform, with the usages defined, you can begin to enumerate what capabilities will be required of your formats. Some capabilities may be mandatory and some only desirable. Any format which does not meet your mandatory capabilities at all can be immediately excluded. If you discover that no format meets your requirements, then you must re-evaluate your requirements. It may be the case that no single format can meet all your requirements, in which case you should consider whether it is possible to use different formats to meet different usage requirements. If this is the case, the use of multiple formats imposes another requirement to migrate information between them. Once you have a list of formats which at least meet your mandatory capabilities, you can compare them by ranking their capabilities on a common scale. 4.1.1 Worked example Mandatory capabilities Using the running spreadsheet format example, we can see that we require spreadsheets which work on Windows and Mac, which can be delivered online, and which are capable of being searched. We have determined that online delivery on the intranet can be satisfied via links to the original spreadsheets, so no special capabilities are required for that platform. The reviewing requirement will be met by using compatible editing software, even though actual editing is not required – the files could be set to be read-only. However, for delivery to the public on the web, an additional constraint is that the public must not need specialised, commercial software to view the spreadsheet results. In evaluating the four example spreadsheet formats, we may determine that there is no freely available method of viewing any of them via the web or Smartphones. This is not a Page 11 of 24 The National Archives Evaluating Your File Formats Version: 1.2 negotiable requirement; so an alternate method of satisfying the requirement may be to transform3 the spreadsheets into a PDF format for web and Smartphone viewing (it is assumed that PDF was selected for this purpose using another format evaluation process). This creates a capability requirement that the formats are easily transformable into PDF using available software. Desirable capabilities Drilling into what the project team of users (the creators and editors of the spreadsheets) require, we find that the spreadsheets would ideally support up to 250,000 rows, support some specialised statistical functions, and be able to produce particular kinds of chart. By scoring all of the capabilities of each format in a table, totalling the scores and dividing by the number of capabilities, we can derive an average capability score for each format. Windows Mac Format A Format B Format C Format D 5 5 5 5 5 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 Poor support 5 Searchable 3 Some metadata not searchable 250,000 rows Chart support Statistics 5 150,000 only 5 5 3 Some missing 5 PDF Average 3 Limited range support Transform to 3 2 Hard to achieve 5 3 In the table above, we have not actually enumerated which types of chart or statistical functions are required, or by what technology the spreadsheet formats must be searchable. 3 Guidance on transforming file formats is given in File Format Conversion nationalarchives.gov.uk/documents/format-conversion.pdf Page 12 of 24 The National Archives Evaluating Your File Formats Version: 1.2 In a real assessment, these would need to be defined, and the reasons for the scores documented to a greater extent. In this example, all the formats score well except Format B which, although it does meet all mandatory capabilities, has quite low scores. This does not mean that format B should be immediately excluded. Note: inability to meet mandatory capability requirements can rule out the use of certain file formats, while merely having low scores does not rule anything out. You will need to assess the three other continuity characteristics (below) to determine which of the file formats have the best overall continuity properties. 4.2 Quality The quality of a file format refers to how well your information is represented by the format. We can ask two questions of quality: Precision – is data represented to a sufficient precision? Lossiness – does the format intentionally throw information away? You may define other quality measures for your particular formats. For example, you may wish to assess whether a document format will preserve page numbers, or whether international characters are properly represented. 4.2.1 Precision If a format must store structured data, it is important that the data is represented to a sufficient level of precision. For example, dates and times may only be accurate to a second, or a millisecond. Numbers may only support a number of decimal places, or none. Calculations using data at a lower level of precision than required can produce the wrong answers. Clearly, not all formats have precision issues, but many do. 4.2.2 Lossiness Some formats do not store all the information originally entered in to them, and actually throw information away which the format regards as inessential. Typically this is done to achieve smaller file sizes, often for media formats (images, audio and video). Page 13 of 24 The National Archives Evaluating Your File Formats Version: 1.2 For example, MP3 audio files discard parts of the audio signal which the algorithm deems human ears will not notice. Likewise, the JPG file format used by many digital cameras is a lossy format, which averages areas of colour in the picture to reduce the amount of information that must be stored. In general, formats which are lossy can be tuned to discard greater or lesser amounts of information. Using a lossy format to store information which must be repeatedly changed is not recommended, as every time the format is resaved, more information is discarded. This is similar to repeatedly photocopying copies of a document – after a few copies of copies of copies, the document becomes unreadable. So, you should only use lossy formats if the information will not be further changed once stored and the loss of information is within acceptable quality bounds. 4.2.3 Worked example Again, using our spreadsheet example, we determine that no spreadsheet formats are lossy – no information is intentionally discarded by them. However, there can be precision issues. Again, we construct a table for our candidate formats: Numeric precision Date/time Format A Format B Format C Format D 5 3 5 5 64-bit floating 32-bit floating 64-bit floating 64-bit floating point point point point 5 5 5 5 5 4 5 5 precision Average We can see that there are no significant quality issues with the spreadsheet formats, except that format B does not handle floating point numbers (e.g. 1.222343454545) to the same level of precision as the others. 4.3 Resilience File formats are vulnerable to changes over time, including accidental corruption of the information in the file and external changes in the technology landscape. Here we define three characteristics of a resilient format, although again you may define others which relate to your particular environment. Page 14 of 24 The National Archives Evaluating Your File Formats Version: 1.2 Ubiquity – how widespread is the use of the format? (section 4.3.1) Stability – how long will the format be supported by software? (section 4.3.2) Recoverability – how resilient is the format to accidental corruption? (section 4.3.3) 4.3.1 Ubiquity Formats which are more widely used will tend to be more resilient than those which are not. The chance that the format will cease to be supported in future is much lower. Measuring the ubiquity of a format must be in relation to its market sector and its natural competitors. For example, Computer Aided Design (CAD) formats are much less widely used than document file formats, but it would be meaningless to compare them in this way, even though in absolute terms it may be true that CAD formats are less resilient than document formats. The purpose of the evaluation is to assess the relative risk between formats in competition with each other. 4.3.2 Stability The stability of a format relates to how long the format is likely to be supported. Certain formats which are commonly used and have become a de-facto standard will tend to be more stable than those which are niche players, or which are new in the marketplace. In this sense, past history can be a guide to stability. An example is the binary Microsoft Office file formats (97-2003). Although there are already newer Office formats, there is such a substantial body of information recorded in the older versions, and they have retained support for so long, it is quite likely that there will be support for them many years into the future. Standardisation can be a clue to the stability of a format even in the absence of ubiquity. The standardisation process does not guarantee future support for a format, but it is an indication that the format may be supported into the future. Common media formats such as JPG are standardised, widely used, and likely to be supported for many years. On the other hand, the recent presence of standardised formats, or a multitude of competing standards, can also be a clue that formats may become less stable. The presence of a standard does not necessarily create stability, although it can be an indication of future support. Page 15 of 24 The National Archives 4.3.3 Evaluating Your File Formats Version: 1.2 Recoverability Different file formats have varying levels of susceptibility to accidental corruption over time. Some formats store information very densely, or in ways in which swapping a single one and zero can prevent software from opening the entire file. In others, the information is quite spread out, with little dependencies between different parts of a file. In general, file formats can be divided into textual formats and binary formats. Textual formats: are formats in which the information is internally represented as text, even if the information within them is not text can be opened using normal text editors and the contents reviewed manually, or processed programmatically fairly easily are not very dense, and can be quite resilient in the face of small numbers of errors. Binary formats: use a variety of methods of representing information, which may include text, but are not purely textual cannot be opened in text editors tend to represent information in a more dense form, which is more directly understandable by computers. However, there is no ‘standard’ for binary file formats – each is uniquely determined by the specification or application which controls it, and they bear no relation to each other. Binary formats tend to be less resilient in the face of small numbers of errors, and are generally harder to access programmatically or manually, with or without corruption. It can be very difficult to open a corrupt binary format and extract information from within it. On top of whether a file format is binary or textual, a file format can be scrambled in a variety of ways. Formats can be compressed (e.g. using zip software), or the contents may be encrypted (e.g. using AES encryption). Both of these methods result in binary files, and the information within them is typically denser and harder to recover in the face of corruption. The particular method of compression or encryption will affect the recoverability of a file. In extreme cases, certain forms of encryption (for example, those that use Cipher Block Chaining or CBC mode) create a sequential dependency on all parts of a file – so that a corruption of a single bit will cascade throughout the rest of the file, rendering it unintelligible. Page 16 of 24 The National Archives Evaluating Your File Formats Version: 1.2 To mitigate against corruption, some file formats explicitly include error detection and correction information, which provides redundancy at the expense of larger file sizes. Typically, error detection and correction is provided for file formats intended for secure backup, or for those which represent information very densely, or which are intended for realtime delivery over networks (e.g. streaming media). Finally, the complexity of the information stored in a format will influence how recoverable a format is. In general, the more complex the information, the more complex the format and therefore the lower the recoverability tends to be. For formats intended for a single type of information or purpose (e.g. a single image, or to compress any file), the complexity is quite low. For formats which allow lots of different kinds of information to be arranged at the user’s discretion (e.g. documents or CAD files), the complexity is high. You will usually be comparing formats with similar levels of complexity, as they will be intended for the same types of information or purpose. However, you can compare formats with different levels of complexity where your purpose does not require complexity, but your existing information is already complex. For example, to store text for a long time, you may evaluate various document formats against simple text files. 4.3.4 Worked example Assessing our four spreadsheet formats against ubiquity, stability and recoverability gives us the following results: Ubiquity Stability Format A Format B Format C Format D 4 5 3 2 Widely used 4 Standardised 4 Recoverability Dense textual Dominant Somewhat Not widely format used used 5 Standardised 5 Textual format 3 Superseded 3 Binary format 1 Very new 3 Binary format format Average 4 5 Page 17 of 24 3 2 The National Archives 4.4 Evaluating Your File Formats Version: 1.2 Flexibility The flexibility of a format gives a measure of how well you are able to adapt to changing requirements and use your information in the way you require. We can define two characteristics against which to assess its flexibility. Interoperability – how much existing software can access the format? (section 4.4.1) Implementability – how easy is it to write software to interact with the format? (4.4.2) 4.4.1 Interoperability File formats are like languages whose only speakers are certain pieces of software. The more software that can ‘speak’ the language of your file format, the more flexibility you have in selecting cost-effective software to access your information, and the easier it will be to exchange information between different audiences and platforms. When considering interoperability, think beyond the platform on which the information is most commonly created, and consider all the platforms on which you want your information to be accessible, and how it needs to be accessed. For example, while spreadsheets may typically be created using a desktop application, you may require that they are readable on Smartphones, or editable using a web browser. While the platforms defined in your scoping exercise give you an idea of what you must support today, you should also consider future needs and interoperability with external audiences where you don’t control the platform. Standardisation (whether formally, or by being a de-facto standard) of a format can help in creating the conditions for interoperability, but be aware that merely being standardised does not ensure interoperability. A file can conform entirely to a standard, yet not actually interoperate with other software out there, which may have made different judgements on interpreting the standard. The reasons for this can range from poor implementation of the software, to complexity of the standard, to the desire of a dominant vendor to create lock-in to their software. In general, it is not enough for software to claim to support a format standard – this is a necessary but not sufficient condition to ensure interoperability. It must also implement the standard in a way which actually interoperates with other software. For very simple formats, or formats which are very common (for example, text files or JPG images) this is not generally an issue, but complex or niche formats will require testing with specific software to gain assurance of actual interoperability. Page 18 of 24 The National Archives Evaluating Your File Formats Version: 1.2 It will not possible for most organisations to test exchanging information in each format using every possible piece of software, so you must pragmatically scope your scoring on interoperability. You can gain an idea of the level of interoperability of a given file format by examining how many pieces of software on a variety of platforms claim support for a file format in providing the types of access that you require. You may decide that in your business context, interoperability is only important to assess between certain defined pieces of software or platforms, or that some interoperability is mandatory and some only desirable. For example, you could run tests on software you know you need to exchange information between and provide only subjective measures, such as the amount of software claiming support, for other platforms less important to you. 4.4.2 Implementability It is often important to be able to automate processes, or enable access to your information on a new platform. If you have a need to process your information like this, then you should consider implementability. The implementability of a file format refers to how easy it is to create bespoke software that interoperates with your information in that format. Simple formats or those which use highly standardised methods of recording the information will tend to be easy to implement. For example, XML is a highly standardised method of creating textual file formats and so it is quite easy to create software which processes information in an XML-based format. The existence of Application Programming Interfaces (APIs) or Software Development Kits (SDKs) for those formats provide assurance of implementability. These are libraries of software which you can use in your own software to access information in the format, without having to know how to read or write the information in the format directly. They are often provided by the vendor of the principle software used to create information in that format, although there may be alternate APIs or SDKs provided by other organisations, or as opensource software. You should assess whether these APIs or SDKs are available on the platforms you may want to implement software, and whether they are being actively maintained and supported. Page 19 of 24 The National Archives 4.4.3 Evaluating Your File Formats Version: 1.2 Worked example Assessing our example spreadsheets against interoperability and implementability gives us this table: Interoperability Format A Format B Format C Format D 2 3 5 3 Only one vendor Older but Lots of A few good on Windows & good different implementations Mac, only software on software on Mac, but supporting basic Mac, multiple available, with only one on features. on Windows demonstrably Windows. of varying good quality. interoperability. 4 Implementability 3 5 1 APIs available Older APIs APIs available No APIs, only on most available but on all automatable platforms, not only on platforms. using desktop Smartphones. Windows software platform Average 3 3 5 2 4.5 Conclusions Once you have assessed your file formats against each continuity category, you can make a final table containing the averages of each category (see worked example in 4.5.1 below). This enables you to see at a glance which formats have better or worse continuity properties. The best format for your use is not necessarily the one with the highest scores against each category. You should judge which categories are most important to you, and if any specific issues arose during your assessment which would influence any decision you may make. Given that you may not pick a format with the highest scores, it is natural to wonder whether it is worth assessing formats at all. However, the risks of not assessing, and simply going with the default, can be high. The highest impacts are in the loss of critical business information over time and change. There are also other, less visible, costs, including lower business and technical flexibility and higher information management costs. Page 20 of 24 The National Archives Evaluating Your File Formats Version: 1.2 The process of assessment gives you a picture of the trade-offs you make in choosing various formats, and therefore enables you to use your information in the way that you need to, for as long as you need to, and to justify any choice you may make. Key messages Understanding the file formats you have or plan to use is an important factor in managing your risks to digital continuity. Formats are not innately better or worse than one another – it depends on the use you need to make of them and the context they are used in. Using formats by default exposes you to continuity risk. It is better to make an informed choice by assessing the available options. Formal and open standards are important to the continuity of your information, but de facto standards and ubiquity can be just as important. Ensure you consider all four continuity factors when considering formats, not just what meets immediate business needs. 4.5.1 Worked example Taking our spreadsheet example once again, placing the average assessments of each of the categories against each format gives us a final table: Format A 5 Format B 3 Format C 5 Format D 5 Quality 5 4 5 5 Resilience 4 5 3 2 Flexibility 3 3 5 2 Capability We can make a more direct visual comparison of the file formats by creating charts: Page 21 of 24 The National Archives Evaluating Your File Formats Version: 1.2 Figure 1: Bar chart grouped by format Figure 2: Radar chart grouped by category We can see immediately that Format B has some of the lowest average scores, except for a high score for resilience. The other formats all score equally for capability and quality. Resilience and flexibility vary considerably between them however. Format D has very low scores for both resilience and flexibility, so cannot be considered a good choice when set against A and C. There is little difference in the resilience of Format A and C, but a big difference in flexibility. On the basis of this assessment, Format C would seem to be the best choice, assuming that long term resilience of the format is not the overriding concern. Note that Format B in our example was actually the dominant format in use at the time. It is quite common for organisations to choose a dominant format by default. There is certainly an argument for safety in numbers, but not an argument for choosing formats by default. Page 22 of 24 The National Archives Evaluating Your File Formats Version: 1.2 Note that the same formats could be scored very differently depending on the particular audiences, platforms and usages you define, and the criteria deemed most important will also vary. This guidance does not dictate that a format with the highest scores across all categories is necessarily the ‘best’ format. During the process of evaluation, critical criteria may emerge which influence any final decisions. For example, the decision here may be made to go with Format B (safety overriding all other concerns). Whatever decisions are made, they will be informed and justifiable, with a clear picture emerging of the various factors which influence the continuity of your digital information. Page 23 of 24 The National Archives 5. Evaluating Your File Formats Version: 1.2 Further reading Find out more about file formats in our addendum to this document: A Guide to File Formats nationalarchives.gov.uk/information-management/projectsand-work/dc-guidance.htm To help in evaluating file formats, this guide will present factual information about selected existing file formats used widely across organisations. Wikipedia is also a good source of information on file formats: http://en.wikipedia.org/wiki/File_format The National Archives maintains PRONOM, an online database of information about file formats and software which is regularly updated: nationalarchives.gov.uk/pronom/ If you have a requirement to convert information from one file format to another, please see: File Format Conversion nationalarchives.gov.uk/documents/format-conversion.pdf File Format Conversion explains the issues in migrating information between different file formats. It will enable you to understand why, when and how you should convert file formats, and what you should convert them to. For more information on managing your digital continuity, please see: Understanding Digital Continuity nationalarchives.gov.uk/documents/understandingdigital-continuity.pdf Understanding Digital Continuity is for anyone who wants to know more about digital continuity and digital information management. It is a high level piece of guidance providing you with an introduction to digital continuity – what it is, why it’s important and why it’s relevant to you and your organisation. Managing Digital Continuity nationalarchives.gov.uk/documents/managing-digitalcontinuity.pdf Managing Digital Continuity is aimed at guiding you through the steps you will need to take to successfully manage your digital information. This piece of guidance is primarily aimed at Senior Responsible Owners (SROs) who hold overall responsibility for ensuring digital continuity. Page 24 of 24