Evaluating Your File Formats

advertisement
Evaluating Your File Formats
This guidance relates to:
Stage 1: Plan for action
Stage 2: Define your digital continuity requirements
Stage 3: Assess and address risks to digital continuity
Stage 4: Maintain digital continuity
This guidance has been produced by the Digital Continuity Project
Page 1 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
© Crown copyright 2011
You may re-use this document/publication (not including the Royal Arms and other
departmental or agency logos) free of charge in any format for research, private study or
internal circulation within an organisation. You must re-use it accurately and not use it in a
misleading context. The material must be acknowledged as Crown copyright and you must
give the title of the source document/publication.
Where we have identified any third party copyright material you will need to obtain
permission from the copyright holders concerned.
This document/publication is also available at www.nationalarchives.gov.uk/digitalcontinuity
For any other use of this material please apply for a Click-Use PSI License or by writing to:
Information Policy Team
Kew
Richmond
Surrey
TW9 4DU
Email: psi@nationalarchives.gsi.gov.uk
The National Archives
Evaluating Your File Formats Version: 1.2
CONTENTS
1.
2.
Introduction .................................................................................................................. 4
1.1
What is the purpose of this guidance? ..................................................................... 4
1.2
Who is this guidance for? ........................................................................................ 5
What is a file format? ................................................................................................... 6
2.1
3.
4.
5.
Why evaluate file formats? ...................................................................................... 6
Scoping your evaluation .............................................................................................. 8
3.1
Audiences ............................................................................................................... 8
3.2
Platforms ................................................................................................................. 8
3.3
Usages .................................................................................................................... 9
Evaluating file formats ................................................................................................10
4.1
Capability ...............................................................................................................11
4.2
Quality ....................................................................................................................13
4.3
Resilience ..............................................................................................................14
4.4
Flexibility ................................................................................................................18
4.5
Conclusions ...........................................................................................................20
Further reading ............................................................................................................24
1.
Introduction
Digital continuity is the ability to use your information in the way you need, for as long
as you need.
Managing digital continuity protects the information you need to do business. This enables
you to operate accountably, legally, effectively and efficiently. It helps you to protect your
reputation, make informed decisions, avoid and reduce costs, and deliver better public
services. If you lose information because you haven't managed your digital continuity
properly, the consequences can be as serious as those of any other information loss.
Digital information is more vulnerable than paper. It is underpinned by high volumes of
fragile, invisibly related data and access to it is mediated by fast-changing, proprietary or
bespoke technologies, often only partially understood by a few people in your organisation.
If you do not actively work to ensure digital continuity, your information can easily become
unusable.
Digital continuity is not about digital archiving (although that may be a valid strategy in some
cases). It is about assuring an appropriate level of access to your digital information for as
long as you need it.
1.1 What is the purpose of this guidance?
This guidance will enable you to:
evaluate file formats from a digital continuity perspective
employ various strategies to maintain the continuity of your digital information.
This guidance forms part of a suite of guidance1 that The National Archives has delivered as
part of a digital continuity service for government, in consultation with central government
departments.
This piece of guidance provides you with practical information and support to help you
complete Stage 4 of the four-stage process of managing digital continuity, which helps you to
maintain the continuity of your digital information over time and through change.2
1
2
For more information and guidance, visit www.nationalarchives.gov.uk/digitalcontinuity
See Managing Digital Continuity nationalarchives.gov.uk/documents/managing-digital-continuity.pdf
Page 4 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
1.2 Who is this guidance for?
This guidance is primarily aimed at information and IT managers who need to assess the
usage of file formats in different business situations across their organisation and beyond.
While the methodology outlined in this guidance is high-level, specialist technical knowledge
will be required to assess the file formats against the criteria.
Page 5 of 24
The National Archives
2.
Evaluating Your File Formats Version: 1.2
What is a file format?
A file format is an arbitrary method of storing digital content in a file, allowing its later retrieval
or interchange with other people and computers. There are many different file formats for
different kinds of digital content and often different versions of the ‘same’ file format.
A file format is often confused with the software most often used to create it. For example, it
is common to talk about ‘Microsoft Word’ files, or ‘Acrobat PDF’ files. Despite these naming
conventions, in principle a file format is not bound to any particular software – even if in
practice this is sometimes the case.
A file format is like a language whose only speakers are certain pieces of software. In
general, the more languages you have to deal with, and the fewer speakers you have for
each, the more risks exist to digital continuity. Translating between languages can cause
errors, yet leaving information in old or rare languages creates another kind of risk.
2.1 Why evaluate file formats?
There are many different reasons for evaluating file formats. You can apply the methodology
presented in this guidance in all cases, but your reason for evaluation will influence the
constraints and criteria you use.
2.1.1
Assessing the continuity of existing information
You may already have information stored in various file formats, and wish to understand
which have the best or worst properties in terms of digital continuity in your own organisation.
This will influence whether you need to migrate existing information, to standardise a file
format with good continuity properties, or to avoid obsolescence.
Note: the file formats you use or support will have different digital continuity
properties from one another. It is not possible to simply recommend the ‘best’ file
formats to use from a continuity perspective – you must evaluate these formats in the
context of your own business needs and technological environment. A good format in
one circumstance can be a bad one in another, and vice versa.
there – including the content and context, such as metadata – so, for example, you
have still got links to external files or you have maintained important connections
between files and metadata.
Page 6 of 24
Available: This means you can find what you need and it can be opened with
The National Archives
2.1.2
Evaluating Your File Formats Version: 1.2
Upgrading file formats
File formats are often upgraded as software is upgraded. You may wish to assess the
properties of your existing file formats against the newer ones, to decide whether to change.
It can often be the case that a newer format presents a higher risk to resilience and flexibility.
The format may have underlying weaknesses that are not immediately apparent, which
require later correction. Even if a previous format was widely used, there are no guarantees
that a successor format will become so. There will be less software that works with it,
reducing interoperability and implementability, and less experience available in recovering
from any problems with it. Unless there are compelling new capabilities offered by a newer
format, it can be better to adopt a ‘wait and see’ policy.
2.1.3
Selecting different file formats
You may be in a position to select a different file format or formats to use in your
organisation, taking into account your existing technological infrastructure. In this case,
compatibility with existing software will act as a constraint on your assessment and selection.
You may also be influenced by technology in use outside of your organisation, to enable
interchange with other bodies or the public.
2.1.4
Selecting new software
Your choice of software can equally be influenced by a principled choice of file formats to
use. In this case, your assessment of which file formats have the right properties will
influence your choice of software rather than software acting as a constraint on your
assessment of file formats.
Page 7 of 24
The National Archives
3.
Evaluating Your File Formats Version: 1.2
Scoping your evaluation
This guidance helps you to evaluate your file formats so that you can continue to use your
information in the way you need, for as long as you need. To contextualise your file format
evaluation, it is important to define your:
audiences – who needs to access the information (section 3.1)
platforms – what technologies the information is deliverable on (section 3.2)
usages – how people need to be able to use the information (section 3.3).
3.1
Audiences
The ‘audiences’ of your information are all the different individuals or communities who may
need to access your information in different ways. Often only the immediate audience of a
format is considered, but you need to consider other parties who may need access.
For example: a project involves creating advanced spreadsheets and the project
team needs full, editable access to them, whereas the rest of your staff only needs
to read them. External auditors may also need to able to review the spreadsheets.
Looking more widely, there is a need to do joint work on the spreadsheets with
partner organisations. Finally, the public may require access at some later stage
once the project is finished, or even earlier if a Freedom of Information (FOI) enquiry
is issued.
there – including the content and context, such as metadata – so, for example, you
have still got links to external files or you have maintained important connections
3.2 between
Platforms
files and metadata.
Available: This means you can find what you need and it can be opened with
The platforms for your information are all the technologies you wish to make your information
available technology – so, for example, you have the metadata you need, or have
available on. This covers not only the obvious desktop systems staff typically use on a daily
information in versions that can be processed using available IT applications.
basis, but also server-based systems (e.g. web based publication, or automated processing)
That means
that it is fit for purpose and can be used in a way that meets
and Usable:
mobile systems
(e.g. Smartphones).
the business needs Complete: Everything you need to use and understand the
For example:
the main
use of your
spreadsheets
is by the
project
staff using
information
is there
– including
the content
and context,
such
as metadata
– so, for
Microsoft you
Windows
XP on
some
of maintained
the staff in your
example,
have still
gottheir
linksdesktops.
to externalHowever,
files or you
have
important
organisation between
use Mac files
OS/X.
Business
information is made available on your intranet,
connections
and
metadata.
and some remote workers also use Smartphones. Finally, project details are
Available: This means you can find what you need and it can be opened with
regularly published on your website.
available technology – so, for example, you have the metadata you need, or have
information in versions that can be processed using available IT applications.
there – including the content and context, such as metadata – so, for example, you
Usable: That means that it is fit Page
for purpose
8 of 24 and can be used in a way that meets
have still got links to external files or you have maintained important connections
the business needs of the organisation – so, for example information is not locked
between files and metadata.
into formats or systems that restrict your ability to use or re-use it, or restrict the
The National Archives
3.3
Evaluating Your File Formats Version: 1.2
Usages
For each audience and platform, you should define what each of your audiences needs to be
able to do with your information. To ensure that important usages are not missed, ask how
each audience can:
create the information
find the information
open the information
work with the information
understand the information
trust the information.
For example, taking the examples above, we may define our usages as follows:
Windows XP
Project staff
All staff
Intranet
Smartphone
Create, Edit,
Read,
Read
Read
Search
Read
Mac
Read
Web
Read,
Search
External auditors
Review
Read,
Search
Partner organisations
Create, Edit,
Read,
Read
Search
Public
Read,
Search
Public
Review
(Freedom of
Information)
Notice that we have made a distinction between ‘reviewing’ and ‘reading’. The reason is that
reviewing a spreadsheet involves being able to see the formulae inside the spreadsheet.
Reading, in our definition, only implies seeing the resulting data in it.
You must be careful to separate different kinds of usage, even if they appear to be
superficially the same. Where you use arbitrary terminology to make these distinctions, you
should document what you mean by it.
Page 9 of 24
The National Archives
4.
Evaluating Your File Formats Version: 1.2
Evaluating file formats
Once you have a clear picture of the audiences, platforms and usages for your information,
you should evaluate your file formats in that context. To provide a basis for evaluation and
comparison, we suggest that you assess file formats against each of the following
characteristics:
capability – how well your business requirements are met (section 4.1)
quality – how accurately your information is stored (section 4.2)
resilience – how resilient your information is to time (section 4.3)
flexibility – how well you can adapt to changing requirements (section 4.4).
By providing a measure of these characteristics on a common scale, you can compare
different formats to assess how well they each might meet your needs. For example, you
may be evaluating four spreadsheet formats to use across your organisation. You could
produce a table with scores ranging from 0 to 5, like this:
Format A
5
Format B
3
Format C
5
Format D
5
Quality
5
4
5
5
Resilience
4
5
3
2
Flexibility
3
3
5
2
Capability
You can determine the score for each characteristic by evaluating a file format against
various sub-criteria, with the final score being an average of the sub-criteria scores.
These characteristics (and the criteria defined within them) have been chosen to provide a
good basis for comparing the continuity properties of file formats. You can, of course, define
further characteristics to assess your file formats against, or define further sub-criteria within
them. In some cases, it may be useful to pull out distinct areas of importance to your
organisation in order to avoid averaging important and less important features. You should
feel free to adapt the methodology to fit your particular reasons for assessing file formats.
The following sections will explain each of these four characteristics and their sub-criteria,
and how to assess file formats against them, in more detail.
Page 10 of 24
The National Archives
4.1
Evaluating Your File Formats Version: 1.2
Capability
The capabilities of your formats are the features they must support to meet your business
needs.
One way to determine which features are important is to look at the usages and platforms
you defined in the scoping stage. By examining the requirements of the audience on each
platform, with the usages defined, you can begin to enumerate what capabilities will be
required of your formats.
Some capabilities may be mandatory and some only desirable. Any format which does not
meet your mandatory capabilities at all can be immediately excluded. If you discover that no
format meets your requirements, then you must re-evaluate your requirements.
It may be the case that no single format can meet all your requirements, in which case you
should consider whether it is possible to use different formats to meet different usage
requirements. If this is the case, the use of multiple formats imposes another requirement to
migrate information between them.
Once you have a list of formats which at least meet your mandatory capabilities, you can
compare them by ranking their capabilities on a common scale.
4.1.1
Worked example
Mandatory capabilities
Using the running spreadsheet format example, we can see that we require spreadsheets
which work on Windows and Mac, which can be delivered online, and which are capable of
being searched. We have determined that online delivery on the intranet can be satisfied via
links to the original spreadsheets, so no special capabilities are required for that platform.
The reviewing requirement will be met by using compatible editing software, even though
actual editing is not required – the files could be set to be read-only. However, for delivery to
the public on the web, an additional constraint is that the public must not need specialised,
commercial software to view the spreadsheet results.
In evaluating the four example spreadsheet formats, we may determine that there is no freely
available method of viewing any of them via the web or Smartphones. This is not a
Page 11 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
negotiable requirement; so an alternate method of satisfying the requirement may be to
transform3 the spreadsheets into a PDF format for web and Smartphone viewing (it is
assumed that PDF was selected for this purpose using another format evaluation process).
This creates a capability requirement that the formats are easily transformable into PDF
using available software.
Desirable capabilities
Drilling into what the project team of users (the creators and editors of the spreadsheets)
require, we find that the spreadsheets would ideally support up to 250,000 rows, support
some specialised statistical functions, and be able to produce particular kinds of chart.
By scoring all of the capabilities of each format in a table, totalling the scores and dividing by
the number of capabilities, we can derive an average capability score for each format.
Windows
Mac
Format A
Format B
Format C
Format D
5
5
5
5
5
2
5
5
5
5
5
5
5
5
5
5
5
5
5
5
Poor support
5
Searchable
3
Some metadata
not searchable
250,000 rows
Chart support
Statistics
5
150,000 only
5
5
3
Some missing
5
PDF
Average
3
Limited range
support
Transform to
3
2
Hard to achieve
5
3
In the table above, we have not actually enumerated which types of chart or statistical
functions are required, or by what technology the spreadsheet formats must be searchable.
3
Guidance on transforming file formats is given in File Format Conversion
nationalarchives.gov.uk/documents/format-conversion.pdf
Page 12 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
In a real assessment, these would need to be defined, and the reasons for the scores
documented to a greater extent.
In this example, all the formats score well except Format B which, although it does meet all
mandatory capabilities, has quite low scores. This does not mean that format B should be
immediately excluded.
Note: inability to meet mandatory capability requirements can rule out the use of
certain file formats, while merely having low scores does not rule anything out. You
will need to assess the three other continuity characteristics (below) to determine
which of the file formats have the best overall continuity properties.
4.2
Quality
The quality of a file format refers to how well your information is represented by the format.
We can ask two questions of quality:
Precision – is data represented to a sufficient precision?
Lossiness – does the format intentionally throw information away?
You may define other quality measures for your particular formats. For example, you may
wish to assess whether a document format will preserve page numbers, or whether
international characters are properly represented.
4.2.1
Precision
If a format must store structured data, it is important that the data is represented to a
sufficient level of precision. For example, dates and times may only be accurate to a second,
or a millisecond. Numbers may only support a number of decimal places, or none.
Calculations using data at a lower level of precision than required can produce the wrong
answers. Clearly, not all formats have precision issues, but many do.
4.2.2
Lossiness
Some formats do not store all the information originally entered in to them, and actually throw
information away which the format regards as inessential. Typically this is done to achieve
smaller file sizes, often for media formats (images, audio and video).
Page 13 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
For example, MP3 audio files discard parts of the audio signal which the algorithm deems
human ears will not notice. Likewise, the JPG file format used by many digital cameras is a
lossy format, which averages areas of colour in the picture to reduce the amount of
information that must be stored. In general, formats which are lossy can be tuned to discard
greater or lesser amounts of information.
Using a lossy format to store information which must be repeatedly changed is not
recommended, as every time the format is resaved, more information is discarded. This is
similar to repeatedly photocopying copies of a document – after a few copies of copies of
copies, the document becomes unreadable. So, you should only use lossy formats if the
information will not be further changed once stored and the loss of information is within
acceptable quality bounds.
4.2.3
Worked example
Again, using our spreadsheet example, we determine that no spreadsheet formats are lossy
– no information is intentionally discarded by them. However, there can be precision issues.
Again, we construct a table for our candidate formats:
Numeric
precision
Date/time
Format A
Format B
Format C
Format D
5
3
5
5
64-bit floating
32-bit floating
64-bit floating
64-bit floating
point
point
point
point
5
5
5
5
5
4
5
5
precision
Average
We can see that there are no significant quality issues with the spreadsheet formats, except
that format B does not handle floating point numbers (e.g. 1.222343454545) to the same
level of precision as the others.
4.3
Resilience
File formats are vulnerable to changes over time, including accidental corruption of the
information in the file and external changes in the technology landscape. Here we define
three characteristics of a resilient format, although again you may define others which relate
to your particular environment.
Page 14 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
Ubiquity – how widespread is the use of the format? (section 4.3.1)
Stability – how long will the format be supported by software? (section 4.3.2)
Recoverability – how resilient is the format to accidental corruption? (section 4.3.3)
4.3.1
Ubiquity
Formats which are more widely used will tend to be more resilient than those which are not.
The chance that the format will cease to be supported in future is much lower. Measuring the
ubiquity of a format must be in relation to its market sector and its natural competitors. For
example, Computer Aided Design (CAD) formats are much less widely used than document
file formats, but it would be meaningless to compare them in this way, even though in
absolute terms it may be true that CAD formats are less resilient than document formats. The
purpose of the evaluation is to assess the relative risk between formats in competition with
each other.
4.3.2
Stability
The stability of a format relates to how long the format is likely to be supported. Certain
formats which are commonly used and have become a de-facto standard will tend to be
more stable than those which are niche players, or which are new in the marketplace. In this
sense, past history can be a guide to stability.
An example is the binary Microsoft Office file formats (97-2003). Although there are already
newer Office formats, there is such a substantial body of information recorded in the older
versions, and they have retained support for so long, it is quite likely that there will be support
for them many years into the future.
Standardisation can be a clue to the stability of a format even in the absence of ubiquity. The
standardisation process does not guarantee future support for a format, but it is an indication
that the format may be supported into the future. Common media formats such as JPG are
standardised, widely used, and likely to be supported for many years.
On the other hand, the recent presence of standardised formats, or a multitude of competing
standards, can also be a clue that formats may become less stable. The presence of a
standard does not necessarily create stability, although it can be an indication of future
support.
Page 15 of 24
The National Archives
4.3.3
Evaluating Your File Formats Version: 1.2
Recoverability
Different file formats have varying levels of susceptibility to accidental corruption over time.
Some formats store information very densely, or in ways in which swapping a single one and
zero can prevent software from opening the entire file. In others, the information is quite
spread out, with little dependencies between different parts of a file.
In general, file formats can be divided into textual formats and binary formats.
Textual formats:
are formats in which the information is internally represented as text, even if the
information within them is not text
can be opened using normal text editors and the contents reviewed manually, or
processed programmatically fairly easily
are not very dense, and can be quite resilient in the face of small numbers of errors.
Binary formats:
use a variety of methods of representing information, which may include text, but are
not purely textual
cannot be opened in text editors
tend to represent information in a more dense form, which is more directly
understandable by computers.
However, there is no ‘standard’ for binary file formats – each is uniquely determined by the
specification or application which controls it, and they bear no relation to each other. Binary
formats tend to be less resilient in the face of small numbers of errors, and are generally
harder to access programmatically or manually, with or without corruption. It can be very
difficult to open a corrupt binary format and extract information from within it.
On top of whether a file format is binary or textual, a file format can be scrambled in a variety
of ways. Formats can be compressed (e.g. using zip software), or the contents may be
encrypted (e.g. using AES encryption). Both of these methods result in binary files, and the
information within them is typically denser and harder to recover in the face of corruption.
The particular method of compression or encryption will affect the recoverability of a file. In
extreme cases, certain forms of encryption (for example, those that use Cipher Block
Chaining or CBC mode) create a sequential dependency on all parts of a file – so that a
corruption of a single bit will cascade throughout the rest of the file, rendering it unintelligible.
Page 16 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
To mitigate against corruption, some file formats explicitly include error detection and
correction information, which provides redundancy at the expense of larger file sizes.
Typically, error detection and correction is provided for file formats intended for secure
backup, or for those which represent information very densely, or which are intended for realtime delivery over networks (e.g. streaming media).
Finally, the complexity of the information stored in a format will influence how recoverable a
format is. In general, the more complex the information, the more complex the format and
therefore the lower the recoverability tends to be. For formats intended for a single type of
information or purpose (e.g. a single image, or to compress any file), the complexity is quite
low. For formats which allow lots of different kinds of information to be arranged at the user’s
discretion (e.g. documents or CAD files), the complexity is high.
You will usually be comparing formats with similar levels of complexity, as they will be
intended for the same types of information or purpose. However, you can compare formats
with different levels of complexity where your purpose does not require complexity, but your
existing information is already complex. For example, to store text for a long time, you may
evaluate various document formats against simple text files.
4.3.4
Worked example
Assessing our four spreadsheet formats against ubiquity, stability and recoverability gives us
the following results:
Ubiquity
Stability
Format A
Format B
Format C
Format D
4
5
3
2
Widely used
4
Standardised
4
Recoverability Dense textual
Dominant
Somewhat
Not widely
format
used
used
5
Standardised
5
Textual format
3
Superseded
3
Binary format
1
Very new
3
Binary format
format
Average
4
5
Page 17 of 24
3
2
The National Archives
4.4
Evaluating Your File Formats Version: 1.2
Flexibility
The flexibility of a format gives a measure of how well you are able to adapt to changing
requirements and use your information in the way you require. We can define two
characteristics against which to assess its flexibility.
Interoperability – how much existing software can access the format? (section 4.4.1)
Implementability – how easy is it to write software to interact with the format? (4.4.2)
4.4.1
Interoperability
File formats are like languages whose only speakers are certain pieces of software. The
more software that can ‘speak’ the language of your file format, the more flexibility you have
in selecting cost-effective software to access your information, and the easier it will be to
exchange information between different audiences and platforms.
When considering interoperability, think beyond the platform on which the information is most
commonly created, and consider all the platforms on which you want your information to be
accessible, and how it needs to be accessed. For example, while spreadsheets may typically
be created using a desktop application, you may require that they are readable on
Smartphones, or editable using a web browser. While the platforms defined in your scoping
exercise give you an idea of what you must support today, you should also consider future
needs and interoperability with external audiences where you don’t control the platform.
Standardisation (whether formally, or by being a de-facto standard) of a format can help in
creating the conditions for interoperability, but be aware that merely being standardised does
not ensure interoperability. A file can conform entirely to a standard, yet not actually
interoperate with other software out there, which may have made different judgements on
interpreting the standard. The reasons for this can range from poor implementation of the
software, to complexity of the standard, to the desire of a dominant vendor to create lock-in
to their software.
In general, it is not enough for software to claim to support a format standard – this is a
necessary but not sufficient condition to ensure interoperability. It must also implement the
standard in a way which actually interoperates with other software. For very simple formats,
or formats which are very common (for example, text files or JPG images) this is not
generally an issue, but complex or niche formats will require testing with specific software to
gain assurance of actual interoperability.
Page 18 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
It will not possible for most organisations to test exchanging information in each format using
every possible piece of software, so you must pragmatically scope your scoring on
interoperability. You can gain an idea of the level of interoperability of a given file format by
examining how many pieces of software on a variety of platforms claim support for a file
format in providing the types of access that you require. You may decide that in your
business context, interoperability is only important to assess between certain defined pieces
of software or platforms, or that some interoperability is mandatory and some only desirable.
For example, you could run tests on software you know you need to exchange information
between and provide only subjective measures, such as the amount of software claiming
support, for other platforms less important to you.
4.4.2
Implementability
It is often important to be able to automate processes, or enable access to your information
on a new platform. If you have a need to process your information like this, then you should
consider implementability.
The implementability of a file format refers to how easy it is to create bespoke software that
interoperates with your information in that format. Simple formats or those which use highly
standardised methods of recording the information will tend to be easy to implement. For
example, XML is a highly standardised method of creating textual file formats and so it is
quite easy to create software which processes information in an XML-based format.
The existence of Application Programming Interfaces (APIs) or Software Development Kits
(SDKs) for those formats provide assurance of implementability. These are libraries of
software which you can use in your own software to access information in the format, without
having to know how to read or write the information in the format directly. They are often
provided by the vendor of the principle software used to create information in that format,
although there may be alternate APIs or SDKs provided by other organisations, or as opensource software. You should assess whether these APIs or SDKs are available on the
platforms you may want to implement software, and whether they are being actively
maintained and supported.
Page 19 of 24
The National Archives
4.4.3
Evaluating Your File Formats Version: 1.2
Worked example
Assessing our example spreadsheets against interoperability and implementability gives us
this table:
Interoperability
Format A
Format B
Format C
Format D
2
3
5
3
Only one vendor Older but
Lots of
A few good
on Windows &
good
different
implementations
Mac, only
software on
software
on Mac, but
supporting basic
Mac, multiple
available, with
only one on
features.
on Windows
demonstrably
Windows.
of varying
good
quality.
interoperability.
4
Implementability
3
5
1
APIs available
Older APIs
APIs available
No APIs, only
on most
available but
on all
automatable
platforms, not
only on
platforms.
using desktop
Smartphones.
Windows
software
platform
Average
3
3
5
2
4.5 Conclusions
Once you have assessed your file formats against each continuity category, you can make a
final table containing the averages of each category (see worked example in 4.5.1 below).
This enables you to see at a glance which formats have better or worse continuity properties.
The best format for your use is not necessarily the one with the highest scores against each
category. You should judge which categories are most important to you, and if any specific
issues arose during your assessment which would influence any decision you may make.
Given that you may not pick a format with the highest scores, it is natural to wonder whether
it is worth assessing formats at all. However, the risks of not assessing, and simply going
with the default, can be high. The highest impacts are in the loss of critical business
information over time and change. There are also other, less visible, costs, including lower
business and technical flexibility and higher information management costs.
Page 20 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
The process of assessment gives you a picture of the trade-offs you make in choosing
various formats, and therefore enables you to use your information in the way that you need
to, for as long as you need to, and to justify any choice you may make.
Key messages
Understanding the file formats you have or plan to use is an important
factor in managing your risks to digital continuity.
Formats are not innately better or worse than one another – it depends on
the use you need to make of them and the context they are used in.
Using formats by default exposes you to continuity risk. It is better to make
an informed choice by assessing the available options.
Formal and open standards are important to the continuity of your
information, but de facto standards and ubiquity can be just as important.
Ensure you consider all four continuity factors when considering formats,
not just what meets immediate business needs.
4.5.1
Worked example
Taking our spreadsheet example once again, placing the average assessments of each of
the categories against each format gives us a final table:
Format A
5
Format B
3
Format C
5
Format D
5
Quality
5
4
5
5
Resilience
4
5
3
2
Flexibility
3
3
5
2
Capability
We can make a more direct visual comparison of the file formats by creating charts:
Page 21 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
Figure 1: Bar chart grouped by format
Figure 2: Radar chart grouped by category
We can see immediately that Format B has some of the lowest average scores, except for a
high score for resilience. The other formats all score equally for capability and quality.
Resilience and flexibility vary considerably between them however. Format D has very low
scores for both resilience and flexibility, so cannot be considered a good choice when set
against A and C.
There is little difference in the resilience of Format A and C, but a big difference in flexibility.
On the basis of this assessment, Format C would seem to be the best choice, assuming that
long term resilience of the format is not the overriding concern. Note that Format B in our
example was actually the dominant format in use at the time. It is quite common for
organisations to choose a dominant format by default. There is certainly an argument for
safety in numbers, but not an argument for choosing formats by default.
Page 22 of 24
The National Archives
Evaluating Your File Formats Version: 1.2
Note that the same formats could be scored very differently depending on the particular
audiences, platforms and usages you define, and the criteria deemed most important will
also vary.
This guidance does not dictate that a format with the highest scores across
all categories is necessarily the ‘best’ format.
During the process of evaluation, critical criteria may emerge which influence any
final decisions. For example, the decision here may be made to go with Format B
(safety overriding all other concerns).
Whatever decisions are made, they will be informed and justifiable, with a clear
picture emerging of the various factors which influence the continuity of your digital
information.
Page 23 of 24
The National Archives
5.
Evaluating Your File Formats Version: 1.2
Further reading
Find out more about file formats in our addendum to this document:
A Guide to File Formats nationalarchives.gov.uk/information-management/projectsand-work/dc-guidance.htm
To help in evaluating file formats, this guide will present factual information about
selected existing file formats used widely across organisations.
Wikipedia is also a good source of information on file formats:
http://en.wikipedia.org/wiki/File_format
The National Archives maintains PRONOM, an online database of information about file
formats and software which is regularly updated:
nationalarchives.gov.uk/pronom/
If you have a requirement to convert information from one file format to another, please see:
File Format Conversion nationalarchives.gov.uk/documents/format-conversion.pdf
File Format Conversion explains the issues in migrating information between different
file formats. It will enable you to understand why, when and how you should convert
file formats, and what you should convert them to.
For more information on managing your digital continuity, please see:
Understanding Digital Continuity nationalarchives.gov.uk/documents/understandingdigital-continuity.pdf
Understanding Digital Continuity is for anyone who wants to know more about digital
continuity and digital information management. It is a high level piece of guidance
providing you with an introduction to digital continuity – what it is, why it’s important
and why it’s relevant to you and your organisation.
Managing Digital Continuity nationalarchives.gov.uk/documents/managing-digitalcontinuity.pdf
Managing Digital Continuity is aimed at guiding you through the steps you will need to
take to successfully manage your digital information. This piece of guidance is
primarily aimed at Senior Responsible Owners (SROs) who hold overall responsibility
for ensuring digital continuity.
Page 24 of 24
Download