2.0 Best Practices for File Naming

advertisement
2.0 Best Practices for File Naming
A filename provides one form of unique identification for each digital asset that the Library
creates. A good file naming system ensures consistency, prevents file loss through accidental
overwriting, and can facilitate retrieval and processing of materials from creation onwards. File
naming conventions and practices should be determined for each digital project, or content set,
at the beginning of the project when other technical specifications (e.g., file format, resolution,
etc.) are being established. A file naming system for a specific project or content set should
employ a directory structure to help guard against filename collisions across projects.
Filenames, however, are only one form of digital asset identifier. Subsequent to instantiation,
additional identifiers are associated with digital resources. URLs, handles, PURLs, DOIs, and
CONTENTdm identifiers are examples of such additional identifiers. While file name
conventions insure unique identification of a digital file in the scope of a particular local project
or content set, other identifiers are needed to insure unique identification in larger scopes,
such as on the World Wide Web or within a general archive. Non-filename identifiers also are
used to deal with issues of granularity (e.g., assigning an identifier for the entire digitized book
rather than just an individual digitized page in a book) and versioning (e.g., a corrected PDF of a
digitized book), and can be useful in expressing persistent relationships (e.g., the relationship
between a metadata record and a digitized book object, independent of updates made to the
digitized book's component files). These non-file name identifiers are addressed in a separate
document.
Table of contents
2.1 ISO Standard 9660:1999 (Level 2)
2.2 Root identifiers
•
•
Registry of root identifiers
Guidelines for constructing root identifiers
o Monographs
o Serials
o Newspapers
o Collections (non-ContentDM)
2.3 Subsequent directory levels
•
•
•
Serials
Monographs
Collections (non-ContentDM)
2.4 Non-image files
2.5 ContentDM collections
•
•
•
Root identifiers
File name structure
Collection registry
Appendix: Page Naming Conventions for Monographs and Serials
2.1 ISO Standard 9660:1999 (Level 2)
The Library follows ISO Standard 9660:1999 (Level 2) format, which defines a file system for
digital media. This standard stipulates certain restrictions on file names:
•
•
•
•
Limit total path length to 207 characters
Characters used in file names are restricted to lowercase a-z, 0-9, underscore ( _ ), and
period (.)
File names shall not include spaces; should not begin or end with a period (.); and should
contain no more than one period (.).
Limit directory hierarchy to eight levels. Directory names should not use periods.
Names for directories, folders, and files will be no longer than 21 characters (not including 3
letter extensions) and will be unique within the context of the project.
2.2 Root identifiers (top level of directory structure)
Each content set, be it a full-text book, a collection of related documents, a group of
photographs, etc., should be assigned a root identifier that is unique to that particular set of
content; the top level file directory for the content set should be named with this unique
identifier. Root identifiers should be no longer than 16 characters and serve as the basis for
naming the image files created from it. Uniqueness of root identifiers should be verified by
checking the root identifier against a Library-wide Registry of Root Identifiers.
•
•
The Registry of Root Identifiers should be a centrally managed resource containing the
following information about the identifier:
a. Date the root identifier was assigned
b. Name of person assigning the identifier
c. Where the content resides
d. Finding aid that describes the content
Guidelines for constructing root identifiers:
a. Monographs— For monographs, the first four characters will be the first four
letters of the author’s last name; the next two characters will be the first two
letters of the author’s first name; the next four characters will be a zero padded
incrementing number; the next three characters will be the first three characters
of the first word of the title (articles omitted); and the last three characters will
be the first three letters of the second word of the title. For example, the root
identifier for the book “Collected Works of Abraham Lincoln” would be
“lincab0001colwor.”
b. Serials—For serials, the unique root identifier will consist of the first 16 letters of
the journal name (excluding initial articles). If the journal name is less than 16
letters, the root identifier will be also. An example of a root identifier for a
journal is “librarytrends”.
c. Newspapers—File naming conventions for newspapers are specific to this format
and to the presentation software used, e.g., Olive Active Paper. These
conventions are outlined in the best practices document on newspaper
digitization.
d. Collections of letters, photographs, maps, and other non-book or serial content
may already have some kind of naming and numbering scheme associated with
them that could be used as the basis for creating a unique identifier for the
content set. An effort should be made to parallel the file naming convention
described above for monographs. For example, the RBML has collections of
letters from Carl Sandburg to Lillian Sandburg and Vachel Lindsay. The root
identifier for the Carl and Lillian collection might be something like
“sandca0001sanlil” and the Carl and Vachel collection might be something like
“sandca0002linvac”.
e. Stop words—certain words may occur with such frequency that they should be
avoided when constructing root identifiers. Examples might be “association,”
“journal,” etc.
2.3 Subsequent directory levels
Under the root identifier level, subsequent directory levels should follow consistent patterns as
described below:
• Serials— The logical order for directories for image files from serial publications will be
rootidentifier/volume/issue/page as illustrated in the example below. Volume and
issue numbers should be preceded by the letters “v” and “I” and four padded zeros.
Page image file names will be divided into two logical components. The first six
characters will contain a leading zero padded sequentially incremented image sequence
number. The final six characters will contain a representation of the page number as
printed on the page. See Appendix for more specifics on dealing with unnumbered
pages, prefatory matter numbered with Roman numbers, and other deviations.
Root Identifier
librarytrends
•
Volume
Issue#
Page #s
v00001i00002
00000100000a.jp2
000002000001.jp2
Monographs—The logical order of directories for image files for monographs will be
rootidentifier/volume/placeholder_for_issue/page. The two directory levels under the
root identifier will usually serve only as dummy directory levels so that the basic
directory structure for monographs and serials are the same. However, multi-volume
monographic sets will use the volume level directory (e.g., v00002). Issue numbers for
monographs will always be “i00000”.
Root Identifier
lincab0001colwor
•
Volume
v00000
Issue#
i00000
Page #s
00000100000a.jp2
000002000001.jp2
Collections—generally the directory levels below the root identifier directory level
should make sense in the context of the project. For instance, there may just be one
additional level containing the individual content files; or a collection of letters might be
best described by using the volume level for the year the letter(s) was written and the
issue level for the month. Subsequent directory levels should employ the padded zero
convention described above. This will insure that the documents will sort as expected
utilizing the natural sort order of all ASCII-based computer systems. Make sure you start
with enough zeros to accommodate the maximum number of items in the collection.
2.4 File naming conventions for non-image files
In addition to image files, other files are often created in a digitization project. Among these
are OCR, PDF, xml, and encoding files. The following conventions should be followed for these
files:
•
•
•
•
The file naming convention for the OCR file will be rootidentifier_ocr.txt.
The file naming convention for PDF files will be rootidentifier.pdf.
The file naming convention for xml files will be rootidentifier.xml.
The file naming convention for TEI encoded file will be rootidentifier_tei.xml.
2.5 ContentDM collections
For image collections going into ContentDM, file names should be created for each digitized
image, both access and master. The image file name consists of a three letter root identifier,
seven digit number and letter (when the object is a compound object such as post card or
pamphlet) combinations. The file name should be included in the metadata of the item with a
proper file name extension. (Please see the minimum requirement of the metadata element
for CONTENTdm collections.)
•
Root identifiers
The root identifier works as a collection identifier and combines all the associated items
into a collection where it belongs. Since most of the collections reside in CONTENTdm,
we recommend using a collection alias as a root identifier. When the collection is added
into CONTENTdm, we create a unique alias for each collection.
The alias can be more than three letters. For a root identifier, please use the first three letters of
the collection alias. (For this collection, the root identifier should be ‘emb.’)
•
File name structure
A seven digit number and letter combination will follow the root identifier. If you have a
compound object (i.e., post card or pamphlet), each item will share the same number
but each image will have a different letter. This structure can be seen in the following
examples:
1. simple object: abc1000000
2. Postcard:
o Front – abc200000a
o Back – abc200000b
3. Pamphlet 1 :
o Cover – abc300000a
o Page 1 – abc300000b
o Page 2 – abc300000c
4. Pamphlet 2:
o Cover – abc400000a
o Page 1 – abc400000b
o Page 2 – abc400000c
•
Collection registry
In order to make each root identifier unique, creation of the formal registry of the root identifier
and collection is needed. The registry should include the following information for
administrative purpose. These elements are derived from Dublin Core Collection Description
Application Profile (http://dublincore.org/groups/collections/collection-applicationprofile/2006-08-24/) and Illinois Harvest Collection Description Application Profile
(\\libgrtyr\harvests\IllinoisHarvest\projectManagement\Illinois Harvest\Collection-Level
Metadata).
The location and management of the registry should be discussed further.
Element
Label
Definition
dc:identifier
Root identifier
The unique root identifier of the collection
dc:title
Collection title
Title of the collection
dc:creator
Collection
coordinator
Collection coordinator
vcard:UID
Email
Contact information of the collection
coordinator
dc:description
Collection
description
Collection description that could include
collection development policy, uniqueness,
and other relevant information about the
collection.
dc:source
Physical collection
Location of the physical collection
dc:date
Date
Date information of when the collection was
created.
dct:extent
Size
Size of the collection, usually a number of
items in the collection.
dc:right
Right
Right statement of the collection.
Completeness
Indicate whether the collection is complete
or not.
Contributor
People involved in the collection creation.
Add the CONTENTdm ID.
dcterms:accrualMethod
dc:contributor
APPENDIX: Page Naming Conventions for Monographs and Serials
Page image file names will be divided into two logical components. The first six characters will
contain a leading zero padded sequentially incremented image sequence number. The final six
characters will contain a representation of the page number as printed on the page, formulated
according to the following rules:
a)
Every image should have designated a logical page number or appropriate tag to
accompany it. This page number or tag will be used in the image file header and, if
necessary to accommodate ISO 9660 file name restrictions, in the image file name.
b)
For the purposes of these instructions, the word “pagination” will refer to the logical
sequential pagination of a series of pages. For example, a page without a printed number
on it which is located between pages imprinted 2 and 4 can be assumed to be page 3.
Similarly, four pages without page numbers printed on them followed by pages 5, 6, 7,
etc., can be assumed to be pages 1, 2, 3 and 4.
c)
All pages that are included within the logical pagination should be designated with their
actual page numbers.
d)
The first page of every volume will be the production note; the second page will be the
outside front cover; and the third page will be the inside front cover (which may or may
not contain a bookplate). These will always be designated 00000a, 00000b, and 00000c
respectively. If there are more pages before the logical pagination begins, they will be
designated 00000d, 00000e, 00000f, etc.
e)
Pagination that appears as Roman numerals in the original will be translated into Arabic
numbers and appended with a leading “R” for file names (e.g., page vii becomes page
r00007, etc.). In the absence of printed page numbers, it is to be assumed that Roman
numerals continue until logical Arabic pagination commences. In the situation where
sequential pagination continues through a change from Roman numerals to Arabic
numerals, the Arabic numerals will be assumed to start at the change in type of document
content.
f)
When there are pages in the material which are not included in the sequential pagination
(commonly occurring with plates) the pages will be designated by the number of the
preceding paginated page appended with a trailing letter which will increase sequentially
for each page (e.g., 000031, 000032, 00032a, 00032b, 000033, 000034).
g)
When pages are numbered incorrectly in the original material, the correct logical
pagination should be used in the image file header and file name, unless otherwise
specified in the pagination instructions which will accompany each volume.
h)
When pagination restarts result in duplicate page numbers in the same volume, the
longest section will have its pages recorded unamended. Shorter segments will be
recorded with a letter preceding the number to differentiate it from similarly numbered
pages in the same volume (e.g., a00001, a00002, a00003, a00004; b00001, b00002,
b00003, b00004). Note: Letters that will not be used in this situation are i, l, o, and r. This
procedure does not need to be used for a Roman Numeral section if it is the only one in
the volume. If there is more than one, the shorter section(s) will be differentiated in the
same manner as Arabic Numbers (e.g., ra0001, ra0002, ra0003, ra0004; rb0001, rb0002,
rb0003, rb0004.
i)
Page numbers which actually contain letter prefixes will be recorded according to the
same rules as standard Arabic numbered pages, except that punctuation between the
prefix and the number should be dropped. Thus a page from Appendix A which is labeled
A-9 should be recorded as 0000a9.
j)
Page numbers containing characters that are not permitted in ISO 9660 file names should
be recorded with an underscore character in place of the illegal character. For example,
page 22.6 should be recorded as 22_6. In situations such as this, the unmodified page
number should be recorded in the image file header.
k)
Adornments around page numbers, such as if the page number is both preceded and
followed by a dash, asterisk, parenthesis, square bracket, etc., should be ignored (not
entered).
l)
Any pagination situation that falls outside those described above will be noted in the
pagination instructions that will accompany the volume. These instructions will include
how the pages in question should be designated in the image file header and file name. If
the vendor discovers a situation that is not described above and not commented on in the
worksheet, they will contact UIUC for instructions about how to proceed.
Download