2.0 Best Practices for File Naming A filename provides one form of unique identification for each digital asset that the Library creates. A good file naming system ensures consistency, prevents file loss through accidental overwriting, and can facilitate retrieval and processing of materials from creation onwards. File naming conventions and practices should be determined for each digital project, or content set, at the beginning of the project when other technical specifications (e.g., file format, resolution, etc.) are being established. A file naming system for a specific project or content set should employ a directory structure to help guard against filename collisions across projects. Filenames, however, are only one form of digital asset identifier. Subsequent to instantiation, additional identifiers are associated with digital resources. URLs, handles, PURLs, DOIs, and CONTENTdm identifiers are examples of such additional identifiers. While file name conventions insure unique identification of a digital file in the scope of a particular local project or content set, other identifiers are needed to insure unique identification in larger scopes, such as on the World Wide Web or within a general archive. Non-filename identifiers also are used to deal with issues of granularity (e.g., assigning an identifier for the entire digitized book rather than just an individual digitized page in a book) and versioning (e.g., a corrected PDF of a digitized book), and can be useful in expressing persistent relationships (e.g., the relationship between a metadata record and a digitized book object, independent of updates made to the digitized book's component files). These non-file name identifiers are addressed in a separate document. Table of contents 2.1 ISO Standard 9660:1999 (Level 2) 2.2 Root identifiers • • Registry of root identifiers Guidelines for constructing root identifiers o Monographs o Serials o Newspapers o Collections (non-ContentDM) 2.3 Subsequent directory levels • • • Serials Monographs Collections (non-ContentDM) 2.4 Non-image files 2.5 ContentDM collections • • • Root identifiers File name structure Collection registry Appendix: Page Naming Conventions for Monographs and Serials 2.1 ISO Standard 9660:1999 (Level 2) The Library follows ISO Standard 9660:1999 (Level 2) format, which defines a file system for digital media. This standard stipulates certain restrictions on file names: • • • • Limit total path length to 207 characters Characters used in file names are restricted to lowercase a-z, 0-9, underscore ( _ ), and period (.) File names shall not include spaces; should not begin or end with a period (.); and should contain no more than one period (.). Limit directory hierarchy to eight levels. Directory names should not use periods. Names for directories, folders, and files will be no longer than 21 characters (not including 3 letter extensions) and will be unique within the context of the project. 2.2 Root identifiers (top level of directory structure) Each content set, be it a full-text book, a collection of related documents, a group of photographs, etc., should be assigned a root identifier that is unique to that particular set of content; the top level file directory for the content set should be named with this unique identifier. Root identifiers should be no longer than 16 characters and serve as the basis for naming the image files created from it. Uniqueness of root identifiers should be verified by checking the root identifier against a Library-wide Registry of Root Identifiers. • • The Registry of Root Identifiers should be a centrally managed resource containing the following information about the identifier: a. Date the root identifier was assigned b. Name of person assigning the identifier c. Where the content resides d. Finding aid that describes the content Guidelines for constructing root identifiers: a. Monographs— For monographs, the first four characters will be the first four letters of the author’s last name; the next two characters will be the first two letters of the author’s first name; the next four characters will be a zero padded incrementing number; the next three characters will be the first three characters of the first word of the title (articles omitted); and the last three characters will be the first three letters of the second word of the title. For example, the root identifier for the book “Collected Works of Abraham Lincoln” would be “lincab0001colwor.” b. Serials—For serials, the unique root identifier will consist of the first 16 letters of the journal name (excluding initial articles). If the journal name is less than 16 letters, the root identifier will be also. An example of a root identifier for a journal is “librarytrends”. c. Newspapers—File naming conventions for newspapers are specific to this format and to the presentation software used, e.g., Olive Active Paper. These conventions are outlined in the best practices document on newspaper digitization. d. Collections of letters, photographs, maps, and other non-book or serial content may already have some kind of naming and numbering scheme associated with them that could be used as the basis for creating a unique identifier for the content set. An effort should be made to parallel the file naming convention described above for monographs. For example, the RBML has collections of letters from Carl Sandburg to Lillian Sandburg and Vachel Lindsay. The root identifier for the Carl and Lillian collection might be something like “sandca0001sanlil” and the Carl and Vachel collection might be something like “sandca0002linvac”. e. Stop words—certain words may occur with such frequency that they should be avoided when constructing root identifiers. Examples might be “association,” “journal,” etc. 2.3 Subsequent directory levels Under the root identifier level, subsequent directory levels should follow consistent patterns as described below: • Serials— The logical order for directories for image files from serial publications will be rootidentifier/volume/issue/page as illustrated in the example below. Volume and issue numbers should be preceded by the letters “v” and “I” and four padded zeros. Page image file names will be divided into two logical components. The first six characters will contain a leading zero padded sequentially incremented image sequence number. The final six characters will contain a representation of the page number as printed on the page. See Appendix for more specifics on dealing with unnumbered pages, prefatory matter numbered with Roman numbers, and other deviations. Root Identifier librarytrends • Volume Issue# Page #s v00001i00002 00000100000a.jp2 000002000001.jp2 Monographs—The logical order of directories for image files for monographs will be rootidentifier/volume/placeholder_for_issue/page. The two directory levels under the root identifier will usually serve only as dummy directory levels so that the basic directory structure for monographs and serials are the same. However, multi-volume monographic sets will use the volume level directory (e.g., v00002). Issue numbers for monographs will always be “i00000”. Root Identifier lincab0001colwor • Volume v00000 Issue# i00000 Page #s 00000100000a.jp2 000002000001.jp2 Collections—generally the directory levels below the root identifier directory level should make sense in the context of the project. For instance, there may just be one additional level containing the individual content files; or a collection of letters might be best described by using the volume level for the year the letter(s) was written and the issue level for the month. Subsequent directory levels should employ the padded zero convention described above. This will insure that the documents will sort as expected utilizing the natural sort order of all ASCII-based computer systems. Make sure you start with enough zeros to accommodate the maximum number of items in the collection. 2.4 File naming conventions for non-image files In addition to image files, other files are often created in a digitization project. Among these are OCR, PDF, xml, and encoding files. The following conventions should be followed for these files: • • • • The file naming convention for the OCR file will be rootidentifier_ocr.txt. The file naming convention for PDF files will be rootidentifier.pdf. The file naming convention for xml files will be rootidentifier.xml. The file naming convention for TEI encoded file will be rootidentifier_tei.xml. 2.5 ContentDM collections For image collections going into ContentDM, file names should be created for each digitized image, both access and master. The image file name consists of a three letter root identifier, seven digit number and letter (when the object is a compound object such as post card or pamphlet) combinations. The file name should be included in the metadata of the item with a proper file name extension. (Please see the minimum requirement of the metadata element for CONTENTdm collections.) • Root identifiers The root identifier works as a collection identifier and combines all the associated items into a collection where it belongs. Since most of the collections reside in CONTENTdm, we recommend using a collection alias as a root identifier. When the collection is added into CONTENTdm, we create a unique alias for each collection. The alias can be more than three letters. For a root identifier, please use the first three letters of the collection alias. (For this collection, the root identifier should be ‘emb.’) • File name structure A seven digit number and letter combination will follow the root identifier. If you have a compound object (i.e., post card or pamphlet), each item will share the same number but each image will have a different letter. This structure can be seen in the following examples: 1. simple object: abc1000000 2. Postcard: o Front – abc200000a o Back – abc200000b 3. Pamphlet 1 : o Cover – abc300000a o Page 1 – abc300000b o Page 2 – abc300000c 4. Pamphlet 2: o Cover – abc400000a o Page 1 – abc400000b o Page 2 – abc400000c • Collection registry In order to make each root identifier unique, creation of the formal registry of the root identifier and collection is needed. The registry should include the following information for administrative purpose. These elements are derived from Dublin Core Collection Description Application Profile (http://dublincore.org/groups/collections/collection-applicationprofile/2006-08-24/) and Illinois Harvest Collection Description Application Profile (\\libgrtyr\harvests\IllinoisHarvest\projectManagement\Illinois Harvest\Collection-Level Metadata). The location and management of the registry should be discussed further. Element Label Definition dc:identifier Root identifier The unique root identifier of the collection dc:title Collection title Title of the collection dc:creator Collection coordinator Collection coordinator vcard:UID Email Contact information of the collection coordinator dc:description Collection description Collection description that could include collection development policy, uniqueness, and other relevant information about the collection. dc:source Physical collection Location of the physical collection dc:date Date Date information of when the collection was created. dct:extent Size Size of the collection, usually a number of items in the collection. dc:right Right Right statement of the collection. Completeness Indicate whether the collection is complete or not. Contributor People involved in the collection creation. Add the CONTENTdm ID. dcterms:accrualMethod dc:contributor APPENDIX: Page Naming Conventions for Monographs and Serials Page image file names will be divided into two logical components. The first six characters will contain a leading zero padded sequentially incremented image sequence number. The final six characters will contain a representation of the page number as printed on the page, formulated according to the following rules: a) Every image should have designated a logical page number or appropriate tag to accompany it. This page number or tag will be used in the image file header and, if necessary to accommodate ISO 9660 file name restrictions, in the image file name. b) For the purposes of these instructions, the word “pagination” will refer to the logical sequential pagination of a series of pages. For example, a page without a printed number on it which is located between pages imprinted 2 and 4 can be assumed to be page 3. Similarly, four pages without page numbers printed on them followed by pages 5, 6, 7, etc., can be assumed to be pages 1, 2, 3 and 4. c) All pages that are included within the logical pagination should be designated with their actual page numbers. d) The first page of every volume will be the production note; the second page will be the outside front cover; and the third page will be the inside front cover (which may or may not contain a bookplate). These will always be designated 00000a, 00000b, and 00000c respectively. If there are more pages before the logical pagination begins, they will be designated 00000d, 00000e, 00000f, etc. e) Pagination that appears as Roman numerals in the original will be translated into Arabic numbers and appended with a leading “R” for file names (e.g., page vii becomes page r00007, etc.). In the absence of printed page numbers, it is to be assumed that Roman numerals continue until logical Arabic pagination commences. In the situation where sequential pagination continues through a change from Roman numerals to Arabic numerals, the Arabic numerals will be assumed to start at the change in type of document content. f) When there are pages in the material which are not included in the sequential pagination (commonly occurring with plates) the pages will be designated by the number of the preceding paginated page appended with a trailing letter which will increase sequentially for each page (e.g., 000031, 000032, 00032a, 00032b, 000033, 000034). g) When pages are numbered incorrectly in the original material, the correct logical pagination should be used in the image file header and file name, unless otherwise specified in the pagination instructions which will accompany each volume. h) When pagination restarts result in duplicate page numbers in the same volume, the longest section will have its pages recorded unamended. Shorter segments will be recorded with a letter preceding the number to differentiate it from similarly numbered pages in the same volume (e.g., a00001, a00002, a00003, a00004; b00001, b00002, b00003, b00004). Note: Letters that will not be used in this situation are i, l, o, and r. This procedure does not need to be used for a Roman Numeral section if it is the only one in the volume. If there is more than one, the shorter section(s) will be differentiated in the same manner as Arabic Numbers (e.g., ra0001, ra0002, ra0003, ra0004; rb0001, rb0002, rb0003, rb0004. i) Page numbers which actually contain letter prefixes will be recorded according to the same rules as standard Arabic numbered pages, except that punctuation between the prefix and the number should be dropped. Thus a page from Appendix A which is labeled A-9 should be recorded as 0000a9. j) Page numbers containing characters that are not permitted in ISO 9660 file names should be recorded with an underscore character in place of the illegal character. For example, page 22.6 should be recorded as 22_6. In situations such as this, the unmodified page number should be recorded in the image file header. k) Adornments around page numbers, such as if the page number is both preceded and followed by a dash, asterisk, parenthesis, square bracket, etc., should be ignored (not entered). l) Any pagination situation that falls outside those described above will be noted in the pagination instructions that will accompany the volume. These instructions will include how the pages in question should be designated in the image file header and file name. If the vendor discovers a situation that is not described above and not commented on in the worksheet, they will contact UIUC for instructions about how to proceed.