Metadata Workflow Discussion 9/9/13, Gorgas 402 Attendees: Jason Battles (chair), Millie Jackson, Mary Bess Paluzzi, Janet Lee-Smeltzer, Donnelly Lancaster Walton, Jody DeRidder (minutes), Will Jones, Mary Alexander, April Burnett, Jeremiah Colonna-Romano. Intro: Jason Battles This was a kick-off meeting to obtain a common understanding of how our metadata efforts interact. The goal is to take detailed workflow documentation from all three areas, create consistency, and develop two levels of documentation: an executive summary, and a second level with specific information, linked as necessary from the top level overview. To achieve that, Jason proposed we will work through each area, starting with the archivists and moving forward through the pipeline to get a sense of when/where/how things change or get added. We’re looking for ways to improve, as well as seeking to understand the process. We also want to eliminate misperceptions. At a later date, we hope to use our findings to inform Acumen development of search and retrieval, and display. Along the way, we are also hoping to better understand the level of processing, and who does what, with what results. This discussion needs to be user-based: consideration of the outcome for the user is primary. To start, these meetings will occur every 2 weeks, but later may shift to monthly. Jason then asked Donnelly to start with a description of how finding aids are created, and how content is selected and prepared for digitization. Archivist’s Process: Donnelly Lancaster Walton and April Burnett After seeking clarification on whether to focus on previous methods or current ones, Donnelly then described the current archivist workflow. They process collections, and store all collection information in Archivists Toolkit (AT) which will be replaced by ArchivesSpace (should be a seamless change) this fall. They select a collection in AT and process it until the finding aid meets the DACS standards (Describing Archives: A Content Standard, officially approved by the Society of American Archivists (SAA)). DACS is integrated into AT, as the software shows the applicable rules by hovering over fields during data entry. DACS is primarily collection-based, and does not address many item level things. So in the controlled vocabulary area for names, they use the Library of Congress Name Authority File (LCNAF) which usually corresponds to DACS. They are also considering using the ACRL DCRM-MSS (Descriptive Cataloging of Rare Materials (Manuscripts)) in the future for item level data entry. If DACS does not answer the questions as to how to input something, they use the Chicago Manual of Style. They have an in-house processing manual as well. From the processed collections, content is selected for the digitization queue, based on the following qualifications: No copyright issues No preservation issues There must be a finding aid If there’s a known demand for access to the collection, it gets priority After deciding on content, they enter information about the selected collections in the Selection spreadsheet (on the share drive in S:\Digital Projects\Organization\Digital_Program). Unless something is bumped up in priority, April works through this list in order, creating item level metadata. When she’s finished, she cuts and pastes the entry for the collection from one tab of this spreadsheet to another. [The Selection spreadsheet contains the following tabs: Sandbox, In Progress, Queue, DS inProgress, DS Digitization Complete. Each tab represents a step in the pipeline on a collection-level basis, so that anyone can tell at a glance where a collection is in the progression from the archivists through digitization: The Sandbox is for sharing ideas among the archivists of possible content for digitization. In Progress is content identified for April to create item-level metadata. Queue is content waiting for Digital Services to digitize. DS inProgress is content undergoing digitization. DS Digitization Complete is content digitized and online. The spreadsheet originally started as a single page with color-coding but has evolved to this form as it’s more practical. Movement of entries is a simple cut and paste, and occurs as content moves through the pipeline.] April uses the M01 spreadsheet supplied by the metadata librarians (see Template Registry) and the metadata librarians’ input guidelines (see Metadata Creation section here: https://intranet.lib.ua.edu/cataloging/metadata ). April creates item-level names but not subjects, unless something stands out. When she does add a subject, she also adds it to AT for the collection. The finding aid is not released until the collection is ready for digitization, so that April can correct it as she works through the content. When April has finished creating the metadata, she: Moves the collection row in the Selection spreadsheet to the Queue tab. Creates directories in the Digital Services area on the share drive (S:\Digital Projects\Digital_Coll_in_progress\Digital_Coll_Waiting) according to Digital Services protocols Places the metadata spreadsheet for the collection in the Metadata folder she’s just created Sends an email to Digital Services to notify them that a new collection is ready Finding aids are placed in the “new” or “remediated” folders in S:\Special Collections\Digital_Program_files\EAD where they are picked up every Friday night for processing and web delivery. (At this point, Jody clarified that the finding aids follow a different path than the item-level content. We agreed to continue to follow the item-level content at this point in the discussion, and come back to the finding aids later.) Digital Services Process: Jeremiah Colonna-Romano The overview for the Digital Services workflow is online. Item-level digitization is organized according to staff production needs. Some of the things that must be considered include: Types of content Formats of material Size of material Different hardware needs Availability of personnel with the training on that hardware and type of content Digital Services perform juggling acts with this, particularly with collections that contain a variety of materials, as parts of them may need to be captured on different stations, and potentially by different personnel. The material exchange process for obtaining and returning boxes is documented here: S:\Digital Projects\Administrative\Pipeline\Material_exchange_pipeline and works well. Staff members select a collection entry from the Queue in the Selection spreadsheet, move that row to the next tab (DS inProgress) and copy the information into an XML collection file, which later serves to feed the database for browsing collections, and as a landing page in Acumen when no EAD is yet online. This file is named appropriately and placed in the Admin directory for the collection, and the collection directories are moved into S:\Digital Projects\Digital_Coll_in_progress. Additional columns for tracking and facilitating our work are temporarily added to the existing metadata spreadsheet provided by April: Number of Captures Captured With Captured By Date OCR? (1=yes or 0=no) DS Notes Metadata changed (These rows will later be exported as a tab-delimited text log file for storage in the archive, and deleted from the spreadsheet before it is transferred to the metadata librarians. The remainder – April’s work – will be exported as tab-delimited metadata for translation into MODS to be uploaded to Acumen. There was some discussion about the fact that DS export scripts correct for embedded encodings that may have inadvertently been incorporated into the spreadsheet from MS Word, PDF, OCLC or elsewhere. These export scripts may be helpful for the metadata librarians, who also will be working with the same spreadsheets.) When a staff member prepares to digitize a box of content, they first compare the actual items to the metadata in the spreadsheet. Any anomalies are noted in the “DS Notes” column (such as torn pages, missing items, content too fragile to digitize) and if page numbers need to be corrected, that is noted in the “Metadata changed” column. Captures are made and progress is logged, including whether an item should be processed for OCR (optical character recognition) capture. The movement of metadata after capture is described online. After the collection, or a batch of the collection, is completed, the collection is moved to S:\Digital Projects\Digital_Coll_Complete. The content here undergoes two levels of quality control (QC) review: one by the digitizer, and another by an assigned peer or supervisor. The tab-delimited log file is exported from the metadata spreadsheet, named appropriately, and placed in the Admin directory; these columns are then deleted from the metadata spreadsheet. The metadata itself is also exported into tab-delimited UTF-8 and processed through Archivists Utility to generate MODS into a folder in the Metadata folder for the collection. Once quality control is completed, the spreadsheet itself is then placed in S:\Digital Projects\Administrative\Pipeline\collectionInfo\forMDlib\needsRemediation for metadata librarians to pick up. The next part of this process is described online here. Once QC is complete, DS personnel log into the libcontent server (a Linux server where Acumen and the archive reside) and run makeJpegs script, which will perform more quality control checks, generate JPEGs from the large TIFF files (for web delivery), and extract OCR text from images if indicated by the log file that was exported. This script also uploads the MODS to the Linux server, placing them in a directory next to the JPEGs and OCR text and any transcriptions. A second script (relocate_all) distributes this content into Acumen. Thus all digitized content goes online with April’s metadata before the metadata librarians ever see the spreadsheet. [A third script (moveContent): tests the collection xml file, then inserts/updates the collection entry in the InfoTrack database which feed our collection browse page picks up the exported log and metadata files, the MODS and the TIFFs transports them across the network to the Linux server to the Deposits directory (where they’ll be processed for the archive) tests the TIFF copies to verify they did not change when crossing the network deletes the content on the share drive if everything copied successfully. This script must wait until after indexing is completed for new collections to avoid creating dead links in our collection browse page. ] Conclusion We stopped at this point because we ran out of time. Next time we will begin with the Metadata Unit’s process. Our next meeting is set for Monday, September 23rd at 1 pm in 402 Gorgas. Jody noted that the TrackingFilenames spreadsheet (not to be confused with TrackingFiles), where we document the collection identifiers and names, as well as the organization of them, may be of interest to the group. This document lives in S:\Digital Projects\Organization\Digital_Program_Logs. (A copy was later distributed by email.)