Digital Services Process: Jeremiah Colonna

advertisement
Metadata Workflow Discussion
9/9/13, Gorgas 402
Attendees: Jason Battles (chair), Millie Jackson, Mary Bess Paluzzi, Janet Lee-Smeltzer, Donnelly Lancaster
Walton, Jody DeRidder (minutes), Will Jones, Mary Alexander, April Burnett, Jeremiah Colonna-Romano.
Intro: Jason Battles
This was a kick-off meeting to obtain a common understanding of how our metadata efforts interact. The
goal is to take detailed workflow documentation from all three areas, create consistency, and develop two
levels of documentation: an executive summary, and a second level with specific information, linked as
necessary from the top level overview.
To achieve that, Jason proposed we will work through each area, starting with the archivists and moving
forward through the pipeline to get a sense of when/where/how things change or get added. We’re looking
for ways to improve, as well as seeking to understand the process. We also want to eliminate
misperceptions. At a later date, we hope to use our findings to inform Acumen development of search and
retrieval, and display.
Along the way, we are also hoping to better understand the level of processing, and who does what, with
what results. This discussion needs to be user-based: consideration of the outcome for the user is primary.
To start, these meetings will occur every 2 weeks, but later may shift to monthly.
Jason then asked Donnelly to start with a description of how finding aids are created, and how content is
selected and prepared for digitization.
Archivist’s Process: Donnelly Lancaster Walton and April Burnett
After seeking clarification on whether to focus on previous methods or current ones, Donnelly then described
the current archivist workflow.
They process collections, and store all collection information in Archivists Toolkit (AT) which will be replaced
by ArchivesSpace (should be a seamless change) this fall. They select a collection in AT and process it until
the finding aid meets the DACS standards (Describing Archives: A Content Standard, officially approved by the
Society of American Archivists (SAA)). DACS is integrated into AT, as the software shows the applicable rules
by hovering over fields during data entry.
DACS is primarily collection-based, and does not address many item level things. So in the controlled
vocabulary area for names, they use the Library of Congress Name Authority File (LCNAF) which usually
corresponds to DACS. They are also considering using the ACRL DCRM-MSS (Descriptive Cataloging of Rare
Materials (Manuscripts)) in the future for item level data entry. If DACS does not answer the questions as to
how to input something, they use the Chicago Manual of Style. They have an in-house processing manual as
well.
From the processed collections, content is selected for the digitization queue, based on the following
qualifications:




No copyright issues
No preservation issues
There must be a finding aid
If there’s a known demand for access to the collection, it gets priority
After deciding on content, they enter information about the selected collections in the Selection spreadsheet
(on the share drive in S:\Digital Projects\Organization\Digital_Program). Unless something is bumped up in
priority, April works through this list in order, creating item level metadata. When she’s finished, she cuts
and pastes the entry for the collection from one tab of this spreadsheet to another.
[The Selection spreadsheet contains the following tabs: Sandbox, In Progress, Queue, DS inProgress, DS
Digitization Complete. Each tab represents a step in the pipeline on a collection-level basis, so that anyone
can tell at a glance where a collection is in the progression from the archivists through digitization:





The Sandbox is for sharing ideas among the archivists of possible content for digitization.
In Progress is content identified for April to create item-level metadata.
Queue is content waiting for Digital Services to digitize.
DS inProgress is content undergoing digitization.
DS Digitization Complete is content digitized and online.
The spreadsheet originally started as a single page with color-coding but has evolved to this form as it’s more
practical. Movement of entries is a simple cut and paste, and occurs as content moves through the pipeline.]
April uses the M01 spreadsheet supplied by the metadata librarians (see Template Registry) and the
metadata librarians’ input guidelines (see Metadata Creation section here:
https://intranet.lib.ua.edu/cataloging/metadata ). April creates item-level names but not subjects, unless
something stands out. When she does add a subject, she also adds it to AT for the collection. The finding aid
is not released until the collection is ready for digitization, so that April can correct it as she works through
the content.
When April has finished creating the metadata, she:




Moves the collection row in the Selection spreadsheet to the Queue tab.
Creates directories in the Digital Services area on the share drive (S:\Digital
Projects\Digital_Coll_in_progress\Digital_Coll_Waiting) according to Digital Services protocols
Places the metadata spreadsheet for the collection in the Metadata folder she’s just created
Sends an email to Digital Services to notify them that a new collection is ready
Finding aids are placed in the “new” or “remediated” folders in S:\Special
Collections\Digital_Program_files\EAD where they are picked up every Friday night for processing and web
delivery.
(At this point, Jody clarified that the finding aids follow a different path than the item-level content. We
agreed to continue to follow the item-level content at this point in the discussion, and come back to the
finding aids later.)
Digital Services Process: Jeremiah Colonna-Romano
The overview for the Digital Services workflow is online.
Item-level digitization is organized according to staff production needs. Some of the things that must be
considered include:





Types of content
Formats of material
Size of material
Different hardware needs
Availability of personnel with the training on that hardware and type of content
Digital Services perform juggling acts with this, particularly with collections that contain a variety of materials,
as parts of them may need to be captured on different stations, and potentially by different personnel.
The material exchange process for obtaining and returning boxes is documented here: S:\Digital
Projects\Administrative\Pipeline\Material_exchange_pipeline and works well.
Staff members select a collection entry from the Queue in the Selection spreadsheet, move that row to the
next tab (DS inProgress) and copy the information into an XML collection file, which later serves to feed the
database for browsing collections, and as a landing page in Acumen when no EAD is yet online. This file is
named appropriately and placed in the Admin directory for the collection, and the collection directories are
moved into S:\Digital Projects\Digital_Coll_in_progress. Additional columns for tracking and facilitating our
work are temporarily added to the existing metadata spreadsheet provided by April:







Number of Captures
Captured With
Captured By
Date
OCR? (1=yes or 0=no)
DS Notes
Metadata changed
(These rows will later be exported as a tab-delimited text log file for storage in the archive, and deleted from
the spreadsheet before it is transferred to the metadata librarians. The remainder – April’s work – will be
exported as tab-delimited metadata for translation into MODS to be uploaded to Acumen. There was some
discussion about the fact that DS export scripts correct for embedded encodings that may have inadvertently
been incorporated into the spreadsheet from MS Word, PDF, OCLC or elsewhere. These export scripts may
be helpful for the metadata librarians, who also will be working with the same spreadsheets.)
When a staff member prepares to digitize a box of content, they first compare the actual items to the
metadata in the spreadsheet. Any anomalies are noted in the “DS Notes” column (such as torn pages,
missing items, content too fragile to digitize) and if page numbers need to be corrected, that is noted in the
“Metadata changed” column. Captures are made and progress is logged, including whether an item should
be processed for OCR (optical character recognition) capture.
The movement of metadata after capture is described online. After the collection, or a batch of the
collection, is completed, the collection is moved to S:\Digital Projects\Digital_Coll_Complete.
The content here undergoes two levels of quality control (QC) review: one by the digitizer, and another by an
assigned peer or supervisor. The tab-delimited log file is exported from the metadata spreadsheet, named
appropriately, and placed in the Admin directory; these columns are then deleted from the metadata
spreadsheet. The metadata itself is also exported into tab-delimited UTF-8 and processed through Archivists
Utility to generate MODS into a folder in the Metadata folder for the collection. Once quality control is
completed, the spreadsheet itself is then placed in S:\Digital
Projects\Administrative\Pipeline\collectionInfo\forMDlib\needsRemediation for metadata librarians to pick up.
The next part of this process is described online here.
Once QC is complete, DS personnel log into the libcontent server (a Linux server where Acumen and the archive
reside) and run makeJpegs script, which will perform more quality control checks, generate JPEGs from the large
TIFF files (for web delivery), and extract OCR text from images if indicated by the log file that was exported. This
script also uploads the MODS to the Linux server, placing them in a directory next to the JPEGs and OCR text and
any transcriptions. A second script (relocate_all) distributes this content into Acumen.
Thus all digitized content goes online with April’s metadata before the metadata librarians ever see the
spreadsheet.
[A third script (moveContent):






tests the collection xml file,
then inserts/updates the collection entry in the InfoTrack database which feed our collection browse page
picks up the exported log and metadata files, the MODS and the TIFFs
transports them across the network to the Linux server to the Deposits directory (where they’ll be
processed for the archive)
tests the TIFF copies to verify they did not change when crossing the network
deletes the content on the share drive if everything copied successfully.
This script must wait until after indexing is completed for new collections to avoid creating dead links in our
collection browse page. ]
Conclusion
We stopped at this point because we ran out of time. Next time we will begin with the Metadata Unit’s
process. Our next meeting is set for Monday, September 23rd at 1 pm in 402 Gorgas.
Jody noted that the TrackingFilenames spreadsheet (not to be confused with TrackingFiles), where we
document the collection identifiers and names, as well as the organization of them, may be of interest to the
group. This document lives in S:\Digital Projects\Organization\Digital_Program_Logs. (A copy was later
distributed by email.)
Download