HATHI TRUST A Shared Digital Repository Digital Repositories for Preservation and Access Digital Directions 2013 Jeremy York July 22, 2013 Unless otherwise noted, these slides and their contents are licensed under a Creative Commons Attribution Unported License. Digital repositories • Primary mission to preserve content • Performs actions to this end Reasons to preserve content • For access • Guard against threats to content – Digitization accepted method of preservation reformatting – Digital deteriorates, is fragile Reasons to provide access • Meet needs of designated community • Check on integrity of content • Content that is accessible is more likely to be valued and preserved in the future Reasons access might not be offered • • • • Copyright Privacy Licensing Needs of user community – Content available elsewhere • Technical limitations – Networking and storage requirements A number of models • Full user access to preserved digital objects • No end-user access to digital objects • Delayed or triggered user access to digital objects • Partial access to digital objects Requirements to preserve content • OAIS – “An OAIS is an Archive, consisting of an organization...of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community.” [does not imply unrestricted access] OAIS • Support information model – Define target of preservation (content data and representation information) – Define metadata needed to preserve, identify, contextualize information (PDI) • Fulfill responsibilities – – – – – Accept information from Producers Obtain control sufficient to preserve Ensure understandable to designated community Ensure preservation Make available to designated community with information supporting authenticity Ensure preservation • Some strategies: – Transformation – Validation – Checks on integrity – Replication – Choice of formats – Migration TRAC • Starts with “a mission to provide reliable, long-term access to managed digital resources to its designated community, now and into the future” • Encompasses – Organizational Infrastructure – Digital Object Management – Technical Infrastructure TRAC (2) • Borrows vocabulary from OAIS • Adapts ideas for applying criteria from nestor and Digital Curation Centre – Documentation (evidence) – Transparency – Adequacy – Measurability Mission OAIS TRAC Provenance Reference Context Fixity Access Rights Content Data Representation Information Preservation Actions Integrity Authenticity Transparency Documentation Organizational Infrastructure Reliability Adequacy Digital Object Management Designated Community Preserve Content Measurability Technical Infrastructure Where does access come in • Some level of access is necessary – Management, integrity • What is preserved may not be what is most useful to the end user • Implications across the repository Content formats • Can the content you are preserving be delivered over the Web? – Will you be storing derivative files? – Is some kind of transformation needed? – Do the files offer consistent functionality? • Implications for scale of repository, access systems, changes to services • In HathiTrust: – Limited to 3 formats, largely uniform in technical characteristics • ITU G4 TIFF • JPEG2000 • Unicode (with and without coordinates) Storage of information about content • Is information about object adequately available for both preservation and access? – Structural information – Preservation information with implications for interface • HathiTrust uses METS as a wrapper – Available for preservation and access Content Package images text Source METS Zip HT METS Architecture ../uc1/pairtree_root/b3/54/34/86/b34543486 b34543486.zip b34543486.mets.xml images HT METS text Source METS Storage • Does the storage system support needs for ingest and access? • In HathiTrust: – Need to have fast access to repository systems to support services Security • Data Integrity – Checksum validation, digital object provenance • Physical security – Biometric door systems, locked racks • Network security – Firewalling, vulnerability scanning • Application security – Developer best practices, input validation • Access control… Differential access to content • Rights database – Ensures appropriate access • Holdings database – Facilitates lawful uses of materials Authentication/Authorization • Mechanisms to enable differential access, ensure security and appropriate use User services • Bibliographic and full-text search indexes • Collection-building capabilities • User interfaces APIs and Datasets • • • • • Data API Bibliographic API OAI “Hathifiles” Datasets More • Quality • User Support • Correction Content Formats Content Package Architecture Storage Security Authentication Authorization Differential Access Copyright/Agreem ents Lawful Uses Indexes Services / User Interfaces APIs and Datasets Information Quality User Support Correction Provide Access Mission Preservation OAIS TRAC Provenance Reference Context Fixity Access Rights Content Data Representation Information Preservation Actions Integrity Authenticity Documentation Organizational Infrastructure Transparency Reliability Adequacy Digital Object Management Measurability Technical Infrastructure Designated Community Content Formats Content Package Architecture Security Authentication Authorization Lawful Uses Indexes Copyright/Agre ements APIs and Datasets Information Quality User Support Storage Differential Access Services / User Interfaces Correction Access Thank you! How to find out more • • • • About: http://www.hathitrust.org/about Twitter: http://twitter.com/hathitrust Facebook: http://www.facebook.com/hathitrust Monthly newsletter: – http:www.hathitrust.org/updates – RSS http://www.hathitrust.org/updates_rss • Contact us: feedback@issues.hathitrust.org • Blogs: http://www.hathitrust.org/blogs – Large-scale Search – Perspectives from HathiTrust