An Interim Report from DAWG Digital Architecture and Infrastructure Working Group Chartered by Grace Agnew to: – – Goals include: – – – 1 Develop policies and procedures to support an integrated, secure, and effective common infrastructure Develop a digital library infrastructure to support an integrated, sustainable digital library initiative. – DAWG - April 9, 2003 Provide sustainability of the digital content and technology platform Support of the RUL Data Architecture Apply new interoperability protocols Support state-wide initiatives DAWG Team Anne Butman Tom Frusciano Judy Gardner Michael Giarlo Nick Gonzaga Dave Hoover Patrick Huey 2 DAWG - April 9, 2003 Ron Jantz (chair) Sam McDonald Ann Montanaro Lynn Mullins Robert Nahory Jeffery Triggs Karen Wenk Yang Yu Challenges in Digital Libraries Integration across diverse digital collections Scale to millions of objects Flexibility to handle many digital formats Ability to customize by adding special tools and services Preservation of digital objects Sustainability and interoperability 3 DAWG - April 9, 2003 Initial Focus of DAWG Infrastructure Evaluating and selecting a large mass storage system to accommodate millions of digital objects Architecture Developing the architecture and prototype for an RU digital library network. 4 DAWG - April 9, 2003 Concepts and Terminology RU Digital Library Network (DLN) A system of people, standards, and software/hardware that provide the access, management, and preservation of digital repositories of interest to RU. RUL Digital Library Repository (DLR) A repository that is designed and managed by RUL to contain and provide access to digital resources created by RU and RUL. The DLR is part of the DLN. Digital Object Architecture – support of complex objects – – – multiple manifestations, e.g. a book represented as images, text, and digital sound multiple formats, e.g. a map represented as tiff, djvu, and MrSid multiple behaviors, e.g. display at different resolutions, rotate a 3D object, etc. 5 DAWG - April 9, 2003 Architecture Design Philosophy Design Principles: Interoperability, Sustainability, and Extensibility Informed by the Open Archive Information System (OAIS) Reference Model. Designed to contain the output of RU (both scholarly material and administrative data). Policy decisions will determine content and how distributed or centralized the repository will ultimately become. Will accomodate a virtual network of repositories enabling access to existing metadata repositories (IRIS, Luna) as well as providing a framework for accessing and searching external metadata resources. The technological framework and content must be sustainable. All information resources, on submittal to the repository, should have, at a minimum, a core set of metadata that can be mapped to RU Core. The architecture is flexible (customizable) and extensible. For example, disciplinespecific portals can be developed. 6 DAWG - April 9, 2003 RU Digital Library Network - Features Large scale, stable, digital repository Searching across multiple repositories Searching and browsing using RU Core Flexible metadata support Access through portals by community, content, and format Easy to use submission process Digital preservation with persistent identifiers Flexible, digital object architecture Access to existing digital collections Sustainability through open-source, standards, and support of critical workflow processes. 7 DAWG - April 9, 2003 RU Digital Library Network Possible Content 8 Maps (e.g. digitized historic New Jersey Maps) Historic documents Electronic Journals 3D objects (e.g. glass art, Roman coins, scrapbooks) Multimedia objects (e.g. digital video) Special ebook collections Numeric data Preprints, learning objects from RU faculty Dissertations Operational and Administrative RU Reports Object level access to existing digital collections (e.g. NJEDL) Searchable metadata collected through harvesting. DAWG - April 9, 2003 RU Digital Library Network Search and Browse Interface Federated (z39.50) (tightly coupled) RU Digital Library Repository IRIS Harvested (OAI-PMH) (loosely coupled) NJEDL LUNA Other Nodes Cross-Repository Searching An Early DLN Prototype Digital Object Structure – Three Types Persistent ID Metadata Byte stream Metadata Ptr to External Digital object Harvested Metadata 11 Digital Objects DAWG - April 9, 2003 METS Wrapper Repository Architecture and Metadata METS (Metadata Encoding and Transmission Standard) will be used to encapsulate descriptive, preservation, structural and behavior metadata. For interoperability, all metadata schemas must map to NJCore and Dublin Core The architecture must support creation of simple (NJCore, Dublin Core) and complex metadata (FGDC, MPEG-7, IEEE LOM, etc.) 12 DAWG - April 9, 2003 Metadata and Dynamic Mapping An Example Global Search/Retrieval Via RU Core Input – METS Wrapper FGDC – for maps Preservation Structure Object 13 DAWG - April 9, 2003 Object Repository FGDC Search Via lat & long Open Source Digital Repositories Dspace – A digital library repository DSpace is a specialized type of digital asset management or content management system: it manages and distributes digital items, made up of digital files (or “bitstreams”) and allows for the creation, indexing, and searching of associated metadata to locate and retrieve the items. It is designed to support the long-term preservation of the digital material stored in the repository. (http://dspace.rutgers.edu) Fedora – A digital object repository Fedora is a foundation upon which interoperable web-based digital libraries can be built. Fedora consists of APIs (application program interfaces) for creating access and management applications. 14 DAWG - April 9, 2003 Archival Storage and Preservation A physically separate archive is managed for preservation purposes. The archive is separate from the presentation form (website) and the daily backup. The intent of the archive is to capture all the required forms of the digital material in non-proprietary format. Each digital object would have preservation metadata and a persistent ID. 15 DAWG - April 9, 2003 Mass Storage System - Requirements Initial capacity of 10 to 20 Terabytes (TB) Extensible to 100, 200TB and beyond Low management overhead Information must survive migrations across software and platforms History/audit trails required for each object Mirroring to a remote cluster (e.g. a cluster in NB and one in Newark) to provide offsite backup. Global name space across all RUL locations Platforms required: Windows 2000, Unix, Linux 16 DAWG - April 9, 2003 Technologies and Standards Persistent ID – CNRI Handle System OAI-PMH – Protocol for metadata harvesting METS – Metadata Encoding and Transmission Standard OpenURL SCORM 18 DAWG - April 9, 2003 Detailed Requirements 19 DAWG - April 9, 2003 Ingest Administration Access Data Management Preservation Storage System Level Progress To Date Infrastructure – – – Architecture – – – – 20 Commercial product discussions and quote from EMC for mass storage. Also examining ADIC and IBM’s Storage Tank (an open source product). CamdenBase directories/permissions standardized for transfer to systems. Developed initial criteria for an RUL server registry – DAWG - April 9, 2003 Educating ourselves in various technologies: 1) OAI-PMH, 2) CNRI Handle System, 3) SCORM, 4) Z39.50/YAZ, 5) METS, 6) OpenURL Draft for requirements and architecture Cross-repository search prototype Downloaded Dspace (from MIT) - started evaluation UVa (Fedora) visit planned for early March Next Steps Fedora – visit UVa for half day tutorial Continue reviewing and select mass storage system Prepare interim communication package – – – High level architecture and requirements Cross-repository searching prototype Preliminary assessment of Dspace and Fedora Communicate and Get Feedback Begin more detailed evaluation of Dspace and Fedora Produce architecture/functional specification Develop prototype with sample content 21 DAWG - April 9, 2003 Tasks and Timeline for DAWG-A January, 2003 – Requirements/Architecture document February, 2003 – Discussion, Feedback with RUL and RU March – May, 2003 – Evaluation of candidate systems (Dspace, Fedora, et al) June – August, 2003 – Select system and prototype sample content September - November, 2003 – Prototype-trial of multiple repositories 22 DAWG - April 9, 2003 Tasks and Timeline for DAWG-I December, 2002 - Determine requirements for mass storage system December – January, 2003 – Transfer CamdenBase to Systems, test, and evaluate process December – January, 2003 - Research and evaluate possible mass storage products March, 2003 - Recommend mass storage solution March 2003 - Develop RUL server registry criteria 23 DAWG - April 9, 2003