Architecting an Extensible Digital Repository Anoop Kumar, Ranjani Saigal,Rob Chavez, Nikolai Schwertner Tufts University, Medford, MA Overview Background Information on the evolution of TDL Design Requirements TDL Architecture Applications that interface with TDL – – Tufts DL search VUE History of Digital Collections at Tufts About Tufts – – Interdisciplinary Focus on teaching and learning Digital Collections at Tufts – – – – Perseus (Classics) Tufts University Science Knowledgebase (TUSK-Medicine) Artifact (Art History) Digital Collections and Archives (DCA) – Bolles, etc Other (Crime and Punishment) Projects Materials Tools Perseus DL 50 million words, highly structured TEI encoded XML texts of many types. 50,000 images Perseus document management system and tools DCA 13 million words, 35,000 images, geospatial datasets multimedia objects Perseus document management system and tools TUSK 15,000 documents Includes full-text syllabi, digital slide images, lecture recordings (audio and video) and text notes and exam questions, evaluation forms, and bibliographies linked to full-text articles. Networked course management system interface Artifact 2500 images links to the Art History slide collection database containing 120,000 entries. On-demand viewing and searching with Internetbased adaptations of traditional learning aids, such as flashcards, for review and study Why TDL? (Tufts Digital Library) The collections were continuously expanding adding content in a variety of formats. The architecture of these libraries was not built to accommodate such expansion. Needed a university wide digital repository that can manage the ever increasing content while continuing to service the discipline specific needs and leveraging existing and new tools and service Designing TDL Digital Collections and Archives partnered with Academic Technology to create a digital library that can manage the content while supporting teaching and learning. Commitment to comply with standards in the library and the open source community. Ensure Scalability, Flexibility, Reusability, Extensibility and Interoperability Design Requirements Ingest: – Management: – – Ability to enforce archival standards Use of information packages to facilitate storage and dissemination Ability to incorporate content models Persistence: – – Use of persistent identifiers mapped URNs Requirements System Services Unique and persistent identification of materials Naming Service Use of archival information packages (AIP) Digital Object Provider (DOP) Service -- Fedora Use of submission information Packages (SIP) Drop Box, Ingestion Service Use of Dissemination Information Packages (DIP) DOP Service Authentication and integrity checking DOP Service Dissemination Disseminators, Caching Service, Digital Library Application, Search Service Access Search Service and other applications Tufts DL Architecture A M A Fedora Client Application Creation Service Application Data U Application Interface Drop Box Fedora Search Interface Naming Service Fedora Ingestion Service U Search Indexing Service Search Index U - Users M - Manager A - Administrators Component Role Drop Box and Ingestion Validation, Tagging, Preprocessing, Components of TDL Service Ingestion Naming Service Unique persistent identifiers mapped to objects (“tufts:dca:central:MS102.33.1345”) Fedora Repository Management and access framework for digital objects Search and Indexing Service Provides search mechanism Application Creation Service Provides mechanism for external applications to interface with repository TDL Architecture Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing Service and Search Engine Application Creation Service Drop Box and Ingestion Service TDL Architecture Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing Service and Search Engine Application Creation Service Naming Service Assigns, reserves and resolves URNs URN Format tufts:school name:owner:[collection:]item name tufts:dca:central:MS102.33.1345 URN Properties – – Provides unique ID to objects deposited into repository Service assures resolution to unique resource. TDL Architecture Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing Service and Search Engine Application Creation Service Fedora Repository Service@Tufts Fedora - Key Features Repository at Tufts Content Models at Tufts – Objects, Behaviors and Disseminator Implementation Challenges Flexible Extensible Data Object Repository Architecture (Fedora) Support for heterogeneous data types Accommodation of new types as they emerge Aggregation of mixed, possibly distributed, data into complex objects The ability to specify multiple content disseminations of these objects The ability to associate rights management schemes with these disseminations. Repository Model Processing Service Medium Bandwidth (20Mb TIFF) HTTP Request HTTP Server High Bandwidth (20Mb TIFF) Storage Device Caching Service HTTP Request Medium Bandwidth Fedora (200Kb JPEG) User Content Model (CM) Hierarchy Indexing Disseminators Repository-Level Disseminators •getIndexTerms •getArchivalCopy •getForIndexing •getPreview •Etc. •getClass •Etc. Text CM VUE CM Image CM Binary CM Collection CM •getTOC •getConceptMap •getThumbnail •getObject •getObjects •getChunksList •getResource •getAccessHigh •getMIME •getInfo •getChunk •Etc. •getImageStats •Etc. •Etc. •Etc. •Etc. Specific Implementations (TEI text, EAD text, Encyclopedia, Directory, TIFF image, etc) Implementation Challenges Processing Large XML Documents Transforming Large Images Modeling Collections Advanced Search Customized Search Caching Disseminations TDL Architecture Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing Service and Search Engine Application Creation Service Indexing Service and Search Engine Indexing – Implementation – Lucene Supported Types of Search – – Specialized Polymorphic Disseminators Basic Keyword Advanced metadata based Accessing the service – – HTTP GET/POST SOAP TDL Architecture Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing Service and Search Engine Application Creation Service Application Creation Service An important design requirement for TDL was to allow current digital library applications to easily interface with TDL and provide access to the content in the digital library within their own environments in a seamless fashion. Current applications like Perseus can interface with this service to allow their tools to disseminate the content that resides in TDL The service has been designed not only to support current application but also to accommodate the needs of future yet-tobe-defined applications like course management systems, learning tools, portals etc. Applications Accessing TDL Content Tufts DL Search Visual Understanding Environment (VUE) Visual Understanding Environment (VUE) VUE Technical Infrastructure OKI OKI-FEDORA Bridge DR API DR Implementations FEDORA Digital Repository Digital Repository VUE Architecture Why TDL? (Tufts Digital Library) The collections are continuously expanding adding content in a variety of formats. The current architecture of these libraries is not built to accommodate such expansion. Need a university wide digital repository that can manage the ever increasing content while continuing to service the discipline specific needs and leveraging existing and new tools and service Future Direction Authentication and authorization service Customization and enhancement to Fedora@Tufts to address a wide variety of needs. Provide automated browsing service for Repository.