Symposium: Open Access to Information Panel 2: Open Access & Institutional Repositories 24 August 2006, Brasilia Digital Libraries, Electronic Theses and Dissertations (ETDs), and NDLTD http://fox.cs.vt.edu/talks/2006/20060824IBICTp2 Edward A. Fox, fox@vt.edu Executive Director, NDLTD Chair, IEEE-CS Tech. Committee on Digital Libraries Professor, Department of Computer Science Director, Digital Library Research Laboratory Virginia Tech, Blacksburg, VA 26061 USA 1 Outline • • • • • • • • Key Ideas Acknowledgements Digital Libraries DLs & Scholarly Communication Institutional Repositories NDLTD Summary DL Futures 2 Key Ideas - Overview • Theorem 1: Supporters of Open Access should support NDLTD. • Theorem 2: 5S can guide us to better support of Open Access. 3 Acknowledgements • • • • • Students Faculty, Staff Collaborators Support Mentors 4 Acknowledgements: Students • Pavel Calado, Yuxin Chen, Fernando Das Neves, Shahrooz Feizabadi, Robert France, Marcos Gonçalves, Nithiwat Kampanya, S.H. Kim, Aaron Krowne, Bing Liu, Ming Luo, Paul Mather, Fernando Das Neves, Unni. Ravindranathan, Ryan Richardson, Rao Shen, Ohm Sornil, Hussein Suleman, Ricardo Torres, Wensi Xi, Baoping Zhang, Qinwei Zhu, … 5 Acknowledgements: Faculty, Staff • Lillian Cassel, Debra Dudley, Roger Ehrich, Joanne Eustis, Weiguo Fan, James Flanagan, C. Lee Giles, Eberhard Hilf, John Impagliazzo, Filip Jagodzinski, Rohit Kelapure, Neill Kipp, Douglas Knight, Deborah Knox, Aaron Krowne, Alberto Laender, Gail McMillan, Claudia Medeiros, Manuel Perez, Naren Ramakrishnan, Layne Watson, … 6 Other Collaborators (Selected) • • • • • • • • • • Brazil: FUA, IBICT, UFMG, UNICAMP, USP Case Western Reserve University Emory, Notre Dame, Oregon State Germany: Humboldt U., U. Oldenburg Mexico: UDLA (Puebla), Monterrey College of NJ, Hofstra, Penn State, Villanova University of Arizona University of Florida, Univ. of Illinois University of Virginia VTLS (slides on digital repositories, NDLTD)7 Acknowledgements: Support • Course: UNESCO, CETREDE, IFLALAC, AUGM, CLEI, UFC • Sponsors: ACM, Adobe, AOL, CAPES, CNI, CONACyT, DFG, IBM, Microsoft, NASA, NDLTD, NLM, NSF (IIS-9986089, 0086227, 0080748, 0325579, 0535057; ITR-0325579; DUE-0121679, 0136690, 0121741, 0333601), OCLC, SOLINET, SUN, SURA, UNESCO, US Dept. Ed. (FIPSE), VTLS Acknowledgements - Mentors • JCR Licklider – undergrad advisor (1969-71) – Author in 1965 of “Libraries of the Future” – Before, at ARPA, funded start of Internet • Michael Kessler – BS thesis advisor – Project TIP (technical information project) – Defined bibliographic coupling • Gerard Salton – graduate advisor (1978-83) – “Father of Information Retrieval” 9 Digital Libraries • • • • Definitions DL Manifesto – Reference Model Book in process (Fox & Gonçalves), 5S DL Curriculum Project 10 DL Definitions - 1 • “A digital library is an organized and focused collection of digital objects, including text, images, video, and audio, along with methods of access and retrieval, and for selection, creation, organization, maintenance, and sharing of the collection.” • Witten & Bainbridge – “How to Build a Digital Library” – Morgan Kaufmann 2003 11 DL Definitions - 2 • “Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities” • Waters,D.J. CLIR Issues, July/August 1998 • www.clir.org/pubs/issues/issues04.html 12 DL Definitions - 3 • Issues and Spectra – Collection vs. Institution – Content vs. System – Access vs. Preservation – “Free” vs. Quality – Managed vs. Comprehensive – Centralized vs. Distributed 13 DL Definitions - 4 • NOT a “digitized library” • NOT a “deconstruction” of existing systems and institutions, moving them to an electronic box in a Library • IS a new way to deal with knowledge – Authoring, Self-archiving, Collecting, – Organizing, Preserving, – Accessing, Propagating, Re-using 14 Digital Library Content Content Types Text Documents Video Audio Geographic Information Software, Programs Bio Information Images and Graphics Articles, Reports, Books Speech, Music (Aerial) Photos Models Simulations Genome Human, animal, plant 2D, 3D, VR, CAT 15 DL Manifesto - 1 • DL Reference Model • In support of the future European Digital Library • Developed by team connected with DELOS (Candela, Casteli, Ioannidis, Koutrica, Meghini, Pagano, Ross, Schek, Schuldt) • Draft 2.2 presented in Frescati, near Rome, June 2006 – 79 pages • Could be integrated with work of DLF, JISC, etc. 16 DL Manifesto – 2: 3 Tiers 17 DL Manifesto – 3: Main Concepts 18 DL Manifesto – 4: Actor Roles 19 Fox & Gonçalves DL Book Parts • Ch. 1. Introduction (Motivation, Synopsis) • • • • Part 1 – The “Ss” Part 2 – Higher DL Constructs Part 3 – Advanced Topics Appendix 20 Book Parts and Chapters - 1 • Ch. 1. Introduction (Motivation, Synopsis) • Part 1 – The “Ss” – Ch. 2: Streams – Ch. 3: Structures – Ch. 4: Spaces – Ch. 5: Scenarios – Ch. 6: Societies 21 Informal 5S & DL Definitions DLs are complex systems that • • • • • help satisfy info needs of users (societies) provide info services (scenarios) organize info in usable ways (structures) present info in usable ways (spaces) communicate info with users (streams) 22 A Minimal DL in the 5S Framework Streams Structured Stream Structures Spaces Structural Metadata Specification Scenarios Societies services Descriptive Metadata Specification indexing browsing searching hypertext Digital Object Collection Metadata Catalog Repository Minimal DL 23 Book Parts and Chapters - 2 • Part 2 – Higher DL Constructs – Ch. 7: Collections – Ch. 8: Catalogs – Ch. 9: Repositories and Archives – Ch. 10: Services – Ch. 11: Systems – Ch. 12: Case Studies 24 Book Parts and Chapters - 3 • Part 3 – Advanced Topics – Ch. 13: Quality – Ch. 14: Integration – Ch. 15: How to build a digital library – Ch. 16: Research Challenges, Future Perspectives • Appendix – A: Mathematical preliminaries – B: Formal Definitions: Ss – C: Formal Definitions: DL terms, Minimal DL – D: Formal Definitions: Archeological DL – E: Glossary of terms, mappings 25 RELATED TOPICS CORE DL TOPICS COURSE STRUCTURE DL Curriculum Framework Semester 1: DL collections: development/creation Digitization Storage Interchange Metadata Cataloging Author submission Digital objects Composites Packages Semester 2: DL services and sustainability Architectures (agents, buses, wrappers/mediators) Interoperability Spaces (conceptual, geographic, 2/3D, VR) Documents E-publishing Markup Multimedia streams/structures Capture/representation Compression/coding Bibliographic information Bibliometrics Citations Content-based analysis Multimedia indexing Naming Repositories Archives Services (searching, linking, browsing, etc.) Archiving and preservation Integrity Architectures (agents, buses, wrappers/mediators) Interoperability Thesauri Ontologies Classification Categorization Multimedia presentation, rendering Info. Needs Relevance Evaluation Effectiveness Intellectual property rights mgmt. Privacy Protection (watermarking) Routing Filtering Community filtering Search & search strategy Info seeking behavior User modeling Feedback Info summarization Visualization 26 Project Teams/NSF Grant • Project Team at VT (IIS-0535057): – PI: Dr. Edward A. Fox (fox@vt.edu) – GRA: Seungwon Yang (seungwon@vt.edu) • Project Team at UNC-CH (IIS-0535060): – Co-PI: Dr. Barbara Wildemuth (wildem@ils.unc.edu) – Co-PI: Dr. Jeffrey Pomerantz (pomerantz@unc.edu) – GRA: Sanghee Oh (shoh@email.unc.edu) 27 DLs & Scholarly Communication • • • • • • Asynch Information Life Cycle Flattening Author skills, toward Semantic Web Crossing the Chasm OAI 28 Asynchronous, Digital Library Mediated Scholarly Communication Different time and/or place 29 Information Life Cycle Authoring Modifying Using Creating Retention / Mining Organizing Indexing Accessing Filtering Storing Retrieving Distributing Networking 30 Digital Libraries Shorten the Chain from Editor Reviewer Publisher A&I Consolidator Library 31 DLs Shorten the Chain to Author Teacher Digital Reader Editor Reviewer Learner Library Librarian 32 Important skills for authors • • • • • • • Authoring (Word Processing ->e-pub) Rendering, presenting Tagging, Markup (XML, SGML) “Semi-structured information” Dual-publishing, eBooks Styles (XSL, XSLT) Structured queries 33 34 35 36 37 OAI – Repository Perspective Required: Protocol MDO MDO MDO MDO MDO MDO MDO MDO DO DO DO DO 38 OAI – Black Box Perspective OA 7 OA 4 OA 2 OA 1 OA 3 OA 6 OA 5 39 The World According to OAI Service Providers Discovery Current Awareness Preservation Data Providers 40 Institutional Repositories • • • • • • Definitions, Goals Eprints DSpace Fedora, VITAL Comparisons ODL + 5S Suite (not shown) 41 Institutional Repositories - 1 • “Institutional repositories are digital collections that capture and preserve the intellectual output of a single university or a multiple institution community of colleges and universities.” • Crow, R. “Institutional repository checklist and resource guide”, SPARC, Washington, D.C., USA • www.arl.org/sparc/IR/IR_Guide_v1.pdf 42 Institutional Repositories - 2 • “A university-based institutional repository is a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members. It is most essentially an organizational commitment to the stewardship of these digital materials, including long-term preservation where appropriate, as well as organization and access or distribution.” • Lynch, C.A. In ARL Bimonthly Report 226, pp. 1-7, Feb. 2003, www.arl.org/newsltr/226/ir.html 43 What is a Digital Object Repository? Also called: digital rep., digital asset rep., institutional repository Stores and maintains digital objects (assets) Provides external interface for Digital Objects Creation, Modification, Access Enforces access policies Provides for content type disseminations Adapted from Slide by V. Chachra, VTLS 44 Goals of Institutional Repositories (by Steven Harnad, U. Southampton) Self Archiving of Institutional Research Thesis and Dissertations (VTLS NDLTD Project) Article preprints and post prints Internal documents and maps Management of digital collections Preservation of materials – decentralized approach Housing of teaching materials Electronic Publishing of journals, books, posters, maps, audio, video and other multimedia objects Adapted from Slide by V. Chachra, VTLS 45 46 47 48 49 50 51 52 53 What is Fedora™? Flexible Extensible Digital Object Repository Architecture • Slides courtesy Vinod Chachra of VTLS 54 History of Fedora™ • 1997-Present – DARPA and NSF-funded research project at Cornell (Conceptual framework developed by Sandra Payette and Carl Lagoze) – Reference implementation developed at Cornell • 1999-2001 – University of Virginia digital library prototype (Thornton Staples and Ross Wayland) • 2002-Present – Andrew W. Mellon Foundation granted Virginia and Cornell $1 million to develop a production-quality Fedora system – Fedora 1.0 released in May 2003 as Open Source under the Mozilla public license. 55 Fedora™ Terms Metadata Digital Objects (data) Complex Objects (Object consisting of many objects in a complex/hierarchical relationship) Content (Data and Metadata together) Data-streams (are content for dissemination) Disseminators (are services) – A dissemination is defined as a stream of data that manifests a view of the digital objects 56 content. Digital Object w. multiple datastreams Digital Object DC Datastreams Datastreams EAD Admin Metadata EA D 57 Example Disseminators Persistent ID (PID) Disseminators Default Get Profile List Items Get Item List Methods Get DC Record Simple Image System Metadata Datastreams Get Thumbnail Get Medium Get High Get VeryHigh 58 Client Application Fedora™ Repository Batch Program Web Browser HTTP SOAP HTTP SOAP HTTP SOAP Manage Access Search Server Application Web Service Web Service Exposure Exposure Layer Layer HTTP OAI Provider Session Management User Authentication Management Subsystem Security Subsystem Access Subsystem Policy Mgmt Object Reflection Component Mgmt Policy Enforcement Object Dissemination HTTP Object Validation Users/Groups PID Generation External Content Source HTTP FTP External Content Retriever Digital Objects XML Files Datastreams HTTP Local Service Policies Storage Subsystem FT P External Content Source SOAP Object Mgmt Remote Service Content Relational DB Adapted from Slide by V. Chachra, VTLS 59 Fedora Advantage • Extensible digital object model • Repository exposed by Web services APIs – Management (Creation, Deletion, Maintenance, Validation) – Access (Search, Disseminations) • Scalable, persistent storage for content and metadata • Content can be local and/or remote • Content versioning • Open source solution 60 Comparison of DSpace and Fedora Dspace is a standalone product in a box whereas Fedora can be standalone or integrated with ILS In Fedora the metadata and the content are treated the same way as data-streams; in Dspace the metadata and content get separate treatments. Fedora can define complex objects easier Dspace is not as extensible as Fedora as it deals both with the repositories and workflows. Fedora focuses only on the data model. Fedora uses the Mozilla licensing model and Dspace uses GNU license. It makes it easier for software companies to provide extensions to the 61 model. VITAL / Fedora Relationship 62 Prospero: Summary of features of the three software packages compared DSpace E-prints Fedora What you get A package with front-end web interface directly linked to a database A package with front-end web interface directly linked to a database A repository database, with internal database. Server requirements Unix environment, Java, Apache Ant, Apache Tomcat, PostgreSQL or Oracle Unix environment, Perl, Apache+mod-perl, MySQL Unix or Windows, Java. (optional: MySQL or Oracle) Subject classification Yes Yes Yes Community groups Yes No Possible but … (see below) Where from? MIT and HewlettPackard. Southampton University, outcome of a JISC project. Cornell University and the University of Virginia Library. 63 64 65 66 67 NDLTD • • • • • • • DL case study Goals How, Workflow Union Catalog Services atop the Union Catalog Sustainability and Impact UK related report (Aug. 2006) 68 A Digital Library Case Study • Domain: graduate education, research • Genre:ETDs=electronic theses & dissertations • Submission: http://etd.vt.edu • Collection: http://www.theses.org Project: Networked Digital Library of Theses & Dissertations (NDLTD) http://www.ndltd.org NDLTD Goals • For Students: – Gain knowledge and skills for the Information Age, especially about Digital Libraries – Richer communication (digital information, multimedia, …) • For Universities: – Easy way to enter the digital library field and benefit thereby • For the World: – Global digital library – large, useful, many services 70 NDLTD: How can a university get involved? • Select planning/implementation team – – – – Graduate School Library Computing / Information Technology Institutional Research / Educ. Tech. • Join online, give us contact names – www.ndltd.org/join • Adapt Virginia Tech or other proven approach – Build interest and consensus – Start trial / allow optional submission Student Gets Committee Signatures and Submits ETD Signed Grad School Library Catalogs ETD, Access is Opened to the New Research WWW NDLTD Union catalog: OCLC • OCLC will expand OAI data provider on TDs. • Is getting data from WorldCat (so, from many sites!). • Will harvest from all others who contact them. • Need DC and either ETD-MS or MARC. • Has a set for ETDs. 74 75 76 ETD Union Search Mirror Site in China (CALIS) (http://ndltd.calis.edu.cn – popular site!) 77 78 VTLS Union Catalog Content Languages The VTLS NDLTD Union Catalog has data in 6 different languages. These are: English German Greek Korean Portuguese Spanish Examples follow 79 Full-text Services • Running since Sept 2005: Scirus • In beta test: Google Scholar • Challenges: – Data quality problems – Inconsistency in way to get from metadata to the full-text file(s) – Broadening the coverage since OAI use has not spread as widely as we would like 80 81 What are we doing? • Aiding universities to enhance graduate education, publishing and IPR efforts • Helping improve the availability and content of theses and dissertations • Educating ALL future scholars so they can publish electronically and effectively use digital libraries (i.e., are Information Literate and can be more expressive) -> support Open Access UK Report of Aug. 2006 • EVALUATION OF OPTIONS FOR A UK ELECTRONIC THESIS SERVICE • Study report edited by Alma Swan • Key Perspectives Ltd & UCL Library Services • EThOS project (Electronic Theses Online Service) - commissioned to develop a model for a workable, sustainable and acceptable national service for the provision of open access to electronic doctoral theses. 83 EThoS: Stakeholders • Academic registrars • University administrators (graduate schools) • Librarians • Repository managers (3; 2) • Authors (or potential authors) of theses and dissertations 84 Assessment of the organisational models Distributed model Centralised model Mixed architecture model Viability Dependent upon individual institutions’ capabilities and resources, which are highly variable Good, providing service provider selects correct business model and satisfies HEI concerns on rights, liabilities, etc) Good, providing service provider selects correct business model and satisfies HEI concerns on rights, liabilities, etc) Disadvantages Dependent upon individual institutions’ capabilities and resources, which are highly variable. This would lead to a service of patchy quality for at least a decade Potentially chaotic with respect to standards and consistency levels HEIs lose control to an extent and may lose some benefits in terms of PR and other institutional-purpose benefits that accrue with local service provision Offers potential for inconsistencies unless wellmanaged by hub provider Advantages Self-organising, cheap, simple HEIs need only to provide access to e-theses: central service provider does the rest: Standards applied across the board: Guaranteed consistent access: Scope for added-value services: One interface; a true national collection as well as a national gateway: Easy to hook up to other national or international services. Gives the greatest flexibility to HEIs to select the most appropriate options; HEIs can retain control of selected elements: Standards applied across the board: Guaranteed consistent access: Scope for added-value services: One interface (multiple sites of supply): National gateway: Easy to hook up to other national or international services. HEI community views Strong feeling against this option Second most popular option Highest level of support for this option Comments No support in the HEI community Strong support within HEI community Very strong support within HEI community 85 EThoS Survey: familiar with IPR issues related to e-theses • • • • 8% know very little 30% not very familiar 51% familiar 11% very familiar 86 EThoS Survey: my institution’s handling of PhD e-theses • • • • 83% not yet 11% from some students 5% from most students 1% from all students 87 EThoS Survey: my institution’s policy position on PhD e-theses • 55% no policies yet • 34% current planning policies • 11% has a policy 88 EThoS: Benefits • Hugely increased visibility of UK doctoral research output • Resulting in increased usage and impact of UK doctoral research output • The opportunities for resulting new research efforts and collaborations 89 Summary: Key Ideas • Theorem 1: Supporters of Open Access should support NDLTD. • Theorem 2: 5S can guide us to better support of Open Access. 90 Theorem 1: Supporters of Open Access should support NDLTD - 1 • DLs will lead to enormous benefit at all levels, from personal to global. • An IR is a type of DL, in the middle of the levels (requiring support from below, and providing support for above levels). • Having a DL at every university (i.e., IR) greatly encourages Open Access. 91 Theorem 1: Supporters of Open Access should support NDLTD - 2 • The easiest way to launch an IR at a university is with ETDs. • NDLTD is the lead world organization promoting ETD activities. • NDLTD’s goals are all in support of Open Access and IRs. 92 Theorem 2: 5S can guide us to better support of Open Access - 1 • 5S helps us think formally about Open Access, hence clearly, hence to find focus. • 5S helps us design and build DLs, hence IRs. • Societies – Individuals: members of institution, discipline – Social influence can promote DL (re)use. – Economic and political and social issues lead us to a distributed architecture. 93 Theorem 2: 5S can guide us to better support of Open Access - 2 • Distributed infrastructure + services lead us to harvesting (vs. federation, gathering). • 5S helps make harvesting a success: – Streams of content flow from individuals. – Structures: ETD-ms, (browsing) classification – Spaces: indexes, interfaces – Scenarios: submission, workflow, harvesting – Societies (see above) • More collaboration (social networks) • Prestige is more widely spread. • Access if more open 94 DL Futures • • • • • • History People, Content, Tools Sustainable Infrastructure Future Work Links For More Information 95 96 97 98 People • • • • • • • Digital librarians DL system developers DL system administrators DL managers DL collection development staff DL evaluators DL users 99