The Open Archives Initiative (OAI) and Electronic Theses and Dissertations (ETDs) ASIDIC ‘2000 Orlando, FL - March 27, 2000 Edward A. Fox (fox@vt.edu) http://fox.cs.vt.edu Virginia Tech, Blacksburg, VA, USA Acknowledgements (Selected) Sponsors: ACM, Adobe, ARL, Belgian Science Found., CLIR, DARPA, IBM, LANL, Microsoft, NSF, OCLC, SPARC, US Dept. of Ed. (FIPSE), … VT Faculty/Staff: Tony Atkins, Thomas Dunbar, John Eaton, Gwen Ewing, Peter Haggerty, Gary Hooper, Gail McMillan, Len Peters, James Powell VT Students: Emilio Arce, Fernando Das Neves, Brian DeVane, Robert France, Marcos Goncalves, Scott Guyer, Robert Hall, Neill Kipp, Paul Mather, Tim McGonigle, Todd Miller, Constantinos Phanouriou, William Schweiker, Ohm Sornil, Hussein Suleman, Patrick Van Metre, Laura Weiss Virginia Tech Background Largest university in Virginia, land-grant, football, town population 35K plus 25K students Blacksburg Electronic Village, since 1992, with > 80% of community on Internet Net.Work.Virginia, largest ATM network, with over 750 sites, for education, research, gov’t LMDS, Local Multipoint Distribution Service, gigabit wireless networking - 1/3 of Virginia Math Emporium, 500 workstations Faculty Development Initiative, round 2 Digital Libraries Shorten the Chain from Editor Reviewer Publisher A&I Consolidator Library DLs Shorten the Chain to Author Teacher Digital Reader Editor Reviewer Learner Librarian Library The Networked Digital Library of Theses and Dissertations www.NDLTD.org Training Authors Expanding Access Preserving Knowledge Improving Graduate Education Enhancing Scholarly Communication Empowering Students & Universities Leader of the Worldwide ETD (Electronic Thesis and Dissertation) Initiative Open Archives initiative OAi www.openarchives.org openarchives@openarchives.org OAi Philosophy Self-archiving = submission mechanism Long-term storage system = archive Open interface = harvesting mechanism Data provider + service provider Start with “gray literature” – e-prints/pre-prints, reports, dissertations, … Tiered Model of Interoperability Mediator services Metadata harvesting Document models Repository of Digital Objects Repository Access Protocol handle terms and conditions Digital object Open Archives initiative History xxx at LANL = Los Alamos National Laboratory (Ginsparg) for high-energy physics - 1991 CSTR + WATERS = NCSTRL (Lagoze) - 1994 xxx + NCSTRL = CoRR collaboration - 1998 UPS (Universal Preprint Service) – 1999 mtg – Herbert Van de Sompel (U. Ghent, SFX) … – Dublin Core (DC), XML – Dienst protocol and software (Lagoze) Renamed late 1999 as OAi Open Archives (protoproto) ArXiv & Los Alamos National Lab CogPrints & U. Southampton NACA & NASA (reports) NCSTRL & Cornell U. NDLTD & Virginia Tech RePEc & U. Surrey Total of around 200K records Original Open Archives Members Caroline Arms, Library of Congress Leslie Carr, University of Southampton Mark Doyle, American Physical Society Dale Flecker, Harvard University Edward A. Fox, Virginia Tech Michael Friedman, HighWire Press, Stanford U. Paul M. Gherman, Vanderbilt U. & SPARC Paul Ginsparg, Los Alamos National Lab. & xxx Stevan Harnad, University of Southampton Thomas Krichel, University of Surrey & RePEc Carl Lagoze, Cornell University … Original Open Archives Members cont’d Rick Luce, Los Alamos National Laboratory Clifford Lynch, Coalition for Networked Info. Kurt Maly, Old Dominion University Michael Nelson, NASA Langley Research Center John Ober, California Digital Library Bob Parks, Washington University & EconWPA Herbert Van de Sompel, University of Ghent Eric F. Van de Velde, Caltech Don Waters, The Andrew W. Mellon Foundation Ken Weiss, California Digital Library Open Archives Future EconWPA (U. Washington) e-biomed -> PubMed Central (NIH) PubScience (DOE) Clinical Medicine Netprints (+ other HighWire Press holdings ) University ePub (California Digital Library) All public e-prints (MIT) Scholar’s Forum (Caltech) Int’l: CERN, Germany, India, Mexico, … Goal: millions of books/articles/reports / yr Approaches to Open Archives Build By Institution Build By Discipline Approaches to Open Archives Build By Institution Build By Discipline Author Category Interdisciplinary Year Language Query … Open Archives initiative (OAi) www.openarchives.org Santa Fe meeting, Oct. 21-22, 1999, protoproto Next mtg June 3, San Antonio, between HT’00 & DL’00 LANL, CNI, DLF, Mellon, … Convention (see Feb. D-Lib Magazine article) Archives -> Open Archives – – – – Support unique archive identifiers Implement Open Archives Metadata Set (DC-based, using XML) Implement Dienst harvesting interface Register the archive Build tools, layer other services: linking, searching, … Figure 1. Layers Related to Open Archives Initiative Services Citation / Linking Authoring Submission SFX Editorial: CiteSeer Reviewing, Certification Summarization Metadata Creation Registry Citation Checking Archives: Text/MM Editing Citation DB Updating Name, ID, Description, Terms and Conditions, … Authority Control Preservation Conversion Metadata Formats: Gazetteer Cataloging Copy-Edit / Add Value Name, Standard, Preservation Process, … Name, XML DTD, … Search/Browse Protocols Annotation Collaboration Archive Formats: … Services Tools … Repository Repository for NDLTD Metadata Formats: OA Metadata Set, NDLTD Standard (DC-based) Set Transaction Log Training Resources Open Archives Harvesting Protocol VT Partition Record (Metadata) Record (Full Content) NCSTRL Repository UVA Partition Metadata … Content … EconWPA Repository … Caltech Partition Metadata Content RePEc Repository Mechanisms Sharing – Join federation, run software – Make metadata and archive available Aggregating – By discipline – By institution – By genre Automating – Workflow – Harvesting and providing services – Federated searching – Dynamic linking (e.g., with SFX) Report on Open Archives work in progress at Virginia Tech With students: Hussein Suleman (hussein@vt.edu) Dave Watkins (dwatkins@cs.vt.edu) Robert France (france@vt.edu) Marcos Andre Goncalves (mgoncalv@cs.vt.edu) VT View of the Open Archives initiative (OAi) Enable sharing of publication metadata and full-text by digital libraries Standardize low-level mechanisms to share contents of libraries Build higher-level user-centric and administrative services in meta-libraries Install organizational mechanisms to support the technical processes Virginia Tech Projects MARC XML-DTD Computer W3C OAi Science Teaching Centre (CSTC) Web Characterization Repository Repository Explorer Networked Digital Library of Theses and Dissertations (NDLTD) MARC XML-DTD XML Transport format for US-MARC records Standardized metadata exchange format for traditional library services joining OAi CS Teaching Center (CSTC) Collection of reviewed online resources used to aid in teaching of Computer Science Supports author submission and peer-review process for new ACM Journal of Educational Resources In Computing (JERIC) Connected with NSDL (NSF 00-44) http://www.cstc.org W3C Web Characterization Repository Online database of metadata related to publications, tools and data sets dealing with Web characterization Project of the Web Characterization Activity working group of the World-Wide-Web Consortium (www.w3c.org/WCA) http://purl.org/net/repository OAi Repository Explorer Serves as a compliancy test Allows browsing of open archives using only OAi protocol Sends requests on behalf of user, parses and checks responses and displays browsable interface Will detect most discrepancies in protocol http://purl.org/net/explorer NDLTD Work has begun on interoperability between Virginia Tech and partners in Germany Wrappers have been created to harvest data from remote sites which use other protocols Harvested data to be stored in a central OAicompliant database (work in progress) Grad Program Library IT Ed Tech A Digital Library Case Study Domain: graduate education, research Genre:ETDs=electronic theses & dissertations Submission: http://etd.vt.edu Collection: http://www.theses.org Project: Networked Digital Library of Theses & Dissertations (NDLTD) http:// www.ndltd.org with 225 people at 3rd Intl Symposium, March 2000 What are we doing? Aiding universities to enhance graduate education, publishing and IPR efforts Helping improve the availability and content of theses and dissertations Educating ALL future scholars so they can publish electronically and effectively use digital libraries (i.e., are Information Literate and can be more expressive) Key Ideas: Scalability Networked infrastructure University collaboration Workflow, automation Education is the rationale Maximal Access 8th graders vs. grads Authors must submit Standards PDF, SGML, MM, MARC, DC, URNs, Federated search Student Defends & Finalizes ETD My Thesis ETD Student Gets Committee Signatures and Submits ETD Signed Grad School Graduate School Approves ETD, Student is Graduated Ph.D. Library Catalogs ETD, Access is Opened to the New Research WWW NDLTD User Search Support (multilingual, XML) NDLTD World Federated Search User Interface Virginia Tech ... (univ) Dissertations Online (Germany) OhioLink (lib / univ group) Portugese NL ... (national lib) Australia (regional) OAS, ISTEC (Latin America) Note: All groups shown are connected with NDLTD. www.theses.org James Powell student project, D-Lib Magazine description in Sept. 1998 XML description of each site – type of search engine / service – language – coverage (for resource discovery) Adding Z39.50 gateway capability and integrating with MARIAN, along with Harvest and Open Archives protocols Access Possibilities Web search engines www. theses. org Virginia MIT National Tech Library of Portugal www. library openarchives. catalog org clients CBUC (Spain) Ohio Link 3rd Party Services (e.g., UMI) National Projects: AU, GE, … PetaPlex Digital Library Machine (“super” object store) Parallel computer / storage utility Knowledge Systems Incorporated is supplying VT-PetaPlex-1 with – high speed backbone connection – 2.5 terabytes through 100 nodes: Net connection + 25GB disk + 233 MHz Pentium + Linux How does this relate to UMI? 1987 UMI workshop to explore ETDs Support letter for US Dept. of Ed. proposal Steering committee membership ProQuest Direct pilot of scanning works started 1/1/97, free 2 yr access to front part Collaborating – – on: accepting electronic author submissions standards (e.g., representation) ETD Initiative (and UMI) Students Learn about DL, EPub TDs become more expressive Global TDs become more accessible, archived Universities UMI N. Amer. (T)Ds are accessible, archived 0 Date Joined 11/11/99 9/11/99 7/11/99 5/11/99 3/11/99 1/11/99 11/11/98 9/11/98 7/11/98 5/11/98 3/11/98 1/11/98 11/11/97 9/11/97 7/11/97 5/11/97 3/11/97 Number of Members NDLTD Members 80 70 60 50 40 30 20 10 US University Members (41) Air University (Alabama) Baylor University Brigham Young University Caltech Clemson University College of William & Mary Concordia University (Illinois) East Carolina University East Tenn. State U. – require fall 2000 Florida Institute of Tech. Florida International University George Washington University Marshall University (W. Va.) Miami U. of Ohio MIT Michigan Tech Naval Postgraduate School (CA) North Carolina State U. Penn. State University Rochester Institute of Tech. U. of Colorado Health Science Center U. of Florida U. of Georgia University of Hawaii, Manoa U. of Iowa U. of Kentucky U. of Maine U. of North Texas – required since 8/99 U. of Oklahoma U. of South Florida U. of Tennessee, Knoxville U. of Tennessee, Memphis U. of Texas at Austin U. of Virginia U. Wisconsin - Madison Vanderbilt U. Virginia Commonwealth U. Virginia Tech - required since 1/97 West Virginia U. - required fall 1998 Western Michigan U. Worcester Polytechnic Inst. Institutional Members Coalition for Networked Information (CNI) Committee on Institutional Cooperation (CIC) Diplomica.com Dissertation.com Dissertationen Online (Germany) Ibero-American Science & Technology Education Consortium (ISTEC, www.istec.org) National Library of Portugal (for all universities) Organization of American States (SEDI/OAS) UNESCO (www.unesco.org/webworld/etd) Australian Project Members U. New South Wales (lead institution) U. of Melbourne U. of Queensland U. of Sydney Australian National University Curtin U. of Technology Griffith U. German Project Members Humboldt University (lead institution) 3 other universities 5 learned societies – Mathematics, Physics, Chemistry, Sociology, Education 1 computing center 2 major libraries CBUC (www.cbuc.es, Spain) Consorci de Biblioteques Universitàries de Catalunya, as group, with 9 members: – – – – – – – – – Universitat de Barcelona Universitat Autonòma de Barcelona Universitat Politècnica de Catalunya Universitat Pompeu Fabra Universitat de Girona Universitat de Lleida Universitat Rovira i Virgili Universitat Oberta de Catalunya Biblioteca de Catalunya Other International Members Chinese University of Hong Kong Chungnam National U. (S. Korea - CS) City University, London (UK) Darmstadt U. of Tech. (Germany) Free University of Berlin (GE - Vet. Med.) Gyeongsang National U. (Korea) India Institute of Tech., Bombay (India) Nanyang Technological U. (Singapore, pt) National U. of Singapore (Singapore, pt) Other International Members cont’d Polytechnic University of Valencia (Spain) Rhodes U. (South Africa) St. Petersburg St. Tech.U (Russia) Univ. de las Américas Puebla (Mexico) Univ. of Alicante (Spain) Univ. of Pisa (Italy) U. Laval; U. of Guelph; U. Waterloo; Wilfrid Laurier U. (Canada), … What are the long term goals? 400K US students / year getting grad degrees are exposed / involved 200K/yr rich hypermedia ETDs that may turn into electronic portfolios (images, video, audio, …) Dramatic increase in knowledge sharing: literature reviews, bibliographies, … Services providing lifelong access for students: browse, search, prior searches, citation links Hundreds/thousands of downloads / year / work For professional societies Like “writing across the curriculum”, e.g., Chemical Markup Language, MathML, … Besides writing: computing/communications, information literacy, personal digital library management, tool use, research methods, collaboration, archiving/preservation Data sets, communities of users of them Classification systems / browsing / searching Extending Services - 1 of 2 Working with publishers – Motivate students: awards, … – Publicize support of NDLTD ACM, ACS, IEEE-CS, Elsevier, … – Allow students to increase level of access Arranging preservation – Mirroring worldwide – Involving long-term trusted parties Extending Services - 2 of 2 Adding services currently prototyped – annotation and SDI (routing) capabilities – Dublic Core metadata, crosswalk to MARC – support for XML, *ML, preservation – harvesting, federated search Adding other services planned – building/using citation DB (CiteSeer, SFX, …) – implementing plagiarism check (like “SCAM”) Remember! Digital Libraries (technology base) OAi (help establish enormous international cooperative of data and service providers) NDLTD - improve graduate education – www.ndltd.org/join – (www.ndltd.org/talks for this)