Moving Electronic Theses from ETD-db to EPrints: The Best of Both Worlds Betsy Coles Technical Manager, Digital Library Systems California Institute of Technology bcoles@library.caltech.edu Katherine Johnson CODA Coordinator and Metadata Librarian California Institute of Technology kjohnson@library.caltech.edu CNI Spring 2010 Membership Meeting, Baltimore, MD April 12, 2010 Where We Came From The Background Background: CODA The Caltech Collection of Digital Archives (CODA) grew out of a longstanding commitment to scholarly communication and open access Since its inception CODA has been integral to the Caltech Library’s mission Not a separate department or function Many staff involved, from all levels No special funding – support is from general operating budget Background: CODA First repository: Computer Science Technical Reports, using EPrints software from the University of Southampton (April 2001) Thesis archive, using ETD-db software from Virginia Tech starting (July 2001) In 2010: largest CODA repository is CaltechAUTHORS, with almost 15,000 non-thesis items Background: Why ETD-db for Theses? In 2001, there were not many options! ETD-db was designed specifically for theses Thesis-specific metadata Support for thesis workflow and approval process Support for withholding theses, or parts of theses, pending journal publication or patent application “Notifications” allowing staff to communicate with authors via email Background: ETD Evolution Voluntary ETD submission for the first year (2001/2002) Mandatory submission for PhD students after July 2002 Electronic thesis is now the version of record and we are committed to its preservation But we still keep and bind a paper copy Background: From Paper to Bits Retrospective conversion of older theses began in 2002 Library staff handle both scanning and submitting No dedicated staff – it’s a “spare time” activity, and many staff participate We are currently more than halfway through the backlog of pre-2002 theses Background: Numb3rs New theses: we add between 180 and 250 per year Retrospective conversion: we add about 500 per year As of April 2010 5,518 electronic theses in collection 1,433 of these were born digital Background: Numb3rs Usage is high; most users come from Google In March 2010 19,000+ website visits from 14,000+ unique users More than 22,000 document file downloads (.pdf, .doc, and .ps) More than 4,300 supplemental file downloads (video, data files, software, etc) What’s Right, What’s Wrong? The Problem Problem: Multiple Platforms As of 2008, we were running ETD-db for theses and EPrints for everything else Duplication of effort for software maintenance and system administration Staff had to learn two systems Users had to search two interfaces Problem: Resource Crunch In 2008, we were operating with reduced staff and limited resources (like everyone else) Since we had no dedicated repository staff or funding, efficiency was crucial Problem: Need for Flexibility By 2008, ETD-db software was aging In contrast, the EPrints platform is in active development; EPrints 3 includes: Plugin architecture offering easy customization and extension to support new features and protocols Active development team and contributing user community worldwide Much more …. The Road to Oz The Goals Goals: One Platform We wanted all our repositories on one platform Greater efficiency would allow more forward development Goals: Workflow Need to improve instructions and documentation for students submitting theses and staff processing theses Better understanding of thesis workflow No disturbance of delicate and hard-won relationships with other campus organizations involved with theses (Graduate Office, academic departments) Need to modernize our process for submitting theses to Proquest/UMI Goals: Technical Retain useful features of ETD-db Thesis-specific metadata fields and search capabilities (committee, major/minor options, etc.) Ability to communicate with authors, via email, from within the system interface Special limited-access categories for thesis materials (restricted, withheld), at the file level as well as the record level Support for the complex thesis-approval workflow Goals: Technical Add brand-new features New metadata elements including advisor(s), major and minor field, funders, additional dates, references, internal notes, and others Ability to store and identify related documents (copyright permissions, signed thesis forms, etc.) in a hidden part of the record Expanded ability to track theses through the complex approval process Additional automatically generated emails Migrating Theses to Eprints 3 The Plan The Plan: Outsourced Elements EPrints Services in Southampton would create Metadata conversion scripts and metadata and data migration scripts New email trigger function in EPrints New complex data structure for degree-granting departments New functionality to accommodate “hidden” documents, e.g. permissions letters, signed thesis forms The Plan: In-House Elements Caltech Library Services staff would do Metadata analysis and modifications Analysis and customization of the user interface Customization of the system “workflow” – the movement of theses through the stages of approval Migration of persistent URLS within the Caltech Library’s locally developed PURL system The Plan: In-House Elements Local staff would also Customize and localize screen text and help text Create a new web guide for student submitters Write documentation for library staff using the new system The Plan: Timeline 6 month timespan allocated for migration (March though August 2009) Informal scheduling process: timeline and tasks list maintained on the library wiki Schedule did slip by one month, but project was complete by the beginning of the academic year (Sept. 2009) On the Road The Process Process: People Initial task group the coordinator for the ETD-db repository the programmer/system administrator for the digital repositories one subject liaison librarian with extensive CODA involvement The Metadata Group support staff person who processes submitted theses Process: People Group expanded gradually as project progressed: Other subject librarians reviewed progress, especially interface issues Staff were asked to test specific features Final testing phase was open to all library staff Feedback was received from a wide range of staff, from the University Librarian to circulation desk staff Process: What & Where EPrints Services staff in Southampton were able to fit their work into our timeline. Code was often delivered ahead of schedule Integration of contracted code was done at Caltech Testing of system components, migration process, and user interface was also done locally Process: Testing We used a “staged” process to test the conversion and migration Multiple rounds of testing the migration process, with a larger number of records each time and a larger number of testers Final test was a full “dress rehearsal” of the complete migration process. We wish we had had time for formal usability testing, but we didn’t Process: Arrival The actual migration was done on a weekend Thesis submission was unavailable for 48 hours, but public search interface was up Actual conversion process took about 8 hours The remainder of the weekend was devoted to reviewing and testing the results CaltechTHESIS was open for business Monday morning, as planned Are We There Yet? The Outcome Outcome: Immediate Feedback First thesis was submitted by a student within hours We emailed the submitter: “Congratulations! You are the first student to have deposited his thesis into the new CaltechTHESIS database. Would you mind giving us some feedback on your experience? Ease of use, problems encountered, confusion?” Outcome: Immediate Feedback The student’s reply: “Thanks! I was wondering when this had changed, realized it must have been recent. I found the submission quite easy, it took me only a couple of minutes for the whole process.” Outcome: The View from Here Our experience in the months since we “went live” with CaltechTHESIS has confirmed our initial impression that we now have a modern, flexible system that provides a better user experience and smoother process for both students and library staff. Outcome: Specifics Goals met – we now have A better and more easily supported technical system More efficient thesis processing by library staff A better user experience for Students submitting theses Library staff processing theses Searchers worldwide Outcome: On the Horizon No, we’re not “done.” To-do list: Data cleanup remaining from the conversion (fairly minor), and filling in the new metadata fields added as part of the migration project Export plugin for Proquest/UMI’s XML metadata format (currently being tested) ETD-MS format for OAI harvesting (in process) User interface tweaks and improvements ( a neverending task!) Outcome: On the Horizon More “to-do’s” Upgrade to recently-released EPrints v. 3.2 Upgrade to new faster hardware and Red Hat 5 64-bit Linux operating system Implement available EPrints add-ons: IRSTATS statistics module DROID/PRESERV plugin to support preservation status monitoring Outcome: On the Horizon Even more “to-do’s” Perhaps most important: complete the task of documenting the migration in technical terms and uploading migration scripts into the EPrints wiki, so that others may make use of what we’ve done. The Rear-view Mirror Lessons Learned Lesson: Best of Both Worlds? We avoided the “buy vs. build” dilemma by contracting out specific parts of the migration development work to experts, while using our own resources where our local skill set could be put to best use and where local control was crucial to success. Lesson: Best of Both Worlds? We now have, in EPrints 3, a single, full-featured repository platform for all of our institutional materials, but we haven’t lost any of the valuable functionality of the older system. The Future We look forward to beginning our second decade of institutional repository management with a strong and flexible foundation. Links CaltechTHESIS – http://thesis.library.caltech.edu CaltechAUTHORS – http://authors.library.caltech.edu CODA – http://library.caltech.edu/digital Thesis workflow planning document: http://library.caltech.edu/etd/System_Independent_Thesis_Workflow.pdf Web guide for student submitters –http://libguides.caltech.edu/theses This Presentation: http://resolver.caltech.edu/CaltechLIB:2010.001 More Links EPrints software – http://software.eprints.org EPrints Services – http://www.eprints.org/services/ ETD-db software – http://scholar.lib.vt.edu/ETD-db/developer/index.shtml Screenshots