Moving Electronic Theses from ETD-db to EPrints: The Best of Both Worlds

advertisement
Moving Electronic Theses from
ETD-db to EPrints:
The Best of Both Worlds
Betsy Coles
Technical Manager, Digital Library Systems
California Institute of Technology
bcoles@library.caltech.edu
Katherine Johnson
CODA Coordinator and Metadata Librarian
California Institute of Technology
kjohnson@library.caltech.edu
CNI Spring 2010 Membership Meeting, Baltimore, MD
April 12, 2010
Where We Came From
The Background
Background: CODA
 The Caltech Collection of Digital Archives (CODA) grew
out of a longstanding commitment to scholarly
communication and open access
 Since its inception CODA has been integral to the
Caltech Library’s mission
 Not a separate department or function
 Many staff involved, from all levels
 No special funding – support is from general operating
budget
Background: CODA
 First repository: Computer Science Technical Reports,
using EPrints software from the University of
Southampton (April 2001)
 Thesis archive, using ETD-db software from Virginia
Tech starting (July 2001)
 In 2010: largest CODA repository is CaltechAUTHORS,
with almost 15,000 non-thesis items
Background: Why ETD-db for
Theses?
 In 2001, there were not many options!
 ETD-db was designed specifically for theses
 Thesis-specific metadata
 Support for thesis workflow and approval process
 Support for withholding theses, or parts of theses,
pending journal publication or patent application
 “Notifications” allowing staff to communicate with authors
via email
Background: ETD Evolution
 Voluntary ETD submission for the first year (2001/2002)
 Mandatory submission for PhD students after July 2002
 Electronic thesis is now the version of record and we
are committed to its preservation
 But we still keep and bind a paper copy
Background: From Paper to Bits
 Retrospective conversion of older theses began in
2002
 Library staff handle both scanning and submitting
 No dedicated staff – it’s a “spare time” activity, and
many staff participate
 We are currently more than halfway through the
backlog of pre-2002 theses
Background: Numb3rs
 New theses: we add between 180 and 250 per year
 Retrospective conversion: we add about 500 per year
 As of April 2010
 5,518 electronic theses in collection
 1,433 of these were born digital
Background: Numb3rs
 Usage is high; most users come from Google
 In March 2010
 19,000+ website visits from 14,000+ unique users
 More than 22,000 document file downloads (.pdf, .doc,
and .ps)
 More than 4,300 supplemental file downloads (video, data
files, software, etc)
What’s Right, What’s Wrong?
The Problem
Problem: Multiple Platforms
 As of 2008, we were running ETD-db for theses and
EPrints for everything else
 Duplication of effort for software maintenance and system
administration
 Staff had to learn two systems
 Users had to search two interfaces
Problem: Resource Crunch
 In 2008, we were operating with reduced staff and
limited resources (like everyone else)
 Since we had no dedicated repository staff or funding,
efficiency was crucial
Problem: Need for Flexibility
 By 2008, ETD-db software was aging
 In contrast, the EPrints platform is in active
development; EPrints 3 includes:
 Plugin architecture offering easy customization and
extension to support new features and protocols
 Active development team and contributing user
community worldwide
 Much more ….
The Road to Oz
The Goals
Goals: One Platform
 We wanted all our repositories on one platform
 Greater efficiency would allow more forward
development
Goals: Workflow
 Need to improve instructions and documentation for
students submitting theses and staff processing theses
 Better understanding of thesis workflow
 No disturbance of delicate and hard-won relationships
with other campus organizations involved with theses
(Graduate Office, academic departments)
 Need to modernize our process for submitting theses to
Proquest/UMI
Goals: Technical
 Retain useful features of ETD-db
 Thesis-specific metadata fields and search capabilities
(committee, major/minor options, etc.)
 Ability to communicate with authors, via email, from within
the system interface
 Special limited-access categories for thesis materials
(restricted, withheld), at the file level as well as the record
level
 Support for the complex thesis-approval workflow
Goals: Technical
 Add brand-new features
 New metadata elements including advisor(s), major and
minor field, funders, additional dates, references, internal
notes, and others
 Ability to store and identify related documents (copyright
permissions, signed thesis forms, etc.) in a hidden part of
the record
 Expanded ability to track theses through the complex
approval process
 Additional automatically generated emails
Migrating Theses to Eprints 3
The Plan
The Plan: Outsourced Elements
 EPrints Services in Southampton would create
 Metadata conversion scripts and metadata and data
migration scripts
 New email trigger function in EPrints
 New complex data structure for degree-granting
departments
 New functionality to accommodate “hidden” documents,
e.g. permissions letters, signed thesis forms
The Plan: In-House Elements
 Caltech Library Services staff would do
 Metadata analysis and modifications
 Analysis and customization of the user interface
 Customization of the system “workflow” – the movement
of theses through the stages of approval
 Migration of persistent URLS within the Caltech Library’s
locally developed PURL system
The Plan: In-House Elements
 Local staff would also
 Customize and localize screen text and help text
 Create a new web guide for student submitters
 Write documentation for library staff using the new
system
The Plan: Timeline
 6 month timespan allocated for migration (March
though August 2009)
 Informal scheduling process: timeline and tasks list
maintained on the library wiki
 Schedule did slip by one month, but project was
complete by the beginning of the academic year (Sept.
2009)
On the Road
The Process
Process: People
 Initial task group
 the coordinator for the ETD-db repository
 the programmer/system administrator for the digital
repositories
 one subject liaison librarian with extensive CODA
involvement
 The Metadata Group support staff person who processes
submitted theses
Process: People
 Group expanded gradually as project progressed:
 Other subject librarians reviewed progress, especially
interface issues
 Staff were asked to test specific features
 Final testing phase was open to all library staff
 Feedback was received from a wide range of staff, from
the University Librarian to circulation desk staff
Process: What & Where
 EPrints Services staff in Southampton were able to fit
their work into our timeline. Code was often delivered
ahead of schedule
 Integration of contracted code was done at Caltech
 Testing of system components, migration process, and
user interface was also done locally
Process: Testing
 We used a “staged” process to test the conversion and
migration
 Multiple rounds of testing the migration process, with a
larger number of records each time and a larger number
of testers
 Final test was a full “dress rehearsal” of the complete
migration process.
 We wish we had had time for formal usability testing,
but we didn’t
Process: Arrival
 The actual migration was done on a weekend
 Thesis submission was unavailable for 48 hours, but
public search interface was up
 Actual conversion process took about 8 hours
 The remainder of the weekend was devoted to
reviewing and testing the results
 CaltechTHESIS was open for business Monday
morning, as planned
Are We There Yet?
The Outcome
Outcome: Immediate Feedback
 First thesis was submitted by a student within hours
 We emailed the submitter:
“Congratulations! You are the first student to have
deposited his thesis into the new CaltechTHESIS
database. Would you mind giving us some
feedback on your experience? Ease of use,
problems encountered, confusion?”
Outcome: Immediate Feedback
 The student’s reply:
“Thanks! I was wondering when this had changed,
realized it must have been recent. I found
the submission quite easy, it took me only a couple
of minutes for the whole process.”
Outcome: The View from Here
 Our experience in the months since we “went live” with
CaltechTHESIS has confirmed our initial impression
that we now have a modern, flexible system that
provides a better user experience and smoother
process for both students and library staff.
Outcome: Specifics
 Goals met – we now have
 A better and more easily supported technical system
 More efficient thesis processing by library staff
 A better user experience for
 Students submitting theses
 Library staff processing theses
 Searchers worldwide
Outcome: On the Horizon
 No, we’re not “done.” To-do list:
 Data cleanup remaining from the conversion (fairly
minor), and filling in the new metadata fields added as
part of the migration project
 Export plugin for Proquest/UMI’s XML metadata format
(currently being tested)
 ETD-MS format for OAI harvesting (in process)
 User interface tweaks and improvements ( a neverending task!)
Outcome: On the Horizon
 More “to-do’s”
 Upgrade to recently-released EPrints v. 3.2
 Upgrade to new faster hardware and Red Hat 5 64-bit
Linux operating system
 Implement available EPrints add-ons:
 IRSTATS statistics module
 DROID/PRESERV plugin to support preservation status
monitoring
Outcome: On the Horizon
 Even more “to-do’s”
 Perhaps most important: complete the task of
documenting the migration in technical terms and
uploading migration scripts into the EPrints wiki, so that
others may make use of what we’ve done.
The Rear-view Mirror
Lessons Learned
Lesson: Best of Both Worlds?
 We avoided the “buy vs. build” dilemma by contracting
out specific parts of the migration development work to
experts, while using our own resources where our local
skill set could be put to best use and where local
control was crucial to success.
Lesson: Best of Both Worlds?
 We now have, in EPrints 3, a single, full-featured
repository platform for all of our institutional materials,
but we haven’t lost any of the valuable functionality of
the older system.
The Future
 We look forward to beginning our second decade of
institutional repository management with a strong and
flexible foundation.
Links
 CaltechTHESIS – http://thesis.library.caltech.edu
 CaltechAUTHORS – http://authors.library.caltech.edu
 CODA – http://library.caltech.edu/digital
 Thesis workflow planning document:
http://library.caltech.edu/etd/System_Independent_Thesis_Workflow.pdf
 Web guide for student submitters –http://libguides.caltech.edu/theses
 This Presentation: http://resolver.caltech.edu/CaltechLIB:2010.001
More Links
 EPrints software – http://software.eprints.org
 EPrints Services – http://www.eprints.org/services/
 ETD-db software – http://scholar.lib.vt.edu/ETD-db/developer/index.shtml
Screenshots
Download