OAIster: A “No Dead Ends” Digital Object Service Kat Hagedorn OAIster Librarian University of Michigan Libraries October 3, 2003 background • One-year Mellon grant project to test the feasibility of making OAI-enabled metadata for digital objects accessible to the public • Digital Library Production Service at University of Michigan Libraries began work in December 2001 • Publicized as OAIster in February 2002 • Launched in June 2002 highlights • • • • • • Any audience Any subject matter Any format Freely accessible No dead ends One-stop shopping …retrieving the “hidden web” the protocol • OAI = Open Archives Initiative • OAI-PMH = Open Archives Initiative Protocol for Metadata Harvesting • Designed to make it easy to exchange metadata among interested parties • Consists of 6 HTTP requests to identify repositories / metadata and perform “harvesting” tool we borrowed • University of Illinois Urbana-Champaign open-source OAI protocol harvester • java edition for our unix environment • Worked collaboratively to iron out kinks – resumptionToken / retryAfter – inexplicable kill – bogus records in MySQL table development environment • Digital Library Extension Service (DLXS) • Develop open-source middleware and license XPAT search engine for building and mounting digital libraries • Middleware consists of document classes, i.e., Text, Image, Bib, FindAid • Originally designed to make SGML encoded texts available online tool we developed • Runs in DLXS environment using BibClass • Current BibClass web templates modified • Additional java-based transformation tool to: – – – – – DC metadata records concatenated No-digital-object records filtered out Records counted Conversion from UTF-8 to ISO-8859-1 XSLT used to transform DC records into BibClass records system design XSL stylesheets (per source type) UIUC harvester OAI-enabled DC records Non-OAIenabled DC records Record storage BibClass indexes XSLT transformation tool Search interface (XPAT) result • One place to look for digital objects • Big – 1,484,767 metadata records – 195 institutions (as of August 03) • Popular – Averages 3300 search sessions / month – Picked up in March 03: average 3700 now – 43,894 searches total (through July 03) www.oaister.org: search www.oaister.org: limiters www.oaister.org: sort www.oaister.org: results www.oaister.org: repositories repositories: e.g., – Online Archive of California: manuscripts, photographs, and works of art held in institutions across California – arXiv Eprint Archive: math and physics preand post-prints – Sammelpunkt, Elektronisch Archivierte Theorie: archive of philosophical publications – British Women Romantic Poets Project: collection of poems written by British women between 1789 and 1832 repositories: stats • As of July 03, out of 191 repositories… • U.S. and foreign – U.S.: 49% (94) – Foreign: 51% (97) • By subject – Humanities: 26% (50) – Science: 30% (58) – Mixed: 43% (83) • E-prints and pre-prints – Using eprints.org software: 41% (78) – Not using eprints.org software: 58% (110) major issues encountered • Metadata variation • Records not leading to digital objects • Access restrictions on digital objects described in records • Duplicate records for a single digital object issue: metadata variation • With more records, users need more restrictions • Consistent metadata needed to facilitate these restrictions • One option: normalization of data issue: metadata variation • Type: the obvious quick win – 240 metadata values mapped to four generic values (text, image, audio, video) – e.g., audio, sound = audio motion, animation, newsreels, etc. = video watercolour, watercolor, slides, etc. = image article, articles, booklet, diss, story, etc. = text issue: metadata variation • Date: where to begin? – Most records with at least one date – Some records include up to seven dates – No consistent style of date • Subject: out of context, what meaning? – Many records with at least one subject element – But over 100 records with more than 50 subjects – And one record with 1000! issue: metadata variation • Sample date values <date>2-12-01</date> <date>2002-01-01</date> <date>0000-00-00</date> <date>1822</date> <date>between 1827 and 1833</date> <date>18--?</date> <date>November 13, 1947</date> <date>SEP 1958</date> <date>235 bce</date> <date>Summer, 1948</date> issue: metadata variation • Sample subject values <subject>30,51,52</subject> <subject>1852, Apr. 22. E[veritt] Judson, letter to Philuta [Judson].</subject> <subject>Slavery--United States--Controversial literature</subject> <subject>view of interior with John Henry sculpture</subject> <subject>Particles (Nuclear physics) -Research.</subject> issue: no digital objects • Some records contain links to further description of digital object • But not the digital object itself • Culling difficult • One option: add explanatory text to site issue: access restrictions • No records where metadata itself is restricted in use (as far as we know!) • Definitely some records where objects are restricted to licensed users • One option: add explanatory text to site issue: access restrictions • DC Rights element: often not enough info about viewing restrictions • Currently no protocol method for indicating restricted digital objects (i.e., “yes/no” toggle element) • Need to assess whether users feel informed or frustrated when encountering restricted objects issue: duplicate records • Two records harvested, different identifiers, same object described and pointed to • Acquired in two ways: – Harvesting of original repository and aggregator – Receiving “static” DC records provided by content creator and harvesting aggregator issue: duplicate records • Aggregators can contain records not currently available through OAI channels • Aggregators do not always contain all the records of a particular original repository • So, need to harvest both aggregator and original repositories issue: duplicate records • Harvest records from aggregator • Also receive from original content creator, but as snapshot – e.g., MEO and cogprints – Snapshot before aggregator – Creator unsure all records would be aggregated issue: duplicate records • Were duplicates to be identified, how to deal with the issue? – Suppress? – Group? – Flag? • So far, not addressed in OAIster assessment • Large survey (over 400 respondents) • 2 rounds of face-to-face and remote user testing • Conducted before design and after phase one rollout assessment: survey • Online journals and reference materials wanted over other digital objects • Difficult to search for information; every service different; where to start • Number of respondents (5%) indicated they were generally successful in finding resources online assessment: user testing • No short and long record formats: one size fits all • Want clearly defined and labeled AND/OR searching options • Results clear and easy to understand • Want to sort by title, date, institution, resource format…you name it! • Use OAIster for academic, trustworthy, authentic materials service providers: comparison high Usability UIUC, Emory, etc. Ad hoc OAIster DP-9 low some Content all • Focus on high usability • Focus on all content available • Some service providers have increased functionality (e.g., deduplication, integration of thesauri) future of OAIster • • • • • • • Make it faster Advanced searching Grouping to aid browsing Saving/emailing/downloading records Further normalization of data Handling duplicate records Collaboration with other services: search, instructional… current state of protocol • Popular • As Peter Suber says: – “…no other single idea or technology in the [opensource movement has enjoyed this density of endorsement and adoption in a six month period.” • Data providers over one year: – – – – June 02: 56 repositories / 274,062 records June 03: 187 repositories / 1,246,953 records Over three-fold increase for repositories Over four-fold increase for records future of protocol • Branching out – – – – HTTP vs. SOAP DC required vs. highly recommended Use of OAI in closed environments Static repository protocol • Need for add-on applications • OAI evangelism what can you do? • OAI-enable your data – – – – – DLXS customer: easiest Make sure data is UTF-8 / Unicode compliant Provide as much metadata as you can Use standard element tags Develop “sets” for service providers • Let us know you’re ready to be harvested • Keep us informed about changes to the harvesting URL, new data and deleted data, change in contact info contact info • Kat Hagedorn • University of Michigan Libraries, Digital Library Production Service • khage@umich.edu • http://www.oaister.org/