PowerPoint

advertisement
OAIster: A “No Dead Ends”
Digital Object Service
Kat Hagedorn
OAIster Librarian
University of Michigan Libraries
October 3, 2003
background
• One-year Mellon grant project to test
the feasibility of making OAI-enabled
metadata for digital objects accessible
to the public
• Digital Library Production Service at
University of Michigan Libraries began
work in December 2001
• Publicized as OAIster in February 2002
• Launched in June 2002
highlights
•
•
•
•
•
•
Any audience
Any subject matter
Any format
Freely accessible
No dead ends
One-stop shopping
…retrieving the “hidden web”
the protocol
• OAI = Open Archives Initiative
• OAI-PMH = Open Archives Initiative
Protocol for Metadata Harvesting
• Designed to make it easy to exchange
metadata among interested parties
• Consists of 6 HTTP requests to identify
repositories / metadata and perform
“harvesting”
tool we borrowed
• University of Illinois Urbana-Champaign
open-source OAI protocol harvester
• java edition for our unix environment
• Worked collaboratively to iron out kinks
– resumptionToken / retryAfter
– inexplicable kill
– bogus records in MySQL table
development environment
• Digital Library Extension Service (DLXS)
• Develop open-source middleware and
license XPAT search engine for building
and mounting digital libraries
• Middleware consists of document
classes, i.e., Text, Image, Bib, FindAid
• Originally designed to make SGML
encoded texts available online
tool we developed
• Runs in DLXS environment using
BibClass
• Current BibClass web templates modified
• Additional java-based transformation tool
to:
–
–
–
–
–
DC metadata records concatenated
No-digital-object records filtered out
Records counted
Conversion from UTF-8 to ISO-8859-1
XSLT used to transform DC records into
BibClass records
system design
XSL
stylesheets
(per source
type)
UIUC
harvester
OAI-enabled
DC records
Non-OAIenabled
DC records
Record
storage
BibClass
indexes
XSLT
transformation
tool
Search
interface
(XPAT)
result
• One place to look for digital objects
• Big
– 1,484,767 metadata records
– 195 institutions (as of August 03)
• Popular
– Averages 3300 search sessions / month
– Picked up in March 03: average 3700 now
– 43,894 searches total (through July 03)
www.oaister.org: search
www.oaister.org: limiters
www.oaister.org: sort
www.oaister.org: results
www.oaister.org: repositories
repositories: e.g.,
– Online Archive of California: manuscripts,
photographs, and works of art held in
institutions across California
– arXiv Eprint Archive: math and physics preand post-prints
– Sammelpunkt, Elektronisch Archivierte
Theorie: archive of philosophical
publications
– British Women Romantic Poets Project:
collection of poems written by British
women between 1789 and 1832
repositories: stats
• As of July 03, out of 191 repositories…
• U.S. and foreign
– U.S.: 49% (94)
– Foreign: 51% (97)
• By subject
– Humanities: 26% (50)
– Science: 30% (58)
– Mixed: 43% (83)
• E-prints and pre-prints
– Using eprints.org software: 41% (78)
– Not using eprints.org software: 58% (110)
major issues encountered
• Metadata variation
• Records not leading to digital objects
• Access restrictions on digital objects
described in records
• Duplicate records for a single digital
object
issue: metadata variation
• With more records, users need more
restrictions
• Consistent metadata needed to
facilitate these restrictions
• One option: normalization of data
issue: metadata variation
• Type: the obvious quick win
– 240 metadata values mapped to four
generic values (text, image, audio, video)
– e.g.,
audio, sound = audio
motion, animation, newsreels, etc. = video
watercolour, watercolor, slides, etc. = image
article, articles, booklet, diss, story, etc. = text
issue: metadata variation
• Date: where to begin?
– Most records with at least one date
– Some records include up to seven dates
– No consistent style of date
• Subject: out of context, what meaning?
– Many records with at least one subject element
– But over 100 records with more than 50 subjects
– And one record with 1000!
issue: metadata variation
• Sample date values
<date>2-12-01</date>
<date>2002-01-01</date>
<date>0000-00-00</date>
<date>1822</date>
<date>between 1827 and 1833</date>
<date>18--?</date>
<date>November 13, 1947</date>
<date>SEP 1958</date>
<date>235 bce</date>
<date>Summer, 1948</date>
issue: metadata variation
• Sample subject values
<subject>30,51,52</subject>
<subject>1852, Apr. 22. E[veritt] Judson, letter to
Philuta [Judson].</subject>
<subject>Slavery--United States--Controversial
literature</subject>
<subject>view of interior with John Henry
sculpture</subject>
<subject>Particles (Nuclear physics) -Research.</subject>
issue: no digital objects
• Some records contain links to further
description of digital object
• But not the digital object itself
• Culling difficult
• One option: add explanatory text to site
issue: access restrictions
• No records where metadata itself is
restricted in use (as far as we know!)
• Definitely some records where objects
are restricted to licensed users
• One option: add explanatory text to site
issue: access restrictions
• DC Rights element: often not enough
info about viewing restrictions
• Currently no protocol method for
indicating restricted digital objects (i.e.,
“yes/no” toggle element)
• Need to assess whether users feel
informed or frustrated when
encountering restricted objects
issue: duplicate records
• Two records harvested, different
identifiers, same object described and
pointed to
• Acquired in two ways:
– Harvesting of original repository and
aggregator
– Receiving “static” DC records provided by
content creator and harvesting aggregator
issue: duplicate records
• Aggregators can contain records not
currently available through OAI
channels
• Aggregators do not always contain all
the records of a particular original
repository
• So, need to harvest both aggregator
and original repositories
issue: duplicate records
• Harvest records from aggregator
• Also receive from original content
creator, but as snapshot
– e.g., MEO and cogprints
– Snapshot before aggregator
– Creator unsure all records would be
aggregated
issue: duplicate records
• Were duplicates to be identified, how to
deal with the issue?
– Suppress?
– Group?
– Flag?
• So far, not addressed in OAIster
assessment
• Large survey (over 400 respondents)
• 2 rounds of face-to-face and remote
user testing
• Conducted before design and after
phase one rollout
assessment: survey
• Online journals and reference materials
wanted over other digital objects
• Difficult to search for information; every
service different; where to start
• Number of respondents (5%) indicated
they were generally successful in
finding resources online
assessment: user testing
• No short and long record formats: one
size fits all
• Want clearly defined and labeled
AND/OR searching options
• Results clear and easy to understand
• Want to sort by title, date, institution,
resource format…you name it!
• Use OAIster for academic, trustworthy,
authentic materials
service providers: comparison
high
Usability
UIUC,
Emory,
etc.
Ad hoc
OAIster
DP-9
low
some
Content
all
• Focus on high usability
• Focus on all content
available
• Some service providers
have increased
functionality (e.g., deduplication, integration
of thesauri)
future of OAIster
•
•
•
•
•
•
•
Make it faster
Advanced searching
Grouping to aid browsing
Saving/emailing/downloading records
Further normalization of data
Handling duplicate records
Collaboration with other services:
search, instructional…
current state of protocol
• Popular
• As Peter Suber says:
– “…no other single idea or technology in the [opensource movement has enjoyed this density of
endorsement and adoption in a six month period.”
• Data providers over one year:
–
–
–
–
June 02: 56 repositories / 274,062 records
June 03: 187 repositories / 1,246,953 records
Over three-fold increase for repositories
Over four-fold increase for records
future of protocol
• Branching out
–
–
–
–
HTTP vs. SOAP
DC required vs. highly recommended
Use of OAI in closed environments
Static repository protocol
• Need for add-on applications
• OAI evangelism
what can you do?
• OAI-enable your data
–
–
–
–
–
DLXS customer: easiest
Make sure data is UTF-8 / Unicode compliant
Provide as much metadata as you can
Use standard element tags
Develop “sets” for service providers
• Let us know you’re ready to be harvested
• Keep us informed about changes to the
harvesting URL, new data and deleted data,
change in contact info
contact info
• Kat Hagedorn
• University of Michigan Libraries, Digital
Library Production Service
• khage@umich.edu
• http://www.oaister.org/
Download