167-373-1-PB

advertisement
Unicorn
The Myth of Federated Searching
Photo by Norman Walsh
OhioLINK Digital Resource
Commons Team
DSpace @ Open Repositories @ Georgia Tech, May 2009, Atlanta
Ohio’s Academic Library Consortium
Digital Resource Commons
Nearly 90 member institutions
The Ohio Library and Information Network, OhioLINK, is a
consortium of Ohio college and university libraries, and the State
Library of Ohio, that work together to provide Ohio students, faculty
and researchers with the information they need for teaching and
research.
The Ohio Digital Resource Commons is a robust, statewide
platform that enables institutions to save, discover and share—free
of charge—the instructional, research, historic and creative
materials produced by the University System of Ohio and Ohio's
liberal arts colleges.
We want the Digital Resource Commons to be Ohio’s choice for digital
repositories.
We want all of Ohio’s stuff.
The OhioLINK Branding Challenge
Brand not Bland
DSpace 1.5 with xmlui
Wright State University Library
Wright State University DRC
University of Toledo Library
University of Toledo DRC
Southern State Community College Library
Southern State Community College DRC
Marietta College Library
Marietta College DRC
Bowling Green State University Library
Bowling Green State University DRC
Digital Archive of Literacy Narratives
The Knowledge Bank – operated
independently by The Ohio State
University
Digital.Maag – operated independently by
Youngstown State University
Digital Resources Management
Committee
• Facilitates cooperation in the creation and
sharing of digital collections by developing
a community of digitization practitioners to
share expertise and resources
• Acts as advisory board for new Digital
Resource Commons developments
Sometimes referred to as “Doctor Emcee”
Doctor Emcee said:
• We’d like for the Digital Resource
Commons to be seen as the statewide
digital repository for Ohio’s academic
libraries
• The ability to search all of these
institutional repositories simultaneously is
needed so that the DRC can be
viewed as a single resource
Our desire
• To be able to search and find any record
in any DRC instance from within any other
DRC instance
Why the name “Unicorn?”
• “Uni” in Unicorn sounds like “University”
• “Uni” can mean one, as in “one place from
which to search” – “United Search”
• The unicorn is mythical and elusive, and
federated* search can also be mythical
and elusive
Photo by Wolfgang Sauber
DRC Team Philosophy
July 2007 – Early 2009
Build it
now!
ETD Center -> DRC Experiment
Early Import Experiment
• Import of metadata using RDFizer OAIPMH parser, shell scripts, and intervention
Electronic Theses
and
Dissertations
Import metadata into DRC
Center
Digital Resource
Commons
ETD collections in the DRC
Oberlin
Toledo
Xavier
Wilmington
OhioLINK
Bowling
Green
How would this model scale?
UC
Marietta
Southern
State
Cleveland
Oberlin
Xavier
Wilmington
OhioLINK
Bowling
Green
Toledo
UC
Southern
Marietta
State
Cleveland
Doctor Emcee said:
• Can you use the Open Archives Harvester
software from the Public Knowledge
Project?
• PKP Harvester is an OAI-PMH Harvester
with a built-in metadata indexer and web
search interface
• Uses PHP/MySQL/Smarty
Oberlin
Toledo
Xavier
Wilmington
OhioLINK
Bowling
Green
PKP
Harvester
UC
Marietta
Southern
State
Cleveland
Going to school – practice
project
• Practice in
customizing search
results with smarty
templates
• Practice in resolving
harvesting issues
(memory errors,
invalid encoding,
timeouts)
Doctor Emcee requirements:
• Show image thumbnails for items in image
collections
• Identify the name of the repository in
which each result was found and display
the institution name to the user
Emerging Requirements
Make the DRC Unicorn Search look and feel
more like other OhioLINK searches
Similar colors
Similar fonts
(more)… widgets
Spell correction
highlighting
institution branding(TDL style)
Customize DRC pages
Supporting Thumbnails in
Federated Search Results
• Based on a similar project by Nathan Pugh
at University of Utah with CONTENTdm
collections
• Accomplished in Dspace by adding a
servlet to the xmlui project.
• Servlet returns a thumbnail based on the
item handle passed in the URL
Example Thumbnail URL
• http://drcobe.ohiolink.edu/GetThumbnail?it
emHandle=2374.OBE/1087
• Gets a thumbnail for the item on
drcobe.ohiolink.edu with the handle
2374.OBE/1087
GetThumbnail Servlet
• Code at
https://dev.ohiolink.edu/svn/dspace1_5/dspace-xmlui/dspace-xmluiapi/src/main/java/edu/ohiolink/org/dspace/
GetThumbnail.java
• Believed to be usable in jspui with small
modifications, but untested
• Alternatives: get Thumbnail URL from
mets.xml, OAI-ORE
Hack!
• The search application does not know in
advance if an item has a Thumbnail, so
the servlet will return a default thumbnail if
one does not exist for an item
• Configurable in dspace.cfg with
– drc.search.thumbnail.defaultjpegfile
– drc.search.thumbnail.bundles
PKP Harvester customizations
• Modify look of search results pages with
stylesheets and Smarty templates
• Build thumbnail image URLs from handle
stored in Dublin Core identifier field
• Customized PKP Harvester source at
https://dev.ohiolink.edu/svn/drc_search/
Multimedia versus Metadata
versus Text, DRC
• The overwhelming
majority of items in
OhioLINK DRC
repositories are
non-textual in
nature
• For textual items,
in many cases
(ETD) full-text is
not stored
Full-text Storage
Stored,
1800
items
Not stored
or N/A,
217000
items
Most common Search Problems
• Problems related to issues with metadata
quality and consistency
• Problems related to indexing and
searching algorithms used
• General usability issues
Example Indexing Problem
• The PKP Harvester does not perform word
stemming. (sail/sails are the same word)
Search Result Bias
• Search results are weighted based on the
number of times a keyword appears in the
metadata
• This leads to a strong bias for items with
long descriptions, such as Electronic
Theses and Dissertations
Example Metadata Problem
• Information in date.created field is in so
many formats that date range searches
are broken
December 17, 1903
2006-06-15T16:18:48Z
1964
16th Century
Metadata Collapse!
• Qualified
Dublin
Core
fields
collapsed
when
crossed
over to
Federated
Search
date.accessioned
date.created
date
date.available
date.issued
D
R
C
P
K
P
Example Usability Issue
• The federated
search interface
does not
understand
DSpace security
levels
• Could store rights
in metadata and
handle in results
display
Infrequent Usage
New search used only in one institutional repository
Operational
• PKP Harvester had trouble harvesting
OAI-PMH via HTTPS
• Administrative web interface has a
harvester tool, but it times out so we use
the command line harvester at whim.
• We’ve seen problems when records are
deleted from the repositories. We need to
delete them manually from the Unicorn
index
What’s next – Pegasus Search?
• Most likely: replace
with OAI-ORE/OAIPMH harvesting
components from
Texas Digital
Library
• Possible: SOLR
integration
• Interface changes
Photo by Beatrice Murch
Pegasus Should
• Support federation
for other Dspace
repositories in the
state
• Support federation
for repositories
based on other
software
• Be usable!
Photo by Beatrice Murch
Unified OhioLINK Vision
Users
Discovery Layer
DMC
EJC
EBC
ETD
Branded
DRC
DRC Team Goals
• Communicate better
with:
– Users, to determine
needs
– DSpace developers, to
learn and save time
– Repository community,
to take home new
ideas
Download