Unicorn The Myth of Federated Searching Photo by Norman Walsh OhioLINK Digital Resource Commons Team DSpace @ Open Repositories @ Georgia Tech, May 2009, Atlanta Ohio’s Academic Library Consortium Digital Resource Commons Nearly 90 member institutions The Ohio Library and Information Network, OhioLINK, is a consortium of Ohio college and university libraries, and the State Library of Ohio, that work together to provide Ohio students, faculty and researchers with the information they need for teaching and research. The Ohio Digital Resource Commons is a robust, statewide platform that enables institutions to save, discover and share—free of charge—the instructional, research, historic and creative materials produced by the University System of Ohio and Ohio's liberal arts colleges. We want the Digital Resource Commons to be Ohio’s choice for digital repositories. We want all of Ohio’s stuff. The OhioLINK Branding Challenge Brand not Bland DSpace 1.5 with xmlui Wright State University Library Wright State University DRC University of Toledo Library University of Toledo DRC Southern State Community College Library Southern State Community College DRC Marietta College Library Marietta College DRC Bowling Green State University Library Bowling Green State University DRC Digital Archive of Literacy Narratives The Knowledge Bank – operated independently by The Ohio State University Digital.Maag – operated independently by Youngstown State University Digital Resources Management Committee • Facilitates cooperation in the creation and sharing of digital collections by developing a community of digitization practitioners to share expertise and resources • Acts as advisory board for new Digital Resource Commons developments Sometimes referred to as “Doctor Emcee” Doctor Emcee said: • We’d like for the Digital Resource Commons to be seen as the statewide digital repository for Ohio’s academic libraries • The ability to search all of these institutional repositories simultaneously is needed so that the DRC can be viewed as a single resource Our desire • To be able to search and find any record in any DRC instance from within any other DRC instance Why the name “Unicorn?” • “Uni” in Unicorn sounds like “University” • “Uni” can mean one, as in “one place from which to search” – “United Search” • The unicorn is mythical and elusive, and federated* search can also be mythical and elusive Photo by Wolfgang Sauber DRC Team Philosophy July 2007 – Early 2009 Build it now! ETD Center -> DRC Experiment Early Import Experiment • Import of metadata using RDFizer OAIPMH parser, shell scripts, and intervention Electronic Theses and Dissertations Import metadata into DRC Center Digital Resource Commons ETD collections in the DRC Oberlin Toledo Xavier Wilmington OhioLINK Bowling Green How would this model scale? UC Marietta Southern State Cleveland Oberlin Xavier Wilmington OhioLINK Bowling Green Toledo UC Southern Marietta State Cleveland Doctor Emcee said: • Can you use the Open Archives Harvester software from the Public Knowledge Project? • PKP Harvester is an OAI-PMH Harvester with a built-in metadata indexer and web search interface • Uses PHP/MySQL/Smarty Oberlin Toledo Xavier Wilmington OhioLINK Bowling Green PKP Harvester UC Marietta Southern State Cleveland Going to school – practice project • Practice in customizing search results with smarty templates • Practice in resolving harvesting issues (memory errors, invalid encoding, timeouts) Doctor Emcee requirements: • Show image thumbnails for items in image collections • Identify the name of the repository in which each result was found and display the institution name to the user Emerging Requirements Make the DRC Unicorn Search look and feel more like other OhioLINK searches Similar colors Similar fonts (more)… widgets Spell correction highlighting institution branding(TDL style) Customize DRC pages Supporting Thumbnails in Federated Search Results • Based on a similar project by Nathan Pugh at University of Utah with CONTENTdm collections • Accomplished in Dspace by adding a servlet to the xmlui project. • Servlet returns a thumbnail based on the item handle passed in the URL Example Thumbnail URL • http://drcobe.ohiolink.edu/GetThumbnail?it emHandle=2374.OBE/1087 • Gets a thumbnail for the item on drcobe.ohiolink.edu with the handle 2374.OBE/1087 GetThumbnail Servlet • Code at https://dev.ohiolink.edu/svn/dspace1_5/dspace-xmlui/dspace-xmluiapi/src/main/java/edu/ohiolink/org/dspace/ GetThumbnail.java • Believed to be usable in jspui with small modifications, but untested • Alternatives: get Thumbnail URL from mets.xml, OAI-ORE Hack! • The search application does not know in advance if an item has a Thumbnail, so the servlet will return a default thumbnail if one does not exist for an item • Configurable in dspace.cfg with – drc.search.thumbnail.defaultjpegfile – drc.search.thumbnail.bundles PKP Harvester customizations • Modify look of search results pages with stylesheets and Smarty templates • Build thumbnail image URLs from handle stored in Dublin Core identifier field • Customized PKP Harvester source at https://dev.ohiolink.edu/svn/drc_search/ Multimedia versus Metadata versus Text, DRC • The overwhelming majority of items in OhioLINK DRC repositories are non-textual in nature • For textual items, in many cases (ETD) full-text is not stored Full-text Storage Stored, 1800 items Not stored or N/A, 217000 items Most common Search Problems • Problems related to issues with metadata quality and consistency • Problems related to indexing and searching algorithms used • General usability issues Example Indexing Problem • The PKP Harvester does not perform word stemming. (sail/sails are the same word) Search Result Bias • Search results are weighted based on the number of times a keyword appears in the metadata • This leads to a strong bias for items with long descriptions, such as Electronic Theses and Dissertations Example Metadata Problem • Information in date.created field is in so many formats that date range searches are broken December 17, 1903 2006-06-15T16:18:48Z 1964 16th Century Metadata Collapse! • Qualified Dublin Core fields collapsed when crossed over to Federated Search date.accessioned date.created date date.available date.issued D R C P K P Example Usability Issue • The federated search interface does not understand DSpace security levels • Could store rights in metadata and handle in results display Infrequent Usage New search used only in one institutional repository Operational • PKP Harvester had trouble harvesting OAI-PMH via HTTPS • Administrative web interface has a harvester tool, but it times out so we use the command line harvester at whim. • We’ve seen problems when records are deleted from the repositories. We need to delete them manually from the Unicorn index What’s next – Pegasus Search? • Most likely: replace with OAI-ORE/OAIPMH harvesting components from Texas Digital Library • Possible: SOLR integration • Interface changes Photo by Beatrice Murch Pegasus Should • Support federation for other Dspace repositories in the state • Support federation for repositories based on other software • Be usable! Photo by Beatrice Murch Unified OhioLINK Vision Users Discovery Layer DMC EJC EBC ETD Branded DRC DRC Team Goals • Communicate better with: – Users, to determine needs – DSpace developers, to learn and save time – Repository community, to take home new ideas