Search to Discovery: Finding Global Scholarly Resources with Primo Pascal Calarco & Alison Hitchens, Library December 6, 2011 Agenda • The state of search in libraries (Pascal) • Expanding Primo beyond the local catalogue (Alison) • Questions 2011 Library Information Systems: Milestones Discovery Metasearch Citation Linking ILS 3rd gen (Client-server; 1990s) ILS 2nd gen (Mainframe; 1980s) OCLC (library network; 1972) Early systems MARC 1960 1980 2011 1990 2000 2010 In the beginning, there was the card catalog (1901+) Indexes: • Subject • Author • Title • Interfiled cards, call number access 2011 Library of Congress National Union Catalog (pre-1956) 2011 Henriette Avram, Developer of MARC • Programmer/analyst at Library Of Congress • Developed system for printing card catalog information (MARC) • ISO certification 1973 2011 Later, there was the Online Public Access Catalog (OPAC) • Machine Readable Cataloging (MARC) • Inventory of the print/physical holdings of a library • Better than the card catalog; keyword searching & boolean functionality • Non-intuitive; required training or intermediation (information professional) • Limited generally to single library 2011 Library networks & resource sharing 2011 Print to Electronic 2011 Now: Electronic Almost Ubiquitous • 85%+ of journal literature digital • Hundreds of specialized scholarly databases • Mass print book digitization efforts • Electronic books going mainstream • Aggregated meta-indexes: 750 million metadata for journal/newspaper articles 2011 Goal: improve user experience • Users want to FIND not search • Source required information to user regardless of format or location • Leverage our knowledge of academic community @ uWaterloo • Integrate into key services: LMS, CMS, other library services 2011 Database Content Silos Content Silos ScienceDirect Catalog Web of Science ILL JSTOR ETDs EEBO Website Metasearch eReserve System Silos Metasearch: an interim step • aka Federated Search; emerged 2003 • Distributed search from one interface via web services, SOAP/XML gateways • Idiosyncratic and slow; vendors implemented variously • Relevancy of merged results problematic 2011 Problems with catalog searching & evolution to discovery • UCLA & Berkeley: information retrieval & user behavior (1986-1996) • Google Books: “digitize the world’s knowledge” (2002) • Karen Schneider, Andrew Pace, Roy Tennant: “The OPAC ‘Sucks’”(2002) • Next generation catalogs -> Discovery (2008+) 2011 Catalogs: Information Science Research • Christine L. Borgman (1986) “Why are online catalogs hard to use? Lessons learned from information retrieval studies” Journal of the American Society for Information Science • Ray R. Larsen (1991) “The decline of subject searching: Long-term trends and patterns of index use in an online catalog” Journal of the American Society for Information Science • Ray R. Larsen (1992) “Evaluation of advanced retrieval techniques in an experimental online catalog” Journal of the American Society for Information Science • Ray R. Larsen (1996) “Cheshire II: designing a nextgeneration online catalog” Journal of the American Society for Information Science • Christine L. Borgman (1996) “Why are online catalogs still hard to use?” Journal of the American Society for Information Science How Users Search: What We’ve Learned • Most people make typos at least some of the time • Most searches are 2, 3, 4 words with no Boolean operators • Most searches use keyword • Search is hesitant, iterative, often random process of discovery • Most people start elsewhere • Few read help screens • Few use advanced search – this is true even in Google The Google Effect • Expectations for web search tools now: – Radically simplified UI, fast results – Aggregated content – Relevant results on first page – Natural Language queries – Spelling correction/adaptation 2011 The OPAC “Sucks” • The OPAC lacks common features of most search engines – – – – – – – – Relevance ranking vs. last in, first out Spell checking (related - did you mean?) Popular query operators like + and – Refine search Sort flexibility Faceting Citation indexing vs full text Developed for print materials, limitations with electronic materials or atomized items (like articles) – Difficult for certain known item search Industry Trends • Decouple the front end (search and discovery) from the back end (inventory and cataloguing) • Service Oriented Architecture – many programs loosely coupled • Cloud services -- SaaS • The 5th generation of library business systems emerging now – hosted, cloud solutions Discovery Characteristics • Enhanced Search Functionality – Faceted browse – Relevance ranking – “Did you mean?” / Spell Checking • auto-correction, resubmit search – Content aggregation • Integrating search for books, articles, etc. – Single, Simple Search Box – FRBR – functional requirements for bibliographic record, grouping editions Discovery Characteristics, cont. • Enhanced Experience – Sometimes fun and engaging – Interactive/Collaborative – User centered design • Enhanced Services – Find it / Get it for me – Book Covers / Synopsis – Full text – Availability on same page as results Discovery Characteristics, cont. • • Enhanced Content – Article Searching – Commercial Data – Merging Special Collections – Harvesting Online Collections • Grey Literature • Free Content Enhanced Access – Syndication - Getting into users tools • Course Management Systems • Browser and Desktop Tool Bars • Portals Discovery Components 1. Next Generation Catalog 2. Next Generation “Unified Search” Aid Full Text Vendor Data OAI User Interface OPAC ILSCirc MARC Data Normalization & Apache SOLR/Lucene MetaSearch Phase I TUG Content Components Phase II Future OCUL Others Primo Central HathiTrust Archives Geospatial RACER Primo Evolution of Discovery Primo Catalog Metasearch Primo Central Options for Expanding Primo • Local ingestion of resources using FTP or OAI harvesting • Searching remote resources in Primo using the Primo DeepSearch API* • Subscribing to a large centralized index, such as Primo Central *Application Programming Interface 2011 Local ingestion of records • Example: Hathi Trust Digital Library – Harvest the public domain records from Hathi Trust Digital Library – Normalize the records – Index the records in our local Primo database – Schedule updates from Hathi Trust into Primo 2011 Normalization: creating local sort field (Date – Oldest) 2011 Primo Normalized XML (PNX) 2011 Open source & Open platform • Primo uses Lucene for its indexing • SOLR exposes Lucene as a web service and allows for faceting • APIs and web services allow flexibility and customization 2011 We can’t index everything! • Trying out a subscription to Primo Central, a centralized index of scholarly journal articles, newspapers, conference proceedings etc. • User sees one interface; user is searching 2 indexes 2011 What is Primo Central Index? • A centralized index – of free and restricted resources – primarily articles & e-books – based on metadata & full-text provided by publishers/aggregators – based on the collections selected by the library in the Primo Administration module – created & maintained by our vendor, Ex Libris What is Primo Central Index? • A centralized index – of records harvested using the same process as our local Primo database – created using the same PNX record structure as our local Primo database – indexed using the same indexing tools as our local Primo database Blending local and remote resources • Both local and remote results are represented in the facets • Blended relevance ranking – Can configure Primo to boost high ranking local results so that when Primo is doing relevance ranking on our 4 million records alongside 100s of millions of Primo Central records local results aren’t missed by the user Search = local resources & Primo Central How does it work? • Ex Libris has created & indexed records for millions of items based on information from the publishers • Primo searches Primo Central the same way it searches the local database • Full text availability is determined in advance by our URL resolver SFX, i.e. • Delivery of the resource uses menu for New features: snippets give context If your search term is found in the full-text, Primo supplies a snippet highlighting the term New features: expanding the search Defaults to our library’s electronic subscriptions but users can expand the search to all of Primo Central New Facets & Facet Values Added value: bX Recommender Trouble-shooting remote resources • We can view the PNX records using web services but we have no control over the content or the normalization rules • Records have the same structure as our local records but are missing local fields and don’t reflect local policies 2011 Assessing Primo Central • Over 65 hours of one-on-one usability testing and focus groups with undergraduate students, graduate students, faculty, staff and alumni • Library staff survey • Feedback form • Statistics from Cognos 2011 Looking to the future • What other content should be added to Primo? • How can we improve/enhance the interface? • What is the right balance for boosting local physical resources? • How do we point users to resources that can’t be searched using Primo? 2011 Questions? • Pascal Calarco – Associate University Librarian, Digital & Discovery Services – pvcalarco@uwaterloo.ca • Alison Hitchens – Cataloguing & Metadata Librarian – ahitchen@uwaterloo.ca 2011