A New Kind of Catalog Charley Pennell Principal Cataloger for Metadata North Carolina State University North Carolina Library Association 2007 Where is this talk headed? Local motivation National trends What is Endeca? Features Does Endeca work? Where are we going from here? Where is everybody else going? Why a new catalog? What was wrong with the old one? A little TRLN catalog primer TRLN libraries (Duke, NCCU, NCSU, UNCCH) jointly develop and maintain BIS, 19851992 DRA implemented for catalog (UNC & Duke continue Acq/Serials modules), 1991-1993 No integrated keyword/browse capability, 1993-1999 Web2 catalog implemented, 1999Sirsi & DRA “merge” in 2002; Taos DOA A little TRLN catalog primer 2 NCSU & NCCU to Unicorn; Duke to Aleph; UNC-CH to Millenium, 2003-2004 Sirsi/Dynix merger, 2004: vendor focus shifts (even more) toward school/public market While agreeing to continue to support Web2, S/D increasingly looking to merge all product catalogs into single interface What was the catalog lacking? Simplicity: a simple, hopefully uncluttered interface Interactivity: ways to interact with results to get better results Forgiveness: just fix my typos and case errors, don’t make me feel stupid! Response time: always Real-time sorting: the limit is how many?!! Relevance ranking: as if! Web services: use the Web to repurpose data, enable mash-ups, add-ons & improvements Which interface is ready for immediate use? 90 80 70 60 50 East West North 40 30 20 10 0 1st 2nd 3rd 4th Qtr Qtr Qtr Qtr So, why DOES everyone think that the catalog sucks stinks? "Most integrated library systems, as they are currently configured and used, should be removed from public view." - Roy Tennant, OCLC The old model The integrated library system Historically, the ILS developed as an inventory control system for use by library staff only First library automation systems (Plessey, CLSI, Geac, Innovative) were designed around circulation or acquisitions functions Interaction time was calibrated to the slow pace of backroom work where the audience was basically captive Staff focus on known-item searching, not resource discovery The catalog as part of the ILS The first integrated OPACs were veneers on top of existing inventory management systems—patrons & staff competed for system resources! They still do! First OPACs allowed for browse only; early keyword searching restricted to certain fields (A/T/S) only Libraries with no IT support were stuck with what their vendor provided and the enhancement process for improvements Libraries with IT support created their own systems: BIS, NOTIS, Clarement Colleges, Georgetown, PALS, DOBIS/LIBIS The state of the ILS in 2007 Customer demands for increasing functionality in a marketplace with little $$ to spend has reduced the ILS vendor pool through mergers and buyouts New functionality (multi-search, ERMS, E-Ref, ILL, etc.) increasingly being met by stand-alone and third party applications Increasing competition from open source (Koha, Evergreen, Scriblio, LibraryThing) and e-commerce Q: Is our dogged adherence to MARC the only thing keeping the remaining ILS vendors afloat? The state of the catalog 2007 Library users’ search expectations have been conditioned by interactions with commercial Websites and Google, with which Libraries can barely afford to compete, but must Libraries are becoming increasingly virtual as users interact with us online (e-resources, Second Life) User expectations for online experiences are more interactive, instantaneous, and inviting Perhaps most importantly… The information resources represented in the catalog represent a shrinking percentage of what end users need or want Calhoun’s Aristotelian vs. Copernican views of the catalog What do users want from the OPAC? Make subject searching in online catalogs easier using postBoolean probabilistic searching with automatic spelling correction, term weighting, intelligent stemming, relevance feedback, and output ranking Streamline users' book selection decisions at the catalog by adding tables of contents and back-of-the-book indexes to cataloging (i.e., metadata) records Reduce the many failed subject searches by expanding the online catalog with full texts—journal and newspaper articles, encyclopedias, dissertations, government documents, etc. Increase finding strategies in online catalogs through the library classification -- Markey, Karen (2007). “The online library catalog: Paradise lost and paradise regained”, D-Lib Magazine, 13(1/2). “Many researchers express surprise at the brevity (from one to three words) of the queries people submit to online systems. Belkin tells why so few words make up their queries, "Precisely because of the inquirer's lack of knowledge about a problem area, it is impossible to specify what would resolve it." For Belkin, the saving grace is the inquirer's ability to recognize what he or she wants or does not want during the course of the search. Therein lies an important solution to the problem— information systems that report results for easy eyeballing and instantaneous recognition of relevant possibilities.” – Karen Markey What is an Endeca? A software company based in Cambridge, MA A search and information access technology provider for a number of major e-commerce websites Developers of the Endeca Information Access Platform Endeca features Commercialstrength search/sort speeds Site customizable relevance ranking Faceted browse True browsing (LC classification) Spell-checking ”Did you mean?” Automatic word stemming Endeca at NCSU Libraries Went live in January 2006 Works with a text version of a daily snapshot of Libraries’ MARC & other metadata Used to improve the discovery portion of the library catalog Interoperates with ILS for holdings, current availability status Web2 interface still present for known item & authority searching Implementation timeline License / negotiation: Spring 2005 Acquire: Summer 2005 Implementation: August 2005 : vendor training September 2005 : finalize requirements October 2005 – January 2006 : design and development January 12, 2006 : go-live date Widen to TRLN partners: Winter 2008 Implementation Team Implementation Team brought together from IT, DLI, Cataloging, Collections, Reference, Circulation Worked on indexing, UI, usability testing, etc. Areas of contention Number of initial search boxes (1 or 2) Order, grouping of facets Placement of classification hierarchies, breadcrumbs Use of “search” and “browse” on tabs Visualization aided by Tito’s wireframes Brief view vs. Full view gives user choice about displaying holdings. Reduces complexity of continuing and online resources. 8th (and Final) Revision: Aggregate holdings information by library. NCSU Endeca features Breadcrumbs Call # browse Results Facets Features we started with Faceted browse Availability facet Breadcrumbs Spell check / Did you mean Hierarchical subject browse based on LCC Fuzzy link to live Web2 data New book browse for titles added in last week only Features that we’ve added New book browse based on relative date (last week, last month, last three months) RSS feeds based on user results “Search within” results Send search to TRLN partners Static unique link to live Web2 data Relevance ranking Based on locally customizable algorithm: Most relevant: query exactly as entered For multi-term searches: phrase match Field match title match more relevant than notes match Other factors: number of fields matched weighted frequency static ordering (publication date, circulation stats) Faceting at the NCSU Libraries Follows on what we have learned from the commercial Web search model Mines metadata already available via MARC record, local class number, ILS item categories, circ status, and date stamping Required massive clean-up of 6xx subdivisions Allows both pre- and post-coordinate limits Uses table mapping to enable drilling down through call number results Facet refinements Availability Author Library Format Language New(ness) LC Classification Subject: Topic Subject: Genre Subject: Region Subject: Era A single facet need not represent data from a single field Single Unicorn item types (Book, Kit, Manuscript, Map, Data set) Multiple Unicorn item types (Audio, Microform, Thesis/Dissertation, Software & Multimedia, Videos) Leader byte 07 (Bib lvl): Journal, Magazine Library (Online) Ranking facet results by number of postings makes sense in a short list, but not in a long list The author facet is less useful in some types of searches … … than others! Technical overview Information Access Platform NCSU exports and reformats Data Foundry Parse text files Raw MARC data MDEX Engine Indices Flat text files HTTP HTTP NCSU Web Application MARC ingest MARC flat text file(s) for ingest by Endeca. Transformation accomplished with MARC4J. Opportunity to manipulate data on the back-end. Transformed data The end result… Video Other Endeca library catalogs Phoenix Public Library: http://www.phoenixpubliclibrary.org/ McMaster University: http://libcat.mcmaster.ca Florida Center for Library Automation http://catalog.fcla.edu/ Individual Florida universities http://fs.catalog.fcla.edu/, etc. Does Endeca work? Problems: authority control Endeca is a keyword search engine; “browse” can only be effected using sort options There is no authority control within Endeca itself, rather it relies on AC within ILS To make use of available metadata, subjects were split along subdivisions. Authors were not Talks were held with the vendor to explain the potential for drawing on authority x-refs to collocate searches Problems: subject context Problems with wrong delimiter values (esp. $v) Problems maintaining context in atomized LCSH One-way relationships English language$vDictionaries$xSpanish Chronological headings devoid of geographic context Cuba$xHistory$yRevolution, 1959 Phrase headings expressed in multiple subdivisions Prisoners$xAbuse of Problems: subject hierarchies Chronological hierarchy not built into $y “19th century” does not subsume 1800-1809, 1801-1861, 1809-1817, 18151861, 1817-1825, Civil War, 1861-1865, etc. Geological periods exist as text only (Ordovician, Pleistocene, etc.) Some chronological headings are expressed as text in 650$a Middle Ages Nineteen sixties Geographic hierarchy not consistent between 651 and 650 $zNorth Carolina$zRaleigh $aRaleigh (N.C.) BT/NT/RT relationships from authority file lacking Some potential solutions Search behavior education FAST (Faceted Application of Subject Terminology) Web2 x-refs to redirect searches to Endeca Combining $z hierarchies Hierarchy lists What do our users think? “The new Endeca system is incredible. It would be difficult to exaggerate how much better it is than our old online card catalog (and therefore that of most other universities). I've found myself searching the catalog just for fun, whereas before it was a chore to find what I needed.” - NCSU Undergrad, Statistics “The new library catalog search features are a big improvement over the old system. Not only is the search extremely fast, but seemingly it's much more intelligent as well.” - NCSU faculty, Psychology Usability testing Task Difficulty: New Catalog Task Difficulty: Old Catalog Failed 22% Failed 23% Easy 43% Hard 7% Easy 59% Hard 22% Medium 12% Medium 12% Usability testing Average Task Duration: Old vs New Catalog 00:00.0 Task 1 00:43.2 01:26.4 02:09.6 02:52.8 03:36.0 Old C atalog New C atalog Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Task 10 Usage statistics Searches by Field Type: July 06 - Jan 07 420,000 360,000 300,000 240,000 180,000 120,000 60,000 0 Keyword (default) ISBN Title Author Subject Multi-Field Newness wearing off? Requests by Search Type Search -> Navigation 29% March ‘06 - May ‘06 Search 51% Navigation 20% July ‘06-January ‘07 Search and Navigation Search -> Navigation 25% Navigation 8% Search 67% Navigation by Dimensions July 06 – Jan 07 Subject: Topic 26% Subject: Era 2% Language Availability 3% 2% Subject: Region 4% Author 6% Subject: Genre 6% Library 10% LC Classification 21% New 10% Format 10% July 06 – Jan 07 Navigation by Dimension (most used) Subject: Topic LC Classification Format New Library Subject: Genre Author Subject: Region Language Subject: Era Availability 0 20,000 40,000 60,000 80,000 Requests 100,000 120,000 140,000 July 06 – Jan 07 Navigation by Dimension (order of UI presentation) 9,286 Availability 120,644 LC Classification 145,589 Subject: Topic 34,096 Subject: Genre 57,667 Format 54,476 Library 22,818 Subject: Region 12,257 Subject: Era 16,009 Language 32,650 Author 0 20,000 40,000 60,000 80,000 Requests 100,000 120,000 140,000 160,000 Where are we going from here? Future directions Additional hierarchies (geographic names, dates) Make use of NAF, SAF, particularly cross-reference structure Massage underlying metadata Addition of Date Cataloged – Done! Addition of LC Class numbers to e-resources – Done! FRBR work numbers/records? – Tested! FAST headings? Accommodation of true browse for all indexes Future opportunities Expanding the scope of the implementation to the 10M records in TRLN (Duke, NCCU, NCSU, UNCChapel Hill) Enrich catalog through external web services: book jackets, reviews, TOC, etc. – Amazon, OCLC. LibraryThing, Bowker Syndetics Build use-case based cross-application shopping cart functionality Integrate catalog w/other tools through web services—“Free the Data” Web services… Mobile device searching Where is everybody else going? Catalogs detaching themselves from ILS Detached data lends itself to experimentation Don’t have to throw out baby with bathwater when better interfaces come out Data itself safe and secure in ILS MARC becoming superfluous; MARC’s granularity NOT! Social interaction: reviews, folksonomic tags, ratings Phoenix Public Library on Endeca III’s new faceted catalog, Encore ExLibris Primo at Vanderbilt Athens County, OH—Koha Zoom open source Georgia PINES—Evergreen open source Casey Bisson’s Scriblio Danbury Public powered by LibraryThing OCLC WorldCat Local at UW Thanks for listening! Charley Pennell Principal Cataloger for Metadata NCSU Libraries North Carolina State University Raleigh, NC 27695-7111 cpennell@ncsu.edu More info at: http://www.lib.ncsu.edu/endeca/