Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist III Akeisha Heard Technical Intern XXV Annual Charleston Conference 04 November 2005 Introduction Research Goals Develop a service to support advanced collection intelligence Cluster collected objects based on their issuing entity • As can be determined via metadata about the objects • Gain intelligence about the nature of individual publishers • Collection intelligence • Acquisition patterns • User behavior Research Objectives Resolve • ISBN prefixes to publisher name • Variant publisher names to a preferred form Capture and make available for use various attributes of individual publishers • Location of publisher • Language(s) of materials published • Genre(s)/format(s) of materials published • Dominant subject domain(s) of the publisher's output • Parent company and subsidiaries Theoretical Foundation: Authority Control Adhere to authorized form • Personal names • Corporate entities Why no authorized form for publishing entities? Pragmatic Foundation: Collection Development Identified publisher series • Retrospective conversion project (1984) Family tree • Which publishers are related? Approval plans • Which publishers publish which subjects? Pragmatic Foundation: OCLC WorldCat Data Mining Collection Analysis • Which libraries have the most items by a publisher in a particular subject area? • How do library holdings by publisher compare? E-books for a particular STM publisher (2000) • Cataloged as reproductions • 2 publishers! Pragmatic Foundation: Citation Analysis Sweetland (1989) • Reader functions of citations • Information retrieval via citation databases • Document retrieval • Includes interlibrary loan verification • Bibliometrics • Faculty and researcher productivity measure Other functions • Creation of references/bibliographies Pragmatic Foundation : Education for Librarians Collection development & acquisitions librarian education • Subject focuses of publishers • Parent and subsidiary relationships Specialized Corporate Authority Files ACOLIT (Ruggeri, 2004) • Names, uniform titles, Italian and international Catholic institutions, Catholic religious communities, and institutions • Related to the Catholic Church, Papal State, and Vatican City State COPAR (Boddaert, 2004) • French official corporate bodies • Mainly national and preceding the French Revolution CORELI (Boddaert, 2004) • Religious corporate bodies from 3 French ancient specialized catalogues Specialized Corporate Authority Files Chinese Modern Author Authority Database (Hu, Tam & Lo, 2004) • Chinese authors of expanded works and Chinese corporate bodies since 1912 Chinese Name Authority Database (Hu, Tam & Lo, 2004) • Mainly Taiwanese personal names with some Taiwanese corporate bodies Specialized Corporate Authority Files Case study by Elias & Fair (1983) • Standard Oil Co.’s Media Query File • No authority control • 3 professionals in 6 months averaged 12 telephone calls/day from reporters • Decided against canonical list for media names • Noted 20 unique variants for Wall Street Journal including WSJ, Wall St. Jnl, Wall Street Jnl Specialized Corporate Authority Files Case study by French, Powell & Schulman (1997, 2000) • Smithsonian Astrophysical Observatory’s Astrophysics Data System database • Programmatically identify author affiliations and map variant names to canonical name • Investigated various techniques separately and iteratively to bring variants together including: • Lexical cleanup • Data clustering algorithms • Approximate string-matching • Reduced number of unique strings by 55% • Required manual review of clusters Database Quality Literature: Database Quality Review by O’Neill & Vizine-Goetz (1988) • Busch (1981) • < 35% of 141 OCLC libraries routinely reported errors • Pollock & Zamora (1983) • Noted misspellings comprise 90-96% of errors & include: • Omission • Insertion • Substitution • Transposition Literature: Database Quality Intner (1989) • Reviewed 215 matching records in OCLC and RLIN • Errors relating to publishers: OCLC RLIN Count (Total) % Count (Total) % 64 (205) 31.2 52 (191) 27.2 MARC tagging in 260 field 4 (25) 16.0 3 (26) 11.5 Typographic errors 4 (32) 12.5 6 (45) 13.3 Application of AACR2 & LCRI Literature: Database Quality Romero (1994) • Evaluated cataloging of library science students • Noted 221 errors (28.22%) in the publisher description area Issues: Historical Practices Different rules for abbreviations • LC Rule Interpretation B.14 • State postal (2-letter) abbreviation if it appears in the item along with the place • Anglo-American Cataloguing Rules, Revised (2002) • Abbreviations included in Appendix B.14 Issues: Historical Practices ALA Catalog Rules (1941) • Multiple places of publication and publishers and neither or first is prominent • Include first listed first, indicate omission • Multiple places of publication and publishers and first is not prominent • Include prominent first • Include first listed second • Unknown place of publication – [n.p.] Issues: Historical Practices Anglo-American Cataloging Rules (1967) • Multiple places of publication and publishers and neither or first is prominent • Include first listed only, omit others • Multiple places of publication and publishers and first is not prominent • Include prominent only, omit others • Unknown place of publication – [n. p.] Issues: Historical Practices Anglo-American Cataloguing Rules, Revised (2002) • Multiple places of publication and publishers and neither or first is prominent • Include first listed only, omit others • Multiple places of publication and publishers and first is not prominent • Include first listed first • Include prominent second • Unknown place of publication – [S.l.] Issues: Historical and Local Practices “u.a.” • At least one German institution uses “u.a.” as mark of omission • Means “et al.” • Not an AACR2r rule • Local practice? • Is local practice/policy an error? Issues: Historical and Local Practices WorldCat enhanced records • Eliminate or lessen the probability of these issues Examining Quality of WorldCat WorldCat: Publisher Name Selection Criteria Fixed field lang = “eng” WorldCat by Language NonEnglish 39% English 61% WorldCat: ISBN Validation Errors WorldCat records with ISBNs: 22.69% ISBNs by Language Non-English 45% English 55% WorldCat: ISBN Validation Errors English Language Valid Invalid 7,561,445 99.90% 7,600 0.10% 13,147,325 99.88% 15,654 0.12% All Languages Valid Invalid WorldCat: MARC Tagging Errors Examined English language records based on some known issues and manual evaluation Total MARC tagging errors found: 11,874 (0.03%) WorldCat Tagging Errors Other 2% Dates tagging 43% MARC 260 vs 300 tagging 55% WorldCat: MARC Tagging Errors MARC 260 vs 300 tagging • In 260 field, information from 300 field in $a, $b, $c and/or $e Dates tagging • Date in $a or $b • Five digit year • “cm” follows year WorldCat: Typographical Errors Used “Typographical Errors in Library Databases” to identify and quantify English language WorldCat errors (Ballard, 2005) • Total errors: 26,599 (0.08%) • Require manual examination to determine if actual errors • Searching for Institi* • Misspelled: • American Institite of Physics • British Standards Institition • Spelled correctly: • Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin Institute for Advanced Studies) WorldCat: Typographical Errors Top words (10.4%): Word Probability According to Ballard Error Type WorldCat Count Worchester Highest Insertion 398 Metheun High Transposition 355 Universt* Highest Omission 299 Unives* Highest Omission 275 Westminister [and] Press Highest Insertion 266 Niagr* High Omission 260 Phildel* High Omission 235 Tallahasee High Omission 234 John Hopkins Press Highest Omission 227 Institi* High Substitution 226 WorldCat: Typographical Errors “Westminister” • Only included on Ballard list in combination with other words • Total errors in WorldCat: 628 (2.36%) • Require manual review Where are we now? WorldCat: MARC 260 Evaluation Top 10 terms in 260 $b in WorldCat Term Count press 2,094,111 co 1,664,005 university 1,550,435 dept 1,084,647 pub 984,234 research 853,954 service 710,314 institute 660,346 office 649,794 chu ban she 620,735 WorldCat: MARC 260 Evaluation University Press names in 260 $b in WorldCat Term Count oxford 35,804 hopkins 22,564 cambridge 21,951 harvard 17,069 cornell 11,305 stanford 10,900 purdue 5,468 yale 5,076 princeton 4,746 rutgers 3,854 Clustering Attempting programmatic clustering of publishers using ISBN prefixes • Data clustering (The Free Dictionary) • "The science of extracting useful information from large data sets or databases" • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) • Data in each subset (ideally) share some common trait WorldCat: Clustering Example Used ISBN prefix 019 (Oxford University Press) • Total WorldCat records: 58,004,317 • Records with ISBN prefix 019: 84,276 (0.15%) • Non-unique publisher names from ISBN prefix records: 91,528 NACO normalized unique publisher names Number of clusters Non-singleton clusters Largest cluster One or more 019 ISBN All 019 ISBNs 1,550 1,386 919 799 222 (24.16%) 205 (25.66%) 82 text strings 81 text strings Challenges: Publisher Name Authority File Quality issue • Level of acceptance for cluster • What is acceptable? Subsidiaries and Relationships • Oxford & Auckland • Examined manually to determine relationship Form of name • What is acceptable? • Likely to use the most prominent form of name Questions and Discussion Contact Information: connawal@oclc.org hearda@oclc.org Project Web Site: http://www.oclc.org/research/projects/publisherns/