Publisher Name Authority Project: An Attempt to Enhance

advertisement
Publisher Name Authority Project: An
Attempt to Enhance Data Mining for
Collection Analysis & Comparison
Lynn Silipigni Connaway
Consulting Research Scientist III
Akeisha Heard
Technical Intern
XXV Annual Charleston Conference
04 November 2005
Introduction
Research Goals
 Develop a service to support advanced collection
intelligence
 Cluster collected objects based on their issuing entity
• As can be determined via metadata about the objects
• Gain intelligence about the nature of individual
publishers
• Collection intelligence
• Acquisition patterns
• User behavior
Research Objectives
 Resolve
• ISBN prefixes to publisher name
• Variant publisher names to a preferred form
 Capture and make available for use various attributes of
individual publishers
• Location of publisher
• Language(s) of materials published
• Genre(s)/format(s) of materials published
• Dominant subject domain(s) of the publisher's output
• Parent company and subsidiaries
Theoretical Foundation: Authority Control
 Adhere to authorized form
• Personal names
• Corporate entities
 Why no authorized form for publishing entities?
Pragmatic Foundation: Collection Development
 Identified publisher series
• Retrospective conversion project (1984)
 Family tree
• Which publishers are related?
 Approval plans
• Which publishers publish which subjects?
Pragmatic Foundation: OCLC WorldCat Data Mining


Collection Analysis
• Which libraries have the most items by a publisher in a
particular subject area?
• How do library holdings by publisher compare?
E-books for a particular STM publisher (2000)
• Cataloged as reproductions
• 2 publishers!
Pragmatic Foundation: Citation Analysis
 Sweetland (1989)
• Reader functions of citations
• Information retrieval via citation databases
• Document retrieval
• Includes interlibrary loan verification
• Bibliometrics
• Faculty and researcher productivity measure
 Other functions
• Creation of references/bibliographies
Pragmatic Foundation : Education for Librarians
 Collection development & acquisitions librarian education
• Subject focuses of publishers
• Parent and subsidiary relationships
Specialized Corporate Authority Files
 ACOLIT (Ruggeri, 2004)
• Names, uniform titles, Italian and international Catholic
institutions, Catholic religious communities, and
institutions
• Related to the Catholic Church, Papal State, and
Vatican City State
 COPAR (Boddaert, 2004)
• French official corporate bodies
• Mainly national and preceding the French Revolution
 CORELI (Boddaert, 2004)
• Religious corporate bodies from 3 French ancient
specialized catalogues
Specialized Corporate Authority Files
 Chinese Modern Author Authority Database (Hu, Tam &
Lo, 2004)
• Chinese authors of expanded works and Chinese corporate
bodies since 1912
 Chinese Name Authority Database (Hu, Tam & Lo, 2004)
• Mainly Taiwanese personal names with some Taiwanese
corporate bodies
Specialized Corporate Authority Files
 Case study by Elias & Fair (1983)
• Standard Oil Co.’s Media Query File
• No authority control
• 3 professionals in 6 months averaged 12 telephone
calls/day from reporters
• Decided against canonical list for media names
• Noted 20 unique variants for Wall Street Journal
including WSJ, Wall St. Jnl, Wall Street Jnl
Specialized Corporate Authority Files
 Case study by French, Powell & Schulman (1997, 2000)
• Smithsonian Astrophysical Observatory’s Astrophysics
Data System database
• Programmatically identify author affiliations and map
variant names to canonical name
• Investigated various techniques separately and
iteratively to bring variants together including:
• Lexical cleanup
• Data clustering algorithms
• Approximate string-matching
• Reduced number of unique strings by 55%
• Required manual review of clusters
Database Quality
Literature: Database Quality
 Review by O’Neill & Vizine-Goetz (1988)
• Busch (1981)
• < 35% of 141 OCLC libraries routinely reported
errors
• Pollock & Zamora (1983)
• Noted misspellings comprise 90-96% of errors &
include:
• Omission
• Insertion
• Substitution
• Transposition
Literature: Database Quality
 Intner (1989)
• Reviewed 215 matching records in OCLC and RLIN
• Errors relating to publishers:
OCLC
RLIN
Count
(Total)
%
Count
(Total)
%
64
(205)
31.2
52
(191)
27.2
MARC tagging in
260 field
4
(25)
16.0
3
(26)
11.5
Typographic errors
4
(32)
12.5
6
(45)
13.3
Application of
AACR2 & LCRI
Literature: Database Quality
 Romero (1994)
• Evaluated cataloging of library science students
• Noted 221 errors (28.22%) in the publisher
description area
Issues: Historical Practices
 Different rules for abbreviations
• LC Rule Interpretation B.14
• State postal (2-letter) abbreviation if it appears in
the item along with the place
• Anglo-American Cataloguing Rules, Revised (2002)
• Abbreviations included in Appendix B.14
Issues: Historical Practices
 ALA Catalog Rules (1941)
• Multiple places of publication and publishers and neither
or first is prominent
• Include first listed first, indicate omission
• Multiple places of publication and publishers and first is
not prominent
• Include prominent first
• Include first listed second
• Unknown place of publication – [n.p.]
Issues: Historical Practices
 Anglo-American Cataloging Rules (1967)
• Multiple places of publication and publishers and neither
or first is prominent
• Include first listed only, omit others
• Multiple places of publication and publishers and first is
not prominent
• Include prominent only, omit others
• Unknown place of publication – [n. p.]
Issues: Historical Practices
 Anglo-American Cataloguing Rules, Revised (2002)
• Multiple places of publication and publishers and neither
or first is prominent
• Include first listed only, omit others
• Multiple places of publication and publishers and first is
not prominent
• Include first listed first
• Include prominent second
• Unknown place of publication – [S.l.]
Issues: Historical and Local Practices
 “u.a.”
• At least one German institution uses “u.a.” as mark of
omission
• Means “et al.”
• Not an AACR2r rule
• Local practice?
• Is local practice/policy an error?
Issues: Historical and Local Practices
 WorldCat enhanced records
• Eliminate or lessen the probability of these issues
Examining Quality of WorldCat
WorldCat: Publisher Name Selection Criteria
 Fixed field lang = “eng”
WorldCat by Language
NonEnglish
39%
English
61%
WorldCat: ISBN Validation Errors
 WorldCat records with ISBNs: 22.69%
ISBNs by Language
Non-English
45%
English
55%
WorldCat: ISBN Validation Errors
English Language
Valid
Invalid
7,561,445
99.90%
7,600
0.10%
13,147,325
99.88%
15,654
0.12%
All Languages
Valid
Invalid
WorldCat: MARC Tagging Errors
 Examined English language records based on some known
issues and manual evaluation
 Total MARC tagging errors found: 11,874 (0.03%)
WorldCat Tagging Errors
Other
2%
Dates
tagging
43%
MARC 260
vs 300
tagging
55%
WorldCat: MARC Tagging Errors
 MARC 260 vs 300 tagging
• In 260 field, information from 300 field in
$a, $b, $c and/or $e
 Dates tagging
• Date in $a or $b
• Five digit year
• “cm” follows year
WorldCat: Typographical Errors
 Used “Typographical Errors in Library Databases” to identify
and quantify English language WorldCat errors (Ballard,
2005)
• Total errors: 26,599 (0.08%)
• Require manual examination to determine if actual
errors
• Searching for Institi*
• Misspelled:
• American Institite of Physics
• British Standards Institition
• Spelled correctly:
• Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin
Institute for Advanced Studies)
WorldCat: Typographical Errors
 Top words (10.4%):
Word
Probability
According
to Ballard
Error Type
WorldCat
Count
Worchester
Highest
Insertion
398
Metheun
High
Transposition
355
Universt*
Highest
Omission
299
Unives*
Highest
Omission
275
Westminister [and] Press
Highest
Insertion
266
Niagr*
High
Omission
260
Phildel*
High
Omission
235
Tallahasee
High
Omission
234
John Hopkins Press
Highest
Omission
227
Institi*
High
Substitution
226
WorldCat: Typographical Errors
 “Westminister”
• Only included on Ballard list in combination with other
words
• Total errors in WorldCat: 628 (2.36%)
• Require manual review
Where are we now?
WorldCat: MARC 260 Evaluation
 Top 10 terms in 260 $b in WorldCat
Term
Count
press
2,094,111
co
1,664,005
university
1,550,435
dept
1,084,647
pub
984,234
research
853,954
service
710,314
institute
660,346
office
649,794
chu ban she
620,735
WorldCat: MARC 260 Evaluation
 University Press names in 260 $b in WorldCat
Term
Count
oxford
35,804
hopkins
22,564
cambridge
21,951
harvard
17,069
cornell
11,305
stanford
10,900
purdue
5,468
yale
5,076
princeton
4,746
rutgers
3,854
Clustering
 Attempting programmatic clustering of publishers using
ISBN prefixes
• Data clustering (The Free Dictionary)
• "The science of extracting useful information from
large data sets or databases"
• Classification of similar objects into different groups
• Partitioning of a data set into subsets (clusters)
• Data in each subset (ideally) share some common
trait
WorldCat: Clustering Example
 Used ISBN prefix 019 (Oxford University Press)
• Total WorldCat records: 58,004,317
• Records with ISBN prefix 019: 84,276 (0.15%)
• Non-unique publisher names from ISBN prefix records:
91,528
NACO normalized
unique publisher
names
Number of clusters
Non-singleton
clusters
Largest cluster
One or more
019 ISBN
All 019 ISBNs
1,550
1,386
919
799
222
(24.16%)
205
(25.66%)
82 text strings
81 text strings
Challenges: Publisher Name Authority File
 Quality issue
• Level of acceptance for cluster
• What is acceptable?
 Subsidiaries and Relationships
• Oxford & Auckland
• Examined manually to determine relationship
 Form of name
• What is acceptable?
• Likely to use the most prominent form of name
Questions and Discussion
Contact Information:
connawal@oclc.org
hearda@oclc.org
Project Web Site:
http://www.oclc.org/research/projects/publisherns/
Download