The Future of Bibliographic Control

advertisement
Users and Uses of Bibliographic Data:
The Promise and Paradox of Bibliographic Control
NCSU Case study: Faceted Navigation
Andrew K. Pace
Head, Information Technology
NCSU Libraries
Library of Congress Working Group on the
Future of Bibliographic Control ~ March 8, 2007
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Agenda






NCSU’s Endeca-powered catalog
Data Reality Check
Relevance ranking in online catalogs
Statistics: what are patrons doing?
The Metadata Paradox (in 3 parts)
A brief wish-list for the future of
bibliographic control
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
NCSU’s Endeca-powered catalog
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Rumsfeld’s Law of Bibliographic Control
You search the data
you have, not the data
you wish you had.
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Existing catalogs are hard to use


Known item searching works pretty well
(sometimes), but …
Lots of topical searches and poor subject access






keyword gives too many or too few results – leads to
general distrust among users
authority searching is under-utilized and
misunderstood
Relevance = system sort order
Impossible to browse the collection
Unforgiving on spelling errors, stemming
Response time doesn’t meet expectations of
web-savvy users
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Valuable metadata is buried

Subject headings are not leveraged in
keyword searching


they should be browsed or linked from, not
searched
Data from the item record is not
leveraged

should be able to easily filter based on user’s
changing requirements using item type,
location, circulation status, popularity
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
In a nutshell…
"Most integrated library systems, as
they are currently configured and
used, should be removed from public
view."
- Roy Tennant, CDL
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
What’s the big picture?



Improve the quality of the library catalog
user experience
Exploit our existing authority
infrastructure (aka make MARC data
work harder)
Build a more flexible catalog tool that
can be integrated with discovery tools of
the future.
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
“This-Gen” search tools

Proving that it’s possible to improve the
search experience beyond the
functionality that traditional online
catalogs have supported.
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
What is Endeca?



Software company based in
Cambridge, MA
Search and information
access technology provider
for a number of major ecommerce websites
Developers of the Endeca
Information Access Platform
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Why Endeca?





Customized relevance ranking of results
Better subject access by leveraging
available metadata (including item level
data!) through facets
Improved response time
Enhanced natural language searching
through spell correction, etc.
True browse
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Demo
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Data Reality Check
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Data Reality Check: Part I



Our Integrated Library System has ~80 MARC
fields and subfields indexed in its keyword index
33 additional indexed fields (29% !) are not
publicly displayed
Displayed fields use 37 different labels
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Data Reality Check: Part II



A lot of those same fields are indexed in Endeca
(just much more quickly and efficiently)
~50 MARC fields are indexed
New catalog has ~37 Properties and 11 Dimensions
derived from ~160 MARC fields and subfields
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Data Reality Check: Part III

Simple data is the
best



MARC4J to convert
MARC into flat files
Lots of stripping of
punctuation…ugh
Perl to update files
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Relevance Ranking
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Relevance Ranking


TF/IDF alone is inadequate for determining
relevance order of bibliographic metadata
The Endeca MDEX Engine offers NCSU
alternatives


Matching techniques: matchall, matchany,
matchboolean, matchallpartial
A suite of relevance ranking options are
applied to Boolean-type searches
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Relevance Ranking (cont.)


Individual RR modules are combined and
prioritized according to our specifications to
form an overall RR strategy, or algorithm
Current strategy includes seven modules,


5 of which rank results dynamically on things like:
phrase, rank of the field in which term appears,
weighted frequency
2 final rules provide static ordering based on
publication date and aggregate circulation totals
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Relevance Ranking: Challenge and Promise
Challenge
 Uncharted territory required a best-guess
approach
 More experimentation required with
“matchallpartial” to provide an interface
intuitive enough for users to know what is
happening
Promise
 Having technology nimble enough to support
experimentation with indexing and relevance
strategies
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Statistics:
What patrons are doing
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
July 06 – Jan 07
Requests by Search Type
Search ->
Navigation 25%
Navigation 8%
Search 67%
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Navigation by Dimensions
July 06 – Jan 07
Subject: Topic
26%
Subject: Era
2%
Language
Availability
3%
2%
Subject: Region
4%
Author
6%
Subject: Genre
6%
Library
10%
LC Classification
21%
New
10%
Format
10%
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
July 06 – Jan 07
Requests by Search Type
Includes
Search
->
Navigation
Navigation 33%
25%
19.4%
Subj./Class
Navigation 8%
Search
Search 67%
67%
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
July 06 – Jan 07
Navigation by Dimension (most used)
Subject: Topic
LC Classification
Format
New
Library
Subject: Genre
Author
Subject: Region
Language
Subject: Era
Availability
0
20,000
40,000
60,000
80,000
100,000
120,000
Requests
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
140,000
July 06 – Jan 07
Navigation by Dimension (order of UI presentation)
9,286
Availability
120,644
LC Classification
145,589
Subject: Topic
34,096
Subject: Genre
57,667
Format
54,476
Library
22,818
Subject: Region
12,257
Subject: Era
16,009
Language
32,650
Author
0
20,000
40,000
60,000
80,000
100,000
120,000
Requests
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
140,000
160,000
July 06 – Jan 07
Searches by Search Key
Multi-Field
Subject
1%
4%
Author
8%
Title
13%
ISSN/ISBN
16%
Keyword
(default)
58%
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
July 06 – Jan 07
Sorting Requests
Author A-Z
10%
Title A-Z
12%
Most Popular
12%
Pub Date
53%
Call Number
13%
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Most Popular Dimension Values
July 06 – Jan 07
Out of 765,170 Navigation Requests
Physical






Topical
New:New (92,037)
Format:Book (40,183)
Availability:Available (33,125)
Library: D.H. Hill (33,091)
Language: English (22,668)
Format: eBook (21,177)






LC Class: Q-Science (25,277)
Subject|Region: US (20,954)
Subject|Topic: History (20,861)
LC Class: T-Technology (16,951)
LC Class: H-Soc. Sci. (16,345)
Subject Topic: Bioethics (12,933)
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
The Metadata Paradox
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Plugging Holes in the System

Natural language problem





LCSH=United States—History—Revolution, 1775-1783
keyword=revolutionary war (834 hits)
Subject keyword=“United States—History—Revolution, 17751783” (3081 hits)
Facets taken out of the free-floating and
hierarchical context of LCSH can be misleading
I’ve followed many a tag cloud, but assuming
that browsing is still popular, how does one
browse keywords?
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Paradox #1
We finally have interesting discovery tools
that make use of bibliographic data in
ways that show us that the data are not
completely adequate for use with the new
discovery tools.
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
“Subject Keywords”
“[from Recommendations:]
….Abandon the attempt to do
comprehensive subject analysis manually
with LCSH in favor of subject keywords;
urge LC to dismantle LCSH”

-- Karen Calhoun, The Changing Nature of the Catalog
and its Integration with Other Discovery Tools, report
prepared for the Library of Congress, March 2006
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Paradox #2
“Subject keywords” should replace the
controlled vocabulary from which the
keywords themselves are most easily
derived.
Let’s build bridges between the mountains
of bibliographic description so that we can
tear down the mountains.
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
If not LCSH, then what?
(with some help from colleague Charley Pennell, Principle Cataloger for Metadata, NCSU)

Dissertation Abstracts Model




Social Tagging Model




Constrained list of subject terms
Dissertations are at the edge of scholarship, so granular preexisting thesaurus is inadequate
Vocabulary cannot be “thwarted” by authors
Human intervention times n users
New difficulty of matching tags between creators
Contrary to the “literary warrant” that requires a hierarchical
thesaurus to differentiate thousands of titles from one another
Full Text Model



Good for currency: hypothetically uses the most
contemporary language
Harder for research: some sort of citation analysis
required; but arguably could be solved computationally
The question of scale
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Paradox #3
Computational (e.g. non-human mediated)
creation of subject-based facets will work
perfectly once all the full text of every
work is available in electronic format.
What does a search and retrieval system
for 50 million books and 50 million articles
look like?
Gooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooogle
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Bibliographic Control Wish List
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Bibliographic Control Wish List






A classification or subject thesaurus system
that enables faceted navigation
A work identifier for books and serials
Something other than LC Name Authority for
“organizations”
Physical descriptions that help libraries send
books to off-site shelving and to patron’s
mailboxes
Something other than MARC in which to encode
all of the above
Systems that can actually use the encoding
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Paradox #4: The Ultimate Paradox
“You’re damned if you do and
you’re damned if you don’t.”
- Bart Simpson
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Thanks

NCSU project site (includes these slides):


http://www.lib.ncsu.edu/endeca
Andrew K. Pace


Head, Information Technology, NCSU Libraries
andrew_pace@ncsu.edu
Library of Congress Working Group on the Future of Bibliographic Control ~ March 8, 2007
Download