PowerPoint - Cornell University

advertisement
CS 430 / INFO 430
Information Retrieval
Lecture 16
Library Catalogs 1
1
Course Administration
2
Information Retrieval with High Recall
Full-text Indexing (automated)
•
Text only. Most effective on medium-length documents
on related topics. High recall requires tuning system to the
specific collection and skilled users.
Catalogs and Indexes (created manually)
3
•
Can be used for all formats of material
•
Requires close quality control of metadata creation
•
High recall requires tuning system to the specific
collection and skilled users.
Descriptive metadata
Information discovery is can be very effective when applied
to metadata rather than raw information
• Allows fielded searching
author = "Goethe"
• Suitable for non-textual material
type = "picture" and subject = "Ithaca"
• Can be used with controlled vocabulary
language = "en"
4
(English)
Examples of Library Catalogs
Cornell University Library catalog:
http://catalog.library.cornell.edu/
Library of Congress, Prints and Photographs:
http://www.loc.gov/rr/print/catalog.html
5
Origins of Library Catalogs
Bibliographic Objective:
•
To bring together like items
•
To differentiate among similar ones
Sir Anthony Panizzi, Keeper of Books
at the British Museum (1856-67).
His Ninety-One Rules (1841) were the
basis of modern catalog rules.
6
Origins of Library Catalogs
Information Discovery:
•
to enable a person to find a book of which either the
author, title or subject is known
•
to show what the library has by a given author, on a
given subject, or in a given kind of literature
•
to assist in the choice of a book as to its edition
(bibliographically) or to its character (literary or topical).
Charles Ammi Cutter
Librarian of the Boston Athenaeum
Rules for a Dictionary Catalog, 1874
7
Origins of Library Catalogs
Classification:
Division of subject matter into a hierarchy.
Typically used in libraries to provided a subjectbased order for shelving books.
Melvil Dewey
Acting Librarian of Amherst College (1874)
Dewey Decimal system of book
classification, uses the numbers 000 to 999
to cover the general fields of knowledge and
decimals to fit special subjects.
8
Technology
Materials to be catalogued:
•
Originally books
•
Extended to serials, maps, music, etc., but concepts still rely
heavily on experience with books
Form of catalog:
9
•
Entries in books (Panizzi)
•
Index cards (Cutter)
•
Online databases (Kilgour)
Catalogs as Investments
Costs:
•
Conventional Catalog Records are created by skilled librarians.
(cost estimate $100 per record).
•
OCLC's catalog has 52 million records. Total investment is
several billion dollars.
Cataloguing Standards:
10
•
Enable libraries to share records
•
Combine records of the past with records created today
•
Allow readers and librarians to move between libraries
Shared Cataloguing: OCLC
OCLC -- Large centralized transaction processing database
system
When a library catalogs a book it deposits MARC record in OCLC
Other libraries can copy the record
•
saves duplication of cataloguing
•
build database of holdings
OCLC database has 52 million records, serves 47,000 libraries
When developed in 1967, OCLC was a pioneering computer
system (had to develop own network, computer terminal, etc.)
11
Layers of a Library Catalog
Encoding
• Rules that define how catalog records are encoded in a
computer system, e.g., XML mark-up.
Syntax
• Rules that define the fields and subfields, whether repeated,
optional, etc.
Semantics
• Rules that define the values of the field and subfield, with
instructions for cataloguers of what data to include and how
to decide when choices have to be made.
12
Library Cataloging using the Anglo
American Cataloguing Rules
Anglo American Cataloguing Rules (AACR2)
• Rules for each category of material, e.g., monographs
(books). Specify what fields should be used and what data
to include in each field. Text strings were originally
intended for printed catalog cards.
MARC format
• An exchange format for catalog records. Includes encoding
rules and syntax specification.
"MARC Catalog"
13
• Catalog in MARC format, where content of each field
follows AACR2.
Anglo American Cataloguing Rules
The Anglo American Cataloguing (AACR) rules provide detailed
rules for
• the choice of fields
• the content of the data that goes into each field
• the syntax of the data that goes into each field
The rules are an excellent example of technical writing, precise but
clear. For an example, see:
http://www.cs.cornell.edu/Courses/cs430/2004fa/slides/AACR.pdf
14
Example: Controlled Vocabulary
Level 1
Level 2
Arts
Architecture
Art therapy
Careers*
Computers in art
Dance
Drama/dramatics
Film
History*
Informal education*
Instructional issues*
Music
Photography
Popular culture*
Process skills*
Technology*
Theater arts
Visual arts
Terms marked * can
appear in other hierarchies
15
Source: presentation by
Diane Hillmann, 2004
MARC Format
The MARC format was developed in the late 1960s as a
tagging scheme for exchanging catalog records on magnetic
tape. It remains the standard way to represent such data.
At present, MARC is steadily being converted (slowly) to
modern computing formats, e.g., Unicode, XML.
16
MARC: Monograph catalog record
Citation
Caroline R. Arms, editor, Campus strategies for libraries and
electronic information. Bedford, MA: Digital Press, 1990.
17
MARC fields
tag value
001 89-16879 r93
050 Z675.U5C16 1990
082 027.7/0973 20
245 Campus strategies for libraries and electronic title statement
information/Caroline Arms, editor.
260 {Bedford, Mass.} : Digital Press, c1990.
publisher
300 xi, 404 p. : ill. ; 24 cm.
collation
440 EDUCOM strategies series on information technology
series title
504 Includes bibliographical references (p. {373}-381).
020 ISBN 1-55558-036-X : $34.95
18
MARC fields (continued)
650 Academic libraries--United States--Automation.
subject heading
650 Libraries and electronic publishing--United States.
650 Library information networks--United States.
650 Information technology--United States.
700 Arms, Caroline R. (Caroline Ruth)
040 DLC DLC DLC
043 n-us--955 CIP ver. br02 to SL 02-26-90
985 APIF/MIG
19
MARC Encoding
tag:
260
subfield a:
{Bedford, Mass.} :
subfield b:
Digital Press,
subfield c:
c1990.
Note that the content is
designed to be part of a
printed catalog record
and is not in a
convenient format for
computer manipulation.
MARC encoding:
&2600#abc#{Bedford, Mass.} :#Digital Press,#c1990.%
[Definitely not a modern encoding!]
20
Name authority files
An Authority File "brings together like items and differentiates
among similar ones."
•
Caroline R. Arms or Caroline Ruth Arms?
•
Which William Phillips of Cardiff?
•
Mark Twain or Samuel Clemens?
•
Epithets:
of Cardiff
doctor
•
21
Dates:
1832 - 1876
flourished 1860
circa 1832 - 1876
Name authority: example
LC Control Number:
HEADING :
000
001
005
008
010
035
040
100
400
400
670
670
670
670
670
22
953
n 87870182
Arms, Caroline R. (Caroline Ruth)
00907cz
2200205n 450
4383796
19890706143144.8
70909n|acannaab |a aaa c
__ |a n 87870182
__ |a (DLC)n 87870182
__ |a InU |c DLC |d DLC
10 |a Arms, Caroline R. |q (Caroline Ruth)
10 |w nna |a Arms, Caroline Ruth
10 |a Arms, C. R. |q (Caroline Ruth)
__ |a Arms, W.Y. Report on the performance problems of the
RLIN computer system, 1982: |b t.p. (Caroline R. Arms)
__ |a LC data base, 8/24/87 |b (hdg.: Arms, Caroline Ruth;
usage: Caroline R. Arms, C. R. Arms)
__ |a Campus networking strategies, 1988: |b CIP t.p.
(Caroline Arms)
__ |a Phone call to pub., 2/10/88 |b (Caroline Ruth Arms;
studied at Oxford)
__ |a Campus strategies for libraries and electronic
information, c1990: |b CIP t.p. (Caroline Arms) data sheet
(b. 10-24-45)
__ |a bz46 |b bd24
Subject information
Library of Congress Subject Headings
Academic libraries--United States--Automation
Hierarchical classification
Library of Congress call number:
Dewey Decimal Classification:
Z675.U5C16
027.7
Creation and maintenance of lists of subject headings and
classifications is a never ending task.
23
Online public access catalog (OPAC)
History: First stage
• Library mounts its MARC records on a central computer
• Provides a simple terminal interface and dedicated terminals
• Boolean search -- fielded searching
[Most university libraries reached this stage about 1990]
History: Second stage
• Library connects computer to a campus network and Internet
• Converts card catalog records to MARC (retrospective
conversion)
24
Library information systems
When the catalog is online ...
Add other collections and services:
•
•
Secondary information (Inspec, Medline, Chemical Abstracts)
Reference works (dictionaries, encyclopedias)
Improve user interface
• Add full text searching
• Add web interface
Add gateway to off-campus information sources:
•
•
25
Scientific journals
Databases (census, genome)
Library management systems
A library management system, sometimes called an integrated
library system, integrates the internal processes of a library, e.g.,
acquisitions, cataloguing, binding, circulation, etc.
It usually contains an online public access catalog, but does not
provide integrated services to users.
Library management systems are produced by small companies
who lack the capital and technical expertise to develop modern
digital libraries.
26
Notes on MARC
A great achievement:
27
•
Developed in 1960s
•
Magnetic tape exchange format for printing catalog records
•
The dawn of computing:
mixed upper and lower case
variable length fields,
repeated fields
non-Roman scripts
•
100(?) million records with standard content and format
•
Thousands of trained librarians (millions?)
Notes on MARC
A great problem:
•
Not designed for computer algorithms
•
One record per item (poor links between records)
•
Tied to traditional materials and traditional practices
•
Not Unicode
•
100 of million records at $100 -- $10 billion
A classic legacy system!
28
Download