CS 430 / INFO 430 Information Retrieval Metadata 1 Lecture 19

advertisement
CS 430 / INFO 430
Information Retrieval
Lecture 19
Metadata 1
1
Course Administration
2
Descriptive Metadata
Some methods of information retrieval search and browse
descriptive metadata about the objects.
Descriptive metadata typically consists of a catalog or indexing
record, or an abstract, one record for each object. The record acts
as a surrogate for the object.
• Usually the metadata is stored separately from the object
that it describes, but sometimes is embedded in the object.
• Usually the metadata is a set of text fields.
Textual metadata can be used to describe non-textual objects,
e.g., software, images, music.
3
Documents and Surrogates
Document
The sea is calm to-night.
The tide is full, the moon lies fair
Upon the straits; -- on the French coast the light
Gleams and is gone; the cliffs of England stand,
Glimmering and vast, out in the tranquil bay.
Surrogate (catalog record)
Author: Matthew Arnold
Title: Dover Beach
Genre: Poem
Come to the window, sweet is the night-air!
Only, from the long line of spray
Date: 1851
Where the sea meets the moon-blanch'd land,
Listen! you hear the grating roar
Of pebbles which the waves draw back, and
fling,
At their return, up the high strand,
Begin, and cease, and then again begin,
With tremulous cadence slow, and bring
4
The eternal note of sadness in.
Notes:
1. The surrogate is also a
document
2. Every word is different!
Surrogates for Non-textual materials
Text based methods of information retrieval can search a
surrogate for a photograph
Document
Surrogate (catalog record)
See next page for a
textual catalog record
about a non-textual item
(photograph).
5
Library of Congress catalog record
(part)
CREATED/PUBLISHED: [between 1925 and 1930?]
SUMMARY: U. S. President Calvin Coolidge sits at a desk and
signs a photograph, probably in Denver, Colorado. A group of
unidentified men look on.
NOTES: Title supplied by cataloger. Source: Morey Engle.
SUBJECTS:
Coolidge, Calvin,--1872-1933.
Presidents--United States--1920-1930.
Autographing--Colorado--Denver--1920-1930.
Denver (Colo.)--1920-1930.
Photographic prints.
6
MEDIUM: 1 photoprint ; 21 x 26 cm. (8 x 10 in.)
Categories of Descriptive Metadata
Catalog: metadata records that have a consistent structure,
organized according to systematic rules. (Example: Library of
Congress Catalog)
Abstract: a free text record that summarizes a longer document.
Indexing record: less formal than a catalog record, but more
structured than a simple abstract. (Example: PubMed)
7
Metadata Format
A metadata format is a set of rules that describe the content
and format of a set of metadata records, e.g.:
8
•
AACR (Anglo American Cataloging Rules) / MARC
•
Dublin Core
•
FGDC (Federal Geographic Data Committee's Content
Standard for Digital Geospatial Metadata)
•
IEEE Standard for Learning Object Metadata
Uses of Metadata in Information
Retrieval
Metadata is used in Information Retrieval systems in conjunction
with or instead of full text indexing:
• For physical objects, e.g., books
• For non-textual materials, e.g., pictures, maps, datasets
• For specialized areas where high recall is important (e.g.,
medicine), or where features such as intended audience are
hard to extract from the text (e.g., education)
• When people are ignorant of the power of full text indexing
(which is surprisingly common)
9
Uses of Metadata in Information
Retrieval
Descriptive metadata provides capabilities that are not
possible with full text indexing:
• Allows fielded searching
author = "Goethe"
• Suitable for non-textual material
type = "picture" and subject = "Ithaca"
• Can be used with controlled vocabulary
language = "en"
10
(English)
Information Retrieval with High Recall
Full-text Indexing (automated)
•
Text only. Most effective on medium-length documents
on related topics. High recall requires tuning system to the
specific collection and skilled users.
Catalogs and Indexes (created manually)
11
•
Can be used for all formats of material
•
Requires close quality control of metadata creation
•
High recall requires tuning system to the specific
collection and skilled users.
Using Metadata for Information
Retrieval
The basic operation of information retrieval is to match
the way that a user describes an information requirement
(a query), against the way that items are described (an
index).
The success of conventional catalogs (e.g., MARC +
Anglo-American Cataloguing Rules) or indexing services
(e.g., Medline) comes from the combination of:
• precise language to describe items
• trained and experienced users to formulate queries.
12
Library Catalogs
Examples:
Cornell University Library catalog:
http://catalog.library.cornell.edu/
Library of Congress, Prints and Photographs:
http://www.loc.gov/rr/print/catalog.html
13
Origins of Library Catalogs
Bibliographic Objective:
•
To bring together like items
•
To differentiate among similar ones
Sir Anthony Panizzi, Keeper of Books
at the British Museum (1856-67).
His Ninety-One Rules (1841) were the
basis of modern catalog rules.
14
Origins of Library Catalogs
Information Discovery:
•
to enable a person to find a book of which either the
author, title or subject is known
•
to show what the library has by a given author, on a
given subject, or in a given kind of literature
•
to assist in the choice of a book as to its edition
(bibliographically) or to its character (literary or topical).
Charles Ammi Cutter
Librarian of the Boston Athenaeum
Rules for a Dictionary Catalog, 1874
15
Origins of Library Catalogs
Classification:
• Division of subject matter into a hierarchy.
• Typically used in libraries to provided a subjectbased order for shelving books.
Melvil Dewey
Acting Librarian of Amherst College (1874)
Dewey Decimal system of book
classification, uses the numbers 000 to 999
to cover the general fields of knowledge and
decimals to fit special subjects.
16
Library Catalogs: Technology
Changes over the Years
Materials to be catalogued:
•
Originally books
•
Extended to serials, maps, music, etc., but concepts still rely
heavily on experience with books
Form of catalog:
17
•
Entries in books (Panizzi)
•
Index cards (Cutter)
•
Online databases (Kilgour)
Shared Cataloguing: OCLC
OCLC -- Large centralized transaction processing database
system (http://www.oclc.org/)
When a library catalogs a book it deposits MARC record in OCLC
Other libraries can copy the record
•
saves duplication of cataloguing
•
OCLC has a database of holdings from all libraries
OCLC database has 69 million records, serves 42,000 libraries
When developed by Fred Kilgour in 1967, OCLC was a pioneering
computer system (had to develop own network, computer terminal,
etc.)
18
Catalogs as Investments
Costs:
•
Conventional Catalog Records are created by skilled librarians.
(cost estimate $100 per record).
•
OCLC's catalog has 69 million records. Total investment is
several billion dollars.
Cataloguing Standards:
19
•
Enable libraries to share records
•
Combine records of the past with records created today
•
Allow readers and librarians to move between libraries
Layers of a Library Catalog
Encoding
• Rules that define how catalog records are encoded in a
computer system, e.g., XML mark-up.
Syntax
• Rules that define the fields and subfields, whether repeated,
optional, etc.
Semantics
• Rules that define the values of the field and subfield, with
instructions for cataloguers of what data to include and how
to decide when choices have to be made.
20
Library Cataloging using the Anglo
American Cataloguing Rules
Anglo American Cataloguing Rules (AACR2)
• Rules for each category of material, e.g., monographs
(books). Specify what fields should be used and what data
to include in each field. Text strings were originally
intended for printed catalog cards.
MARC format
• An exchange format for catalog records. Includes encoding
rules and syntax specification.
"MARC Catalog"
21
• Catalog in MARC format, where content of each field
follows AACR2.
Anglo American Cataloguing Rules
The Anglo American Cataloguing (AACR) rules provide detailed
rules for
• the choice of fields
• the content of the data that goes into each field
• the syntax of the data that goes into each field
The rules are an excellent example of technical writing: precise but
clear. For an example, see:
http://www.cs.cornell.edu/Courses/cs430/2006fa/slides/AACR.pdf
22
Name authority files
An Authority File "brings together like items and differentiates
among similar ones."
•
Caroline R. Arms or Caroline Ruth Arms?
•
Which William Phillips of Cardiff?
•
Mark Twain or Samuel Clemens?
•
Epithets:
of Cardiff
doctor
•
23
Dates:
1832 - 1876
flourished 1860
circa 1832 - 1876
Name authority: example
LC Control Number:
HEADING :
000
001
005
008
010
035
040
100
400
400
670
670
670
670
670
24
953
n 87870182
Arms, Caroline R. (Caroline Ruth)
00907cz
2200205n 450
4383796
19890706143144.8
70909n|acannaab |a aaa c
__ |a n 87870182
__ |a (DLC)n 87870182
__ |a InU |c DLC |d DLC
10 |a Arms, Caroline R. |q (Caroline Ruth)
10 |w nna |a Arms, Caroline Ruth
10 |a Arms, C. R. |q (Caroline Ruth)
__ |a Arms, W.Y. Report on the performance problems of the
RLIN computer system, 1982: |b t.p. (Caroline R. Arms)
__ |a LC data base, 8/24/87 |b (hdg.: Arms, Caroline Ruth;
usage: Caroline R. Arms, C. R. Arms)
__ |a Campus networking strategies, 1988: |b CIP t.p.
(Caroline Arms)
__ |a Phone call to pub., 2/10/88 |b (Caroline Ruth Arms;
studied at Oxford)
__ |a Campus strategies for libraries and electronic
information, c1990: |b CIP t.p. (Caroline Arms) data sheet
(b. 10-24-45)
__ |a bz46 |b bd24
Subject information
Library of Congress Subject Headings
Academic libraries--United States--Automation
Hierarchical classification
Library of Congress call number:
Dewey Decimal Classification:
Z675.U5C16
027.7
Creation and maintenance of lists of subject headings and
classifications is a never ending task.
25
MARC Format
The MARC format was developed in the late 1960s as a
tagging scheme for exchanging catalog records on magnetic
tape. It remains the standard way to represent such data.
At present, MARC is steadily being converted (slowly) to
modern computing formats, e.g., Unicode, XML.
26
MARC: Monograph catalog record
Citation
Caroline R. Arms, editor, Campus strategies for libraries and
electronic information. Bedford, MA: Digital Press, 1990.
27
MARC fields
tag value
001 89-16879 r93
050 Z675.U5C16 1990
082 027.7/0973 20
245 Campus strategies for libraries and electronic title statement
information/Caroline Arms, editor.
260 {Bedford, Mass.} : Digital Press, c1990.
publisher
300 xi, 404 p. : ill. ; 24 cm.
collation
440 EDUCOM strategies series on information technology
series title
504 Includes bibliographical references (p. {373}-381).
020 ISBN 1-55558-036-X : $34.95
28
MARC fields (continued)
650 Academic libraries--United States--Automation.
subject heading
650 Libraries and electronic publishing--United States.
650 Library information networks--United States.
650 Information technology--United States.
700 Arms, Caroline R. (Caroline Ruth)
040 DLC DLC DLC
043 n-us--955 CIP ver. br02 to SL 02-26-90
985 APIF/MIG
29
MARC Encoding
tag:
260
subfield a:
{Bedford, Mass.} :
subfield b:
Digital Press,
subfield c:
c1990.
Note that the content is
designed to be part of a
printed catalog record
and is not in a
convenient format for
computer manipulation.
MARC encoding:
&2600#abc#{Bedford, Mass.} :#Digital Press,#c1990.%
[Definitely not a modern encoding!]
30
Modernizing MARC
1. Keep the content of the catalog record
2. Convert to Unicode for representing scripts
3. Convert to XML for tagging cataloguing metadata.
MARCXML (MARC 21 XML)
http://www.loc.gov/standards/marcxml/
[Direct conversion to XML tagging]
Metadata Object Description Schema (MODS)
http://www.loc.gov/standards/mods/
[Subset of MARC with data clean-up]
31
MARC XML
• Simple and Flexible MARC XML Schema
The schema retains the semantics of MARC. Fields are
treated as elements with the tag as an attribute and
indicators treated as attributes. Subfields are treated as
subelements with the subfield code as an attribute.
• Lossless Conversion of MARC to XML
• Roundtripability from XML back to MARC
• Data Presentation by writing a XML stylesheet
• Validation of MARC data
• Extensibility
32
MODS Example (extracts)
<mods>
<titleInfo>
<title>Sound and fury :</title>
<subTitle>the making of the punditocracy /</subTitle>
</titleInfo>
<name type="personal">
<namePart>Alterman, Eric</namePart>
<role>
<roleTerm type="text">creator</roleTerm>
</role>
</name>
33
MODS Example (extracts)
<typeOfResource>text</typeOfResource>
<originInfo>
<place>
<placeTerm type="text">Ithaca, N.Y</placeTerm>
</place>
<publisher>Cornell University Press</publisher>
<dateIssued>c1999</dateIssued>
</originInfo>
<language>
<languageTerm authority="iso639-2b"
type="code">eng</languageTerm>
</language>
</mods>
34
Notes on MARC
A great achievement:
35
•
Developed in 1960s
•
Magnetic tape exchange format for printing catalog records
•
The dawn of computing:
mixed upper and lower case
variable length fields,
repeated fields
non-Roman scripts
•
100(?) million records with standard content and format
•
Thousands of trained librarians (millions?)
Notes on MARC
A great problem:
•
Not designed for computer algorithms
•
One record per item (poor links between records)
•
Tied to traditional materials and traditional practices
•
Not Unicode
•
100 of million records at $100 -- $10 billion
A classic legacy system!
36
Download