CS/INFO 430 Information Retrieval Metadata 2 Lecture 20

advertisement
CS/INFO 430
Information Retrieval
Lecture 20
Metadata 2
1
Course Administration
2
Cataloguing Online Materials:
Dublin Core
Dublin Core is an attempt to apply cataloguing methods to online
materials, notably the Web.
History
It was anticipated that the methods of full text indexing that were
used by the early Web search engines, such as Lycos, would not
scale up.
"... [automated] indexes are most useful in small collections within
a given domain. As the scope of their coverage expands, indexes
succumb to problems of large retrieval sets and problems of cross
disciplinary semantic drift. Richer records, created by content
experts, are necessary to improve search and retrieval."
3
Weibel 1995
Dublin Core
Simple set of metadata elements for online information
• 15 basic elements
• intended for all types and genres of material
• all elements optional
• all elements repeatable
Developed by an international group chaired by Stuart Weibel
since 1995.
(Diane Hillmann of Cornell has been very active in this group.)
4
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
5
Dublin Core record for the Dublin
Core Web Site
contributor: Dublin Core Metadata Initiative
description: The Dublin Core Metadata Initiative is an open
forum engaged in the development of interoperable online
metadata standards that support a broad range of purposes and
business models...
title: Dublin Core Metadata Initiative (DCMI) Home Page
date: 2004-10-05
6
format: text/html
(MIME type)
language: en
(English)
Dublin Core elements
Element Name: Title
Definition: A name given to the resource.
Comment: Typically, Title will be a name by which the
resource is formally known.
Element Name: Creator
Definition: An entity primarily responsible for making the
content of the resource.
Comment: Examples of Creator include a person, an
organization, or a service. Typically, the name of a
Creator should be used to indicate the entity.
7
Dublin Core elements
Element Name: Subject
Definition: A topic of the content of the resource.
Comment: Typically, Subject will be expressed as keywords, key
phrases or classification codes that describe a topic of the
resource. Recommended best practice is to select a value from
a controlled vocabulary or formal classification scheme.
Element Name: Description
Definition: An account of the content of the resource.
Comment: Examples of Description include, but is not limited
to: an abstract, table of contents, reference to a graphical
representation of content or a free-text account of the content.
8
Dublin Core elements
Element Name: Publisher
Definition: An entity responsible for making the resource available
Comment: Examples of Publisher include a person, an organization,
or a service. Typically, the name of a Publisher should be used to
indicate the entity.
Element Name: Contributor
Definition: An entity responsible for making contributions to the
content of the resource.
Comment: Examples of Contributor include a person, an
organization, or a service. Typically, the name of a Contributor
should be used to indicate the entity.
9
Dublin Core elements
Element Name: Date
Definition: A date of an event in the lifecycle of the resource.
Comment: Typically, Date will be associated with the creation or
availability of the resource. Recommended best practice for
encoding the date value is defined in a profile of ISO 8601
[W3CDTF] and includes (among others) dates of the form YYYYMM-DD.
10
Dublin Core elements
Element Name: Type
Definition: The nature or genre of the content of the resource.
Comment: Type includes terms describing general categories,
functions, genres, or aggregation levels for content.
Recommended best practice is to select a value from a controlled
vocabulary (for example, the DCMI Type Vocabulary [DCT1]). To
describe the physical or digital manifestation of the resource, use
the FORMAT element.
11
Dublin Core elements
Element Name: Format
Definition: The physical or digital manifestation of the resource.
Comment: Typically, Format may include the media-type or
dimensions of the resource. Format may be used to identify the
software, hardware, or other equipment needed to display or
operate the resource. Examples of dimensions include size and
duration. Recommended best practice is to select a value from a
controlled vocabulary (for example, the list of Internet Media
Types [MIME] defining computer media formats).
12
Dublin Core elements
Element Name: Identifier
Definition: An unambiguous reference to the resource within a
given context.
Comment: Recommended best practice is to identify the resource
by means of a string or number conforming to a formal
identification system. Formal identification systems include but
are not limited to the Uniform Resource Identifier (URI)
(including the Uniform Resource Locator (URL)), the Digital
Object Identifier (DOI) and the International Standard Book
Number (ISBN).
13
Dublin Core elements
Element Name: Source
Definition: A Reference to a resource from which the present
resource is derived.
Comment: The present resource may be derived from the Source
resource in whole or in part. Recommended best practice is to
identify the referenced resource by means of a string or number
conforming to a formal identification system.
14
Dublin Core elements
Element Name: Language
Definition: A language of the intellectual content of the resource.
Comment: Recommended best practice is to use RFC 3066
[RFC3066] which, in conjunction with ISO639 [ISO639]), defines
two- and three-letter primary language tags with optional subtags.
Examples include "en" or "eng" for English, "akk" for Akkadian",
and "en-GB" for English used in the United Kingdom.
Element Name: Relation
Definition: A reference to a related resource.
Comment: Recommended best practice is to identify the referenced
resource by means of a string or number conforming to a formal
identification system.
15
Dublin Core elements
Element Name: Coverage
Definition: The extent or scope of the content of the resource.
Comment: Typically, Coverage will include spatial location (a place
name or geographic coordinates), temporal period (a period label,
date, or date range) or jurisdiction (such as a named administrative
entity). Recommended best practice is to select a value from a
controlled vocabulary (for example, the Thesaurus of Geographic
Names [TGN]) and to use, where appropriate, named places or
time periods in preference to numeric identifiers such as sets of
coordinates or date ranges.
16
Dublin Core elements
Element Name: Rights
Definition: Information about rights held in and over the resource.
Comment: Typically, Rights will contain a rights management
statement for the resource, or reference a service providing such
information. Rights information often encompasses Intellectual
Property Rights (IPR), Copyright, and various Property Rights. If
the Rights element is absent, no assumptions may be made about
any rights held in or over the resource.
17
Qualifiers
A qualifier refines the element name to add specificity
Example: element qualifier
Example: Date
18
DC.Date.Created
1997-11-01
DC.Date.Issued
1997-11-15
DC.Date.Available
1997-12-01/1998-06-01
DC.Date.Valid
1998-01-01/1998-06-01
Qualifiers
Example: value qualifiers
Example: Subject
19
DC.Subject.DDC
509.123
(Dewey Decimal Classification)
DC.Subject.LCSH
Digital libraries-United States
(Library of Congress Subject Heading)
Dumbing Down Principle
"The theory behind this principle is that consumers
of metadata should be able to strip off qualifiers and
return to the base form of a property. ... this principle
makes it possible for client applications to ignore
qualifiers in the context of more coarse-grained,
cross-domain searches."
Lagoze 2001
20
Dumbing Down Principle
Qualified version
DC.Date.Created
1997-11-01
DC.Subject.LCSH
Digital libraries-United States
Dumbed-down version
21
DC.Date
1997-11-01
a valid date
DC.Subject
Digital libraries-United States
a valid subject description
Dublin Core with qualifiers
See the next two slides for an example
of a Dublin Core record for a web site
prepared by a professional cataloguer at
the Library of Congress.
Note that the record does not follow the
principle of dumbing-down.
22
23
24
Theoretical Problems in Metadata:
What to Catalog
The IFLA Model
Work A work is the underlying abstraction, e.g.,
•
•
•
•
•
The Iliad
The Computer Science departmental web site
Beethoven's Fifth Symphony
Unix operating system
The 1996 U.S. census
This is roughly equivalent to the concept of "literary
work" used in copyright law.
25
IFLA Model
Expression. A work is realized through an expression, e.g.,
• The Illiad has oral expressions and written expressions
• A musical work has score and performance(s).
• Software has source code and machine code
Many works have only a single expression, e.g. a Web page, or a
book.
26
IFLA Model
Manifestation. A expression is given form in one or more
manifestations, e.g.,
• The text of The Iliad has been manifest in numerous
manuscripts and printed books.
• A musical performance can be distributed on CD, or
broadcast on television.
• Software is manifest as files, which may be stored or
transmitted in any digital medium.
27
IFLA Model
Item. When many copies are made of a manifestation,
each is a separate item, e.g.,
• a specific copy of a book
• computer file
[Works, expressions, manifestations and items are
explored in CS 431, Architecture of Web
Information Systems.]
28
Theoretical Problems in Metadata: :
Events
Version 1
Version 2
New
material
Should Version 2 have its own record or should extra
information be added to the Version 2 record?
How are these represented in Dublin Core or MARC?
29
Theoretical Problems in Metadata: :
Complex Objects
Complex objects
Metadata records
Complete object
Sub-objects
•
•
•
•
30
Article within a journal
Page within a Web site
A thumbnail of another image
The March 28 final edition of a newspaper
Theoretical Problems in Metadata:
Packaging Rules
When an object consists of various parts, how should their
interaction be described?
Example: An object on the Web may consist of several html pages
with images, applets, etc.
Metadata Object Description Schema (MODS)
http://www.loc.gov/standards/mods/
MPEG 21
http://www.chiariglione.org/mpeg/standards/mpeg-21/mpeg-21.htm
31
MPEG 21
32
Theoretical Problems in Metadata:
Flat v. linked records
Flat record
All information about an item is held in a single record (e.g., a
Dublin Core record), including information about related items
convenient for access and preservation
information is repeated -- maintenance problem
Linked record
Related information is held in separate records with a link from the
item record
less convenient for access and preservation
information is stored once
Compare with normal forms in relational databases
33
34
Representations of Dublin Core:
XML (with qualifiers)
<title>Digital Libraries and the Problem of Purpose</title>
<creator>David M. Levy</creator>
<publisher>Corporation for National Research Initiatives</publisher>
<date date-type = "publication">January 2000</date>
<type resource-type = "work">article</type>
<identifier uri-type = "DOI">10.1045/january2000-levy</identifier>
<identifier uri-type =
"URL">http://www.dlib.org/dlib/january00/01levy.html</identifier>
<language>English</language>
<rights>Copyright (c) David M. Levy</rights>
to be continued
35
Dublin Core with flat record extension
Continuation of D-Lib Magazine record
<relation rel-type = "InSerial">
<serial-name>D-Lib Magazine</serial-name>
<issn>1082-9873</issn>
<volume>6</volume>
<issue>1</issue>
</relation>
36
Theoretical Problems in Metadata:
Many Languages
See:
Thomas Baker, Languages for Dublin Core, D-Lib Magazine
December 1998,
http://www.dlib.org/dlib/december98/12baker.html
37
Download