CS 430: Information Discovery Descriptive Metadata 3 Dublin Core

advertisement
CS 430: Information Discovery
Lecture 7
Descriptive Metadata 3
Dublin Core
Automatic Generation of Catalog Records
1
Course Administration
•
2
Relationship between Library of Congress, OCLC and
American Memory
Dublin Core elements
1. Title The name given to the resource by the creator or publisher.
2. Creator The person or organization primarily responsible for
the intellectual content of the resource. For example, authors in the
case of written documents, artists, photographers, or illustrators in
the case of visual resources.
3. Subject The topic of the resource. Typically, subject will be
expressed as keywords or phrases that describe the subject or
content of the resource. The use of controlled vocabularies and
formal classification schemes is encouraged.
3
Dublin Core elements
4. Description A textual description of the content of the resource,
including abstracts in the case of document-like objects or content
descriptions in the case of visual resources.
5. Publisher The entity responsible for making the resource
available in its present form, such as a publishing house, a university
department, or a corporate entity.
6. Contributor A person or organization not specified in a creator
element who has made significant intellectual contributions to the
resource but whose contribution is secondary to any person or
organization specified in a creator element (for example, editor,
transcriber, and illustrator).
4
Dublin Core elements
7. Date A date associated with the creation or availability of the
resource.
8. Type The category of the resource, such as home page, novel,
poem, working paper, preprint, technical report, essay,
dictionary.
9. Format The data format of the resource, used to identify the
software and possibly hardware that might be needed to display
or operate the resource.
5
10. Identifier A string or number used to uniquely identify the
resource. Examples for networked resources include URLs and
URNs.
Dublin Core elements
11. Source Information about a second resource from which
the present resource is derived.
12. Language The language of the intellectual content of the
resource.
13. Relation An identifier of a second resource and its
relationship to the present resource. This element permits links
between related resources and resource descriptions to be
indicated. Examples include an edition of a work
(IsVersionOf), or a chapter of a book (IsPartOf).
6
Dublin Core elements
14. Coverage The spatial locations and temporal durations
characteristic of the resource.
15. Rights A rights management statement, an identifier that
links to a rights management statement, or an identifier that
links to a service providing information about rights
management for the resource.
7
Qualifiers
Element qualifier
Example: Date
DC.Date -> Created: 1997-11-01
DC.Date -> Issued: 1997-11-15
DC.Date -> Available: 1997-12-01/1998-06-01
DC.Date -> Valid: 1998-01-01/1998-06-01
8
Qualifiers
Value qualifiers
Example: Subject
DC.Subject -> DDC: 509.123
DC.Subject -> LCSH: Digital libraries-United States
9
Metadata about subjects
(a) Classification (usually manual)
Dewey Decimal Classification (DDC)
324.973
political web site
Library of Congress classification system (LCC)
E840.8.G65 political web site
(b) Subject headings (usually manual)
Keywords assigned from controlled vocabulary
e.g., Medical Subject Headings (MeSH)
Library of Congress subject headings (LCSH)
Political campaigns - United States
(c) Terms extracted from text (automatic)
10
Automatic indexing [CS 430]
Methods from computational linguistics [CS 374/474]
Dewey Decimal Classification
Main classes:
000 Computers, information, & general reference
100 Philosophy & psychology
200 Religion
300 Social sciences
400 Language
500 Science
600 Technology
700 Arts & recreation
800 Literature
900 History & geography
11
Dewey Decimal Classification
Hierarchy, e.g.:
600 Technology (Applied sciences)
630
Agriculture and related technologies
636
Animal husbandry
636.7 Dogs
636.8 Cats
Uses:
• Shelving collections of physical objects so that items on
similar subjects are shelved together
•
Crude subject access
Scorpion project (OCLC):
Automatic subject recognition and assignment of DDC classes
12
13
14
Limits of Dublin Core
Complex objects
Metadata records
Complete object
Sub-objects
• Article within a journal
• A thumbnail of another image
• The March 28 final edition of a newspaper
15
Flat v. linked records
Flat record
All information about an item is held in a single Dublin Core
record, including information about related items
convenient for access and preservation
information is repeated -- maintenance problem
Linked record
Related information is held in separate records with a link from the
item record
less convenient for access and preservation
information is stored once
Compare with normal forms in relational databases
16
17
Dublin Core with qualifiers
<title>Digital Libraries and the Problem of Purpose</title>
<creator>David M. Levy</creator>
<publisher>Corporation for National Research Initiatives</publisher>
<date date-type = "publication">January 2000</date>
<type resource-type = "work">article</type>
<identifier uri-type = "DOI">10.1045/january2000-levy</identifier>
<identifier uri-type =
"URL">http://www.dlib.org/dlib/january00/01levy.html</identifier>
<language>English</language>
<rights>Copyright (c) David M. Levy</rights>
18
Dublin Core with flat record extension
Continuation
<relation rel-type = "InSerial">
<serial-name>D-Lib Magazine</serial-name>
<issn>1082-9873</issn>
<volume>6</volume>
<issue>1</issue>
</relation>
19
Events
Version 1
Version 2
New
material
Should Version 2 have its own record or should extra
information be added to the Version 2 record?
How are these represented in Dublin Core?
20
Minimalist versus structuralist
Minimalist
15 elements, no qualifiers, suitable for non-professionals
encourage creators to provide metadata
Structuralists
15 elements, qualifiers, RDF, detailed coding rules
will require trained metadata experts
[For an example of how complex Dublin Core can become, see
the source of: http://purl.org/dc/documents/rec-dces199809.htm#]
21
Dublin Core in many languages
See:
Thomas Baker, Languages for Dublin Core, D-Lib Magazine
December 1998,
http://www.dlib.org/dlib/december98/12baker.html
22
Dublin Core: Personal Opinion
Dublin Core is a simple way to describe digital content that:
• is a single, self-contained object ("document-like")
• is static with time
• has few relationships
Some web sites satisfy these criteria
Dublin Core is not suitable for digital content that:
• is heavily structured
• changes dynamically
23
Automatic extraction of catalog data
Example: Dublin Core records for web pages
Strategies
24
•
Manual by trained cataloguers
- high quality records, but expensive and time consuming
•
Entirely automatic
- fast, almost zero cost, but poor quality
•
Automatic followed by human editing
- cost and quality depend on the amount of editing
•
Manual collection level record, automatic item level record
- moderate quality, moderate cost
DC-dot
DC-dot is a Dublin Core metadata editor for web pages,
created by Andy Powell at UKOLN
http://www.ukoln.ac.uk/metadata/dcdot/
DC-dot has two parts:
(a) A skeleton Dublin Core record is created automatically
from clues in the web page
(b) A user interface is provided for cataloguers to edit the
record
25
26
Automatic record for CS 430 home page
DC-dot applied to http://www.cs.cornell.edu/courses/cs430/2001sp/
<link rel="schema.DC" href="http://purl.org/dc">
<meta name="DC.Title" content="CS 430: Information Discovery">
<meta name="DC.Subject" content="wya@cs.cornell.edu; Course
Structure; Readings and references; Slides; Basic Information; William
Y. Arms; Information Retrieval Data Structures and Algorithms;
cs430@cs.cornell.edu; Assignments; Syllabus; Text Book; Laptop
computers; Assumed Background; Nomadic Computing Experiment;
Notices; Course Description; Code of practice; Assignments and
Grading; Last changed: February 6, 2001">
continued on next slide
27
Automatic record for CS 430 home page
(continued)
DC-dot applied to http://www.cs.cornell.edu/courses/cs430/2001sp/
<meta name="DC.Publisher" content="Cornell University">
<meta name="DC.Date" scheme="W3CDTF" content="2001-02-07">
<meta name="DC.Type" scheme="DCMIType" content="Text">
<meta name="DC.Format" content="text/html">
<meta name="DC.Format" content="5781 bytes">
<meta name="DC.Identifier"
content="http://www.cs.cornell.edu/courses/cs430/2001sp/">
28
Observations on DC-dot applied to
CS430 home page
DC.Title is a copy of the html <title> field
DC.Publisher is the owner of the IP address where the page was
stored
DC.Subject is a list of headings and noun phrases presented for
editing
DC.Date is taken from the Last-Modified field in the http header
DC.Type and DC.Format are taken from the MIME type of the http
response
DC.Identifier was supplied by the user as input
29
30
Automatic record for George W. Bush
home page
DC-dot applied to http://www.georgewbush.com/
<link rel="schema.DC" href="http://purl.org/dc">
<meta name="DC.Subject" content="George W. Bush; Bush;
George Bush; President; republican; 2000 election; election;
presidential election; George; B2K; Bush for President; Junior;
Texas; Governor; taxes; technology; education; agriculture;
health care; environment; society; social security; medicare;
income tax; foreign policy; defense; government">
<meta name="DC.Description" content="George W. Bush is
running for President of the United States to keep the country
prosperous.">
31
continued on next slide
Automatic record for George W. Bush
home page (continued)
DC-dot applied to http://www.georgewbush.com/
<meta name="DC.Publisher" content="Concentric Network
Corporation">
<meta name="DC.Date" scheme="W3CDTF" content="2001-01-12">
<meta name="DC.Type" scheme="DCMIType" content="Text">
<meta name="DC.Format" content="text/html">
<meta name="DC.Format" content="12223 bytes">
<meta name="DC.Identifier"
content="http://www.georgewbush.com/">
32
Observations on DC-dot applied to
George W. Bush home page
The home page has several meta tags:
<META NAME="TITLE" CONTENT="George W. Bush for
President"> [The page has no html <title>]
<META NAME="CONTACT" CONTENT="George W Bush
Campaign, P. O. Box 1902, Austin, TX 78767, Phone: (512) 6372000">
<META NAME="DESCRIPTION" CONTENT="George W. Bush is
running for President of the United States to keep the country
prosperous.">
<META NAME="KEYWORDS" CONTENT="George W. Bush, Bush,
George Bush, President, republican, 2000 election and more
33
Collection-level metadata
Several of the most difficult fields to extract automatically are
the same across all pages in a web site.
Therefore create a collection record manually and combine it with
automatic extraction of other fields at item level.
For the CS 430 home page, collection-level metadata:
<meta name="DC.Publisher" content="Cornell University">
<meta name="DC.Creator" content="William Y. Arms">
<meta name="DC.Rights" content="William Y. Arms, 2001">
See: Jenkins and Inman
34
Collection-level metadata
Compare:
(a) Metadata extracted automatically by DC-dot
(b) Collection-level record
(c) Combined item-level record (DC-dot plus collection-level)
(d) Manual record
35
36
Metadata extracted automatically by
DC-dot
D.C. Field Qualifier
Content
title
Digital Libraries and the Problem of
Purpose
subject
not included in this slide
publisher
Corporation for National Research
Initiatives
date
W3CDTF
2000-05-11
type
DCMIType Text
format
text/html
format
27718 bytes
37
identifier
http://www.dlib.org/dlib/january00/01levy.html
Collection-level record
D.C. Field Qualifier
Content
publisher
Corporation for National Research
Initiatives
type
article
type
resource
work
relation
rel-type
InSerial
relation
serial-name
D-Lib Magazine
relation
issn
1082-9873
language
English
rights
Permission is hereby given for the material
in D-Lib Magazine to be used for ...
38
Combined item-level record
(DC-dot plus collection-level)
D.C. Field Qualifier
Content
title
publisher
date
type
Digital Libraries and the Problem of Purpose
(*) Corporation for National Research Initiatives
W3CDTF
2000-05-11
(*) article
type
type
format
resource (*) work
DCMIType Text
text/html
format
27718 bytes
(*) indicates collection-level metadata
continued on next slide
39
Combined item-level record
(DC-dot plus collection-level)
D.C. Field Qualifier
Content
relation
rel-type
(*) InSerial
relation
serial-name (*) D-Lib Magazine
relation
issn
(*) 1082-9873
language
(*) English
rights
(*) Permission is hereby given for the material
in D-Lib Magazine to be used for ...
identifier
http://www.dlib.org/dlib/january00/01levy.html
(*) indicates collection-level metadata
40
Manually created record
D.C. Field Qualifier
title
Digital Libraries and the Problem of Purpose
creator
(+) David M. Levy
publisher
date
type
type
Content
Corporation for National Research Initiatives
publication
resource
January 2000
article
work
(+) entry that is not in the automatically generated records
continued on next slide
41
Manually created record
D.C. Field Qualifier
relation
relation
relation
relation
relation
identifier
identifier
language
rights
Content
rel-type
InSerial
serial-name
D-Lib Magazine
issn
1082-9873
volume
(+) 6
issue
(+) 1
DOI
(+) 10.1045/january2000-levy
URL
http://www.dlib.org/dlib/january00/01levy.html
English
(+) Copyright (c) David M. Levy
(+) entry that is not in the automatically generated records
42
Download