CS/INFO 430 Information Retrieval Metadata 2 Lecture 15

advertisement
CS/INFO 430
Information Retrieval
Lecture 15
Metadata 2
1
Course Administration
Discussion Class on October 19
This class will be held in Phillips Hall 213
2
Course Administration
Assignment 2 and Midterm
Grades will be mailed as soon.
Midterm
This was well done.
Average grade was 26.
Range was from 21 to 30.
3
Course Administration
Midterm: Common Mistake 1
Question: Define precision-recall graph
Answer: The basic precision recall graph applies to the results
of a single query using ranked searching. For each value of
r, from one to the number of hits returned, it plots precision
against recall for the first r hits
4
Course Administration
Midterm: Common Mistake 2
Question: With latent semantic indexing: The dotted lines are
described as, "The dotted cone represents the region whose
points are within a cosine of 0.9 from the query q." All the
documents labeled c1-c5 are within this cone, but none of the
documents labeled m1-m4. What does this imply?
Answer: This implies that each of c1-c5 are similar to q, but
that m1-m4 are not, using the similarity measured discussed
in the paper. It does not imply that the c1-c5 are relevant,
though clearly that is the hope.
5
Examples of Non-textual Materials
Content
maps
photograph
bird songs and images
software
data set
video
6
Attribute
lat. and long., content
subject, date and place
field mark, bird song
task, algorithm
survey characteristics
subject, date, etc.
Possible Approaches to Information
Discovery for Non-text Materials
Human indexing
Manually created metadata records
Automated information retrieval
Automatically created metadata records (e.g., image recognition)
Context: associated text, links, etc. (e.g., Google image search)
Multimodal: combine information from several sources
User expertise
Browsing: user interface design
7
Catalog Records for Non-Textual
Materials
• General metadata standards, such as Dublin Core and
MARC, can be used to create a textual catalog record of nontextual items.
• Subject based metadata standards apply to specific
categories of materials, e.g., FGDC for geospatial materials.
• Text-based searching methods can be used to search these
catalog records.
8
Example 1: Photographs
Photographs in the Library of Congress's American
Memory collections
In American Memory, each photograph is described by a
MARC record.
The photographs are grouped into collections, e.g., The
Northern Great Plains, 1880-1920: Photographs from the Fred
Hultstrand and F.A. Pazandak Photograph Collections
Information discovery is by:
• searching the catalog records
• browsing the collections
9
10
11
12
Photographs: Cataloguing Difficulties
Automatic
• Image recognition methods are very primitive
Manual
• Photographic collections can be very large
• Many photographs may show the same subject
• Photographs have little or no internal metadata (no
title page)
• The subject of a photograph may not be known
(Who are the people in a picture? Where is the
location?)
13
Photographs: Difficulties for Users
Searching
• Often difficult to narrow the selection down by searching -browsing is required
• Criteria may be different from those in catalog (e.g., graphical
characteristics)
Browsing
14
•
Offline. Handling many photographs is tedious. Photographs
can be damaged by repeated handling
•
Online. Viewing many images can be tedious. Screen quality
may be inadequate.
Example 2: Geospatial Information
Example: Alexandria Digital Library at the University of
California, Santa Barbara
• Funded by the NSF Digital Libraries Initiative since 1994.
• Collections include any data referenced by a geographical
footprint.
terrestrial maps, aerial and satellite photographs,
astronomical maps, databases, related textual information
• Program of research with practical implementation at the
university's map library
15
Alexandria User Interface
16
Alexandria: Computer Systems and
User Interfaces
Computer systems
•
Digitized maps and geospatial information -- large files
•
Wavelets provide multi-level decomposition of image
-> first level is a small coarse image
-> extra levels provide greater detail
User interfaces
•
Small size of computer displays
•
Slow performance of Internet in delivering large files
-> retain state throughout a session
17
Alexandria: Information Discovery
Metadata for information discovery
Coverage: geographical area covered, such as the city of
Santa Barbara or the Pacific Ocean.
Scope: varieties of information, such as topographical
features, political boundaries, or population density.
Latitude and longitude provide basic metadata for maps
and for geographical features.
18
Gazetteer
Gazetteer: database and a set of procedures that translate
representations of geospatial references:
place names, geographic features, coordinates
postal codes, census tracts
Search engine tailored to peculiarities of searching for place
names.
Research is making steady progress at feature extraction,
using automatic programs to identify objects in aerial
photographs or printed maps -- topic for long-term research.
19
Gazetteers
The Alexandria Digital Library (ADL): geolibrary at University
of California at Santa Barbara where a primary attribute of objects
is location on Earth (e.g., map, satellite photograph).
Geographic footprint: latitude and longitude values that
represent a point, a bounding box, a linear feature, or a complete
polygonal boundary.
Gazetteer: list of geographic names, with geographic locations
and other descriptive information.
Geographic name: proper name for a geographic place or feature
(e.g., Santa Barbara County, Mount Washington, St. Francis
Hospital, and Southern California)
20
Use of a Gazetteer
• Answers the "Where is" question; for example, "Where is
Santa Barbara?"
• Translates between geographic names and locations. A
user can find objects by matching the footprint of a
geographic name to the footprints of the collection
objects.
•
21
Locates particular types of geographic features in a
designated area. For example, a user can draw a box
around an area on a map and find the schools, hospitals,
lakes, or volcanoes in the area.
Alexandria Gazetteer: Example from
a search on "Tulsa"
Feature name
State County
Type
Latitude
Tulsa
OK Tulsa
pop pl
360914N 0955933W
Tulsa Country
Club
OK Osage
locale
360958N 0960012W
Tulsa County
OK Tulsa
civil
360600N 0955400W
airport
360500N 0955205W
Tulsa Helicopters OK Tulsa
Incorporated
Heliport
22
Longitude
Challenges for the Alexandria Gazetteer
Content standard: A standard conceptual schema for
gazetteer information.
Feature types: A type scheme to categorize individual
features, is rich in term variants and extensible.
Temporal aspects: Geographic names and attributes change
through time.
"Fuzzy" footprints: Extent of a geographic feature is often
approximate or ill-defined (e.g., Southern California).
23
Challenges for the Alexandria
Gazetteer (continued)
Quality aspects:
(a) Indicate the accuracy of latitude and longitude data.
(b) Ensure that the reported coordinates agree with the other
elements of the description.
Spatial extents:
(a) Points do not represent the extent of the geographic
locations and are therefore only minimally useful.
(b) Bounding boxes, often include too much territory (e.g., the
bounding box for California also includes Nevada).
24
Alexandria Thesaurus: Example
canals
A feature type category for places such as the Erie Canal.
Used for:
The category canals is used instead of any of the following.
canal bends
ditches
canalized streams
drainage canals
ditch mouths
drainage ditches
Broader Terms:
Canals is a sub-type of hydrographic structures.
25
... more ...
Alexandria Thesaurus: Example
(continued)
canals (continued)
Related Terms:
The following is a list of other categories related to canals (nonhierarchial relationships).
channels
locks
transportation features
tunnels
Scope Note:
Manmade waterway used by watercraft or for drainage, irrigation,
mining, or water power. » Definition of canals.
26
Alexandria Gazetteer
Alexandria Digital Library
Linda L. Hill, James Frew, and Qi Zheng, Geographic Names:
The Implementation of a Gazetteer in a Georeferenced Digital
Library. D-Lib Magazine, 5: 1, January 1999.
http://www.dlib.org/dlib/january99/hill/01hill.html
27
Cataloguing Online Materials:
Dublin Core
Dublin Core is an attempt to apply cataloguing methods to online
materials, notably the Web.
History
It was anticipated that the methods of full text indexing that were
used by the early Web search engines, such as Lycos, would not
scale up.
"... [automated] indexes are most useful in small collections within
a given domain. As the scope of their coverage expands, indexes
succumb to problems of large retrieval sets and problems of cross
disciplinary semantic drift. Richer records, created by content
experts, are necessary to improve search and retrieval."
28
Weibel 1995
Dublin Core
Simple set of metadata elements for online information
• 15 basic elements
• intended for all types and genres of material
• all elements optional
• all elements repeatable
Developed by an international group chaired by Stuart Weibel
since 1995.
(Diane Hillmann of Cornell has been very active in this group.)
29
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
30
Dublin Core record for the Dublin
Core Web Site
contributor: Dublin Core Metadata Initiative
description: The Dublin Core Metadata Initiative is an open
forum engaged in the development of interoperable online
metadata standards that support a broad range of purposes and
business models...
title: Dublin Core Metadata Initiative (DCMI) Home Page
date: 2004-10-05
31
format: text/html
(MIME type)
language: en
(English)
Dublin Core elements
Element Name: Title
Definition: A name given to the resource.
Comment: Typically, Title will be a name by which the
resource is formally known.
Element Name: Creator
Definition: An entity primarily responsible for making the
content of the resource.
Comment: Examples of Creator include a person, an
organization, or a service. Typically, the name of a
Creator should be used to indicate the entity.
32
Dublin Core elements
Element Name: Subject
Definition: A topic of the content of the resource.
Comment: Typically, Subject will be expressed as keywords, key
phrases or classification codes that describe a topic of the
resource. Recommended best practice is to select a value from
a controlled vocabulary or formal classification scheme.
Element Name: Description
Definition: An account of the content of the resource.
Comment: Examples of Description include, but is not limited
to: an abstract, table of contents, reference to a graphical
representation of content or a free-text account of the content.
33
Dublin Core elements
Element Name: Publisher
Definition: An entity responsible for making the resource available
Comment: Examples of Publisher include a person, an organization,
or a service. Typically, the name of a Publisher should be used to
indicate the entity.
Element Name: Contributor
Definition: An entity responsible for making contributions to the
content of the resource.
Comment: Examples of Contributor include a person, an
organization, or a service. Typically, the name of a Contributor
should be used to indicate the entity.
34
Dublin Core elements
Element Name: Date
Definition: A date of an event in the lifecycle of the resource.
Comment: Typically, Date will be associated with the creation or
availability of the resource. Recommended best practice for
encoding the date value is defined in a profile of ISO 8601
[W3CDTF] and includes (among others) dates of the form YYYYMM-DD.
35
Dublin Core elements
Element Name: Type
Definition: The nature or genre of the content of the resource.
Comment: Type includes terms describing general categories,
functions, genres, or aggregation levels for content.
Recommended best practice is to select a value from a controlled
vocabulary (for example, the DCMI Type Vocabulary [DCT1]). To
describe the physical or digital manifestation of the resource, use
the FORMAT element.
36
Dublin Core elements
Element Name: Format
Definition: The physical or digital manifestation of the resource.
Comment: Typically, Format may include the media-type or
dimensions of the resource. Format may be used to identify the
software, hardware, or other equipment needed to display or
operate the resource. Examples of dimensions include size and
duration. Recommended best practice is to select a value from a
controlled vocabulary (for example, the list of Internet Media
Types [MIME] defining computer media formats).
37
Dublin Core elements
Element Name: Identifier
Definition: An unambiguous reference to the resource within a
given context.
Comment: Recommended best practice is to identify the resource
by means of a string or number conforming to a formal
identification system. Formal identification systems include but
are not limited to the Uniform Resource Identifier (URI)
(including the Uniform Resource Locator (URL)), the Digital
Object Identifier (DOI) and the International Standard Book
Number (ISBN).
38
Dublin Core elements
Element Name: Source
Definition: A Reference to a resource from which the present
resource is derived.
Comment: The present resource may be derived from the Source
resource in whole or in part. Recommended best practice is to
identify the referenced resource by means of a string or number
conforming to a formal identification system.
39
Dublin Core elements
Element Name: Language
Definition: A language of the intellectual content of the resource.
Comment: Recommended best practice is to use RFC 3066
[RFC3066] which, in conjunction with ISO639 [ISO639]), defines
two- and three-letter primary language tags with optional subtags.
Examples include "en" or "eng" for English, "akk" for Akkadian",
and "en-GB" for English used in the United Kingdom.
Element Name: Relation
Definition: A reference to a related resource.
Comment: Recommended best practice is to identify the referenced
resource by means of a string or number conforming to a formal
identification system.
40
Dublin Core elements
Element Name: Coverage
Definition: The extent or scope of the content of the resource.
Comment: Typically, Coverage will include spatial location (a place
name or geographic coordinates), temporal period (a period label,
date, or date range) or jurisdiction (such as a named administrative
entity). Recommended best practice is to select a value from a
controlled vocabulary (for example, the Thesaurus of Geographic
Names [TGN]) and to use, where appropriate, named places or
time periods in preference to numeric identifiers such as sets of
coordinates or date ranges.
41
Dublin Core elements
Element Name: Rights
Definition: Information about rights held in and over the resource.
Comment: Typically, Rights will contain a rights management
statement for the resource, or reference a service providing such
information. Rights information often encompasses Intellectual
Property Rights (IPR), Copyright, and various Property Rights. If
the Rights element is absent, no assumptions may be made about
any rights held in or over the resource.
42
Qualifiers
A qualifier refines the element name to add specificity
Example: element qualifier
Example: Date
43
DC.Date.Created
1997-11-01
DC.Date.Issued
1997-11-15
DC.Date.Available
1997-12-01/1998-06-01
DC.Date.Valid
1998-01-01/1998-06-01
Qualifiers
Example: value qualifiers
Example: Subject
44
DC.Subject.DDC
509.123
(Dewey Decimal Classification)
DC.Subject.LCSH
Digital libraries-United States
(Library of Congress Subject Heading)
Dumbing Down Principle
"The theory behind this principle is that consumers
of metadata should be able to strip off qualifiers and
return to the base form of a property. ... this principle
makes it possible for client applications to ignore
qualifiers in the context of more coarse-grained,
cross-domain searches."
Lagoze 2001
45
Dumbing Down Principle
Qualified version
DC.Date.Created
1997-11-01
DC.Subject.LCSH
Digital libraries-United States
Dumbed-down version
46
DC.Date
1997-11-01
a valid date
DC.Subject
Digital libraries-United States
a valid subject description
Dublin Core with qualifiers
See the next two slides for an example
of a Dublin Core record for a web site
prepared by a professional cataloguer at
the Library of Congress.
Note that the record does not follow the
principle of dumbing-down.
47
48
49
Download