CS/INFO 430 Information Retrieval Metadata 3 Lecture 21

advertisement
CS/INFO 430
Information Retrieval
Lecture 21
Metadata 3
1
Course Administration
2
Automatic extraction of catalog data
Strategies
3
•
Manual by trained cataloguers and indexers
- high quality records, but expensive and time consuming
•
Entirely automatic
- fast, almost zero cost, but poor quality
•
Automatic, followed by human editing
- cost and quality depend on the amount of editing
•
Manual collection level record, automatic item level record
- moderate quality, moderate cost
DC-dot
DC-dot is a Dublin Core metadata editor for Web pages, created
by Andy Powell at UKOLN
http://www.ukoln.ac.uk/metadata/dcdot/
DC-dot has two parts:
(a) A skeleton Dublin Core record is created automatically
from clues in the web page
(b) A user interface is provided for cataloguers to edit the
record
4
5
Automatic record for CS 430 home page
DC-dot applied to http://www.cs.cornell.edu/courses/cs430/2001sp/
DC.Title
CS 430: Information Discovery
DC.Subject wya@cs.cornell.edu; Course Structure; Readings and
references; Slides; Basic Information; William Y. Arms;
Information Retrieval Data Structures and Algorithms;
cs430@cs.cornell.edu; Assignments; Syllabus; Text
Book; Laptop computers; Assumed Background;
Nomadic Computing Experiment; Notices; Course
Description; Code of practice; Assignments and Grading;
Last changed: February 6, 2001
continued on next slide
6
Automatic record for CS 430 home page
(continued)
DC.Publisher Cornell University
DC.Date
qualifier W3CDTF
content 2001-02-07
DC.Type
qualifier DCMIType
content Text
DC.Format
text/html
DC.Format
5781 bytes
DC.Identifier http://www.cs.cornell.edu/courses/cs430/2001sp/
7
Observations on DC-dot applied to
CS430 home page
DC.Title is a copy of the html <title> field
DC.Publisher is the owner of the IP address where the page was
stored
DC.Subject is a list of headings and noun phrases presented for
editing
DC.Date is taken from the Last-Modified field in the http header
DC.Type and DC.Format are taken from the MIME type of the http
response
DC.Identifier was supplied by the user as input
8
9
Observations on DC-dot applied to
George W. Bush home page
The home page has several meta tags:
<META NAME="TITLE" CONTENT="George W. Bush for
President"> [The page has no html <title>]
<META NAME="CONTACT" CONTENT="George W Bush
Campaign, P. O. Box 1902, Austin, TX 78767, Phone: (512) 6372000">
<META NAME="DESCRIPTION" CONTENT="George W. Bush is
running for President of the United States to keep the country
prosperous.">
<META NAME="KEYWORDS" CONTENT="George W. Bush, Bush,
George Bush, President, republican, 2000 election and more
10
Automatic record for George W. Bush
home page
DC-dot applied to http://www.georgewbush.com/
DC.Subject
George W. Bush; Bush; George Bush;
President; republican; 2000 election; election;
presidential election; George; B2K; Bush for
President; Junior; Texas; Governor; taxes;
technology; education; agriculture; health care;
environment; society; social security;
medicare; income tax; foreign policy; defense;
government
DC.Description George W. Bush is running for President of the
United States to keep the country prosperous.
continued on next slide
11
Automatic record for George W. Bush
home page (continued)
DC.Publisher Concentric Network Corporation
DC.Date
qualifier W3CDTF
content 2001-01-12
DC.Type
qualifier DCMIType
content Text
DC.Format
text/html
DC.Format
12223 bytes
DC.Identifier http://www.georgewbush.com/
12
13
Metadata extracted automatically by
DC-dot
D.C. Field Qualifier
Content
title
Digital Libraries and the Problem of
Purpose
subject
not included in this slide
publisher
Corporation for National Research
Initiatives
date
W3CDTF
2000-05-11
type
DCMIType Text
format
text/html
format
27718 bytes
14
identifier
http://www.dlib.org/dlib/january00/01levy.html
Use of Machine Learning for
Automatic Metadata Extraction
In the previous example, several fields were not recognized that are
easily identified by a human, e.g., author, and information about the
serial in which the article appeared.
iVia is the most advanced of several systems that apply machine
learning algorithms to extract metadata automatically. See:
http://ivia.ucr.edu/projects/.
The state-of-the-art is:
• automatic extraction followed by human editing
• expert guided Web crawling.
15
Automatic Extraction and Search
Engine Spam
D-Lib Magazine
Web pages created for user, with good quality control and no
attempt to impress search engines. (The editor originally
trained as a librarian.)
The site lends itself to automatic indexing.
Political Web Sites (Bush and Gore)
Web pages created for marketing, with little consistency,
designed to impress search engines. (The editors are
specialists in public relations.)
The sites are difficult to index automatically.
16
Metatest
Metatest is a research project led by Liz Liddy at Syracuse with
participation from the Human Computer Interaction group at
Cornell.
The aim is to compare the effectiveness as perceived by the
user of indexing based on:
(a) Manually created Dublin Core
(b) Automatically created Dublin Core (higher quality than
DC-dot)
(c) Full text indexing
Preliminary results suggest remarkably little difference in
effectiveness.
17
Non-textual Materials: Examples
Content
maps
photograph
bird songs and images
software
data set
video
18
Attribute
lat. and long., content
subject, date and place
field mark, bird song
task, algorithm
survey characteristics
subject, date, etc.
Possible Approaches to Information
Discovery for Non-text Materials
Human indexing
Manually created metadata records
Automated information retrieval
Automatically created metadata records (e.g., image recognition)
Context: associated text, links, etc. (e.g., Google image search)
Multimodal: combine information from several sources
User expertise
Browsing: user interface design
19
Catalog Records for Non-Textual
Materials
• General metadata standards, such as Dublin Core and
MARC, can be used to create a textual catalog record of nontextual items.
• Subject based metadata standards apply to specific
categories of materials, e.g., FGDC for geospatial materials.
Text-based searching methods can be used to search these
catalog records.
20
Example 1: Photographs
Photographs in the Library of Congress's American
Memory collections
In American Memory, each photograph is described by a
MARC record.
The photographs are grouped into collections, e.g., The
Northern Great Plains, 1880-1920: Photographs from the Fred
Hultstrand and F.A. Pazandak Photograph Collections
Information discovery is by:
• searching the catalog records
• browsing the collections
21
22
23
24
Photographs: Cataloguing Difficulties
Automatic
• Image recognition methods are very primitive
Manual
• Photographic collections can be very large
• Many photographs may show the same subject
• Photographs have little or no internal metadata (no
title page)
• The subject of a photograph may not be known
(Who are the people in a picture? Where is the
location?)
25
Photographs: Difficulties for Users
Searching
• Often difficult to narrow the selection down by searching -browsing is required
• Criteria may be different from those in catalog (e.g., graphical
characteristics)
Browsing
26
•
Offline. Handling many photographs is tedious. Photographs
can be damaged by repeated handling
•
Online. Viewing many images can be tedious. Screen quality
may be inadequate.
Example 2: Geospatial Information
Example: Alexandria Digital Library at the University of
California, Santa Barbara
• Funded by the NSF Digital Libraries Initiative since 1994.
• Collections include any data referenced by a geographical
footprint.
terrestrial maps, aerial and satellite photographs,
astronomical maps, databases, related textual information
• Program of research with practical implementation at the
university's map library
27
Alexandria User Interface
28
Alexandria: Computer Systems and
User Interfaces
Computer systems
•
Digitized maps and geospatial information -- large files
•
Wavelets provide multi-level decomposition of image
-> first level is a small coarse image
-> extra levels provide greater detail
User interfaces
•
Small size of computer displays
•
Slow performance of Internet in delivering large files
-> retain state throughout a session
29
Alexandria: Information Discovery
Metadata for information discovery
Coverage: geographical area covered, such as the city of
Santa Barbara or the Pacific Ocean.
Scope: varieties of information, such as topographical
features, political boundaries, or population density.
Latitude and longitude provide basic metadata for maps
and for geographical features.
30
Gazetteer
Gazetteer: database and a set of procedures that translate
representations of geospatial references:
place names, geographic features, coordinates
postal codes, census tracts
Search engine tailored to peculiarities of searching for place
names.
Research is making steady progress at feature extraction,
using automatic programs to identify objects in aerial
photographs or printed maps -- topic for long-term research.
31
Gazetteers
The Alexandria Digital Library (ADL): geolibrary at University
of California at Santa Barbara where a primary attribute of objects
is location on Earth (e.g., map, satellite photograph).
Geographic footprint: latitude and longitude values that
represent a point, a bounding box, a linear feature, or a complete
polygonal boundary.
Gazetteer: list of geographic names, with geographic locations
and other descriptive information.
Geographic name: proper name for a geographic place or feature
(e.g., Santa Barbara County, Mount Washington, St. Francis
Hospital, and Southern California)
32
Use of a Gazetteer
• Answers the "Where is" question; for example, "Where is
Santa Barbara?"
• Translates between geographic names and locations. A
user can find objects by matching the footprint of a
geographic name to the footprints of the collection
objects.
•
33
Locates particular types of geographic features in a
designated area. For example, a user can draw a box
around an area on a map and find the schools, hospitals,
lakes, or volcanoes in the area.
Alexandria Gazetteer: Example from
a search on "Tulsa"
Feature name
State County
Type
Latitude
Tulsa
OK Tulsa
pop pl
360914N 0955933W
Tulsa Country
Club
OK Osage
locale
360958N 0960012W
Tulsa County
OK Tulsa
civil
360600N 0955400W
airport
360500N 0955205W
Tulsa Helicopters OK Tulsa
Incorporated
Heliport
34
Longitude
Challenges for the Alexandria Gazetteer
Content standard: A standard conceptual schema for
gazetteer information.
Feature types: A type scheme to categorize individual
features, is rich in term variants and extensible.
Temporal aspects: Geographic names and attributes change
through time.
"Fuzzy" footprints: Extent of a geographic feature is often
approximate or ill-defined (e.g., Southern California).
35
Challenges for the Alexandria
Gazetteer (continued)
Quality aspects:
(a) Indicate the accuracy of latitude and longitude data.
(b) Ensure that the reported coordinates agree with the other
elements of the description.
Spatial extents:
(a) Points do not represent the extent of the geographic
locations and are therefore only minimally useful.
(b) Bounding boxes, often include too much territory (e.g., the
bounding box for California also includes Nevada).
36
Alexandria Thesaurus: Example
canals
A feature type category for places such as the Erie Canal.
Used for:
The category canals is used instead of any of the following.
canal bends
ditches
canalized streams
drainage canals
ditch mouths
drainage ditches
Broader Terms:
Canals is a sub-type of hydrographic structures.
37
... more ...
Alexandria Thesaurus: Example
(continued)
canals (continued)
Related Terms:
The following is a list of other categories related to canals (nonhierarchial relationships).
channels
locks
transportation features
tunnels
Scope Note:
Manmade waterway used by watercraft or for drainage, irrigation,
mining, or water power. » Definition of canals.
38
Alexandria Gazetteer
Alexandria Digital Library
Linda L. Hill, James Frew, and Qi Zheng, Geographic Names:
The Implementation of a Gazetteer in a Georeferenced Digital
Library. D-Lib Magazine, 5: 1, January 1999.
http://www.dlib.org/dlib/january99/hill/01hill.html
39
Download