CS 430 / INFO 430 Information Retrieval Metadata 5 Lecture 18

advertisement
CS 430 / INFO 430
Information Retrieval
Lecture 18
Metadata 5
1
Course Administration
2
Effective Information Discovery
Before Digital Information
Searching
(a) Resources separated into categories of related materials.
Each category organized, indexed and searched separately.
(b) Catalogs and indexes built on tightly controlled metadata
standards, e.g., MARC, MeSH headings, etc.
(c) Search engines used Boolean operators and fielding
searching.
(d) Query languages and search interfaces assumed a trained
user.
(e) Resources were physical items.
3
Effective Information Discovery
With Homogeneous Digital Information
Comprehensive metadata with Boolean retrieval
Can be excellent for well-understood categories of material, but
requires standardized metadata and relatively homogeneous
content (e.g., MARC catalog).
Full text indexing with ranked retrieval
Can be excellent, but methods developed and validated for
relatively homogeneous textual material (e.g., TREC ad hoc
track).
4
Mixed Content
Examples: NSDL-funded collections at Cornell
Atlas. Data sets of earthquakes, volcanoes, etc.
Reuleaux. Digitized kinematics models from the nineteenth
century
Laboratory of Ornithology. Sound recording, images, videos
of birds and other animals.
Nuprl. Logic-based tools to support programming and to
implement formal computational mathematics.
5
Mixed Metadata: the Chimera of
Standardization
Technical reasons
(a) Characteristics of formats and genres
(b) Differing user needs
Social and cultural reasons
(a) Economic factors
(b) Installed base
6
Information Discovery in a Messy World
Building blocks
Brute force computation
The expertise of users -- human in the loop
Methods
(a) Better understanding of how and why users seek for
information
(b) Relationships and context information
(c) Multi-modal information discovery
(d) User interfaces for exploring information
7
Understanding How and Why Users
Seek for Information
Homogeneous content
All documents are assumed equal
Criterion is relevance (binary measure)
Goal is to find all relevant documents (high recall)
Hits ranked in order of similarity to query
Mixed content
Some documents are more important than other
Goal is to find most useful documents on a topic and then
browse
Hits ranked in order that combines importance and similarity
to query
8
Case Study
Information discovery in the National Science Foundation's
National Science Digital Library (NSDL).
The goal of the NSDL is to be a digital library for all aspects of
science education, where science and education are very broadly
defined.
http://nsdl.org
9
Why Technology in Education?
Why a Digital Library for Education?
Higher Education. U.S. higher education is the best in the
world, but it is very expensive.
How can we keep the quality while lowering the cost?
K-12. The best K-12 education in the U.S. is excellent, but much
is mediocre or worse.
How can the best be made available to all?
Technology-enhanced education offers a way to increase the
productivity of the skilled people who teach in both higher
and K-12 education.
10
Why a Digital Library for
Science Education?
Excellent teaching materials have been developed...
but they are not being used effectively.
The NSDL provides organization and access for
teachers and students
• Preservation and reuse.
• Searching and browsing.
• Links between teaching materials and their
educational use.
11
The NSDL Architecture
Educational materials are scattered across the Internet
State standards
Math Forum
NASA
12
Scientific American
Ask a Scientist
The NSDL Architecture:
Basic Assumptions
Basic Assumptions
• The NSDL is a partnership of organizations who
manage collections and provide educational and
library services.
• There is a central team to integrate the parts and
provide central services.
• The central team does not manage any collections
and does not create any metadata.
13
Architectural Assumptions:
One Library, Many Portals
Different Groups of Users Need Different Views of the Library
• Central portal for general users.
• Development portal library developers
• Pathways portals by discipline (e.g., mathematics) and
educational level (e.g., middle school)
14
Architectural Assumptions:
A Spectrum of Interoperability
The Problem
Conventional approaches require partners to support
agreements (technical, content, and business)
But NSDL needs thousands of very different partners
... most of whom are not directly part of the NSDL
program
The challenge is to create incentives for independent
digital libraries to adopt agreements
15
Approaches to interoperability
The conventional approach
 Wise people develop standards: protocols, formats, etc.
 Everybody implements the standards.
 This creates an integrated, distributed system.
Unfortunately ...
 Standards are expensive to adopt.
 Concepts are continually changing.
 Systems are continually changing.
 Different people have different ideas.
16
Interoperability is about agreements
Technical agreements cover formats, protocols, security
systems so that messages can be exchanged, etc.
Content agreements cover the data and metadata, and include
semantic agreements on the interpretation of the messages.
Organizational agreements cover the ground rules for access,
for changing collections and services, payment, authentication,
etc.
The challenge is to create incentives for independent digital
libraries to adopt agreements
17
Function versus cost of acceptance
Cost of acceptance
Few
adopters
Many
adopters
Function
18
Example: security
Cost of acceptance
Public key
infrastructure
Login ID and
password
IP address
19
Function
Example: metadata standards
Cost of acceptance
MARC
Dublin
Core
Free text
20
Function
NSDL: The Spectrum of Interoperability
21
Level
Agreements
Example
Federation
Strict use of standards
(syntax, semantic,
and business)
AACR, MARC
Z 39.50
Harvesting
Digital libraries expose
metadata; simple
protocol and registry
Open Archives
metadata harvesting
Gathering
Digital libraries do not
cooperate; services must
seek out information
Web crawlers
and search engines
Architecture: the NSDL Repository
NSDL
Repository
The Repository holds
information about every
collection and item known
to the NSDL.
22
Standards Implemented in the
NSDL Repository Phase 1
Object model
Collection
collection
metadata
URL
URL
Items
item
metadata
Metadata: Dublin Core with educational extensions
Ingest and redistribution: Open Archives Initiative,
Protocol for Metadata Harvesting
23
The NSDL Search Service
Full Text or Metadata?
Full text indexing is excellent, but is not possible for all
materials (non-textual, no access for indexing).
Comprehensive metadata is available for very few of the
materials.
What Architecture to Use?
Few collections support an established search protocol (e.g.,
Z39.50).
24
NSDL Search Service: Phase 1
NSDL
Repository
Search
Service
http
The search service
combines metadata from
the Repository and full
text from the collections
Collections
25
NSDL Search Service: Phase 1
Approach
(a) Collections map metadata to Dublin Core, provide via
Open Archives protocol.
(b) Search service augments Dublin Core metadata with
indexing of full-text where available.
(c) The search engine is Lucene (tf.idf weighting)
(c) User interface returns snippets derived from the
metadata, links to full content and to metadata.
26
NSDL Search Service: Phase 1
Weaknesses
(a) Ranking by similarity to query not sufficient (e.g., no
ranking by grade level)
(b) Snippets do not indicate why item was returned (e.g.,
terms in full text but not in metadata).
(c) Dublin Core records provide limited information.
(d) Browsing environment limited.
(e) Many users begin their search with a Web search engine
(e.g., Google or Yahoo).
27
NSDL and the Web
Many people will find NSDL materials through Web search
engines. Therefore the NSDL must be indexed by them.
NSDL
Repository
http
http
http
28
Collections
NSDL Search Service:
Second Phase Developments
Metadata
(a) Accept any metadata that is available in a range of
formats
(b) System for reviews and annotations, with reputation
management
Search system
(a) Multimodal retrieval and ranking
(b) Dynamic generation of snippets by search engine
29
This work is currently in progress. The first stage is to
reimplement the Repository to manage relationships among
resources.
NSDL Search Service:
Second Phase Developments (cont.)
Usability and human factors
(a) Wider range of browsing tools (e.g., collection
visualization)
(b) Filters by education level and education quality,
where known
Web compatibility
(a) Expose records for Web crawlers to index
(b) Browser bookmarklet to add NSDL information to
Web pages
30
Relationship and Contextual Information
Methods for capturing context
Analysis of citations and links (e.g., PageRank)
Mining usage logs (e.g., customers who buy the same
product)
Reviews (e.g., reputation management)
Structural relationships (e.g., domain names)
31
Acknowledgements
The NSDL is a program of the National Science
Foundation's Directorate for Education and Human
Resources, Division of Undergraduate Education.
The NSDL Core Integration is a collaboration between the
University Center for Atmospheric Research, Columbia
University, and Cornell University.
The initial version of the Search Service was developed by
James Allan and colleagues at the University of
Massachusetts, Amherst. The current version was
developed by Naomi Dushay and colleagues at Cornell
University.
32
Download