Metadata Repository

advertisement
CS 430: Information Discovery
Lecture 13
Case Study: the NSDL
1
Course Administration
2
TheThe
National
NSDLScience
SMETE
Digital Library
Funded by the National Science Foundation
Directorate for Education and Human Resources
Division of Undergraduate Education
3
The NSDL Library Project
1996 Vision articulated by NSF's Division of Undergraduate
Education
1997 National Research Council workshop
1998 Preliminary grants through Digital Libraries Initiative 2
1998 SMETE-Lib workshop
1999 NSDL Solicitation
2000 6 Core Integration System projects + 23 others funded
2001 1 very large Core Integration System project
4
Collections and Services
Scientific and technical
information
Materials used
in education
Materials tailored
to education
5
Core Partners
6
All Partners
7
NSDL Components
Funded by the NSF
•
Core Integration System
•
Collection Projects
•
Service Projects
Other
Any digital collection or service that is relevant to
science education, very broadly defined.
Official start date is December 2002.
8
How Big might the NSDL be?
The NSDL aims to be comprehensive—all branches of
science, all levels of education, very broadly defined.
Five year targets
1,000,000 different users
10,000,000 digital objects
100,000 independent sites
Requires
Low-cost, scalable technology
Automated collection-building and maintenance
9
A User's Wish List
To discover materials and services:
• Good science
• Comprehensible to students -- effective for teaching
• Stable -- will not change or disappear
Through services that are appropriate to the user's needs.
•
•
10
No uniform catalog or index to everything
Mixture of for-profit and open access information
The Dilemma
Collections vary:
Format:
text, images, datasets, etc.
Metadata: extensive, minimal, or none
Dublin Core, other standard, or local scheme
Protocols: HTTP, SQL, Z 39.50, etc.
Access:
Open access or restricted
Methods studied in this course have been for
homogeneous sets of documents.
11
The Challenge of Interoperability
Technical agreements cover formats, protocols, security
systems so that messages can be exchanged, etc.
Content agreements cover the data and metadata, and include
semantic agreements on the interpretation of the messages.
Organizational agreements cover the ground rules for access,
for changing collections and services, payment, authentication,
etc.
Challenge is to create incentives for independent digital
libraries to adopt agreements
12
Levels of Interoperability
13
Level
Agreements
Example
Federation
Strict use of standards
(syntax, semantic,
and business)
AACR, MARC
Z 39.50
Harvesting
Digital libraries supply
basic metadata; simple
protocol and registry
Open Archives
Gathering
Digital libraries do not
cooperate; services must
seek out information
Web crawlers
and search engines
The General Catalog
(Metadata Repository)
User portals
Metadata Repository
Distributed
collections
14
Metadata Harvesting
(Open Archive Initiative)
Central services, metadata collections, etc.
Central
data
Metadata
to harvest
Distributed
collections
15
Metadata Harvesting
Collections must support:
Unqualified Dublin Core
Collections may support:
IMS
FGDC
or one of seven recognized metadata sets
Simple XML tagged format -- protocol derived from Dienst
16
The Information Discovery System
Items are stored in (usually) independent repositories.
Surrogates for items and resources are stored in a central metadata
repository.
Items and surrogates become part of the library by way of
gathering, harvesting and federated services.
A search service allows items in the library to be discovered.
The metadata repository and search service may be distributed.
The big question: How can we have effective information
discovery with such minimal and diverse metadata?
17
The InQuery Retrieval Engine
Developed by Bruce Croft and colleagues at the
University of Massachusetts, Amherst
Used in:
• Infoseek
• Library of Congress -- Thomas, American Memory
• White House
• and many more
Highly rated in TREC experiments
18
InQuery: Advanced Features
Ranked output: Combines evidence in the text of the document
and the corpus as a whole.
Passage-based retrieval: The probability of relevance is based
both on the entire content of a document and the best matching
passage in the document.
Simple and complex queries: e.g., simple word-based queries,
Boolean queries, phrase-based queries or a combination.
Field-based retrieval: e.g., bill number and type.
Flexible and efficient indexing: Incorporates a variety of
document structures (e.g. HTML, MARC, etc.)
Tools for query processing and query expansion
19
How Search Services Fit into the NSDL
Portal
Portal
Portal
SDLIP?
OAI
Search and
Discovery
Services
Note: Services use both
metadata and automatic
indexing of (textual) content
20
Metadata
Repository
http?
Content
Goals of Information Retrieval Service for
First Year
Basic metadata search
 e.g., card catalogue
Basic content search
 Provided content is textual
 If content is publicly readable
Combining metadata and content
 e.g., content search restricted by metadata
What service is not provided
 SQL-like access to metadata repository
21
Future Possible Directions for
Information Retrieval Services
Integration of hierarchies
 content-based search for entries in hierarchies
Browsing capabilities
 by metadata values
 by “concepts” automatically extracted from the content
 by hierarchies
Feedback capabilities
 “more like this” while browsing retrieval results
Use of thesaurus
 allowing user to add vocabulary terms
Clustering/grouping
 show/find strongly related items across the repository
22
Download