CS 430: Information Discovery Lecture 13 Case Study: the NSDL 1 Course Administration 2 TheThe National NSDLScience SMETE Digital Library Funded by the National Science Foundation Directorate for Education and Human Resources Division of Undergraduate Education 3 The NSDL Library Project 1996 Vision articulated by NSF's Division of Undergraduate Education 1997 National Research Council workshop 1998 Preliminary grants through Digital Libraries Initiative 2 1998 SMETE-Lib workshop 1999 NSDL Solicitation 2000 6 Core Integration System projects + 23 others funded 2001 1 very large Core Integration System project 4 Collections and Services Scientific and technical information Materials used in education Materials tailored to education 5 Core Partners 6 All Partners 7 NSDL Components Funded by the NSF • Core Integration System • Collection Projects • Service Projects Other Any digital collection or service that is relevant to science education, very broadly defined. Official start date is December 2002. 8 How Big might the NSDL be? The NSDL aims to be comprehensive—all branches of science, all levels of education, very broadly defined. Five year targets 1,000,000 different users 10,000,000 digital objects 100,000 independent sites Requires Low-cost, scalable technology Automated collection-building and maintenance 9 A User's Wish List To discover materials and services: • Good science • Comprehensible to students -- effective for teaching • Stable -- will not change or disappear Through services that are appropriate to the user's needs. • • 10 No uniform catalog or index to everything Mixture of for-profit and open access information The Dilemma Collections vary: Format: text, images, datasets, etc. Metadata: extensive, minimal, or none Dublin Core, other standard, or local scheme Protocols: HTTP, SQL, Z 39.50, etc. Access: Open access or restricted Methods studied in this course have been for homogeneous sets of documents. 11 The Challenge of Interoperability Technical agreements cover formats, protocols, security systems so that messages can be exchanged, etc. Content agreements cover the data and metadata, and include semantic agreements on the interpretation of the messages. Organizational agreements cover the ground rules for access, for changing collections and services, payment, authentication, etc. Challenge is to create incentives for independent digital libraries to adopt agreements 12 Levels of Interoperability 13 Level Agreements Example Federation Strict use of standards (syntax, semantic, and business) AACR, MARC Z 39.50 Harvesting Digital libraries supply basic metadata; simple protocol and registry Open Archives Gathering Digital libraries do not cooperate; services must seek out information Web crawlers and search engines The General Catalog (Metadata Repository) User portals Metadata Repository Distributed collections 14 Metadata Harvesting (Open Archive Initiative) Central services, metadata collections, etc. Central data Metadata to harvest Distributed collections 15 Metadata Harvesting Collections must support: Unqualified Dublin Core Collections may support: IMS FGDC or one of seven recognized metadata sets Simple XML tagged format -- protocol derived from Dienst 16 The Information Discovery System Items are stored in (usually) independent repositories. Surrogates for items and resources are stored in a central metadata repository. Items and surrogates become part of the library by way of gathering, harvesting and federated services. A search service allows items in the library to be discovered. The metadata repository and search service may be distributed. The big question: How can we have effective information discovery with such minimal and diverse metadata? 17 The InQuery Retrieval Engine Developed by Bruce Croft and colleagues at the University of Massachusetts, Amherst Used in: • Infoseek • Library of Congress -- Thomas, American Memory • White House • and many more Highly rated in TREC experiments 18 InQuery: Advanced Features Ranked output: Combines evidence in the text of the document and the corpus as a whole. Passage-based retrieval: The probability of relevance is based both on the entire content of a document and the best matching passage in the document. Simple and complex queries: e.g., simple word-based queries, Boolean queries, phrase-based queries or a combination. Field-based retrieval: e.g., bill number and type. Flexible and efficient indexing: Incorporates a variety of document structures (e.g. HTML, MARC, etc.) Tools for query processing and query expansion 19 How Search Services Fit into the NSDL Portal Portal Portal SDLIP? OAI Search and Discovery Services Note: Services use both metadata and automatic indexing of (textual) content 20 Metadata Repository http? Content Goals of Information Retrieval Service for First Year Basic metadata search e.g., card catalogue Basic content search Provided content is textual If content is publicly readable Combining metadata and content e.g., content search restricted by metadata What service is not provided SQL-like access to metadata repository 21 Future Possible Directions for Information Retrieval Services Integration of hierarchies content-based search for entries in hierarchies Browsing capabilities by metadata values by “concepts” automatically extracted from the content by hierarchies Feedback capabilities “more like this” while browsing retrieval results Use of thesaurus allowing user to add vocabulary terms Clustering/grouping show/find strongly related items across the repository 22