CS 430 / INFO 430 Information Retrieval Lecture 18 Metadata 5 1 Course Administration 2 Effective Information Discovery Before Digital Information Searching (a) Resources separated into categories of related materials. Each category organized, indexed and searched separately. (b) Catalogs and indexes built on tightly controlled metadata standards, e.g., MARC, MeSH headings, etc. (c) Search engines used Boolean operators and fielding searching. (d) Query languages and search interfaces assumed a trained user. (e) Resources were physical items. 3 Effective Information Discovery With Homogeneous Digital Information Comprehensive metadata with Boolean retrieval Can be excellent for well-understood categories of material, but requires standardized metadata and relatively homogeneous content (e.g., MARC catalog). Full text indexing with ranked retrieval Can be excellent, but methods developed and validated for relatively homogeneous textual material (e.g., TREC ad hoc track). 4 Mixed Content Examples: NSDL-funded collections at Cornell Atlas. Data sets of earthquakes, volcanoes, etc. Reuleaux. Digitized kinematics models from the nineteenth century Laboratory of Ornithology. Sound recording, images, videos of birds and other animals. Nuprl. Logic-based tools to support programming and to implement formal computational mathematics. 5 Mixed Metadata: the Chimera of Standardization Technical reasons (a) Characteristics of formats and genres (b) Differing user needs Social and cultural reasons (a) Economic factors (b) Installed base 6 Information Discovery in a Messy World Building blocks Brute force computation The expertise of users -- human in the loop Methods (a) Better understanding of how and why users seek for information (b) Relationships and context information (c) Multi-modal information discovery (d) User interfaces for exploring information 7 Understanding How and Why Users Seek for Information Homogeneous content All documents are assumed equal Criterion is relevance (binary measure) Goal is to find all relevant documents (high recall) Hits ranked in order of similarity to query Mixed content Some documents are more important than other Goal is to find most useful documents on a topic and then browse Hits ranked in order that combines importance and similarity to query 8 Case Study Information discovery in the National Science Foundation's National Science Digital Library (NSDL). The goal of the NSDL is to be a digital library for all aspects of science education, where science and education are very broadly defined. http://nsdl.org 9 Why Technology in Education? Why a Digital Library for Education? Higher Education. U.S. higher education is the best in the world, but it is very expensive. How can we keep the quality while lowering the cost? K-12. The best K-12 education in the U.S. is excellent, but much is mediocre or worse. How can the best be made available to all? Technology-enhanced education offers a way to increase the productivity of the skilled people who teach in both higher and K-12 education. 10 Why a Digital Library for Science Education? Excellent teaching materials have been developed... but they are not being used effectively. The NSDL provides organization and access for teachers and students • Preservation and reuse. • Searching and browsing. • Links between teaching materials and their educational use. 11 The NSDL Architecture Educational materials are scattered across the Internet State standards Math Forum NASA 12 Scientific American Ask a Scientist The NSDL Architecture: Basic Assumptions Basic Assumptions • The NSDL is a partnership of organizations who manage collections and provide educational and library services. • There is a central team to integrate the parts and provide central services. • The central team does not manage any collections and does not create any metadata. 13 Architectural Assumptions: One Library, Many Portals Different Groups of Users Need Different Views of the Library • Central portal for general users. • Development portal library developers • Pathways portals by discipline (e.g., mathematics) and educational level (e.g., middle school) 14 Architectural Assumptions: A Spectrum of Interoperability The Problem Conventional approaches require partners to support agreements (technical, content, and business) But NSDL needs thousands of very different partners ... most of whom are not directly part of the NSDL program The challenge is to create incentives for independent digital libraries to adopt agreements 15 Approaches to interoperability The conventional approach Wise people develop standards: protocols, formats, etc. Everybody implements the standards. This creates an integrated, distributed system. Unfortunately ... Standards are expensive to adopt. Concepts are continually changing. Systems are continually changing. Different people have different ideas. 16 Interoperability is about agreements Technical agreements cover formats, protocols, security systems so that messages can be exchanged, etc. Content agreements cover the data and metadata, and include semantic agreements on the interpretation of the messages. Organizational agreements cover the ground rules for access, for changing collections and services, payment, authentication, etc. The challenge is to create incentives for independent digital libraries to adopt agreements 17 Function versus cost of acceptance Cost of acceptance Few adopters Many adopters Function 18 Example: security Cost of acceptance Public key infrastructure Login ID and password IP address 19 Function Example: metadata standards Cost of acceptance MARC Dublin Core Free text 20 Function NSDL: The Spectrum of Interoperability 21 Level Agreements Example Federation Strict use of standards (syntax, semantic, and business) AACR, MARC Z 39.50 Harvesting Digital libraries expose metadata; simple protocol and registry Open Archives metadata harvesting Gathering Digital libraries do not cooperate; services must seek out information Web crawlers and search engines Architecture: the NSDL Repository NSDL Repository The Repository holds information about every collection and item known to the NSDL. 22 Standards Implemented in the NSDL Repository Phase 1 Object model Collection collection metadata URL URL Items item metadata Metadata: Dublin Core with educational extensions Ingest and redistribution: Open Archives Initiative, Protocol for Metadata Harvesting 23 The NSDL Search Service Full Text or Metadata? Full text indexing is excellent, but is not possible for all materials (non-textual, no access for indexing). Comprehensive metadata is available for very few of the materials. What Architecture to Use? Few collections support an established search protocol (e.g., Z39.50). 24 NSDL Search Service: Phase 1 NSDL Repository Search Service http The search service combines metadata from the Repository and full text from the collections Collections 25 NSDL Search Service: Phase 1 Approach (a) Collections map metadata to Dublin Core, provide via Open Archives protocol. (b) Search service augments Dublin Core metadata with indexing of full-text where available. (c) The search engine is Lucene (tf.idf weighting) (c) User interface returns snippets derived from the metadata, links to full content and to metadata. 26 NSDL Search Service: Phase 1 Weaknesses (a) Ranking by similarity to query not sufficient (e.g., no ranking by grade level) (b) Snippets do not indicate why item was returned (e.g., terms in full text but not in metadata). (c) Dublin Core records provide limited information. (d) Browsing environment limited. (e) Many users begin their search with a Web search engine (e.g., Google or Yahoo). 27 NSDL and the Web Many people will find NSDL materials through Web search engines. Therefore the NSDL must be indexed by them. NSDL Repository http http http 28 Collections NSDL Search Service: Second Phase Developments Metadata (a) Accept any metadata that is available in a range of formats (b) System for reviews and annotations, with reputation management Search system (a) Multimodal retrieval and ranking (b) Dynamic generation of snippets by search engine 29 This work is currently in progress. The first stage is to reimplement the Repository to manage relationships among resources. NSDL Search Service: Second Phase Developments (cont.) Usability and human factors (a) Wider range of browsing tools (e.g., collection visualization) (b) Filters by education level and education quality, where known Web compatibility (a) Expose records for Web crawlers to index (b) Browser bookmarklet to add NSDL information to Web pages 30 Relationship and Contextual Information Methods for capturing context Analysis of citations and links (e.g., PageRank) Mining usage logs (e.g., customers who buy the same product) Reviews (e.g., reputation management) Structural relationships (e.g., domain names) 31 Acknowledgements The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education. The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research, Columbia University, and Cornell University. The initial version of the Search Service was developed by James Allan and colleagues at the University of Massachusetts, Amherst. The current version was developed by Naomi Dushay and colleagues at Cornell University. 32