CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems 1 Course Administration Assignment 2 Deadline changed to midnight on Sunday, October 9 There is major electrical work in Upson Hall on Saturday and most of the computers and labs will not be available. 2 Course Administration Midterm Examination Wednesday, October 12, 7:30 to 9:00, Upson B17 The topics to be are examined are all lectures and discussion class readings before the midterm break. See the Web site for a sample paper from a previous year. See the Web site for instructions about laptop computers. 3 Course Administration Discussion Class on October 19 This class will be held in Philips Hall 213 4 Notation Documents file Docs File of catalog records Catalog User interface Human action Physical objects User interface service UI 5 Searchable index Index Automatic process Single Homogeneous Collection: Full Text Indexing • Documents and indexes are held on a single computer system (may be several computers). • Information retrieval uses a full text index, which may be tuned to the specific corpus. Build index Search Index Examples: SMART, Lucene 6 Docs Single Homogeneous Collection: Use of Catalog Records • Documents may be digital or physical objects, e.g., books. • Documents are described by catalog records generated manually (or sometimes automatically). • Information retrieval uses an index of catalog records Build index Search Index Example: Library catalog 7 Create catalog Catalog Docs Several Similar Collections: One Computer System • Several more or less similar collections are held on a single computer system. • Each collection is indexed separately using the same software, procedures, algorithms, etc. (but tuned for each collection, e.g., different stoplists). Build indexes Search Index Index Index Example: PubMed 8 Docs Docs Docs Distributed Architecture: Standard Search Protocols Index 1 Index 2 9 Strict adherence to standards allows any user interface to search any conforming search service. Standard Search Protocols Example: Z 39.50 Family of Standards for Searching Library Catalogs The Z 39.50 family of standards has proved successful in a tightly knit community, where: • There is a strong tradition of standardization, with many professionally trained people. • The categories of material change gradually, allowing a slow-moving standardization process. The standardization approach has failed where these two criteria are not met. Historic note: WAIS was based on an early version of Z39.50. 10 Z39.50 principles • Servers store a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server(s), carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction. 11 State Z39.50 • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database. 12 Standards Z 39.50 Family of Standards for Searching Library Catalogs Content: Anglo American Cataloging Rules Structure of Content: MARC Encoding Rules: Base Encoding Rules (character sets, separators, etc.) Message Passing Protocol: Z 39.50 Query Format: Bib 1 (Boolean), Type 102 (full text) In addition, there are the underlying network standards, e.g. the Internet suite of protocols. 13 Distributed Architecture: Meta-search (Broadcast Search) • A user interface service broadcasts a query to several indexes and merges the results. • Can be used with full text or catalogs. Searches Index 1 Search UI User interface service Index 2 Index n Example: Dienst 14 Distributed Architecture: Broadcast Search Interface Service: Can be a separate server (e.g., CGI), or run on the user's computer (e.g., applet). Protocols: In the simple version, each collection must support the same standards and protocols (e.g., Z 39.50, http). 15 Distributed Architecture: Broadcast Search Problems with Broadcast Search • Performance: If any collection does not respond, the Interface Server waits for a time out. • Recall: If any collection does not respond, documents in that collection are not found. • Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections. Broadcast searching is as bad as its weakest link! Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization. 16 Union Catalog • Catalog records from several libraries are merged into a single union catalog • Information retrieval uses an index of the records in the union catalog Create catalog records Build Docs index Search Index to Union Catalog Union Catalog Example: Harvard University's Hollis system 17 Docs Use of Union Catalogs Search Index to Union Catalog Retrieve Union Catalog Docs 18 Docs Batch indexing: Metadata about all items is accumulated in a central system. Real-time searching: The user (a) searches the central index, (b) retrieves catalog records, (c) retrieves documents from collections. Building Union Catalogs Harvesting • Each collection makes a copy of its metadata (catalog records) available from a sever associated with the collection. • A search service harvests metadata from all collections on a regular cycle and builds a central search system. Advantages ... • Can index material from databases without explicit URLs. • Allows authentication and selection of material. but ... 19 • Requires that collections have metadata and support harvesting protocol (e.g., Open Archives Initiative Protocol for Metadata Harvesting). OAI Verbs • • • • • • 20 Identify – repository characteristics ListMetadataFormats – DC required ListSets – repository partitioning ListRecords – (selectively) harvest metadata ListIdentifiers – (selectively) harvest metadata identifiers GetRecord – known item retrieval OAI-PMH Key technical features • • • • • • 21 Simple HTTP encoding Built on of established XML standards Multiple metadata formats, but Dublin Core required Repository partitioning (sets) Selective harvesting (sets and dates) Clean partition between core and implementation-specific extensions – Multiple item-level metadata – Collection level metadata Open Archives Initiative Protocol for Metadata Harvesting See: http://www.openarchives.org/ Herbert Van de Sompel and Carl Lagoze, "The Santa Fe Convention of the Open Archives Initiative." D-Lib Magazine, 6(2), 2000 http://www.dlib.org/dlib/february00/vandesompeloai/02vandesompel-oai.html 22 Web Searching: Architecture • Documents stored on many Web servers are indexed in a single central index. (This is similar to a union catalog.) • The central index is implemented as a single system on a very large number of computers Build index Docs Search on Web Index to all Web pages Examples: Google, Yahoo! 23 server Docs on Web server Use of Web Search Service Search Index to all Web pages Retrieve Docs on Web server Docs on Web server 24 Batch indexing: Each Web page is brought to the central location and indexed. Real-time searching: The user (a) searches the central index, (b) retrieves documents (Web pages) from original location. Web Searching: Building the Index Documents are Web pages Each document is: • identified by Web Crawling • copied to a central location • indexed and added to the central index After indexing the documents are usually discarded, but a cached copy may be retained. Web searching is the topic of Lectures 19-21 and Discussion Classes 9 and 10. 25 Web Crawling Advantages of Web crawling • Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but ... • Can only gather openly accessible materials. • Cannot gather material in databases unless explicit URLs are known. • Cannot easily make use of metadata provided by collections. 26 Standardization: Function Versus Cost of Acceptance Cost of acceptance Few adopters Many adopters 27 Function Example: Textual Mark-up Cost of acceptance SGML XML HTML ASCII 28 Function