CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems 1 Course Administration Next Week Midterm examination on Wednesday evening No office hours or lecture on Thursday. 2 Course Administration Midterm Examination • See the Examinations page on the Web site • On Wednesday October 11, 7:30-9:00 p.m., Phillips 203, instead of the discussion class. • The topics to be are examined are all lectures and discussion class readings before the midterm break. See the Web site for a sample paper. • Laptops may be used to store course materials and your notes, but for no other purposes. Hand calculators are allowed. No communication. No other electronic devices. 3 Follow up on Discussion 6 Crawl and Stop • Crawler runs until K pages have been downloaded • Ideal crawl would download the K most highly ranked pages, R. • Actual crawl, downloads M pages from R, where M < K. the ratio M/K is a measure of the performance of the crawler. • With random crawling, probability that a given page in R is K/T, where T is total number of pages. Since K pages are downloaded, the expected number of pages that are from R is K x K/T. 4 Follow up on Discussion 6 Optimal refresh frequency assuming page changes follow a Poisson process and their change frequencies are static (i.e., do not change over time). 5 The graph does not monotonically increase over change frequency. The optimal is to refresh pages less often if the pages change too often. Notation Documents file Docs File of catalog records Catalog User interface Human action Physical objects User interface service UI 6 Searchable index Index Automatic process Single Homogeneous Collection: Full Text Indexing • Documents and indexes are held on a single computer system (may be several computers). • Information retrieval uses a full text index, which may be tuned to the specific corpus. Build index Search Index Examples: SMART, Lucene 7 Docs Lucene Apache Lucene • High-performance, full-featured text search engine library. • Written entirely in Java. • Suitable for nearly any application that requires fulltext search, especially cross-platform. • An open source project available for free download. 8 Single Homogeneous Collection: Use of Catalog Records • Documents may be digital or physical objects, e.g., books. • Documents are described by catalog records generated manually (or sometimes automatically). • Information retrieval uses an index of catalog records Build index Search Create catalog Docs Index Example: Library catalog 9 Catalog Several Similar Collections: One Computer System • Several more or less similar collections are held on a single computer system. • Each collection is indexed separately using the same software, procedures, algorithms, etc. (but tuned for each collection, e.g., different stoplists). Build indexes Search Index Index Index Example: PubMed 10 Docs Docs Docs Distributed Architecture: Standard Search Protocols • A user interface is configured so that the user can search several different indexes, one at a time User interface 1 User interface 2 Index 1 Index 2 Strict adherence to standards allows any user interface to search any conforming search service. 11 Standards for Searching Standard Query Languages • Example: Common Query Language Protocols for Distributed Searching • Example: SRW: Search/Retrieve Web Service http://www.loc.gov/standards/sru/srw/ • Example: Z39.50 Clifford A. Lynch, "The Z39.50 Information Retrieval Standard, Part I: A Strategic View of Its Past, Present and Future", D-Lib Magazine, April 1997. http://www.dlib.org/dlib/april97/04lynch.html 12 Standard Search Protocols Example: Z 39.50 Family of Standards for Searching Library Catalogs The Z 39.50 family of standards has proved successful in a tightly knit community, where: • There is a strong tradition of standardization, with many professionally trained people. • The categories of material change gradually, allowing a slow-moving standardization process. The standardization approach has failed where these two criteria are not met. Historical note: WAIS was based on an early version of Z39.50. 13 Z39.50: Principles • Servers store a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server(s), carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction. 14 Z39.50: State • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database. 15 Z 39.50 Family of Standards for Searching Library Catalogs Content: Anglo American Cataloging Rules Structure of Content: MARC Encoding Rules: Base Encoding Rules (character sets, separators, etc.) Message Passing Protocol: Z 39.50 Query Format: Bib 1 (Boolean), Type 102 (full text) In addition, there are the underlying network standards, e.g. the Internet suite of protocols. 16 Distributed Architecture: Meta-search (Broadcast Search) • A user interface service broadcasts a query to several indexes and merges the results. • Can be used with full text or catalogs. Searches Index 1 Search UI User interface service Index 2 Index n Example: Dienst 17 Distributed Architecture: Broadcast Search Interface Service: Can be a separate server (e.g., CGI), or run on the user's computer (e.g., applet). Protocols: In the simple version, each collection must support the same standards and protocols (e.g., Z 39.50, http). 18 Distributed Architecture: Broadcast Search Problems with Broadcast Search • Performance: If any collection does not respond, the Interface Server waits for a time out. • Recall: If any collection does not respond, documents in that collection are not found. • Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections. Broadcast searching is as bad as its weakest link! Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization. 19 Union Catalog • Catalog records from several libraries are merged into a single union catalog • Information retrieval uses an index of the records in the union catalog Create catalog records Build Docs index Search Index to Union Catalog Union Catalog Example: National Science Digital Library 20 Docs Use of Union Catalogs Search Index to Union Catalog Retrieve Union Catalog Docs 21 Docs Batch indexing: Metadata about all items is accumulated in a central system. Real-time searching: The user (a) searches the central index, (b) retrieves catalog records, (c) retrieves documents from collections. Building Union Catalogs Harvesting • Each collection makes a copy of its metadata (catalog records) available from a sever associated with the collection. • A search service harvests metadata from all collections on a regular cycle and builds a central search system. Advantages ... • Can index material from databases without explicit URLs. • Allows authentication and selection of material. but ... 22 • Requires that collections have metadata and support harvesting protocol (e.g., Open Archives Initiative Protocol for Metadata Harvesting). Open Archives Initiative Protocol for Metadata Harvesting See: http://www.openarchives.org/ Herbert Van de Sompel and Carl Lagoze, "The Santa Fe Convention of the Open Archives Initiative." D-Lib Magazine, 6(2), 2000 http://www.dlib.org/dlib/february00/vandesompeloai/02vandesompel-oai.html 23 Web Searching: Architecture • Documents stored on many Web servers are indexed in a single central index. (This is similar to a union catalog.) • The central index is implemented as a single system on a very large number of computers Search Build index Index to all Web pages Examples: Google, Yahoo! 24 Crawl Web pages retrieved by crawler Web servers with Web pages Use of Web Search Service Search Index to all Web pages Retrieve Docs on Web servers 25 Batch indexing: Each Web page is brought to the central location and indexed. Real-time searching: The user (a) searches the central index, (b) retrieves documents (Web pages) from original location. Web Searching: Building the Index Documents are Web pages Each document is: • identified by Web Crawling • copied to a central location • indexed and added to the central index After indexing the documents may be discarded, but a copy may be retained, for use by the user interface. Web searching is the topic of Lectures 15-18 and Discussion Classes 6, 7, and 8. 26 Web Crawling Advantages of Web crawling • Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but ... • Can only gather openly accessible materials. • Cannot gather material in databases unless explicit URLs are known. • Cannot easily make use of metadata provided by collections. 27 Standardization: Function Versus Cost of Acceptance Cost of acceptance Few adopters Many adopters 28 Function Example: Textual Mark-up Cost of acceptance SGML XML HTML ASCII 29 Function