CS 430 / INFO 430 Information Retrieval Architecture of Information Retrieval Systems

advertisement
CS 430 / INFO 430
Information Retrieval
Lecture 13
Architecture of Information Retrieval
Systems
1
Course Administration
Next Week
Midterm examination on Wednesday evening
No office hours or lecture on Thursday.
2
Course Administration
Midterm Examination
• See the Examinations page on the Web site
• On Wednesday October 11, 7:30-9:00 p.m., Phillips 203,
instead of the discussion class.
• The topics to be are examined are all lectures and discussion
class readings before the midterm break. See the Web site
for a sample paper.
• Laptops may be used to store course materials and your
notes, but for no other purposes. Hand calculators are
allowed. No communication. No other electronic
devices.
3
Follow up on Discussion 6
Crawl and Stop
• Crawler runs until K pages have been downloaded
• Ideal crawl would download the K most highly ranked
pages, R.
• Actual crawl, downloads M pages from R, where M < K.
the ratio M/K is a measure of the performance of the
crawler.
• With random crawling, probability that a given page in R is
K/T, where T is total number of pages. Since K pages are
downloaded, the expected number of pages that are from R
is K x K/T.
4
Follow up on Discussion 6
Optimal refresh frequency assuming page changes follow a
Poisson process and their change frequencies are static (i.e., do
not change over time).
5
The graph does not monotonically increase over change
frequency. The optimal is to refresh pages less often if the
pages change too often.
Notation
Documents
file
Docs
File of catalog
records
Catalog
User
interface
Human
action
Physical
objects
User interface
service
UI
6
Searchable
index
Index
Automatic
process
Single Homogeneous Collection: Full
Text Indexing
• Documents and indexes are held on a single computer
system (may be several computers).
• Information retrieval uses a full text index, which may
be tuned to the specific corpus.
Build
index
Search
Index
Examples: SMART, Lucene
7
Docs
Lucene
Apache Lucene
• High-performance, full-featured text search engine
library.
• Written entirely in Java.
• Suitable for nearly any application that requires fulltext search, especially cross-platform.
• An open source project available for free download.
8
Single Homogeneous Collection:
Use of Catalog Records
• Documents may be digital or physical objects, e.g., books.
• Documents are described by catalog records generated
manually (or sometimes automatically).
• Information retrieval uses an index of catalog records
Build
index
Search
Create
catalog
Docs
Index
Example: Library catalog
9
Catalog
Several Similar Collections: One
Computer System
• Several more or less similar collections are held on a
single computer system.
• Each collection is indexed separately using the same
software, procedures, algorithms, etc. (but tuned for
each collection, e.g., different stoplists).
Build
indexes
Search
Index
Index
Index
Example: PubMed
10
Docs
Docs
Docs
Distributed Architecture: Standard
Search Protocols
• A user interface is configured so that the user can
search several different indexes, one at a time
User
interface 1
User
interface 2
Index 1
Index 2
Strict adherence to standards allows any
user interface to search any conforming
search service.
11
Standards for Searching
Standard Query Languages
• Example: Common Query Language
Protocols for Distributed Searching
• Example: SRW: Search/Retrieve Web Service
http://www.loc.gov/standards/sru/srw/
• Example: Z39.50
Clifford A. Lynch, "The Z39.50 Information Retrieval
Standard, Part I: A Strategic View of Its Past, Present and
Future", D-Lib Magazine, April 1997.
http://www.dlib.org/dlib/april97/04lynch.html
12
Standard Search Protocols
Example: Z 39.50 Family of Standards for Searching Library
Catalogs
The Z 39.50 family of standards has proved successful in a
tightly knit community, where:
• There is a strong tradition of standardization, with
many professionally trained people.
• The categories of material change gradually, allowing
a slow-moving standardization process.
The standardization approach has failed where these two criteria
are not met.
Historical note: WAIS was based on an early version of Z39.50.
13
Z39.50: Principles
• Servers store a set of databases with searchable
indexes
• Interactions are based on a session
• The client opens a connection with the server(s),
carries out a sequence of interactions and then closes
the connection.
• During the course of the session, both the server and
the client remember the state of their interaction.
14
Z39.50: State
• The server carries out the search and builds a results set
• Server saves the results set.
• Subsequent message from the client can reference the
result set.
• Thus the client can modify a large set by increasingly
precise requests, or can request a presentation of any
record in the set, without searching entire database.
15
Z 39.50 Family of Standards for
Searching Library Catalogs
Content: Anglo American Cataloging Rules
Structure of Content: MARC
Encoding Rules: Base Encoding Rules (character sets,
separators, etc.)
Message Passing Protocol: Z 39.50
Query Format: Bib 1 (Boolean), Type 102 (full text)
In addition, there are the underlying network standards,
e.g. the Internet suite of protocols.
16
Distributed Architecture: Meta-search
(Broadcast Search)
• A user interface service broadcasts a query to several
indexes and merges the results.
• Can be used with full text or catalogs.
Searches
Index 1
Search
UI
User interface
service
Index 2
Index n
Example: Dienst
17
Distributed Architecture: Broadcast
Search
Interface Service: Can be a separate server (e.g., CGI), or run
on the user's computer (e.g., applet).
Protocols: In the simple version, each collection must support
the same standards and protocols (e.g., Z 39.50, http).
18
Distributed Architecture: Broadcast
Search
Problems with Broadcast Search
• Performance: If any collection does not respond, the
Interface Server waits for a time out.
• Recall: If any collection does not respond, documents in
that collection are not found.
• Ranking and duplicates: There are great difficulties in
reconciling ranked lists from different collections.
Broadcast searching is as bad as its weakest link!
Conclusion: Broadcast search does not scale beyond about
five or ten collections, even with strict standardization.
19
Union Catalog
• Catalog records from several libraries are merged into a
single union catalog
• Information retrieval uses an index of the records in the
union catalog
Create catalog
records
Build
Docs
index
Search
Index to
Union
Catalog
Union
Catalog
Example: National Science Digital Library
20
Docs
Use of Union Catalogs
Search
Index to
Union
Catalog
Retrieve
Union
Catalog
Docs
21
Docs
Batch indexing:
Metadata about all
items is accumulated
in a central system.
Real-time searching:
The user (a) searches
the central index, (b)
retrieves catalog
records, (c) retrieves
documents from
collections.
Building Union Catalogs
Harvesting
• Each collection makes a copy of its metadata (catalog records)
available from a sever associated with the collection.
• A search service harvests metadata from all collections on a regular
cycle and builds a central search system.
Advantages ...
• Can index material from databases without explicit URLs.
• Allows authentication and selection of material.
but ...
22
• Requires that collections have metadata and support harvesting
protocol (e.g., Open Archives Initiative Protocol for Metadata
Harvesting).
Open Archives Initiative Protocol for
Metadata Harvesting
See: http://www.openarchives.org/
Herbert Van de Sompel and Carl Lagoze, "The Santa Fe
Convention of the Open Archives Initiative." D-Lib Magazine,
6(2), 2000 http://www.dlib.org/dlib/february00/vandesompeloai/02vandesompel-oai.html
23
Web Searching: Architecture
• Documents stored on many Web servers are indexed in a
single central index. (This is similar to a union catalog.)
• The central index is implemented as a single system on a
very large number of computers
Search
Build index
Index to
all Web
pages
Examples: Google, Yahoo!
24
Crawl
Web
pages
retrieved
by crawler
Web
servers
with Web
pages
Use of Web Search Service
Search
Index to
all Web
pages
Retrieve
Docs
on Web
servers
25
Batch indexing: Each
Web page is brought
to the central location
and indexed.
Real-time searching:
The user (a) searches
the central index, (b)
retrieves documents
(Web pages) from
original location.
Web Searching: Building the Index
Documents are Web pages
Each document is:
• identified by Web Crawling
• copied to a central location
• indexed and added to the central index
After indexing the documents may be discarded, but a
copy may be retained, for use by the user interface.
Web searching is the topic of Lectures 15-18 and
Discussion Classes 6, 7, and 8.
26
Web Crawling
Advantages of Web crawling
• Entirely automatic, low cost. Highly efficient at gathering very
large amounts of material.
but ...
• Can only gather openly accessible materials.
• Cannot gather material in databases unless explicit URLs are
known.
• Cannot easily make use of metadata provided by collections.
27
Standardization: Function Versus Cost
of Acceptance
Cost of acceptance
Few
adopters
Many
adopters
28
Function
Example: Textual Mark-up
Cost of acceptance
SGML
XML
HTML
ASCII
29
Function
Download