CS 430 / INFO 430 Information Retrieval Architecture of Information Retrieval Systems

advertisement
CS 430 / INFO 430
Information Retrieval
Lecture 13
Architecture of Information Retrieval
Systems
1
Course Administration
Assignment 2
Deadline changed to midnight on Sunday, October 9
There is major electrical work in Upson Hall on Saturday and
most of the computers and labs will not be available.
2
Course Administration
Midterm Examination
Wednesday, October 12, 7:30 to 9:00, Upson B17
The topics to be are examined are all lectures and
discussion class readings before the midterm break.
See the Web site for a sample paper from a previous
year.
See the Web site for instructions about laptop
computers.
3
Course Administration
Discussion Class on October 19
This class will be held in Philips Hall 213
4
Notation
Documents
file
Docs
File of catalog
records
Catalog
User
interface
Human
action
Physical
objects
User interface
service
UI
5
Searchable
index
Index
Automatic
process
Single Homogeneous Collection: Full
Text Indexing
• Documents and indexes are held on a single computer
system (may be several computers).
• Information retrieval uses a full text index, which may
be tuned to the specific corpus.
Build
index
Search
Index
Examples: SMART, Lucene
6
Docs
Single Homogeneous Collection:
Use of Catalog Records
• Documents may be digital or physical objects, e.g., books.
• Documents are described by catalog records generated
manually (or sometimes automatically).
• Information retrieval uses an index of catalog records
Build
index
Search
Index
Example: Library catalog
7
Create
catalog
Catalog
Docs
Several Similar Collections: One
Computer System
• Several more or less similar collections are held on a
single computer system.
• Each collection is indexed separately using the same
software, procedures, algorithms, etc. (but tuned for
each collection, e.g., different stoplists).
Build
indexes
Search
Index
Index
Index
Example: PubMed
8
Docs
Docs
Docs
Distributed Architecture: Standard
Search Protocols
Index 1
Index 2
9
Strict adherence to
standards allows any user
interface to search any
conforming search
service.
Standard Search Protocols
Example: Z 39.50 Family of Standards for Searching
Library Catalogs
The Z 39.50 family of standards has proved successful in a
tightly knit community, where:
• There is a strong tradition of standardization, with
many professionally trained people.
• The categories of material change gradually, allowing
a slow-moving standardization process.
The standardization approach has failed where these two
criteria are not met.
Historic note: WAIS was based on an early version of Z39.50.
10
Z39.50 principles
• Servers store a set of databases with searchable
indexes
• Interactions are based on a session
• The client opens a connection with the server(s),
carries out a sequence of interactions and then closes
the connection.
• During the course of the session, both the server and
the client remember the state of their interaction.
11
State
Z39.50
• The server carries out the search and builds a results set
• Server saves the results set.
• Subsequent message from the client can reference the
result set.
• Thus the client can modify a large set by increasingly
precise requests, or can request a presentation of any
record in the set, without searching entire database.
12
Standards
Z 39.50 Family of Standards for Searching Library
Catalogs
Content: Anglo American Cataloging Rules
Structure of Content: MARC
Encoding Rules: Base Encoding Rules (character sets,
separators, etc.)
Message Passing Protocol: Z 39.50
Query Format: Bib 1 (Boolean), Type 102 (full text)
In addition, there are the underlying network standards,
e.g. the Internet suite of protocols.
13
Distributed Architecture: Meta-search
(Broadcast Search)
• A user interface service broadcasts a query to several
indexes and merges the results.
• Can be used with full text or catalogs.
Searches
Index 1
Search
UI
User interface
service
Index 2
Index n
Example: Dienst
14
Distributed Architecture: Broadcast
Search
Interface Service: Can be a separate server (e.g., CGI), or run
on the user's computer (e.g., applet).
Protocols: In the simple version, each collection must support
the same standards and protocols (e.g., Z 39.50, http).
15
Distributed Architecture: Broadcast
Search
Problems with Broadcast Search
• Performance: If any collection does not respond, the
Interface Server waits for a time out.
• Recall: If any collection does not respond, documents in
that collection are not found.
• Ranking and duplicates: There are great difficulties in
reconciling ranked lists from different collections.
Broadcast searching is as bad as its weakest link!
Conclusion: Broadcast search does not scale beyond about
five or ten collections, even with strict standardization.
16
Union Catalog
• Catalog records from several libraries are merged into a
single union catalog
• Information retrieval uses an index of the records in the
union catalog
Create catalog
records
Build
Docs
index
Search
Index to
Union
Catalog
Union
Catalog
Example: Harvard University's Hollis system
17
Docs
Use of Union Catalogs
Search
Index to
Union
Catalog
Retrieve
Union
Catalog
Docs
18
Docs
Batch indexing:
Metadata about all
items is accumulated
in a central system.
Real-time searching:
The user (a) searches
the central index, (b)
retrieves catalog
records, (c) retrieves
documents from
collections.
Building Union Catalogs
Harvesting
• Each collection makes a copy of its metadata (catalog records)
available from a sever associated with the collection.
• A search service harvests metadata from all collections on a regular
cycle and builds a central search system.
Advantages ...
• Can index material from databases without explicit URLs.
• Allows authentication and selection of material.
but ...
19
• Requires that collections have metadata and support harvesting
protocol (e.g., Open Archives Initiative Protocol for Metadata
Harvesting).
OAI Verbs
•
•
•
•
•
•
20
Identify – repository characteristics
ListMetadataFormats – DC required
ListSets – repository partitioning
ListRecords – (selectively) harvest metadata
ListIdentifiers – (selectively) harvest metadata identifiers
GetRecord – known item retrieval
OAI-PMH Key technical features
•
•
•
•
•
•
21
Simple HTTP encoding
Built on of established XML standards
Multiple metadata formats, but Dublin Core required
Repository partitioning (sets)
Selective harvesting (sets and dates)
Clean partition between core and implementation-specific
extensions
– Multiple item-level metadata
– Collection level metadata
Open Archives Initiative Protocol for
Metadata Harvesting
See: http://www.openarchives.org/
Herbert Van de Sompel and Carl Lagoze, "The Santa Fe
Convention of the Open Archives Initiative." D-Lib Magazine,
6(2), 2000 http://www.dlib.org/dlib/february00/vandesompeloai/02vandesompel-oai.html
22
Web Searching: Architecture
• Documents stored on many Web servers are indexed in a
single central index. (This is similar to a union catalog.)
• The central index is implemented as a single system on a
very large number of computers
Build index
Docs
Search
on Web
Index to
all Web
pages
Examples: Google, Yahoo!
23
server
Docs
on Web
server
Use of Web Search Service
Search
Index to
all Web
pages
Retrieve
Docs
on Web
server
Docs
on Web
server
24
Batch indexing: Each
Web page is brought
to the central location
and indexed.
Real-time searching:
The user (a) searches
the central index, (b)
retrieves documents
(Web pages) from
original location.
Web Searching: Building the Index
Documents are Web pages
Each document is:
• identified by Web Crawling
• copied to a central location
• indexed and added to the central index
After indexing the documents are usually discarded, but a cached
copy may be retained.
Web searching is the topic of Lectures 19-21 and Discussion
Classes 9 and 10.
25
Web Crawling
Advantages of Web crawling
• Entirely automatic, low cost. Highly efficient at gathering very
large amounts of material.
but ...
• Can only gather openly accessible materials.
• Cannot gather material in databases unless explicit URLs are
known.
• Cannot easily make use of metadata provided by collections.
26
Standardization: Function Versus Cost
of Acceptance
Cost of acceptance
Few
adopters
Many
adopters
27
Function
Example: Textual Mark-up
Cost of acceptance
SGML
XML
HTML
ASCII
28
Function
Download