CS 430: Information Discovery Automated Information Retrieval Lecture 26 1

advertisement
CS 430: Information Discovery
Lecture 26
Automated Information Retrieval
1
Course Administration
2
Information Discovery
People have many reasons to look for information:
3
•
Known item
Where will I find the wording of the US Copyright Act?
•
Facts
What is the capital of Barbados?
•
Introduction or overview
How do diesel engines work?
•
Related information
Is there a review of this article?
•
Comprehensive search
What is known of the effects of global warming on hurricanes?
Types of Information Discovery
media type
text
image, video,
audio, etc.
linking
searching
CS 502
natural
language
processing
CS 474
4
browsing
By user
statistical
catalogs, indexes
(metadata)
No human effort
user-in-loop
Automated information discovery
Creating catalog records manually is labor intensive
and hence expensive.
The aim of automatic indexing is to build indexes
and retrieve information without human intervention.
The aim of automated information discovery is for
users to discover information without using skilled
human effort to build indexes.
5
Resources for automated information
discovery
Computer power
brute force computing
ranking methods
automatic generation of metadata
The intelligence of the user
browsing
relevance feedback
information visualization
6
Brute force computing
Few people really understand Moore's Law
-- Computing power doubles every 18 months
-- Increases 100 times in 10 years
-- Increases 10,000 times in 20 years
Simple algorithms + immense computing power
may outperform human intelligence
7
Problems with (old-fashioned) Boolean
searching
With Boolean retrieval, a document either matches a query
exactly or not at all
• Encourages short queries
• Requires precise choice of index terms (professional indexing)
• Requires precise formulation of queries (professional training)
8
Relevance and Ranking
Classical methods assume that a document is either relevant to a
query or not relevant.
Often a user will consider a document to be partially relevant.
Ranking methods: measure the degree of similarity between a
query and a document.
Similar
Requests
Documents
Similar: How similar is document to a request?
9
Contrast with (old-fashioned) Boolean
searching
With Boolean retrieval, a document either matches a query exactly
or not at all
• Encourages short queries
• Requires precise choice of index terms
• Requires precise formulation of queries (professional training)
With retrieval using similarity measures, similarities range from 0
to 1 for all documents
• Encourages long queries (to have as many dimensions as possible)
• Benefits from large numbers of index terms
• Permits queries with many terms, not all of which need
match the document
10
SMART System
An experimental system for automatic information retrieval
•
automatic indexing to assign terms to documents and queries
•
identify documents to be retrieved by calculating
similarities between documents and queries
•
collect related documents into common subject classes
•
procedures for producing an improved search query
based on information obtained from earlier searches
Gerald Salton and colleagues
Harvard 1964-1968
Cornell 1968-1988
11
The index term vector space
The space has
as many
dimensions as
there are terms
in the word
list.
t3
d1
d2

t2
t1
12
Vector similarity computation
Documents in a collection are assigned terms from a set of n terms
The term assignment array T is defined as
if term j does not occur in document i, tij = 0
if term j occurs in document i, tij is greater than zero
(the value of tij is called the weight of term j in document i)
Similarity between di and dj is defined as
n
cos(di, dj) =
13
 t t
k=1 ik jk
|di| |dj|
Term weighting
Zipf's Law: If the words, w, in a collection are ranked, r(w),
by their frequency, f(w), they roughly fit the relation:
r(w) * f(w) = c
This suggests that some terms are more effective than others in
retrieval.
In particular relative frequency is a useful measure that
identifies terms that occur with substantial frequency in some
documents, but with relatively low overall collection
frequency.
Term weights are functions that are used to quantify these
concepts.
14
Term Frequency
Concept
A term that appears many times within a document is
likely to be more important than a term that appears
only once.
15
Inverse Document Frequency
Concept
A term that occurs in a few documents is likely to be a
better discriminator that a term that appears in most or all
documents.
16
Ranking -- Practical Experience
1.
Basic method is inner (dot) product with no weighting
2.
Cosine (dividing by product of lengths) normalizes for
vectors of different lengths
3. Term weighting using frequency of terms in document
usually improves ranking
4. Term weighting using an inverse function of terms in the
entire collection improves ranking (e.g., IDF)
5. Weightings for document structure improve ranking
6.
17
Relevance weightings after initial retrieval improve ranking
Effectiveness of methods depends on characteristics of the
collection. In general, there are few improvements beyond
simple weighting schemes.
Page Rank Algorithm (Google)
Concept:
The rank of a web page is higher if many pages link to it.
Links from highly ranked pages are given greater weight
than links from less highly ranked pages.
18
Google PageRank Model
A user:
1. Starts at a random page on the web
2a. With probability p, selects any random page and jumps to it
2b. With probability 1-p, selects a random hyperlink from the
current page and jumps to the corresponding page
3. Repeats Step 2a and 2b a very large number of times
Pages are ranked according to the relative frequency with
which they are visited.
19
Compare TF.IDF to PageRank
With TF.IDF document are ranked depending on how well they
match a specific query.
With PageRank, the pages are ranked in order of importance, with
no reference to a specific query.
20
Latent Semantic Indexing
Objective
Replace indexes that use sets of index terms by indexes
that use concepts.
Approach
Map the index term vector space into a lower
dimensional space, using singular value decomposition.
21
Use of Concept Space: Term Suggestion
22
Non-Textual Materials
Content
maps
photograph
bird songs and images
software
data set
video
23
Attribute
lat. and long., content
subject, date and place
field mark, bird song
task, algorithm
survey characteristics
subject, date, etc.
Direct Searching of Content
Sometimes it is possible to match a query against the content
of a digital object. The effectiveness varies from field to field.
Examples
24
•
Images -- crude characteristics of color, texture, shape, etc.
•
Music -- optical recognition of score
•
Bird song -- spectral analysis of sounds
•
Fingerprints
Image Retrieval: Blobworld
25
Automated generation of metadata
• Vector methods are for textual material
only.
• Metadata is needed for non-textual
materials. (Vector methods can be applied
to textual metadata.)
• Automated extraction of metadata is still
weak because of the semantic knowledge
needed.
26
Surrogates for non-textual materials
Textual catalog record
about a non-textual item
(photograph)
Surrogate
Text based methods of information retrieval can search a
surrogate for a photograph
27
Library of Congress catalog record
CREATED/PUBLISHED: [between 1925 and 1930?]
SUMMARY: U. S. President Calvin Coolidge sits at a desk and
signs a photograph, probably in Denver, Colorado. A group of
unidentified men look on.
NOTES: Title supplied by cataloger. Source: Morey Engle.
SUBJECTS:
Coolidge, Calvin,--1872-1933.
Presidents--United States--1920-1930.
Autographing--Colorado--Denver--1920-1930.
Denver (Colo.)--1920-1930.
Photographic prints.
28
MEDIUM: 1 photoprint ; 21 x 26 cm. (8 x 10 in.)
Photographs: Cataloguing Difficulties
Automatic
• Image recognition methods are very primitive
Manual
• Photographic collections can be very large
• Many photographs may show the same subject
• Photographs have little or no internal metadata (no
title page)
• The subject of a photograph may not be known
(Who are the people in a picture? Where is the
location?)
29
30
Automatic record for George W. Bush
home page
DC-dot applied to http://www.georgewbush.com/
<link rel="schema.DC" href="http://purl.org/dc">
<meta name="DC.Subject" content="George W. Bush; Bush;
George Bush; President; republican; 2000 election; election;
presidential election; George; B2K; Bush for President; Junior;
Texas; Governor; taxes; technology; education; agriculture;
health care; environment; society; social security; medicare;
income tax; foreign policy; defense; government">
<meta name="DC.Description" content="George W. Bush is
running for President of the United States to keep the country
prosperous.">
31
continued on next slide
Automatic record for George W. Bush
home page (continued)
DC-dot applied to http://www.georgewbush.com/
<meta name="DC.Publisher" content="Concentric Network
Corporation">
<meta name="DC.Date" scheme="W3CDTF" content="2001-01-12">
<meta name="DC.Type" scheme="DCMIType" content="Text">
<meta name="DC.Format" content="text/html">
<meta name="DC.Format" content="12223 bytes">
<meta name="DC.Identifier"
content="http://www.georgewbush.com/">
32
Informedia: the need for metadata
A video sequence is awkward for information discovery:
• Textual methods of information retrieval cannot be applied
• Browsing requires the user to view the sequence. Fast skimming
is difficult.
• Computing requirements are demanding (MPEG-1 requires 1.2
Mbits/sec).
Surrogates are required
33
Multi-Modal Information Discovery
The multi-modal approach to information retrieval
Computer programs to analyze video materials for clues
e.g., changes of scene.
• methods from artificial intelligence, e.g., speech
recognition, natural language processing, image
recognition.
• analysis of video track, sound track, closed captioning if
present, any other information.
Each mode gives imperfect information. Therefore use
many approaches and combine the evidence.
34
Informedia Library Creation
Video
Audio
Text
Speech recognition
Image extraction
Natural language
interpretation
35
Segmentation
Segments
with derived
metadata
Harnessing the intelligence of the user
• Relevance feedback
• Support for browsing
• Information visualization
36
The Human in the Loop
Return objects
Return
hits
Browse repository
Search index
37
Informedia: Information Discovery
User
Querying via
natural
language
Requested segments
and metadata
Segments
with derived
metadata
38
Browsing via
multimedia
surrogates
MIRA
Evaluation Frameworks for Interactive Multimedia
Information Retrieval Applications
• Information Retrieval techniques are beginning to be
used in complex goal and task oriented systems whose
main objectives are not just the retrieval of information.
• New original research in IR is being blocked or
hampered by the lack of a broader framework for
evaluation.
European study, 1996-99
39
MIRA Aims
•
•
•
•
•
•
•
•
•
•
40
Bring the user back into the evaluation process.
Understand the changing nature of IR tasks and their evaluation.
'Evaluate' traditional evaluation methodologies.
Consider how evaluation can be prescriptive of IR design
Move towards balanced approach (system versus user)
Understand how interaction affects evaluation.
Support the move from static to dynamic evaluation.
Understand how new media affects evaluation.
Make evaluation methods more practical for smaller groups.
Spawn new projects to develop new evaluation frameworks
Feedback in the Vector Space Model
Document vectors as points on a surface
41
•
Normalize all document vectors to be of length 1
•
Then the ends of the vectors all lie on a surface
with unit radius
•
For similar documents, we can represent parts of
this surface as a flat region
•
Similar document are represented as points that are
close together on this surface
Relevance feedback (concept)
x
x

o

x
o
x
hits from
original
search
o
x documents identified as non-relevant
o documents identified as relevant
 original query
 reformulated query
42
Document clustering (concept)
xx
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
Document clusters are a form of
automatic classification.
A document may be in several
clusters.
43
Browsing in Information Space
Starting point
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Effectiveness depends on
(a) Starting point
(b) Effective feedback
(c) Convenience
44
User Interface Concepts
Users need a variety of ways to search and browse, depending
on the task being carried out and preferred style of working
• Visual icons
one-line headlines
film strip views
video skims
transcript following of audio track
• Collages
• Semantic zooming
• Results set
• Named faces
• Skimming
45
46
47
48
Alexandria User Interface
49
50
Information Visualization: Tilebars
The figure represents a set of hits
from a text search.
Each large rectangle represents a
document or section of text.
Each row represents a search term or
subquery.
The density of each small square
indicates the frequency with which a
term appears in a section of a
document.
Hearst 1995
51
Information Visualization: Dendrogram
6
5
4
3
2
1
alpha
52
delta
golf
bravo
echo
charlie
foxtrot
Information Visualization:
Self Organizing
Maps (SOM)
53
54
Google has proved ...
For a very wide range of users entirely automated:
selection
indexing
ranking
combined with
searching by untrained users
and online browsing
is a very effective form of information discovery.
55
Searching
Changing users, changing user interfaces
56
From
To
Trained user or librarian
Untrained user
Controlled vocabulary
Natural language
Fielded searching
Unfielded text
Manually created records
Full text
Boolean algorithms
Ranking methods
Stateful protocols
Stateless protocols
Information Discovery:
1991 and 2001
Content
Computing
Choice of content
Index creation
Frequency
Vocabulary
Query
Users
57
1991
2001
print
expensive
selective
human
one time
controlled
Boolean
trained
online
inexpensive
comprehensive
automatic
monthly
not controlled
ranked retrieval
untrained
Download