Multimedia Information Retrieval

advertisement
Information Retrieval (IR)
Deals with the representation, storage
and retrieval of unstructured data
Topics of interest: systems, languages,
retrieval, user interfaces, data
visualization, distributed data sets
Classical IR deals mainly with text
The evolution of multimedia databases
and of the web have given new interest
to IR
E.G.M Petrakis
Multimedia Information
Retrieval
1
Multimedia IR
A multimedia information system that
can store and retrieve
attributes
 text
2D grey-scale and color images
1D time series
digitized voice or music
video
E.G.M Petrakis
Multimedia Information
Retrieval
2
Applications
Financial, marketing: stock prices, sales etc.
find companies whose stock prices move similarly
Scientific databases: sensor data
whether, geological, environmental data
Office automation, electronic encyclopedias,
electronic books
Medical databases X-rays, CT, MRI scans
Criminal investigation suspects, fingerprints,
Personal archives, text and color images
E.G.M Petrakis
Multimedia Information
Retrieval
3
Queries
Formulation of user’s information need
Free-text query
By example document (e.g., text, image)
e.g., in a collection of color photos find
those showing a tree close to a house
Retrieval is based on the understanding
of the content of documents and of
their components
E.G.M Petrakis
Multimedia Information
Retrieval
4
Goals of Retrieval
Accuracy: retrieve documents that the
user expects in the answer
With as few incorrect answers as possible
All relevant answers are retrieved
Speed: retrievals has to be fast
The system responds in real time
E.G.M Petrakis
Multimedia Information
Retrieval
5
Accuracy of Retrieval
Depends on query criteria, query
complexity and specificity
Attributes / Content criteria?
Depends on what is matched with the
query
How documents content is represented
Typically feature vectors
 Depends also on matching function
 Euclidean distance
E.G.M Petrakis
Multimedia Information
Retrieval
6
Speed of Retrieval
The query is compared (matched) with all
stored documents
The definition of similarity criteria (similarity
or distance function between documents) is an
important issue:
Similarity or distance function between documents
Matching has to fast: document matching has to be
computationally efficient and sequential searching
must be avoided
Indexing: search documents that are likely to
match the query
E.G.M Petrakis
Multimedia Information
Retrieval
7
Main Idea
Descriptions are extracted and stored
Same or separate storage with documents
Queries addresses the stored descriptions
rather the documents themselves
Images & video: the majority of stored
data
Solution: retrieval by text
Text retrieval is a well researched area
Most systems work with text, attributes
E.G.M Petrakis
Multimedia Information
Retrieval
8
Database
index
query
database
descriptions
documents
Doc1-FeactureVector
Doc2-FeactureVector
Doc1
Doc2
DocN
DocN-FeatureVector
E.G.M Petrakis
Multimedia Information
Retrieval
9
Problem
Text descriptions for images and video
contained in documents are not available
e.g., the text in a web site is not always
descriptive of every particular image
contained in the web site
Two approaches
human annotations
feature extraction
E.G.M Petrakis
Multimedia Information
Retrieval
10
Human Annotations
 Humans insert attributes, captions for
images and videos
 inconsistent or subjective over time and
users (different users do not give the
same descriptions)
 retrievals will fail if queries are
formulated using different keywords or
descriptions
 time consuming process, expensive or
impossible for large databases
E.G.M Petrakis
Multimedia Information
Retrieval
11
Feature Extraction
 Features are extracted from audio,
image and video






cheaper and replicable approach
consistent descriptions but
inexact
low level feature (patterns, colors etc.)
difficult to extract meaning
different techniques for different data
E.G.M Petrakis
Multimedia Information
Retrieval
12
Architecture of a IR System for Images
Furht at.al. 96
E.G.M Petrakis
Multimedia Information
Retrieval
13
Multimedia IR in Practice
Relies on text descriptions
Non-text IR
there is nothing similar to text-based
retrieval systems
automatic IR is sometimes impossible
requires human intervention at some level
significant progress have been made
domain specific interpretations
E.G.M Petrakis
Multimedia Information
Retrieval
14
Ranking of IR Methods
Ranking in terms of complexity and accuracy
 attributes
 text
 audio
 image
 video
accuracy
complexity
Combined IR based on two or more data types
for more accurate retrievals
Complex data types (e.g.,video) are rich
sources of information
E.G.M Petrakis
Multimedia Information
Retrieval
15
Similarity Searching
 IR is not an exact process
two documents are never identical!
they can be “similar”
searching must be “approximate”
The effectiveness of IR depends on the
types and correctness of descriptions used
types of queries allowed
user uncertainly as to what he is looking for
efficiency of search techniques
E.G.M Petrakis
Multimedia Information
Retrieval
16
Approximate IR
A query is specified and all documents
up to a pre-specified degree of
similarity are retrieved and presented
to the user ordered by similarity
Two common types of similarity queries:
range queries: retrieve all documents up to
a distance “threshold” T
nearest-neighbor queries: retrieve the k
best matches
E.G.M Petrakis
Multimedia Information
Retrieval
17
Distance - Similarity
Decide whether two documents are
similar
distance: the lower the distance, the more
similar the documents are
similarity: the higher the similarity, the
more similar the documents are
Key issue for successful retrieval: the
more accurate the descriptions are, the
more accurate the retrieval is
E.G.M Petrakis
Multimedia Information
Retrieval
18
Retrieval Quality Criteria
Retrieval with as few errors as possible
Two types of errors:
false dismissals or misses: qualifying but
non retrieved documents
false positives or false drops: retrieved
but not qualifying documents
A good method minimizes both
Ranking quality: retrieve qualifying
documents before non-qualifying ones
E.G.M Petrakis
Multimedia Information
Retrieval
19
Evaluation of Retrieval
“Given a collection of documents, a set of
queries and human expert’s responses to the
above queries, the ideal system will retrieve
exactly what the human dictated”
The deviations from the above are measured
number of retrieved relevant in answer
precision =
number of retrieved
number or retrieved relevant in answer
recall =
total number or relevant in the collection
E.G.M Petrakis
Multimedia Information
Retrieval
20
precision
Precision-Recall Diagram
1
the ideal method
0,5
1
0,5
recall
 Measure precision-recall for 1, 2 ..N answers
 High precision means few false alarms
 High recall mean few false dismissals
E.G.M Petrakis
Multimedia Information
Retrieval
21
Harmonic Mean
 A single measure combining precision and recall
F =
2
1
+p
1
 r: recall, p: precision
r
 F takes values in [0,1]
 F  1 as more retrieved documents are relevant
 F 0 as few retrieved documents are relevant
 F is high when both precision and recall are high
 F expresses a compromise between precision and
recall
E.G.M Petrakis
Multimedia Information
Retrieval
22
Ranking Quality
The higher the Rnorm the better the ability of
a method to retrieve correct answer before

incorrect

S
S
1
 2 (1   ) if S max
0
S
max
Rnorm  
1
otherwise

A human expert evaluates the answer set
Then take answers in pairs: (relevant,irrelevant)
 S+ : pairs ranked correctly (the relevant entry
has retrieved before the irrelevant one)
 S- : pairs ranked incorrectly
 Smax+: total ranked pairs
E.G.M Petrakis
Multimedia Information
Retrieval
23
Searching Method
Sequential scanning: the query is
matched with all stored documents
slow if the database is large or
if matching is slow
The documents must be indexed
hashing, inverted files, B-trees, R-trees
different data types are indexed and
searched separately
E.G.M Petrakis
Multimedia Information
Retrieval
24
Goals of IR
Summarizing, there are two general
goals common to all IR systems
effectiveness: IR must be accurate
(retrieves what the user expects to see in
the answer)
efficiency: IR must be fast (faster than
sequential scanning)
E.G.M Petrakis
Multimedia Information
Retrieval
25
Approximate IR
A query is specified and all documents
up to a pre-specified degree of
similarity are retrieved and presented
to the user ordered by similarity
Two common types of similarity queries:
range queries: retrieve all documents up to
a distance “threshold” T
nearest-neighbor queries: retrieve the k
best matches
E.G.M Petrakis
Multimedia Information
Retrieval
26
Query Formulation
Must be flexible and convenient
SQL queries constraints on all
attributes and data types may become
very complex
Queries by example e.g., by providing an
example document or image
Browsing: display headers, summaries,
miniatures etc. or for refining the
retrieved results
E.G.M Petrakis
Multimedia Information
Retrieval
27
Access Methods for Text
Access methods for text are
interesting for at least 3 reasons
multimedia documents contain text, e.g.,
images often have captions
text retrieval has several applications in
itself (library automation, web search etc.)
text retrieval research has led to useful
ideas like, vector space model, information
filtering, relevance feedback etc.
E.G.M Petrakis
Multimedia Information
Retrieval
28
Text Queries
 Single or multiple keyword queries
 context queries: phrases, word proximity
 boolean: keywords with and, or, not
 Natural language: free text queries
 Structured search takes also text
structure into account and can be:
 flat or hierarchical for searching in titles,
paragraphs, sections, chapters
 hypertext: combines content-connectivity
E.G.M Petrakis
Multimedia Information
Retrieval
29
Keyword Matching
The whole collection is searched
no preprocessing
no space overhead
updates are easy
The user specifies a string (regular
expression) and the text is parsed using
a finite state automaton
KMP, BMH algorithms
E.G.M Petrakis
Multimedia Information
Retrieval
30
Error Tolerant Methods
Methods that can tolerate errors
Scan text one character at a time by
keeping track of matched characters
and retrieve strings within a desired
editing distance from query
Wu, Manber 92, Baeza-Yates, Connet 92
Extension: regular expressions built-up
by strings and operators: “pro (blem |
tein) (s | ε) | (0 | 1 | 2)”
E.G.M Petrakis
Multimedia Information
Retrieval
31
Error Counting Methods
Retrieve similar words or phrases e.g.,
misspelled words or words with
different pronunciation
editing distance: a numerical estimate
of the similarity between 2 strings
phonetic coding: search words with
similar pronunciation (Soundex, Phonix)
N-grams: count common N-length
substrings
E.G.M Petrakis
Multimedia Information
Retrieval
32
Editing Distance
Minimum number of edit operations that
are needed to transform the first
string to the second
edit operation: insert, delete, substitute
and (sometimes) transposition of
characters
d(si,ti) : distance between characters
usually d(si,ti) = 1 if si < > ti and 0 otherwise
edit(cordis,codis) = 1: r is deleted
edit(cordis, codris) = 1: transposition of rd
E.G.M Petrakis
Multimedia Information
Retrieval
33
Damerau-Levenstein Algorithm
Basic recurrence relation (DP algorithm)
edit(0,0) = 0;edit(i,0) = i; edit(0,j) = j;
edit(i,j) = min{ edit(i-1,j) + 1, edit(i, j-1) + 1,
edit(i-1,j-1) + d(si,ti),
edit(i-2,j-2) +
d(si,tj-1) + d(si-1,sj) + 1
}
E.G.M Petrakis
Multimedia Information
Retrieval
34
Matching Algorithm
 Compute D(A,B)
 #A, #B lengths of A, B
 0: null symbol
 R: cost of an edit operation
D(0,0) = 0
for i = 0 to #A: D(i:0) = D(i-1,0) + R(A[i]0);
for j = 0 to #B: D(0:j) = D(0,j-1) + R(0B[j]);
for i = 0 to #A
for j = 0 to #B {
}
E.G.M. Petrakis
1. m1 = D(i,j-1) + R(0B[j]);
2. m2 = D(i-1,j) + R(A[i] 0);
3. m3 = D(i-1,j-1) + R(A[i] B[j]);
4. D(i,j) = min{m1, m2, m3};
Multimedia Information
Retrieval
35
j
0
1
2
3
4
5
6
7
8
E.G.M. Petrakis
i 0 1
a
0 1
a 1 0
a 2 1
a 3 2
b 4 3
c 5 4
c 6 5
c 7 6
d 8 7
2
b
2
1
1
2
2
3
4
5
6
3
b
3
2
2
2
2
2
4
5
6
4
c
4
3
3
3
3
2
3
4
5
5
d
5
4
4
4
4
3
3
4
4
Multimedia Information
Retrieval
6
d
6
5
5
5
5
4
4
4
4
initialization cost
total cost
36
Classical IR Models
Boolean model
simple based on set theory
queries as Boolean expressions
Vector space model
queries and documents as vectors in term
space
Probabilistic model
a probabilistic approach
E.G.M Petrakis
Multimedia Information
Retrieval
37
Text Indexing
A document is represented by a set of
index terms that summarize document
contents
adjectives, adverbs, connectives are less
useful
mainly nouns (lexicon look-up)
requires text pre-processing (off-line)
E.G.M Petrakis
Multimedia Information
Retrieval
38
Indexing
index
Data Repository
E.G.M Petrakis
Multimedia Information
Retrieval
39
Text Preprocessing
Extract index terms from text
Processing stages
word separation, sentence splitting
change terms to a standard form (e.g.,
lowercase)
eliminate stop-words (e.g. and, is, the, …)
reduce terms to their base form (e.g.,
eliminate prefixes, suffixes)
construct mapping between terms and
documents (indexing)
E.G.M Petrakis
Multimedia Information
Retrieval
40
Text Preprocessing Chart
from Baeza – Yates & Ribeiro – Neto, 1999
E.G.M Petrakis
Multimedia Information
Retrieval
41
Text Indexing Methods
Main indexing methods:
inverted files
signature files
bitmaps
Size of index
may exceed size of actual collection
compress index to reduce storage
typically 40% of size of collection
E.G.M Petrakis
Multimedia Information
Retrieval
42
Inverted Files
Dictionary: terms stored alphabetically
Posting list: pointers to documents
containing the term
Posting info: document id, frequency of
occurrence, location in document, etc.
dictionary indexed by B-trees, tries or
binary searched
Pros: fast and easy to implement
Cons: large space overhead (up to 300%)
E.G.M Petrakis
Multimedia Information
Retrieval
43
Inverted Index
index
άγαλμα
αγάπη
…
δουλειά
…
πρωί
…
ωκεανός
E.G.M Petrakis
posting list
(1,2)(3,4)
(4,3)(7,5)
………
(10,3)
Multimedia Information
Retrieval
documents
1
2
3
4
5
6
7
8
9
10
11
44
Query Processing
1. Parse query and extract query terms
2. Dictionary lookup
 binary search
3. Get postings from inverted file
 one term at a time
4. Accumulate postings
 record matching document ids, frequencies, etc.
 compute weights
5. Rank answers by weights (e.g., frequencies)
E.G.M Petrakis
Multimedia Information
Retrieval
45
Signature Files
A filter for eliminating most of the nonqualifying documents
signature: short hash-coded representation
of document or query
the signatures are stored and searched
sequentially
search returns all qualifying documents plus
some false alarms
the answer is searched to eliminate false
alarms
E.G.M Petrakis
Multimedia Information
46
Retrieval
Superimposed Coding
Each word yields a bit pattern of size F with m
bits set to 1 and the rest left as 0
size of F affects the false drop rate
bit patterns are OR-ed to form the doc signature
find documents having 1 in the same bits as the
query
E.G.M Petrakis
word
signature
data
001 000 110 010
base
000 010 101 001
document signature
001 010 111 011
Multimedia Information
Retrieval
47
Bitmaps
For every term, a bit vector is stored
each bit corresponds to a document
set to 1 if term appears in document, 0
otherwise
e.g., “word”  10100: the “word” appears in
documents 1 and 3 in a collection of 5
documents
Very efficient for Boolean queries
retrieve bit-vectors for each query term
E.G.Mcombine
Petrakis
Multimedia Information
bit vectors
with Boolean operators48
Retrieval
Comparison of Text Indexing
Methods
Bitmaps:
 pros: intuitive, easy to implement and to process
 cons: space overhead
Signature files:
 pros: easy to implement, compact index, no lexicon
 cons: many unnecessary accesses to documents
Inverted files:
 pros: used by most search engines
 cons: large space overhead, index can be
compressed
E.G.M Petrakis
Multimedia Information
Retrieval
49
References
 “Searching Multimedia Databases by Content”, C.
Faloutsos, Kluwer Academic Publishers, 1996
 “Modern Information Retrieval”, R. Baeza-Yates, B.
Ribeiro-Neto, Addison Wesley, 1999
 “Automatic Text Processing”, Gerard Salton, Addison
Wesley, 1989
 Information Retrieval Links:
http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html
E.G.M Petrakis
Multimedia Information
Retrieval
50
Download