Temple University – CIS Dept. CIS661 – Principles of Data

advertisement
Spatial and Temporal Data Mining
V. Megalooikonomou
Generic Multimedia Indexing
(slides are based on notes by C. Faloutsos)
General Overview

Multimedia Indexing

Spatial Access Methods (SAMs)






k-d trees
Point Quadtrees
MX-Quadtree
z-ordering
R-trees
Generic Multimedia Indexing
Mutlimedia Indexing – Detailed
outline

Generic Multimedia Indexing







problem dfn
Distance function
Similarity queries – Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications


1-D Time sequences
2-D Color images
Generic Multimedia Indexing problem


Given a database of multimedia objects
Design fast search algorithms that locate
objects that match a query object, exactly or
approximately

Objects:






1-d time sequences
Digitized voice or music
2-d color images
2-d or 3-d gray scale medical images
Video clips
E.g.: “Find companies whose stock prices
move similarly”
Mutlimedia Indexing – Detailed
outline

Generic Multimedia Indexing







problem dfn
Distance function
Similarity queries – Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications


1-D Time sequences
2-D Color images
Generic Multimedia Indexingproblem

1st step: provide a measure for the
distance between two objects

Distance function D():

Given two objects OA, OB the distance (=dissimilarity) of the two objects is denoted by
D(OA, OB)
E.g., Euclidean distance (sum of squared
differences) of two equal-length time series
Mutlimedia Indexing – Detailed
outline

Generic Multimedia Indexing







problem dfn
Distance function
Similarity queries
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications


1-D Time sequences
2-D Color images
Types of Similarity Queries
S1
1 365 day
Sn
1 365 day

stdF(S1)
F(Sn)
avg
Similarity queries are classified into:

Whole match queries:


Given a collection of N objects O1,…, ON and a query
object Q find data objects that are within distance 
from Q
Sub-pattern Match:

Given a collection of N objects O1,…, ON and a query
(sub-) object Q and a tolerance  identify the parts of
the data objects that match the query Q
Types of Similarity Queries
S1
1 365 day
Sn
1 365 day

std
F(S1)
F(Sn)
avg
Similarity queries are classified into:

Whole match queries:


Given a collection of N objects O1,…, ON and a query
object Q find data objects that are within distance  from
Q
Sub-pattern Match:

Given a collection of N objects O1,…, ON and a query
(sub-) object Q and a tolerance  identify the parts of the
data objects that match the query Q
Types of Similarity Queries
S1
1 365 day
Sn
1 365 day

std
F(S1)
F(Sn)
avg
Similarity queries are classified into:

Whole match queries:


Given a collection of N objects O1,…, ON and a query
object Q find data objects that are within distance  from
Q
Sub-pattern Match:

Given a collection of N objects O1,…, ON and a query
(sub-) object Q and a tolerance  identify the parts of the
data objects that match the query Q
Types of Similarity Queries

Similarity queries are classified into:

Whole match queries:


Given a collection of N objects O1,…, ON and a query
object Q find data objects that are within distance  from
Q
Sub-pattern Match:

Given a collection of N objects O1,…, ON and a query
(sub-) object Q and a tolerance  identify the parts of the
data objects that match the query Q
Types of Similarity Queries
S1
1 365 day
Sn
1 365 day

stdF(S1)
F(Sn)
avg
Additional types of queries:

K- Nearest Neighbor queries:


Given a collection of N objects O1,…, ON and a query
object Q find the K most similar data objects to Q
All pairs queries (or ‘spatial joins’):

Given a collection of N objects O1,…, ON find all objects
that are within distance  from each other
Types of Similarity Queries
S1
1 365 day
Sn
1 365 day

stdF(S1)
F(Sn)
avg
Additional types of queries:

K- Nearest Neighbor queries:


Given a collection of N objects O1,…, ON and a query
object Q find the K most similar data objects to Q
All pairs queries (or ‘spatial joins’):

Given a collection of N objects O1,…, ON find all objects
that are within distance  from each other
Mutlimedia Indexing – Detailed
outline

Generic Multimedia Indexing







problem dfn
Distance function
Similarity queries – Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications


1-D Time sequences
2-D Color images
Idea method – requirements




Fast: sequential scanning and distance
calculation with each and every object too
slow for large databases
“Correct”: No false dismissals. False alarms
are acceptable. Why?
Small space overhead
Dynamic: easy to insert, delete, and
update objects
Approach Outline


Use k feature extraction functions to map
objects into k-dimensional space (applying a
mapping F () )
Use highly fine-tuned database SAMs
(Spatial Access Methods) like R-trees to
accelerate the search (by pruning out large
portions of the database that are not
promising)…
Mutlimedia Indexing – Detailed
outline

Generic Multimedia Indexing







problem dfn
Distance function
Similarity queries – Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications


1-D Time sequences
2-D Color images
Basic idea

Focus on ‘whole match’ queries


Given a collection of N objects O1,…, ON, a
distance/dis-similarity function D(Oi, Oj), and a
query object Q find data objects that are within
distance  from Q
Sequential scanning?
Basic idea

Focus on ‘whole match’ queries


Given a collection of N objects O1,…, ON, a
distance/dis-similarity function D(Oi, Oj), and a
query object Q find data objects that are within
distance  from Q
Sequential scanning?
May be too slow.. Why?
Basic idea

Focus on ‘whole match’ queries


Sequential scanning?
May be too slow.. for the following reasons:



Given a collection of N objects O1,…, ON, a
distance/dis-similarity function D(Oi, Oj), and a query
object Q find data objects that are within distance 
from Q
Distance computation is expensive (e.g., editing
distance in DNA strings)
The Database size N may be huge
Faster alternative?
Basic idea

Faster alternative:



Step 1: a ‘quick and dirty’ test to discard quickly
the vast majority of non-qualifying objects
Step 2: use of SAMs to achieve faster than
sequential searching
Example:




Database of yearly stock price movements
1/ 2


Euclidean distance function D(S , Q)    (S[i]  Q[i]) 2 
 i 1

Characterize with a single number (‘feature’)
Or use two or more features
Basic idea - illustration
Feature2
S1
F(S1)
1
365
day
Sn
Feature1
1

F(Sn)
365
day
A query with tolerance  becomes a sphere with radius 
Basic idea – caution!






The mapping F() from objects to k-d points
should not distort the distances
D(): distance of two objects
Df(): distance of their corresponding feature
vectors
Ideally, perfect preservation of distances
In practice, a guarantee of no false dismissals
How?
Basic idea – caution!






The mapping F() from objects to k-d points
should not distort the distances
D(): distance of two objects
Df(): distance of the corresponding feature
vectors
Ideally, perfect preservation of distances
In practice, a guarantee of no false dismissals
How? If the distance in f-space matches or
underestimates the distance between two
objects in the original space
Basic idea – Lower bounding

Let O1, O2 be two objects with distance
function D() and F(O1), F(O2), be their feature
vectors with distance function Df(), then:
To guarantee no false dismissals for whole
match queries, the feature extraction function
F() should satisfy:
Df(F(O1), F(O2))  D(O1, O2)
for every pair of objects O1, O2
Lower bounding - Proof






Let Q be the query object and O be the
qualifying object and  be the tolerance.
Prove: If object O qualifies it will be retrieved
by a range query in the f-space
Or, D(Q, O)    Df(F(Q), F(O))  
However, Df(F(Q), F(O))  D(Q, O)   
What about ‘all-pairs’?
What about ‘nearest-neighbor’ queries?
Lower bounding - Proof






Let Q be the query object and O be the
qualifying object and  be the tolerance.
Prove: If object O qualifies it will be retrieved
by a range query in the f-space
Or, D(Q, O)    Df(F(Q), F(O))  
However, Df(F(Q), F(O))  D(Q, O)   
What about ‘all-pairs’? (‘spatial join’ on f-space)
What about ‘nearest-neighbor’ queries?
Lower bounding - Proof






Let Q be the query object and O be the
qualifying object and  be the tolerance.
Prove: If object O qualifies it will be retrieved
by a range query in the f-space
Or, D(Q, O)    Df(F(Q), F(O))  
However, Df(F(Q), F(O))  D(Q, O)   
What about ‘all-pairs’? (‘spatial join’ on f-space)
What about ‘nearest-neighbor’ queries? ??
Mutlimedia Indexing – Detailed
outline

Generic Multimedia Indexing







problem dfn
Distance function
Similarity queries – Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications


1-D Time sequences
2-D Color images
GEneric Multimedia object INdexIng

GEMINI approach:
1.
2.
3.
4.

Determine distance function D()
Find one or more numerical feature-extraction functions (to
provide a ‘quick and dirty’ test)
Prove that Df() lower-bounds D() to guarantee no false
dismissals
Use a SAM (e.g., R-tree) to store and retrieve k-d feature
vectors
!!! The methodology focuses on the speed of search
only; not on the quality of the results which relies on
the distance function
Generic Multimedia Object Indexing

Applications:



1-d time sequences
2-d color images
Problems to solve:



How to apply the lower-bounding lemma
‘Curse of Dimensionality’ (time sequences)
‘Cross-talk’ of features (color images)
Mutlimedia Indexing – Detailed
outline

Generic Multimedia Indexing







problem dfn
Distance function
Similarity queries – Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications


1-D Time sequences
2-D Color images
1-D Time Sequences


Distance function: Euclidean distance
Find features that:



Preserve/lower-bound the distance
Carry as much information as possible(reduce false alarms)
If we are allowed to use only one feature what would
this be?
1-D Time Sequences


Distance function: Euclidean distance
Find features that:




Preserve/lower-bound the distance
Carry as much information as possible(reduce false alarms)
If we are allowed to use only one feature what would
this be? The average.
… extending it…
1-D Time Sequences


Distance function: Euclidean distance
Find features that:






Preserve/lower-bound the distance
Carry as much information as possible(reduce false alarms)
If we are allowed to use only one feature what would
this be? The average.
… extending it…
The average of 1st half, of the 2nd half, of the 1st
quarter, etc.
Coefficients of the Fourier transform (DFT), wavelet
transform, etc.
1-D Time Sequences


Show that the distance in feature space lower-bounds the
actual distance
What about DFT?
1-D Time Sequences



Show that the distance in feature space lower-bounds the
actual distance
What about DFT?
Parseval’s Theorem: DFT preserves the energy of the signal as
well as the distances between two signals.
D(x,y) = D(X,Y)
where X and Y are the Fourier transforms of x and y
If we keep the first k  n coefficients of DFT we lower-bound
the actual distance
k 1
D f ( F ( x), F ( y ))   X f  Y f
f 0
2
n 1
  X f  Yf
f 0
2
n 1
2
  xi  yi  D( x, y )
i 0
1-D Time Sequences









Response time improves as the transform concentrates more
the energy of the signal
DFT concentrates the energy for a large class of signals, the
colored noises
Colored noises: skewed energy spectrum that drops as O(f -b)
Energy spectrum or power spectrum of a signal is the square
of the amplitude |Xf| as a function of the frequency f
b = 2: random walks or brown noise (very predictable)
b  2: black noises
b = 1: pink noise
b = 0: white noise (completely unpredictable)
Colored noises even in images (photographs)
Mutlimedia Indexing – Detailed
outline

Generic Multimedia Indexing







problem dfn
Distance function
Similarity queries – Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications


1-D Time sequences
2-D Color images
2-D color images

Image features for Content Based Image
Retrieval (CBIR):

Low Level:






Color – color histograms
Texture – directionality, granularity, contrast
Shape – turning angle, moments of inertia, pattern
spectrum
Position – 2D strings method
…etc
Object Level:

Regions
2-D color images – Color histograms





Each color image – a 2-d array of pixels
Each pixel – 3 color components (R,G,B)
h colors – each color denoting a point in 3-d color space (as
high as 224 colors)
For each image compute the h-element color histogram –
each component is the percentage of pixels that are most
similar to that color
The histogram of image I is defined as:
For a color Ci , Hci(I) represents the number of pixels of color Ci in
image I
OR:
For any pixel in image I, Hci(I) represents the possibility of that pixel
having color Ci.
2-D color images – Color histograms



Usually cluster similar colors together and choose one
representative color for each ‘color bin’
Most commercial CBIR systems include color histogram as one
of the features (e.g., QBIC of IBM)
No space information
Color histograms - distance

One method to measure the distance between two
histograms x and y is:
h
h
i
j
d h2 ( x, y)  ( x  y)t  A  ( x  y)   aij ( xi  yi )( x j  y j )
where the color-to-color similarity matrix A has entries
aij that describe the similarity between color i and color j
Color histograms – lower bounding

Two obstacles for using color-histograms as
feature vectors in GEMINI:


‘Dimensionality curse’ (h is large 64, 128)
Distance function is quadratic

It involves all cross terms (‘cross-talk’ among features)
- expensive to compute
- precludes the use of SAMs
bright red
pink
orange
x
q
e.g.,64 colors
Color histograms – lower bounding



1st step: define the distance function between two color
images D()=dh()
2nd step: find numerical features (one or more) whose
Euclidean distance lower-bounds dh()
If we allowed to use one numerical feature to describe
the color image what should it be?


Avg. amount for each color component (R,G,B)
x  ( Ravg , Gavg , Bavg )t
P

Where
Ravg  (1 / P) R( p) … , similarly for G and B
p 1
Where P is the number of pixels in the image, R(p) is the red
component (intensity) of the p-th pixel
Color histograms – lower bounding

Given the average color vectors x and y of two images
we define davg() as the Euclidean distance between the
3-d average color vectors
3
2
d avg
( x , y )  ( x  y ) t  ( x  y )   ( xi  yi ) 2
i 1


3rd step: to prove that the feature distance davg() lowerbounds the actual distance dh()
Main idea of approach:


First a filtering using the average (R,G,B) color,
then a more accurate matching using the full h-element
histogram
Color auto-correlogram



pick any pixel p1 of color Ci in the image I
at distance k away from p1 pick another pixel p2
what is the probability that p2 is also of color Ci ?
Red ?
k
P2
P1
Image: I
Color auto-correlogram

The auto-correlogram of image I for
color Ci , distance k:
 (I )  Pr[| p1  p2 | k , p2  IC | p1  IC ]
(k )
Ci

i
i
Integrate both color information and
space information.
Color auto-correlogram
Implementations

Pixel Distance Measures

Use D8 distance (also called chessboard distance):
D8 ( p, q)  max(| p x  q x |, | p y  q y |)


Choose distance k=1,3,5,7
Computation complexity:
( n 2 )
 Histogram:
2

(
134
*
n
)
 Correlogram:
Implementations

Features Distance Measures:

D( f(I1) - f(I2) ) is small  I1 and I2 are similar.
Example: f(a)=1000, f(a’)=1050; f(b)=100, f(b’)=150

For histogram:

| I  I ' |h 

| hCi ( I )  hCi ( I ' ) |
1 h
i[ m ]
Ci
( I )  hCi ( I ' )
For correlogram:
| I  I ' | 

|  C( ki ) ( I )   C( ki ) ( I ' ) |
(k )
(k )
1


(
I
)


i[ m ], k[ d ]
Ci
Ci ( I ' )
Color Histogram vs Correlogram

If there is no difference between the query
and the target images, both methods have
good performance.
Correlogram
method
Query Image
1st
2nd
1st
2nd
3rd
4th
5th
(512 colors)
Histogram
method
3rd
4th
5th
Color Histogram vs Correlogram

The correlogram method is more stable to
color change than the histogram method.
Query
Correlogram method: 1st
Histogram method: 48th
Target
Color Histogram vs Correlogram

The correlogram method is more stable to
large appearance change than the histogram
method
Query
Correlogram method: 1st
Histogram method: 31th
Target
Color Histogram vs Correlogram

The correlogram method is more stable to
contrast & brightness change than the
histogram method.
Target
Query 1
Query 2
Query 3
Query 4
C: 178th
C: 1st
C: 1st
C: 5th
H: 230th
H: 1st
H: 3rd
H: 18th
Color Histogram vs Correlogram



The color correlogram describes the global
distribution of local spatial correlations of colors.
It’s easy to compute
It’s more stable than the color histogram method
Mutlimedia Indexing – Conclusions



GEMINI is a popular method
Whole matching problem
Should pay attention to:





Distance functions
Feature Extraction functions
Lower Bounding
Particular application
Sub-pattern matching?
Download