GEMINI GEneric Multimedia INdexIng

advertisement
GEMINI
GEneric Multimedia INdexIng

GEneric Multimedia INdexIng








distance measure
Sub-pattern Match
‘quick and dirty’ test
Lower bounding lemma
1-D Time Sequences
Color histograms
Color auto-correlogram
Shapes
1
GEneric Multimedia INdexIng


Given a database of multimedia objects
Design fast search algorithms that locate objects that
match a query object, exactly or approximately

Objects:
•
•
•
•
•

1-d time sequences
Digitized voice or music
2-d color images
2-d or 3-d gray scale medical images
Video clips
E.g.: “Find companies whose stock prices move
similarly”
Applications

time series:
• financial, marketing (click-streams!), ECGs,
sound;

images:
• medicine, digital libraries, education, art

higher-d signals:
• scientific db (eg., astrophysics), medicine (MRI
scans), entertainment (video)
2
Sample queries
Find medical cases similar to Smith's
 Find pairs of stocks that move in sync
 Find pairs of documents that are similar
(plagiarism?)
 Find faces similar to ‘Tiger Woods’

$price
$price
1
365
day
$price
1
365
day
distance function: by expert
1
(eg, Euclidean distance)
365
day
3
Generic Multimedia Indexing

1st step: provide a measure for the
distance between two objects

Distance function d():
• Given two objects O1, O2 the distance (=dissimilarity) of the two objects is denoted by
d(O1, O2)
E.g., Euclidean distance (sum of squared
differences) of two equal-length time series
ε-Similarity query

Given a query object Q, find all objects Oi
from the database are ε-similar (identical
for ε = 0) to Q

{Oi
DB | d(Q , Oi) < ε}.
4
Types of Similarity Queries

Whole match queries:
• Given a collection of S objects O1,…, Os and a query
object Q find data objects that are within distance ε
from Q
Types of Similarity Queries

Sub-pattern Match:
• Given a collection of S objects O1,…, OS and a query
(sub-) object Q and a tolerance ε identify the parts of the
data objects that match the query Q
5
Idea method – requirements

Fast: sequential scanning and distance
calculation with each and every object too
slow for large databases

Dynamic: easy to insert, delete, and
update objects
Basic idea

Focus on ‘whole match’ queries
• Given a collection of S objects O1,…, Os, a
distance/dis-similarity function d(Oi, Oj), and a
query object Q find data objects that are within
distance ε from Q

Sequential scanning?
May be too slow.. for the following
reasons:
• Distance computation is expensive (e.g., editing
distance in DNA strings)
• The Database size S may be huge

Faster alternative?
6
GEneric Multimedia INdexIng
Christos
Faloutsos
QBIC 1994
• A feature extraction function maps the high dimensional
objects into a low dimensional space
• Objects that are very dissimilar in the feature space, are
also very dissimilar in the original space
Basic idea

Faster alternative:
Step 1: a ‘quick and dirty’ test to discard quickly
the vast majority of non-qualifying objects
 Step 2: use of SAMs (R-trees, Hilbert-Curve,..) to
achieve faster than sequential searching


Example:

Database of yearly stock price movements
• Euclidean distance function
• Characterize with a single number (‘feature’)
• Or use two or more features
7
Basic idea - illustration
Feature2
S1
F(S1)
1
365
day
Sn
Feature1
1

F(Sn)
365
day
A query with tolerance ε becomes a sphere with radius ε
Basic idea – caution!

The mapping F() from objects to k-dim. points
should not distort the distances
d(): distance of two objects
 dfeature(): distance of their corresponding feature
vectors


Ideally, perfect preservation of distances


In practice, a guarantee of no false dismissals
How?
8

Objects represented by vectors that are very
dissimilar in the feature space are expected to
be very dissimilar in the original space

If the distances in the feature space are always
smaller or equal than the distances in the
original space, a bound which is valid in both
spaces can be determined

The distance of similar objects is smaller
or equal to ε in the original space and,
consequently, it is smaller or equal to ε in
the feature space as well...
9
Lower bounding lemma

if distance of similar “objects“ is smaller
or equal to ε in original space
 then it is as well smaller or equal ε in the
feature space

d feature (F(O1 ),(O2 )) " d(O1,O2 ) " #
o.k.
d(O1,O2 ) " # $$
% d feature (F(O1 ),F(O2 ))
d feature (F(O1 ),F(O2 )) " # $WRONG!
$ $% d(O1,O2 ) " #
d feature (F(O1 ),F(O2 )) " # " d(O1,O2 ) " ?

!

No object in the feature space will be missed
(false dismissals) in the feature space
There will be some objects that are not similar
in the original space (false hints/alarms)
10

That means that we are guaranteed to have
selected all the objects we wanted plus some
additional false hits in the feature space

In the second step, false hits have to be filtered
from the set of the selected objects through
comparison in the original space
11
Time sequences
white noise
brown noise
Fourier
spectrum
... in log-log
Time sequences

Conclusion: colored noises are well
approximated by their first few Fourier
coefficients

Colored noises appear in nature
12
Time sequences

Eg.:
GEMINI
Important:
Q: how to extract features?
A: “if I have only one number to describe
my object, what should this be?”
13
1-D Time Sequences
Distance function: Euclidean distance
Find features that:




Preserve/lower-bound the distance
Carry as much information as possible(reduce false
alarms)
If we are allowed to use only one feature what
would this be? The average
… extending it…


1-D Time Sequences





......
If we are allowed to use only one feature what
would this be? The average
… extending it…
The average of 1st half, of the 2nd half, of the
1st quarter, etc.
Coefficients of the Fourier transform (DFT),
wavelet transform, etc.
14
Feature extracting function
1.
2.
Define a distance function
Find a feature extraction function F() that
satisfies the bounding lemma
Example:


Discrete Fourier Transform (DFT) preserve
Euclidian distances between signals (Parseval's
theorem)
F() = DTF which keeps the first coefficients of the
transform
1-D Time Sequences
Show that the distance in feature space lower-bounds the actual
distance
DFT?
Parseval’s Theorem: DFT preserves the energy of the signal as
well as the distances between two signals
d(x,y) = d(X,Y)
where X and Y are the Fourier transforms of x and y
If we keep the first k ≤ n coefficients of DFT we lower-bound the
actual distance
k"1
2
n"1
2
n"1
2
d feature (F(x),F(y)) = # X f " Y f $ # X f " Y f = # x i " y i % d(x, y)
f =0
f =0
i= 0
!
15
Time sequences - results
keep the first 2-3 Fourier coefficients
 faster than seq. scan
 no false dismissals

total
time
cleanup-time
r-tree time
# coeff. kept
Time sequences improvements:
could use Wavelets, or DCT
 could use segment averages

16
Images - color
what is an image?
A: 2-d array
2-D color images – Color histograms





Each color image – a 2-d array of pixels
Each pixel – 3 color components (R,G,B)
h colors – each color denoting a point in 3-d color
space (as high as 224 colors)
For each image compute the h-element color
histogram – each component is the percentage of
pixels that are most similar to that color
The histogram of image I is defined as:
For a color Ci , Hci(I) represents the number of pixels of
color Ci in image I
OR:
For any pixel in image I, Hci(I) represents the possibility of
that pixel having color Ci.
17
2-D color images – Color histograms



Usually cluster similar colors together and choose one
representative color for each ‘color bin’
Most commercial CBIR systems include color histogram as
one of the features (e.g., QBIC of IBM)
No space information
Color histograms - distance

One method to measure the distance between two
histograms x and y is:
h
h
i
j
d h2 ( x, y ) = ( x " y ) t # A # ( x " y ) = !! aij ( xi " yi )( x j " y j )
where the color-to-color similarity matrix A has entries
aij that describe the similarity between color i and color j
18
Images - color
Mathematically, the distance function is:
Color histograms – lower bounding



1st step: define the distance function between two color
images d()=dh()
2nd step: find numerical features (one or more) whose
Euclidean distance lower-bounds dh()
If we allowed to use one numerical feature to describe
the color image what should it be?

Avg. amount for each color component (R,G,B)

x = ( Ravg , Gavg , Bavg ) t

Where Ravg = (1 / P )
P
! R( p)… , and similarly for G and B
p =1
Where P is the number of pixels in the image, R(p) is the red
component (intensity) of the p-th pixel
19
Color histograms – lower bounding

Given the average color vectors and
y images we define davg() as
x of two
the Euclidean distance between the 3-d average color vectors
3
2
d avg
( x , y ) = ( x " y ) t # ( x " y ) = ! ( xi " yi ) 2
i =1

3rd step: to prove that the feature distance davg() lower-bounds the actual
distance dh()...
•

...by the ``Quadratic Distance Bounding'' theorem it is guaranteed that the
distance between vectors representing histograms is bigger or equal as the
distance between histograms of average color images. The proof of the
``Quadratic Distance Bounding'' theorem is based upon the unconstrained
minimization problem using Langrange multipliers
Main idea of approach:


First a filtering using the average (R,G,B) color,
then a more accurate matching using the full h-element histogram
Images - color
time
seq scan
performance:
w/ avg RGB
selectivity
20
Color auto-correlogram



pick any pixel p1 of color Ci in the image I
at distance k away from p1 pick another pixel p2
what is the probability that p2 is also of color Ci ?
Red ?
k
P2
P1
Image: I
Color auto-correlogram

The auto-correlogram of image I for color Ci ,
distance k:
$ C( ki ) ( I ) # Pr[| p1 " p2 |= k , p2 ! I Ci | p1 ! I Ci ]

Integrate both color information and space
information
21
Color auto-correlogram
Implementations

Pixel Distance Measures

Use D8 distance (also called chessboard distance):
dmax ( p,q) = max(| px " qx |,| py " qy |)


Choose distance k=1,3,5,7
Computation
complexity:
!
• Histogram:
• Correlogram:
!( n 2 )
!(134 * n 2 )
22
Implementations

Features Distance Measures:

D( f(I1) - f(I2) ) is small  I1 and I2 are similar
m= R,G,B k=distance

or histogram:

| I # I ' |h $
| hCi ( I ) # hCi ( I ' ) |
! 1+ h
i"[ m ]

Ci
( I ) + hCi ( I ' )
For correlogram:
| I # I ' |% $
!
i"[ m ], k"[ d ]
| % C( ki ) ( I ) # % C( ki ) ( I ' ) |
1 + % C( ki ) ( I ) + % C( ki ) ( I ' )
Color Histogram vs Correlogram
Correlogram
method
Query Image
1st
2nd
1st
2nd
3rd
4th
5th
(512 colors)
Histogram
method
3rd
4th
5th
23
Color Histogram vs Correlogram
Query
Correlogram method: 1st
Histogram method: 48th
Target
Color Histogram vs Correlogram
Query
Correlogram method: 1st
Histogram method: 31th
Target
24
Color Histogram vs Correlogram
Target

Query 1
Query 2
Query 3
Query 4
C: 178th
C: 1st
C: 1st
C: 5th
H: 230th
H: 1st
H: 3rd
H: 18th
The correlogram method is more stable to contrast &
brightness change than the histogram method.
Color Histogram vs Correlogram



The color correlogram describes the global
distribution of local spatial correlations of colors.
It’s easy to compute
It’s more stable than the color histogram method
25
Images - shapes

Distance function: Euclidean, on the area

Q: how to do dim. reduction?

A: Karhunen-Loeve (PCA)
Images - shapes

Performance: ~10x faster
log(# of I/Os)
all kept
# of features kept
26
Mutlimedia Indexing – Conclusions
GEMINI is a popular method
 Whole matching problem
 Should pay attention to:

•
•
•
•
Distance functions
Feature Extraction functions
Lower Bounding
Particular application
Conclusions
GEMINI works for any setting (time
sequences, images, etc)
 uses a ‘quick and dirty’ filter


faster than seq. scan
27

GEneric Multimedia INdexIng








distance measure
Sub-pattern Match
‘quick and dirty’ test
Lower bounding lemma
1-D Time Sequences
Color histograms
Color auto-correlogram
Shapes
28
Download