GEMINI GEneric Multimedia INdexIng GEneric Multimedia INdexIng distance measure Sub-pattern Match ‘quick and dirty’ test Lower bounding lemma 1-D Time Sequences Color histograms Color auto-correlogram Shapes 1 GEneric Multimedia INdexIng Given a database of multimedia objects Design fast search algorithms that locate objects that match a query object, exactly or approximately Objects: • • • • • 1-d time sequences Digitized voice or music 2-d color images 2-d or 3-d gray scale medical images Video clips E.g.: “Find companies whose stock prices move similarly” Applications time series: • financial, marketing (click-streams!), ECGs, sound; images: • medicine, digital libraries, education, art higher-d signals: • scientific db (eg., astrophysics), medicine (MRI scans), entertainment (video) 2 Sample queries Find medical cases similar to Smith's Find pairs of stocks that move in sync Find pairs of documents that are similar (plagiarism?) Find faces similar to ‘Tiger Woods’ $price $price 1 365 day $price 1 365 day distance function: by expert 1 (eg, Euclidean distance) 365 day 3 Generic Multimedia Indexing 1st step: provide a measure for the distance between two objects Distance function d(): • Given two objects O1, O2 the distance (=dissimilarity) of the two objects is denoted by d(O1, O2) E.g., Euclidean distance (sum of squared differences) of two equal-length time series ε-Similarity query Given a query object Q, find all objects Oi from the database are ε-similar (identical for ε = 0) to Q {Oi DB | d(Q , Oi) < ε}. 4 Types of Similarity Queries Whole match queries: • Given a collection of S objects O1,…, Os and a query object Q find data objects that are within distance ε from Q Types of Similarity Queries Sub-pattern Match: • Given a collection of S objects O1,…, OS and a query (sub-) object Q and a tolerance ε identify the parts of the data objects that match the query Q 5 Idea method – requirements Fast: sequential scanning and distance calculation with each and every object too slow for large databases Dynamic: easy to insert, delete, and update objects Basic idea Focus on ‘whole match’ queries • Given a collection of S objects O1,…, Os, a distance/dis-similarity function d(Oi, Oj), and a query object Q find data objects that are within distance ε from Q Sequential scanning? May be too slow.. for the following reasons: • Distance computation is expensive (e.g., editing distance in DNA strings) • The Database size S may be huge Faster alternative? 6 GEneric Multimedia INdexIng Christos Faloutsos QBIC 1994 • A feature extraction function maps the high dimensional objects into a low dimensional space • Objects that are very dissimilar in the feature space, are also very dissimilar in the original space Basic idea Faster alternative: Step 1: a ‘quick and dirty’ test to discard quickly the vast majority of non-qualifying objects Step 2: use of SAMs (R-trees, Hilbert-Curve,..) to achieve faster than sequential searching Example: Database of yearly stock price movements • Euclidean distance function • Characterize with a single number (‘feature’) • Or use two or more features 7 Basic idea - illustration Feature2 S1 F(S1) 1 365 day Sn Feature1 1 F(Sn) 365 day A query with tolerance ε becomes a sphere with radius ε Basic idea – caution! The mapping F() from objects to k-dim. points should not distort the distances d(): distance of two objects dfeature(): distance of their corresponding feature vectors Ideally, perfect preservation of distances In practice, a guarantee of no false dismissals How? 8 Objects represented by vectors that are very dissimilar in the feature space are expected to be very dissimilar in the original space If the distances in the feature space are always smaller or equal than the distances in the original space, a bound which is valid in both spaces can be determined The distance of similar objects is smaller or equal to ε in the original space and, consequently, it is smaller or equal to ε in the feature space as well... 9 Lower bounding lemma if distance of similar “objects“ is smaller or equal to ε in original space then it is as well smaller or equal ε in the feature space d feature (F(O1 ),(O2 )) " d(O1,O2 ) " # o.k. d(O1,O2 ) " # $$ % d feature (F(O1 ),F(O2 )) d feature (F(O1 ),F(O2 )) " # $WRONG! $ $% d(O1,O2 ) " # d feature (F(O1 ),F(O2 )) " # " d(O1,O2 ) " ? ! No object in the feature space will be missed (false dismissals) in the feature space There will be some objects that are not similar in the original space (false hints/alarms) 10 That means that we are guaranteed to have selected all the objects we wanted plus some additional false hits in the feature space In the second step, false hits have to be filtered from the set of the selected objects through comparison in the original space 11 Time sequences white noise brown noise Fourier spectrum ... in log-log Time sequences Conclusion: colored noises are well approximated by their first few Fourier coefficients Colored noises appear in nature 12 Time sequences Eg.: GEMINI Important: Q: how to extract features? A: “if I have only one number to describe my object, what should this be?” 13 1-D Time Sequences Distance function: Euclidean distance Find features that: Preserve/lower-bound the distance Carry as much information as possible(reduce false alarms) If we are allowed to use only one feature what would this be? The average … extending it… 1-D Time Sequences ...... If we are allowed to use only one feature what would this be? The average … extending it… The average of 1st half, of the 2nd half, of the 1st quarter, etc. Coefficients of the Fourier transform (DFT), wavelet transform, etc. 14 Feature extracting function 1. 2. Define a distance function Find a feature extraction function F() that satisfies the bounding lemma Example: Discrete Fourier Transform (DFT) preserve Euclidian distances between signals (Parseval's theorem) F() = DTF which keeps the first coefficients of the transform 1-D Time Sequences Show that the distance in feature space lower-bounds the actual distance DFT? Parseval’s Theorem: DFT preserves the energy of the signal as well as the distances between two signals d(x,y) = d(X,Y) where X and Y are the Fourier transforms of x and y If we keep the first k ≤ n coefficients of DFT we lower-bound the actual distance k"1 2 n"1 2 n"1 2 d feature (F(x),F(y)) = # X f " Y f $ # X f " Y f = # x i " y i % d(x, y) f =0 f =0 i= 0 ! 15 Time sequences - results keep the first 2-3 Fourier coefficients faster than seq. scan no false dismissals total time cleanup-time r-tree time # coeff. kept Time sequences improvements: could use Wavelets, or DCT could use segment averages 16 Images - color what is an image? A: 2-d array 2-D color images – Color histograms Each color image – a 2-d array of pixels Each pixel – 3 color components (R,G,B) h colors – each color denoting a point in 3-d color space (as high as 224 colors) For each image compute the h-element color histogram – each component is the percentage of pixels that are most similar to that color The histogram of image I is defined as: For a color Ci , Hci(I) represents the number of pixels of color Ci in image I OR: For any pixel in image I, Hci(I) represents the possibility of that pixel having color Ci. 17 2-D color images – Color histograms Usually cluster similar colors together and choose one representative color for each ‘color bin’ Most commercial CBIR systems include color histogram as one of the features (e.g., QBIC of IBM) No space information Color histograms - distance One method to measure the distance between two histograms x and y is: h h i j d h2 ( x, y ) = ( x " y ) t # A # ( x " y ) = !! aij ( xi " yi )( x j " y j ) where the color-to-color similarity matrix A has entries aij that describe the similarity between color i and color j 18 Images - color Mathematically, the distance function is: Color histograms – lower bounding 1st step: define the distance function between two color images d()=dh() 2nd step: find numerical features (one or more) whose Euclidean distance lower-bounds dh() If we allowed to use one numerical feature to describe the color image what should it be? Avg. amount for each color component (R,G,B) x = ( Ravg , Gavg , Bavg ) t Where Ravg = (1 / P ) P ! R( p)… , and similarly for G and B p =1 Where P is the number of pixels in the image, R(p) is the red component (intensity) of the p-th pixel 19 Color histograms – lower bounding Given the average color vectors and y images we define davg() as x of two the Euclidean distance between the 3-d average color vectors 3 2 d avg ( x , y ) = ( x " y ) t # ( x " y ) = ! ( xi " yi ) 2 i =1 3rd step: to prove that the feature distance davg() lower-bounds the actual distance dh()... • ...by the ``Quadratic Distance Bounding'' theorem it is guaranteed that the distance between vectors representing histograms is bigger or equal as the distance between histograms of average color images. The proof of the ``Quadratic Distance Bounding'' theorem is based upon the unconstrained minimization problem using Langrange multipliers Main idea of approach: First a filtering using the average (R,G,B) color, then a more accurate matching using the full h-element histogram Images - color time seq scan performance: w/ avg RGB selectivity 20 Color auto-correlogram pick any pixel p1 of color Ci in the image I at distance k away from p1 pick another pixel p2 what is the probability that p2 is also of color Ci ? Red ? k P2 P1 Image: I Color auto-correlogram The auto-correlogram of image I for color Ci , distance k: $ C( ki ) ( I ) # Pr[| p1 " p2 |= k , p2 ! I Ci | p1 ! I Ci ] Integrate both color information and space information 21 Color auto-correlogram Implementations Pixel Distance Measures Use D8 distance (also called chessboard distance): dmax ( p,q) = max(| px " qx |,| py " qy |) Choose distance k=1,3,5,7 Computation complexity: ! • Histogram: • Correlogram: !( n 2 ) !(134 * n 2 ) 22 Implementations Features Distance Measures: D( f(I1) - f(I2) ) is small I1 and I2 are similar m= R,G,B k=distance or histogram: | I # I ' |h $ | hCi ( I ) # hCi ( I ' ) | ! 1+ h i"[ m ] Ci ( I ) + hCi ( I ' ) For correlogram: | I # I ' |% $ ! i"[ m ], k"[ d ] | % C( ki ) ( I ) # % C( ki ) ( I ' ) | 1 + % C( ki ) ( I ) + % C( ki ) ( I ' ) Color Histogram vs Correlogram Correlogram method Query Image 1st 2nd 1st 2nd 3rd 4th 5th (512 colors) Histogram method 3rd 4th 5th 23 Color Histogram vs Correlogram Query Correlogram method: 1st Histogram method: 48th Target Color Histogram vs Correlogram Query Correlogram method: 1st Histogram method: 31th Target 24 Color Histogram vs Correlogram Target Query 1 Query 2 Query 3 Query 4 C: 178th C: 1st C: 1st C: 5th H: 230th H: 1st H: 3rd H: 18th The correlogram method is more stable to contrast & brightness change than the histogram method. Color Histogram vs Correlogram The color correlogram describes the global distribution of local spatial correlations of colors. It’s easy to compute It’s more stable than the color histogram method 25 Images - shapes Distance function: Euclidean, on the area Q: how to do dim. reduction? A: Karhunen-Loeve (PCA) Images - shapes Performance: ~10x faster log(# of I/Os) all kept # of features kept 26 Mutlimedia Indexing – Conclusions GEMINI is a popular method Whole matching problem Should pay attention to: • • • • Distance functions Feature Extraction functions Lower Bounding Particular application Conclusions GEMINI works for any setting (time sequences, images, etc) uses a ‘quick and dirty’ filter faster than seq. scan 27 GEneric Multimedia INdexIng distance measure Sub-pattern Match ‘quick and dirty’ test Lower bounding lemma 1-D Time Sequences Color histograms Color auto-correlogram Shapes 28