Image Retrieval Basics Uichin Lee KAIST KSE Slides based on “Relevance Models for Automatic Image and Video Annotation & Retrieval” by R. Manmatha (UMASS) How do we retrieve images? • Using Content-based Image Retrieval (CBIR) systems – Hard to represent information needs using abstract image features; color percentages, color layout and textures. How do we retrieve images? • IBM QBIC system example using color How do we retrieve images? • Use Google image search ! –Google uses filenames, surrounding text and ignores contents of the images. How do we retrieve images ? • Using manual annotations – Libraries, Museums – Manual annotation is expensive. CREATED/PUBLISHED:1940 August NOTES : Store or cafe with soft drink signs: Coca-Cola, Orange-Crush, Royal Crown, Double Cola and Dr. Pepper. SUBJECTS: Carbonated beverages Advertisements Restaurants United States-Mississippi—Natchez Slides--Color CALL NUMBER: LC-USF35-115 Picture from Library of Congress American Memory Collections How to retrieve images/videos? • Retrieval based on similarity search of visual features (similar to traditional IR w/ visterms) – Doesn’t support textural queries – Doesn’t capture “semantics” • Automatically annotate images then retrieve based on the textual annotations Example Annotations: Tiger, grass. Content based image retrieval Image Database Extracted Features Compute Similarity Extracted Features Query Rank Images Visterms: image vocabulary • Can we represent all the images with a finite set of symbols? – Text documents consist of words – Images consist of visterms V123 V89 V988 V4552 V12336 V2 V765 V9887 Construction of visterms 1. 2. 3. 4. Segment images into visual segments (e.g., Blobworld, Normalized-cuts algorithm.) Extract features from segments Cluster similar segments (k-means) Each cluster is a visterm Images Segments Visterms (=blob-tokens) V1 V3 V2 V4 V1 V5 … V6 … Segmentation • Segment images into parts (tile or regions) Break image down into simple geometric shapes Tiling (a) 5 tiles (b) 9 tiles (c) 5 regions (d) 9 regions Regioning Break Image down into visually coherent areas Image features • Information about color or texture or shape which are extracted from an (part of) image are known as image features • Features – Color (e.g., Red), Texture (e.g., Sandy), Shape – SIFT (Scale-invariant feature transform)* –… 90 80 70 60 50 40 30 20 10 0 Red Color histogram Orange Texture David G. Lowe “Distinctive image features from scale-invariant keypoints” (IJCV 2004) Discrete visterms • Segmentation vs. rectangular partition – Tiling vs. regioning • Results - rectangular partition performs better than segmentation! – Model learned over many images. – Vs. segmentation over one image. Automatic annotation & retrieval • Automatically annotate unseen images – A training set of annotated images • Do not know which word corresponds to which part of image. – Compute visterms (based on image features) – Learn a model and annotate a set of test – Learn all annotations at the same time • Retrieval based on the annotation output – Use query likelihood language model – Rank test images according to the likelihoods Correspondence (matching) Tiger grass V1 V3 Tiger grass Maui People Dance V2 V4 V6 Maui People Dance See Sand See_Lion V5 V12 V321 See Sand See_Lion • Now we want to find relationship between visterms and words. – P( Tiger | V1 ), P( V1 | Tiger ), P( Maui | V3,V4 ) Correspondence models • • • • • • Co-occurrence model Translation model Normalized & regularized model Cross media relevance model Continuous relevance model Multiple Bernoulli model… Co-occurrence models • Mori et al. 1999 • Create the cooccurrence table using a training set of annotated images • Tend to annotate with high frequency words • Context is ignored – Needs joint probability models w1 w2 w3 w4 V1 12 2 0 1 V2 32 40 13 32 V3 13 12 0 0 V4 65 43 12 0 P( w1 | v1 ) = 12/(12+2+0+1)=0.8 P( v3 | w2 ) = 12/(2+40+12+43)=0.12 Cross media relevance models • Estimating Relevance Model – the joint distribution of words and visterms P( w | I ) P( w | b1...bm ) • P( w, b1...bm ) P(w, b1...bm ) w Training: – Joint distribution computed as an expectation over the training set J – P(w, b1, b2, .., bm) = ∑P(J)P(w,b1,..,bm|J) • Annotation: – Compute P(w | I) for different w. – Annotate the image with every possible w in the vocabulary with associated probabilities (or pick top k words) • Retrieval: – Given a query Q, find the prob of drawing Q from image I: P(Q | I) – Rank images according to this prob. J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Relevance Using Cross-Media Relevance Models, In Proc. SIGIR’03.