Multimedia Information Retrieval Modern Information Retrieval Course Computer Engineering Department Sharif University of Technology Spring 2006 Outline Introduction Text-Based MMIR Content-Based Retrieval Multimedia IR Model Image Retrieval Audio Retrieval Video Retrieval Conclusions Sharif University, Modern Information Retrieval Course, Spring 2006 2 Outline Introduction Text-Based MMIR Content-Based Retrieval Multimedia IR Model Image Retrieval Audio Retrieval Video Retrieval Conclusions Sharif University, Modern Information Retrieval Course, Spring 2006 3 Support variety of data Different kinds of media Image Audio Graph,… Music, speech,… Video Sharif University, Modern Information Retrieval Course, Spring 2006 4 MMIR Motivations Content, content, and more content … How to get what is needed ? Increasing availability of multimedia information Difficult to find, select, filter, manage AV content More and more situations where it is necessary to have ‘information about the content’ Sharif University, Modern Information Retrieval Course, Spring 2006 5 Key Issues in MMIR Sharif University, Modern Information Retrieval Course, Spring 2006 6 Goals Want to make multimedia content searchable like text information, Because the value of content depends on how easy it is to find, filter, manage, and use it. Need content description method beyond simple text annotation Sharif University, Modern Information Retrieval Course, Spring 2006 7 MMIR Approaches Text Based MMIR Content Based MMIR Sharif University, Modern Information Retrieval Course, Spring 2006 8 Outline Introduction Text-Based MMIR Content-Based Retrieval Multimedia IR Model Image Retrieval Audio Retrieval Video Retrieval Conclusions Sharif University, Modern Information Retrieval Course, Spring 2006 9 Text-Based Retrieval based on text associated with the file URL: Alt text: http://www.host.com/animals/dogs/poodle.gif <img src=URL alt="picture of poodle"> Hyperlink text: <a href=URL>Sally the poodle</a> Sharif University, Modern Information Retrieval Course, Spring 2006 10 Text-based Search Engines Indexing based on text in the container webpage Http://www.google.com Http://www.ditto.com … Sharif University, Modern Information Retrieval Course, Spring 2006 11 Keyword-based System User Video Database Automatic Annotation Keyword Information Need Including filename, video title, caption, related web page Sharif University, Modern Information Retrieval Course, Spring 2006 12 Why this happens? Most of these search engines are keyword based Have to represent your idea in keywords These keywords are expected to appear in the filename, or corresponding webpage Sharif University, Modern Information Retrieval Course, Spring 2006 13 Image: The Google Approach How does image search work? Examples Google analyzes the text on the page adjacent to the image, the image caption and dozens of other factors to determine the image content. Google also uses sophisticated algorithms to remove duplicates and ensure that the highest quality images are presented first in your results. Campanile tcd Cliffs of Moher Recall may not be great… Sharif University, Modern Information Retrieval Course, Spring 2006 14 Google image search Sharif University, Modern Information Retrieval Course, Spring 2006 15 Google Image Search Sharif University, Modern Information Retrieval Course, Spring 2006 16 Problems with Text-Based The text in the ALT tag has to be done manually Expensive Time consuming It is incomplete and subjective Some features are difficult to define in text such as texture or object shape Sharif University, Modern Information Retrieval Course, Spring 2006 17 Therefore…… Unable to handle semantic meaning of images Unable to handle visual position Unable to handle time information Unable to use images as query ………. Sharif University, Modern Information Retrieval Course, Spring 2006 18 So … Better for simple concepts e.g. A picture of a giraffe Don’t work for complex queries e.g. A picture of a brick home with black shutters and white pillars, with a pickup truck in front of it (image) Sharif University, Modern Information Retrieval Course, Spring 2006 19 Outline Introduction Text-Based MMIR Content-Based Retrieval Multimedia IR Model Image Retrieval Audio Retrieval Video Retrieval Conclusions Sharif University, Modern Information Retrieval Course, Spring 2006 20 Architecture for Multimedia Retrieval Feature extraction AV Description Manual / automatic Storage Decoding (for transmission) Search / query Pull Browse Conf. points Transmission Encoding (for transmission) Filter Push Human or machine Sharif University, Modern Information Retrieval Course, Spring 2006 21 text stills sketch speech sound humming examples Query-retrieval matrix query text video Example doc images speech music sketches multimedia Sharif University, Modern Information Retrieval Course, Spring 2006 conventional text roar retrieval you and get a wildlife type “floods” documentary and BBC humget a tune radio news and get a music piece 22 Main Components Feature Extraction & Analysis Description Schemes Searching & Filtering Examples: IBM’s Query By Image Content (QBIC) Virages’s VIR Image Engine Online http://collage.nhil.com/ Sharif University, Modern Information Retrieval Course, Spring 2006 23 Internal representation Using attributes is not sufficient Feature Information extracted from objects Multimedia object is represented as a set of features Features can be assigned manually, automatically, or using a hybrid approach Sharif University, Modern Information Retrieval Course, Spring 2006 24 Features for MMIR high-level features medium-level features words and phrases from text, speech recognition face detector, regions classifiers, outdoor etc low-level features Fourier transforms, wavelet decomposition, texture histograms, colour histograms, shape primitives, filter primitives Sharif University, Modern Information Retrieval Course, Spring 2006 25 Internal representation Values of some specific features are assigned to a object by comparing the object with some previously classified objects Feature extraction cannot be precise A weight is usually assigned to each feature value representing the uncertainty of assigning such a value to that feature 80% sure that a shape is a square Sharif University, Modern Information Retrieval Course, Spring 2006 26 Outline Introduction Text-Based MMIR Content-Based Retrieval Multimedia IR Model Image Retrieval Audio Retrieval Video Retrieval Conclusions Sharif University, Modern Information Retrieval Course, Spring 2006 27 MMIR Model’s Main Components Query Language Indexing and Searching Sharif University, Modern Information Retrieval Course, Spring 2006 28 Query languages In designing a multimedia query language, two main aspects require attention How the user enters his/her request to the system Which conditions on multimedia objects can be specified in the user request Sharif University, Modern Information Retrieval Course, Spring 2006 29 Request specification Interfaces Browsing and navigation Specifying the conditions the objects of interest must satisfy, by means of queries Queries can be specified in two different ways Using a specific query language Query by example Using actual data (object example) Sharif University, Modern Information Retrieval Course, Spring 2006 30 Conditions on multimedia data Query predicates Attribute predicates Concern the attributes for which an exact value is supplied for each object Exact-match retrieval Structural predicates Concern the structure of multimedia objects Can be answered by metadata and information about the database schema “Find all multimedia objects containing at least one image and a video clip” Sharif University, Modern Information Retrieval Course, Spring 2006 31 Conditions on multimedia data Semantic predicates Concern the semantic content of the required data, depending on the features that have been extracted and stored for each multimedia object “Find all the red houses” Exact match cannot be applied Sharif University, Modern Information Retrieval Course, Spring 2006 32 Indexing and searching Searching similar patterns Distance function Given two objects, O1 and O2, the distance (=dissimilarity) of the two objects is denoted by D(O1,O2) Similarity queries Whole match Sub-pattern match Nearest neighbors All pairs Sharif University, Modern Information Retrieval Course, Spring 2006 33 Spatial access methods Map objects into points in f-D space, and to use multiattribute access methods (also referred to as spatial access methods or SAMs) to cluster them and to search for them Methods R*-trees and the rest of the R-tree family Linear quadtrees Grid-files Linear quadtrees and grid files explode exponentially with the dimensionality Sharif University, Modern Information Retrieval Course, Spring 2006 34 R-tree R-tree Represent a spatial object by its minimum bounding rectangle (MBR) Data rectangles are grouped to form parent nodes (recursively grouped) The MBR of a parent node completely contains the MBRs of its children MBRs are allowed to overlap Nodes of the tree correspond to disk pages Sharif University, Modern Information Retrieval Course, Spring 2006 35 Sharif University, Modern Information Retrieval Course, Spring 2006 36 Outline Introduction Text-Based MMIR Content-Based Retrieval Multimedia IR Model Image Retrieval Audio Retrieval Video Retrieval Conclusions Sharif University, Modern Information Retrieval Course, Spring 2006 37 Visual Features ... Texture Colour Shape Sharif University, Modern Information Retrieval Course, Spring 2006 38 Histograms Greyscale histogram of image A Assuming 256 intensity levels hA(l) (l=1 256) hA(l) =#{(i,j)|A(i,j)=l, i = 1 m, for j = 1 n} Sharif University, Modern Information Retrieval Course, Spring 2006 i.e. a count of the number of pixels at each level 39 Colour Histogram Describe the colors and its percentages in an image. f c (I j , Pj ) I j ColorValue,0 Pj 1, Pj 1, and 1 j N 1 j N Sharif University, Modern Information Retrieval Course, Spring 2006 40 Texture Matching Texture characterizes small-scale regularity Color describes pixels, texture describes regions Described by several types of features e.g., smoothness, periodicity, directionality Perform weighted vector space matching Usually in combination with a color histogram Sharif University, Modern Information Retrieval Course, Spring 2006 41 Texture Test Patterns Sharif University, Modern Information Retrieval Course, Spring 2006 42 Image Retrieval using low level features See IBM demos at: http://wwwqbic.almaden.ibm.com/ http://mp7.watson.ibm.com/ (video) Hermitage Museum www.hermitagemuseum.org Sharif University, Modern Information Retrieval Course, Spring 2006 43 Berkeley Blobworld Sharif University, Modern Information Retrieval Course, Spring 2006 44 Berkeley Blobworld Sharif University, Modern Information Retrieval Course, Spring 2006 45 But….. • Low-level feature doesn’t work in all the cases Sharif University, Modern Information Retrieval Course, Spring 2006 46 Solution: Regional Low-level Image Feature Segmentation into objects Extract low-level features from each regions Sharif University, Modern Information Retrieval Course, Spring 2006 47 Solution: High-level Image Feature Objects: Persons, Roads, Cars, Skies… Scenes: Indoors, Outdoors, Cityscape, Landscape, Water, Office, Factory… Event: Parade, Explosion, Picnic, Playing Soccer… Generated from low-level features Sharif University, Modern Information Retrieval Course, Spring 2006 48 Outline Introduction Text-Based MMIR Content-Based Retrieval Multimedia IR Model Image Retrieval Audio Retrieval Video Retrieval Conclusions Sharif University, Modern Information Retrieval Course, Spring 2006 49 Audio Genres Important types of audio data Speech-centered Music-centered Radio programs Telephone conversations Recorded meetings Instrumental, vocal Other sources Alarms, instrumentation, surveillance, … Sharif University, Modern Information Retrieval Course, Spring 2006 50 Speech-based Documents Radio/TV news retrieval. Search archival radio/news broadcasts. Video and audio email. Knowledge management : transfert of tacit knowledge to others. Search audio archives of meetings, lectures, etc… Sharif University, Modern Information Retrieval Course, Spring 2006 51 Preamble Two utterances of the same words by the same person under the same conditions generate very different waveforms. Variations due to loudness, pitch, brightness, bandwidth, harmonisity, and others are all continuous variables and are equivalent to color and texture in images. Sharif University, Modern Information Retrieval Course, Spring 2006 52 Detectable Speech Features Content Identity Speaker identification, speaker segmentation Language Phonemes, one-best word recognition, n-best Language, dialect, accent Other measurable parameters Time, duration, channel, environment Sharif University, Modern Information Retrieval Course, Spring 2006 53 How Speech Recognition Works Three stages What sounds were made? How could the sounds be grouped into words? Identify the most probable word segmentation points Which of the possible words were spoken? Convert from waveform to subword units (phonemes) Based on likelihood of possible multiword sequences All three stages are learned from training data Using hill climbing (a “Hidden Markov Model”) Sharif University, Modern Information Retrieval Course, Spring 2006 54 Speech Recognition One-best phoneme transcription Phoneme Detection Phoneme transcription dictionary Word n-gram language model N-best phoneme sequences Phoneme n-grams Phoneme lattices Word Construction Word Selection Sharif University, Modern Information Retrieval Course, Spring 2006 One-best word transcript Words 55 Music and audio analysis Music is a large and extremely variable audio class. The range of sounds is large, from music genres to animal cries to synthesizer samples. Any of the above can and will occur in combination. Sharif University, Modern Information Retrieval Course, Spring 2006 56 Audio retrieval-by-content Require some measure of audio similarity. Most approaches to general audio retrieval take a perceptual approach, using measures such as loudness. Neural net to map a sound clip to a text description : An obvious drawback is the subjective nature of audio description. Sharif University, Modern Information Retrieval Course, Spring 2006 57 Sample system: Muscle fish To analyze sound files for a specific set of psychoacoustic features. This results in a vector of attributes that include loudness, pitch, bandwidth and harmonicity. Given enough training samples, a Gaussian classifier can be constructed, or for retrieval. Sharif University, Modern Information Retrieval Course, Spring 2006 58 An Euclidean distance is used as a measure of similarity. For retrieval, the distance is computed between a given sound example and all other sound examples (about 400 in the demonstration). Sounds are ranked by distance, with the closer ones being more similar. Sharif University, Modern Information Retrieval Course, Spring 2006 59 Music and MIDI retrieval Using archives of MIDI files, which are score-like representations of music intended for musical synthesizers or sequencers. Given a melodic query, the MIDI files can be searched for similar melodies. Sharif University, Modern Information Retrieval Course, Spring 2006 60 Polyphonic Music Indexing Technique n-grams encode music as text strings using pitch and onsets index text words with text search engine process query in the same way application: eg, Query by Humming Sharif University, Modern Information Retrieval Course, Spring 2006 61 Monophonic pitch n-gramming Interval: 0 +7 0 +2 0 -2 0 -2 0 [0 +7 0 +2] [+7 0 +2 0] ZGZB [0 +2 0 -2] GZBZ ZBZb Example: musical strings with interval-only representation Sharif University, Modern Information Retrieval Course, Spring 2006 62 Outline Introduction Text-Based MMIR Content-Based Retrieval Multimedia IR Model Image Retrieval Audio Retrieval Video Retrieval Conclusions Sharif University, Modern Information Retrieval Course, Spring 2006 63 Application Increasing demand for visual information retrieval Retrieve useful information from databases Sharing and distributing video data through computer networks Example: BBC BBC archive has +500k queries plus 1M new items … per year; From the BBC … Police car with blue light flashing Government plan to improve reading standards Two shot of Kenneth Clarke and William Hague Sharif University, Modern Information Retrieval Course, Spring 2006 64 Video Search Active Research Area Sharif University, Modern Information Retrieval Course, Spring 2006 65 Video Search: Features Color Robust to background Independent of size, orientation Color Histogram [Swain & Ballard] “Sensitive to noise and sparse”- Cumulative Histograms [Stricker & Orgengo] Color Moments Color Sets: Map RGB Color space to Hue Saturation Value, & quantize [Smith, Chang] Color layout- local color features by dividing image into regions Color Autocorrelograms Texture One of the earliest Image features [Harlick et al 70s] Co-occurrence matrix Orientation and distance on grayscale pixels Contrast, inverse deference moment, and entropy [Gotlieb & Kreyszig] Human visual texture properties: coarseness, contrast, directionality, likeliness, regularity and roughness [Tamura et al] Wavelet Transforms [90s] [Smith & Chang] extracted mean and variance from wavelet subbands Gabor Filters And so on Sharif University, Modern Information Retrieval Course, Spring 2006 Region Segmentation Partition image into regions Strong Segmentation: Object segmentation is difficult. Weak segmentation: Region segmentation based on some homegenity criteria Scene Segmentation Shot detection, scene detection Look for changes in color, texture, brightness Context based scene segmentation applied to certain categories such as broadcast news 66 Video Search: Features Face Shape Outer Boundary based vs. region based Fourier descriptors Moment invariants Finite Element Method (Stiffness matrix- how each point is connected to others; Eigen vectors of matrix) Turing function based (similar to Fourier descriptor) convex/concave polygons[Arkin et al] Wavelet transforms leverages multiresolution [Chuang & Kao] Chamfer matching for comparing 2 shapes (linear dimension rather than area) 3-D object representations using similar invariant features Well-known edge detection algorithms. Face detection is highly reliable - Neural Networks [Rwoley] - Wavelet based histograms of facial features [Schneiderman] Face recognition for video is still a challenging problem. - EigenFaces: Extract eigenvectors and use as feature space OCR OCR is fairly successful technology. Accurate, especially with good matching vocabularies. Script recognition still an open problem. ASR Automatic speech recognition fairly accurate for medium to large vocabulary broadcast type data Large number of available speech vendors. Still open for free conversational speech in noisy conditions. Sharif University, Modern Information Retrieval Course, Spring 2006 67 Video Structures Image structure Object motion Translation, rotation Camera motion Absolute positioning, relative positioning Pan, zoom, perspective change Shot transitions Cut, fade, dissolve, … Sharif University, Modern Information Retrieval Course, Spring 2006 68 Typical Retrieval Framework User : provide query information that represents his information needs Database: store a large collection of video data Goal: Find the most relevant shots from the database Shots: “paragraph” in video, typically 20 – 40 seconds, which is the basic unit of video retrieval Sharif University, Modern Information Retrieval Course, Spring 2006 69 Bridging the Gap Video Database User Result Sharif University, Modern Information Retrieval Course, Spring 2006 70 Automatically Structure Video Data The first step for video retrieval: Video “programmes” are structured into logical scenes, and physical shots If dealing with text, then the structure is obvious: paragraph, section, topic, page, etc. All text-based indexing, retrieval, linking, etc. builds upon this structure; Automatic shot boundary detection and selection of representative keyframes is usually the first step; Sharif University, Modern Information Retrieval Course, Spring 2006 71 Typical automatic structuring of video a video document A set of shots Keyframe browser combined with transcript or objectbased search Sharif University, Modern Information Retrieval Course, Spring 2006 72 Ideal solution Video Database Video Structure User Information Need Understanding the semantic meaning and retrieve Result Sharif University, Modern Information Retrieval Course, Spring 2006 73 Ideal solution Video Database Video Structure However, 1. Hard to represent query in natural language and for User computer to understand 2. Computers have no experience 3. Other representation restriction like position, time Information Need Understanding the semantic meaning and retrieve Result Sharif University, Modern Information Retrieval Course, Spring 2006 74 Alternative Solution Video Database Video Structure User Provide evidence of relevant information ( text, image, audio) Information Need Match and combine Result Sharif University, Modern Information Retrieval Course, Spring 2006 75 Evidence-based Retrieval System General framework for current video retrieval system Video retrieval based on the evidence from both users and database, including Text information Image information Motion information Audio information Return a relevant score for each evidence Combination of the scores Sharif University, Modern Information Retrieval Course, Spring 2006 76 Keyword-based System Video Database User Automatic Annotation Keyword Video Structure Information Need Including filename, video title, caption, related web page Sharif University, Modern Information Retrieval Course, Spring 2006 77 Keyword-based System Video Database User Automatic Annotation Video Structure Keyword Information Need Manual Annotation Sharif University, Modern Information Retrieval Course, Spring 2006 78 Manual Annotation Manually creating annotation/keywords for image / video data Examples: Gettyimage.com (image retrieval) Pros: Represent the semantic meaning of video Cons Time-consuming, labor-intensive Keyword is not enough to represent information need Sharif University, Modern Information Retrieval Course, Spring 2006 79 Speech and OCR transcription Video Database User Annotation Keyword Video Structure Information Need Speech Transcription OCR Transcription Sharif University, Modern Information Retrieval Course, Spring 2006 80 Query using speech/OCR information Query: Find pictures of Harry Hertz, Director of the National Quality Program, NIST Speech: We’re looking for people that have a broad range of expertise that have business knowledge that have knowledge on quality management on quality improvement and in particular … OCR: H,arry Hertz a Director aro 7 wa,i,,ty Program ,Harry Hertz a Director Sharif University, Modern Information Retrieval Course, Spring 2006 81 What we lack? Video Database User Annotation Keyword Video Structure Information Need Speech Transcription OCR Transcription Image Information Sharif University, Modern Information Retrieval Course, Spring 2006 82 Image-based Retrieval Video Database User Text Information Keyword Information Need Video Structure Image Feature Query Images Sharif University, Modern Information Retrieval Course, Spring 2006 83 Image-based Retrieval Video Database User Text Information Keyword Information Need Video Structure Image Feature Low-level Feature Query Images High-level Feature Sharif University, Modern Information Retrieval Course, Spring 2006 84 More Evidence in Video Retrieval Video Database User Text Information Keyword Information Need Video Structure Image Information Query Images Motion Information Motion Audio Information Audio Sharif University, Modern Information Retrieval Course, Spring 2006 85 MPEG-7: The Objective Standardize object-based description tools for various types of audiovisual information, allowing fast and efficient content searching, filtering and identification, and addressing a large range of applications. New objective for MPEG: MPEG-1, -2 and -4 represent the content itself (‘the bits’) MPEG-7 should represent information about the content (‘the bits about the bits’) Sharif University, Modern Information Retrieval Course, Spring 2006 86 Scope of MPEG-7 Description creation description Not the description creation Not the description consumption Just the description ! Description consumption This is the scope of MPEG-7 The goal is to define the minimum that enables Sharif University, Modern Information Retrieval interoperability. Course, Spring 2006 87 MPEG-7 Terminology: Descriptor Descriptor (D) : A Descriptor is a representation of a Feature. A Descriptor defines the syntax and the semantics of the Feature representation. Examples: Feature Descriptor Color Histogram of Y,U,V components Shape ART moments Motion Motion field, coefficients of a model Audio frequency Average frequency components Title Text Annotation Text Genre Text, index in as thesaurus Sharif University, Modern Information Retrieval Course, Spring 2006 88 Outline Introduction Text-Based MMIR Content-Based Retrieval Multimedia IR Model Image Retrieval Audio Retrieval Video Retrieval Conclusions Sharif University, Modern Information Retrieval Course, Spring 2006 89 Conclusions Simple image retrieval is commercially available Segmentation-based retrieval is still in the lab Color histograms, texture, limited shape information Keep an eye on the Berkeley group Limited audio indexing is practical now Audio feature matching, answering machine detection Sharif University, Modern Information Retrieval Course, Spring 2006 90 Conclusions Multimedia IR Text: good solutions exist Video, Image, Sound – a lot of work to do. Sharif University, Modern Information Retrieval Course, Spring 2006 91 Conclusions The goal of content-based video retrieval is to build more intelligent video retrieval engine via semantic meaning Many applications in daily life Combine evidence from different aspects Hot research topic, few business system State-of-the-art performance is still unacceptable for normal users, space to improve Sharif University, Modern Information Retrieval Course, Spring 2006 92 Conclusions Problems with Content-Based MMIR Must have an example image Example image is 2-D Hence only that view of the object will be returned Large amount of image data Similar colour histogram does not equal similar image Usually the best results come from a combination of both text and content searching Sharif University, Modern Information Retrieval Course, Spring 2006 93 Conclusions Combination of multi-modal results Difference characteristics between multimodal information Text-based Information: better for middle and high level queries Image-based Information: better for low and middle level queries Combination of multi-modal information Sharif University, Modern Information Retrieval Course, Spring 2006 94 Conclusions Challenging research questions Draws on computer vision, audio processing, natural language analysis, unstructured document analysis, information retrieval, information visualisation, computer human interaction, artificial intelligence Sharif University, Modern Information Retrieval Course, Spring 2006 95