Content Based Multimedia Signal Processing Yu Hen Hu University of Wisconsin – Madison Outline • Multimedia content description Interface (MPEG-7) • Video content features • Spoken content features • Multimedia indexing, and retrieval • Multimedia summary, filtering • Other applications MPEG-7 Overview • Large amount of digital contents are available • Easy to create, digitize, and distribute audiovisual content • Family album syndrome – Need organize, index, retrieval • Information overloading – Need filtering • MPEG-7 Objective Provide inter-operability among systems and applications used in generation, management, distribution, and consumption of audio-visual content description. Help user to identify, retrieve, or filter audio-video information. Potential Application of MPEG-7 • Summary, – Generation of multimedia program guide or content summary – Generation of content description of A/V archive to allow seamless exchange among content creator, aggregator, and consumer. • Filtering – Filter and transform multimedia streams in resource limited environment by matching user preference, available resource and content description. • Retrieval – Recall music using samples of tunes – Recall pictures using sketches of shape, color movement, description of scenario • Recommendation – Recommend program materials by matching user preference (profile) to program content • Indexing – Create family photo or video library index Content descriptions • Descriptors – MPEG-7 contains standardized descriptors for audio, visual, generic contents. – Standardize how these content features are being characterized, but not how to extract. – Different levels of syntax and semantic descriptions are available • Description Scheme – Specify the structure and relations among different A/V descriptors • Description Definition Language (DDL) – Standardized language based on XML (eXtended Markup Language) for defining new Ds and DSs; extending or modifying existing Ds and Dss. Visual Color Descriptors • Color space: HSV (huesaturation-value) – Scalable color descriptor (SCD): color histogram (uniform 255 bin) of an image in HSV encoded by Haar transform. • Color layout descriptor: – spatial distribution of color in an arbitrarily shaped region. • Dominant color descriptor (DCD): – colors are clustered first. • Color structure descriptor (CSD): – scan 8x8 block in slide window, and count particular color in window. • Group of Frame/Group of Picture color descriptor Visual Texture Descriptor • Texture Browsing D. – Regularity: • 0: irregular; 3: periodic – Directionality • Up to 2 directions • 1-6 in 30O increment – Coarseness • 0: fine; 3: coarse • Edge histogram D. – 16 sub-images – 5 (edge direction) bins/sub-image • Homogeneous Texture D. (HTD) – Divide frequency space into 30 bins (5 radial, 6 angular) – 2D Gabor filter bank applied to each bin – Energy and energy deviation in each bin computed to form descriptor. Visual Shape Descriptor • 3D Shape D. – Shape spectrum – Histogram (100 bins, 12bits/bin) of a shape index, computed over 3D surface. – Each shape index measures local convexity. • Region-based D.: Art – Angular radial transform – Shape analysis based on moments – ART basis: Vnm(, ) = exp(jm)Rn() Rn() = 2 cos(n) n 0 =1 n=0 • Contour based shape descriptor – Curvature scale space (CSS) – N points/curve, successively smoothed by [0.25 0.5 0.25] till curve become convex. – Curvature at each point form a curvature at that scale. – Peaks of each scale are used as feature • 2D/3D descriptors – Use multiple 2D descriptors to describe 3D shape Visual Motion Descriptor • Motion activity D. – – – – Video Intensity segment Direction of activity Spatial distribution of activity Mosaic Camera Temporal distribution of motion activity • Camera motion – – – – – – Panning Booming (lift up) Tracking Tilting Zooming Rolling (around image center) – Dollying (backward) Motion activity Motion region trajectory Warping parameter • Warping (w.r.t. mosaic) • Motion trajectory Parametric motion MPEG-7 Audio Content Descriptors • 4 classes of audio signals – – – – Pure music Pure speech Pure sound effect Arbitrary sound track • Audio descriptors – Silence Ds: silencetype – Sound effect Ds: • Audio Spectrum • Sound effect features – Spoken content Ds: • • • • Speaker type Link type Extraction info type Confusion info type – Timbre Ds: • Instrument • Harmonic instrument • Percussive instrument – Melody contour Ds • Contour • Meter • beat Spoken content description Speech waveform Audio processing ASR MPEG-7 Encoder • – – • – – • Header Word lexicon (vocabulary) Phone lexicon: • Goal: To support potentially erroneous decoding extracted using an automatic speech recognition system for robust retrieval. lattice Spoken content Header IPA (international phonetic association. Alphabet) SAMPA (speech assessment method phonetic alphabet) Phone confusion statistics Speaker Spoken content lattice (word or phone) – – Lattice Node Word and phone link lattice BORE P=0.6 IS P=0.7 HIS P=0.3 Use of Content Features • Multimedia information retrieval – Create searchable archive of A/V materials, e.g. album, digital library – Real world examples: • • • • • call routing Technical support On-line manual Shopping Multimedia on demand • Filtering – Automated email sorter – Personalized information portal • Enhance low-level signal processing – Coding and trans-coding – Post-processing Content-based Retrieval Query Module Retrieval Module Input Module Feature extraction Feature comparison Feature Database Feature extraction Interactive Query Formation Browsing & Feedback Image Database Multimedia data User Output Multimedia CBR System Design Issues • Requirement analysis – How the multimedia materials are to be used – Determines what set of features are needed. • Archiving – How should individual objects are stored? Granularity? • Indexing (query) and retrieving – With multi-dimensional indices, what is an effective and efficient retrieval method? – What is a suitable perceptually-consistent similarity measure? • User interface – Modality? Text or spoken language or others? – Interactive or batch? Will dialogue be available? Multimedia Archiving • Facts: – Often in compressed format and needs large storage space – Content index will also occupy storage space • Issues – Granularity must match underlying file system – Logical versus physical segmentation – File allocation on file system must support multiple stream access and low latency Indexing and Retrieving • Index – A very high dimensional binary vector – Encoding of content features – Text-based content can be represented with term vectors – A/V content features can be either Boolean vectors or term vectors • Retrieval – Retrieval is a pattern classification problem – Use index vector as the feature vector – Classify each object as relevant and irrelevant to a query vector (template) – A perceptually consistent similarity measure is essential Term Vector Query • Each document is represented by a specific term vector • A term is a key-word or a phrase • A term vector is a vector of terms. Each dimension of the vector corresponding to a term. • Dimension of a term vector = total number of distinct terms. • Example: Set of terms = [tree, cake, happy, cry, mother, father, big, small] document = “Father gives me a big cake. I am so happy”, “mother planted a small tree” Term vectors: [ 0, 1, 1, 0, 0, 1, 1, 0], [1, 0, 0, 0, 1, 0, 0, 1] Inverse Term Frequency Vector – A probabilistic term vector representation. – Relative Term Frequency (within a document) tf (t,d) = count of term t / # of terms in document d – Inverse document Frequency df(t) = total count of document/ # of doc contain t – Weighted term frequency dt = tf(t,d) · log [ df(t)] – Inverse document frequency term vector D = [d1, d2, … ] ITF Vector Example Document 1: The weather is great these days. Document 2: These are great ideas Document 3: You look great Eliminate: The, is, these, are, you Term Weather great day idea look tf(t,1) 1/6 1/6 1/6 0 0 tf(t,2) 0 1/4 0 1/4 0 tf(t,3) 0 1/3 0 0 1/3 df(t) 3 1 3 3 3 D1 D2 D3 0.08 0.00 0.00 0.00 0.00 0.00 0.08 0.00 0.00 0.00 0.12 0.00 0.00 0.00 0.16 Human Computer Interface Voice, gesture push button/key expression, eye Command HCI is a match-maker: Matching the needs of human and computers Sensation: visual audio, pressure smell: virtual environment Data Basic HCI Design Principles • Consistency: Same command means the same thing • • • • Intuition: Metaphor that is familiar to the user Adaptability: Adapt to user’s skill, style Economy: Use minimum efforts to achieve a goal Non-intrusive: Do not decide for user without asking • Structure: Present only relevant information to user in a simple manner. User Models • User Profiles: – – – – Categorize users using features relevant to tasks Static features: age, sex, etc. Dynamic features: activity logs, etc. Derived features: skill levels, preferences, etc. • Use of Profiles for HCI – Adaptation: Customize HCI for different category of users – Better understanding of user’s needs Principles of Dialogue Design • • • • Feedback: Always acknowledge user’s input Status: Always inform users where are they in the system Escape: Provide a graceful way to exit half way. Minimal Work: Minimize amount of input user must provide • Default: Provide default values to minimize work • Help: Context sensitive help • Undo: Allow user to make unintentional mistake and correct it • Consistency: Performance Evaluation • Document retrieval problem is a hypothesis testing problem: H0: di is relevant to q (r=1) H1: di is irrelevant to q (r=0) • Type I error (Pe1=P{r=0|H0}) Relevant but not retrieved. • Type II error (Pe2 =P{r=1|H1}) : Irrelevant but retrieved. Contingency table for evaluating retrieval Retrieved Not retrieved Relevant w x Irrelevant y z • Precision Recall Curve – P(recision) = w/(w+y) is a measure of specificity of the result – R(ecall) = w/(w+x) is an indicator of completeness of the result. • Operating curve – Pe1 = x/(w+x) = 1 – R – Pe2 = y/(y+z) = F(allout) • Expected search length = average # of documents need to be examined to retrieve a given number of relevant documents. • Subjective criteria Example: MetaSEEk • MetaSEEk-A meta-search engine – Purpose: retrieving images – Method: Select and interface with multiple on-line image search engines – Search Principle: Performance of different query classes of search engines and their search options A. B. Benitez, M. Beigi, and S.-F. Chang, Using Relevance Feedback in Content-Based Image Metasearch, IEEE Internet Computing, Vol. 2, No. 4, pp. 59-69, July/August 1998 Basic idea of MetaSEEk • Classify the user queries into different clusters by their visual content. • Rank the different search engines according to their performance for the different classes of user queries • Select the search engines and search options according to their rank for the specific query cluster • Display the search results to User • Modify these performance according to the user feedback Overview-Basic components of a meta-search engine Content-Based Visual Query (1) • Advantage – Ease of creating, capturing and collecting digital imaginary • Approaches – Extract significant features (Color, Texture, Shape, Structure) – Organize Feature Vectors – Compute the closeness of the feature vectors – Retrieve matched or most similar images Content-Based Visual Query (2) Improve Efficiency • Keyword-based search – Match images with particular subjects and narrow down the search scope • Clustering – Classify images into various categories based on their contents • Indexing – Applied to the image feature vectors to support efficient access to the database Cluster the visual data • K-means algorithm – Simplicity – Reduced computation • Tamura algorithm (for text) • For Color, feature vector are calculated using the color histogram • Using Euclidean distance Conceptual structure of the meta-search database. Multimedia summary and filtering • Summary – Text: email reading – Image: caption generation – Video: high-lights, story board • Issues: – – – – Segmentation Clustering of segments Labeling clusters Associate with syntactic and semantic labels • Filtering – Same as retrieval: filter out irrelevant objects based on a given criterion (query) – Often need to be performed based on content features • E.g. filtering traffic accidents or law violations from traffic monitoring videos Content based Coding and Post-processing • Different coding decisions based on low level content features – coding mode (inter/intra selection) – motion estimation • Object based coding – Encoding different regions (VOP) separately – Using different coder for different types of regions • Multiple abstraction layer coding – An analysis/synthesis approach – Synthesize low level contents from higher level abstraction • E.g. texture synthesis • Content based postprocessing – Identify content types and en synthesize low level content