MPEG-7 STANDARD

advertisement

MPEG-7 STANDARD

TOWARDS INTELLIGENT

AUDIO-VISUAL

INFORMATION HANDLING

MPEG-7, formally named “Multimedia Content

Description Interface”, is a standard for describing the multimedia content data that supports some degree of interpretation of the information’s meaning, which can be passed onto, or accessed by, a device or a computer code.

MPEG-7 is not aimed at any one application in particular; rather, the elements that MPEG-7 standardizes support as broad a range of applications as possible.

Applications of MPEG-7

The elements that MPEG-7 standardizes provide support to a broad range of applications (for example, multimedia digital libraries, broadcast media selection, multimedia editing, home entertainment devices, etc.). MPEG-7 will also make the web as searchable for multimedia content as it is searchable for text today. This would apply especially to large content archives, which are being made accessible to the public, as well as to multimedia catalogues enabling people to identify content for purchase.

OBJECTIVES OF MPEG-7 STANDARD

The MPEG-7 standard aims at providing standardized core technologies allowing description of audiovisual data content in multimedia environments. Audiovisual data content that has MPEG-7 data associated with it, may include: still pictures, graphics, 3D models, audio, speech, video, and composition information about how these elements are combined in a multimedia presentation

(scenarios). Special cases of these general data types may include facial expressions and personal characteristics

APPLICATION AREAS OF MPEG-7

Broadcast media selection (e.g., radio channel, TV channel).

Cultural services (history museums, art galleries, etc.).

Digital libraries (e.g., image catalogue, musical dictionary, film, video and radio archives).

E-Commerce (e.g., personalised advertising, on-line catalogues, directories of e-shops).

Education (e.g., repositories of multimedia courses, multimedia search for support material).

Home Entertainment (e.g., systems for the management of personal multimedia collections, including manipulation of content, e.g. home video editing, searching a game, karaoke).

Investigation services (e.g., human characteristics recognition, forensics).

Journalism (e.g. searching speeches of a certain politician using his name, his voice or his face).

Multimedia directory services (e.g. yellow pages, Tourist information, Geographical information systems).

Multimedia editing (e.g., personalised electronic news service, media authoring).

Remote sensing (e.g., cartography, ecology, natural resources management).

Shopping (e.g., searching for clothes that you like).

Social (e.g. dating services).

Surveillance (e.g., traffic control, surface transportation, non-destructive testing in hostile environments).

MPEG-7 Description tools allow to create descriptions of content that may include:

Information describing the creation and production processes of the content (director, title, short feature movie)

Information related to the usage of the content (copyright pointers, usage history, broadcast schedule)

Information of the storage features of the content (storage format, encoding)

Structural information on spatial, temporal or spatio-temporal components of the content (scene cuts, segmentation in regions, region motion tracking)

Information about low level features in the content (colors, textures, sound timbres, melody description)

Conceptual information of the reality captured by the content

(objects and events, interactions among objects)

All these descriptions are coded in an efficient way for searching, filtering, etc.

APPLICATION MODEL

Parts of the standard

MPEG-7 Visual – the Description Tools dealing with Visual descriptions.

MPEG-7 Audio – the Description Tools dealing with Audio descriptions

MPEG-7 Multimedia Description Schemes - the Description Tools dealing with generic features and multimedia descriptions.

MPEG-7 Description Definition Language - the language defining the syntax of the MPEG-7 Description Tools and for defining new Description Schemes.

Structure of the descriptions

Those main elements of the MPEG-7’s standard are:

Descriptors (D): representations of

Features, that define the syntax and the semantics of each feature representation,

Description Schemes (DS): specify the structure and semantics of the relationships between their components.

These components may be both

Descriptors and Description Schemes,

Description Definition Language (DDL): to allow the creation of new Description

Schemes and, possibly, Descriptors and to allows the extension and modification of existing Description Schemes,

System tools: to support multiplexing of descriptions, synchronization issues, transmission mechanisms, coded representations (both textual and binary formats) for efficient storage and transmission, management and protection of intellectual property in MPEG-7 descriptions, etc

MPEG-7 Description Definition Language (DDL)

The DDL defines the syntactic rules to express and combine Description Schemes and Descriptors

DDL is a schema language to represent the results of modeling audiovisual data, i.e.

DSs and Ds.

It was decided to adopt XML Schema Language from W3C as the MPEG-7 DDL. The DDL will require some specific extensions to XML

Schema

WHAT IS XML ?

Extensible Markup Language (XML) is a subset of SGML (ISO standard). Its goal is to enable generic SGML to be processed on the Web in the same way that is now possible with HTML. XML has been designed for ease of implementation compared to SGML.

XML defines document structure and embeds it directly within the document through the use of markups. A markup is composed of two kinds of tags which encapsulate data : open tags and close tags. XML is similar to HTML but the tags can be defined by the user. The definition of valid document structure is expressed in a language called DTD

(Document Type Definition).

SIMPLE EXAMPLE of an XML

DOCUMENT :

<letter>

<header>

<name>Mr John Smith</name>

<address>

<street>15 rue Lacepede</street>

<city>Paris</city>

</address>

</header>

<text>Dear Mr Doe, .....</text>

</letter>

XML Schema: Structures

XML Schema consists of a set of structural schema components which can be divided into three groups. The primary components are:

The Schema – the wrapper around the definitions and declarations;

Simple type definitions;

Complex type definitions;

Attribute declarations;

Element declarations.

The secondary components are:

Attribute group definitions;

Identity-constraint definitions;

Model group definitions;

Notation declaration

The third group are the "helper" components which contribute to the other components and cannot stand alone:

Annotations;

Model groups;

Particles;

Wildcards.

Simple example of a DTD file :

<!DOCTYPE letter[

<!ELEMENT letter (header, text)>

<!ELEMENT header (name,address)>

<!ELEMENT address (street, city)>

]>

<!ELEMENT name #PCDATA>

<!ELEMENT street #PCDATA>

<!ELEMENT city #PCDATA>

<!ELEMENT text #PCDATA>

What is an XML schema ?

The purpose of an XML schema is almost the same than a DTD except that it goes beyond the current functionalities of a

DTD and allows more precise datatype definitions and easier reuse of structure definitions. Schema can be seen as an extended DTD. Even more important is that an XML schema is itself an XML

document.

XML Schema Overview

The DDL can be broken down into the following logical normative components:

XML Schema Structural components;

XML Schema Datatype components;

MPEG-7 Extensions to XML Schema

The MPEG-7 DDL is basically XML

Schema Language but with some MPEG-

7-specific extensions such as array and matrix datatypes.

The DDL allows to define complexTypes and simpleTypes. The complexTypes specify the structural constraints while simpleTypes express datatype constraints

MPEG-7 Extensions to XML Schema

Parameterized array sizes;

Typed references ;

Built-in array and matrix datatypes;

Enumerated datatypes for MimeType,

CountryCode, RegionCode,

CurrencyCode and CharacterSetCode

XML Schema Language parsers available:

XSV - Open Source Edinburgh Schema

Validator (written in Python)

XML Spy - Validating XML Editor

Xerces - Open source XML Parsers in Java and C++

EXAMPLE

<simpleType name=”6bitInteger” base=”nonNegativeInteger”>

<minInclusive value=”0”/>

<maxInclusive value=”63”/>

</simpleType>

A complex type definition is a set of attribute declarations and a content type, applicable to the attributes and children of an element declared to be of this complex type

<complexType name="Organization">

<element name="OrganizationName" type="string"/>

<element name="ContactPerson" type="Individual" minOccurs="0" maxOccurs="unbounded"/>

<element name="Address" minOccurs="0"/> type="Place"

<attribute name="id" type="ID" use=”required”/>

</complexType

>

0.1XML Built-in Primitive Datatypes

Schema:Datatypes:

 string;

 boolean;

 float;

 double;

 decimal;

 timeDuration [ISO 8601];

 recurringDuration;

 binary;

 uriReference;

ID;

IDREF;

ENTITY;

NOTATION;

QName.

0 MPEG-7 Structural Extensions

Defining Arrays and Matrices

<simpleType name="IntegerMatrix3x4" base="integer" derivedBy="list">

<mpeg7:dimension value="3 4" />

</simpleType>

<element name='IntegerMatrix3x4' type='IntegerMatrix3x4'/>

<IntegerMatrix3x4>

5 8 9 4

6 7 8 2

7 1 3 5

</IntegerMatrix3x4>

VECTORS

<!-- Definition of "Vector of integers" -->

<simpleType name="listOfInteger" base="integer" derivedBy="list" />

<complexType name="VectorI" base="listOfInteger" derivedBy="extension">

<attribute ref="mpeg7:dim" />

</complexType>

<!-- Definition of "Vector of reals"

<simpleType name="listOfFloat" derivedBy="list" />

--> base="float"

<complexType name="VectorR" base="listOfFloat" derivedBy="extension">

<attribute ref="mpeg7:dim" />

MPEG-7 Visual

MPEG-7 Visual Description Tools included in the standard consist of basic structures and

Descriptors that cover the following basic visual features: Color, Texture, Shape, Motion,

Localization, and Face recognition. Each category consists of elementary and sophisticated Descriptors.

Visual Descriptors

Basic Descriptors

There are five Visual related Basic structures: the

Grid layout, and the Time series, Multiple view, the

Spatial 2D coordinates, and Temporal interpolation.

Color Descriptors

There are seven Color Descriptors: Color space,

Color Quantization, Dominant Colors, Scalable Color,

Color Layout, Color-Structure, and GoF/GoP Color.

VISUAL DESCRIPTORS AND

DESCRIPTION SCHEMES

Descriptors: representations of features that define the syntax and the semantics of each feature representation.

Description Schemes: specify the structure and semantics of the relationships between their components, which may be both

Descriptors and Description Schemes

BASIC STRUCTURES

GRID LAYOUT

The grid layout is a splitting of the image into a set of rectangular regions, so that each region can be described separately. Each region of the grid can be described in terms of other descriptors such as color or texture.

1.1. DDL representation syntax

<element name=”GridLayout”>

<complexType content=”empty”>

<attribute name=”PartNumberH” datatype=”positiveInteger”/>

<attribute name=”PartNumberV” datatype=”positiveInteger”/>

</complexType>

</element

PartNumberH 16 bit

This field contains number of horizontal partitions in the grid over the image.

PartNumberV 16 bit

This field contains number of vertical partitions in the grid over the image

COLOR

Color space, several supported

- RGB

- YCbCr

- HSV

- HMMD

- Linear transformation matrix with reference to RGB

- Monochrome

DDL representation syntax

<element name=”ColorSpace”>

<complexType>

<choice>

<element name=”RGB” type=”emptyType”/>

<element name=”YCbCr” type=”emptyType”/>

<element name=”HSV” type=”emptyType”/>

<element name=”HMMD” type=”emptyType”/>

<element name=”LinearMatrix” >

<complexType base=”IntegerMatrix” derivedBy=”restriction”>

<!--matrix element as 16-bit unsigned integer -->

<minInclusive value=”0”/>

<maxInclusive value=”65535”/>

<attribute name=”sizes” use=”fixed” value=”3 3”/>

</complexType>

</element>

<element name=”Monochrome” type=”emptyType”/>

</choice>

</complexType>

</element>

White Color

Sum

Min

Diff

Hue

Max

Black Color

HMMD SPACE REPRESENTATION

1.

Color quantization

This descriptor defines the quantization of a color space. The following quantization types are supported: uniform,subspace_uniform, subspace_nonuniform and lookup_table.

1. Dominant color

This descriptor specifies a set of dominant colors in an arbitrarily-shaped region. It targets content-based retrieval for color, either for the whole image or for an arbitrary region

(rectangular or irregular)

1.1 DDL representation syntax

<element name=”DominantColor”>

<complexType>

<element ref=”ColorSpace”/>

<element ref=”ColorQuantization”/>

<element name=”DomColorValues” minOccursPar=”DomColorsNumber”>

<complexType>

<element name=”Percentage” type=”unsigned5”/>

<element name=”ColorValueIndex”>

<simpleType base=”unsigned12” derivedBy=”list”>

<length valuePar=”ColorSpaceDim”/>

</simpleType>

</element>

<element name=”ColorVariance” minOccurs=”0” maxOccurs=”1”>

<simpleType base=”unsigned1” derivedBy=”list”>

<length valuePar=”ColorSpaceDim”/>

</simpleType>

</element>

</complexType>

1.

Descriptor semantics

DomColorsNumber

This element specifies the number of dominant colors in the region.

The maximum allowed number of dominant colors is 8, the minimum number of dominant colors is 1.

VariancePresent

This is a flag used only in binary representation that signals the presence of the color variances in the descriptor.

SpatialCoherency

The image spatial variance (coherency) per dominant color captures whether or not a given dominant color is coherent and appears to be a solid color in the given image region.

NON-COHERENT AND COHERENT REGIONS

DESCRIPTORS ALREADY DEFINED

FOR THE FOLLOWING ATTRIBUTES:

COLOR (COLOR SPACE, QUANTIZATION,

DOMINANT COLOR, SCALABLE COLOR,

COLOR LAYOUT, COLOR STRUCTURE)

COLOR HISTOGRAM FOR GROUP OF

FRAMES

TEXTURE (HOMOGENOUS,TEXTURE

BROWSING, EDGE HISTOGRAM)

SHAPE (REGION SHAPE, CONTOUR SHAPE)

CNTD.

MOTION(CAMERA MOTION, MOTION

TRAJECTORY, PARAMETRIC MOTION,

MOTION ACTIVITY)

LOCALIZATION(REGION LOCATOR,

SPATIO-TEMPORAL LOCATOR(INCLUDES

FigureTrajectory, ParemeterTrajectory)

TEXTURE

Homogeneous texture

This descriptor provides similarity based image-to-image matching for texture image databases. In order to describe the image texture, energy and energy deviation feature values are extracted from a frequency layout and are used to constitute a texture feature vector for similarity-based retrieval.

Channel (C i

) channel number (i)

4

5 3

6

10

11 9

12

30

18

17

16

15

23

22

24

21

20

19

14

13

29

28

27

26

25

0

8

7

2

1 w q w

30 ANGULAR CHANNELS FOR FREQUENCY LAYOUT

ENERGY FUNCTION IS DEFINED AS FOLLOWS p i

360

 w

1  

0

 q 

0

 

[ G

P s , r

( w

, q

)

P ( w

, q

)]

2

P(ω,θ) is the Fourier transform of an image represented in the polar frequency domain

G

P s , r

 w  q

 exp

 w  w s

2

2

 2

 s

 exp

 q  q r

2

2

 q

2 r

G is Gaussian function e i

 log

10

[ 1

 p i

] e i is energy in i channel

DDL representation syntax

<element name=”HomogeneousTexture”>

<complexType>

<attribute name=”FeatureType” type=”boolean”/>

<element name=”AverageFeatureValue” type=”unsigned8”/>

<element name=”StandardDeviationFeatureValue” type=”unsigned8”/>

<element name=”EnergyComponents”>

<simpleType base=”unsigned8” derivedBy=”list”>

<length value=”30”/>

</simpleType>

</element>

<element name=”EnergyDeviationComponents” minOccurs=”0” maxOccurs=”1”>

<simpleType base=”unsigned8” derivedBy=”list”>

<length value=”30”/>

</simpleType>

</element>

</complexType>

</element>

1.

Texture browsing

This descriptor specifies a texture browsing descriptor.

It relates to a perceptual characterisation of texture, similar to a human characterisation, in terms of regularity, coarseness and directionality. This representation is useful for browsing applications and coarse classification of textures. We refer to this as the

Perceptual Browsing Component (PBC).

11 10 01 00

RegularityComponent

This element represents texture’s regularity. A texture is said to be regular if it is a periodic pattern with clear directionalities and of uniform scale

DirectionComponent

This element represents the dominant direction characterising the texture directionality

ScaleComponent

This element represents the coarseness of the texture associated with the corresponding dominant orientation specified in the

DirectionComponent

Edge histogram

The edge histogram descriptor represents the spatial distribution of five types of edges namely, four directional edges and one nondirectional edge a) vertical b) horizontal c) 45 degree d) 135 degree e)non-directional

edge edge edge edge edge

DDL representation semantics element name=”EdgeHistogram”>

<complexType>

<element name=”BinCounts”>

<simpleType base=”unsigned8” derivedBy=”list”>

<length value=”80”>

</simpleType>

</element>

</complexType>

</element>

SHAPE

Region shape

T s The shape of an object may consist of either a single connected region or a set of disjoint regions, as well as some holes in the object.

SHAPES

The region-based shape descriptor utilizes a set of ART (Angular

Radial Transform) coefficients. ART is a 2-D complex transform defined on a unit disk in polar coordinates,

F nm

V nm

2

  

0

1

0

V

 nm

   

 d

 d q f (

, q

) is an image function in polar coordinates, and

V nm

(

, q

) is the ART basis function. The ART basis functions are separable along the angular and radial directions, i.e.,

V nm

 

A m

    n

The angular and radial basis functions are defined as follows:

A m

1

2

 exp

 jm q

R n

1

 2 cos

  n

0 n

0

ART BASIS FUNCTIONS

CONTOUR SHAPE

The object contour shape descriptor describes a closed contour of a 2D object or region in an image or video sequence

The object contour-based shape descriptor is based on the Curvature Scale Space

(CSS) representation of the contour

HOW THE CONTOUR IS CALCULATED?

N equidistant points are selected on the contour, starting from an arbitrary point on the contour and following the contour clockwise. The x-coordinates of the selected N points are grouped together and the y-coordinates are also grouped together into two series X, Y. The contour is then gradually smoothed by repetitive application of a low-pass filter with the kernel (0.25,0.5,0.25) to X and Y coordinates of the selected N contour points

GlobalCurvatureVector

This element specifies global parameters of the contour, namely the

Eccentricity and Circularity

FOR A CIRCLE

CIRCULARITY IS circularit y

C circle

2

 perimeter

( 2

 r r area

)

2

2

4

.

eccentrici ty

 i

02

( y

 y c

)

2 i

20

 i

02

 i

20

 i

02

 i

20

2  i

02

2 

2 i

20 i

02

4 i

11

2 i

20

2  i

02

2 

2 i

20 i

02

4 i

11

2 i

11

( x

 x c

)( y

 y c

) i

20

( x

 x c

)

2

MOTION

Camera motion

This descriptor characterizes 3-D camera motion parameters. It is based on 3-D camera motion parameter information, which can be automatically extracted or generated by capture devices

Tilt up

Boom up Pan right

Track right Dolly backward

Pan left

Dolly forward

Track left

Roll Tilt down

Boom down

Motion trajectory

Motion Trajectory is a high-level feature associated with a moving region, defined as a spatio-temporal localization of one of its representative points (such as centroid

Parametric motion

This descriptor addresses the motion of objects in video sequences, as well as global motion

Motion activity

The activity descriptor captures intuitive notion of “intensity of action” or “pace of action” in a video segment. Examples of high activity include scenes such as “goal scoring in a soccer match”, “scoring in a baseball game”, “a high speed car chase”, etc. On the other hand, scenes such as “news reader shot”, “an interview scene”,

“a still shot” etc. are perceived as low action shots

Localization

Region locator

This descriptor enables localization of regions within images or frames by specifying them with a brief and scalable representation of a Box or a Polygon

Spatio-temporal locator

The SpatioTemporalLocator describes spatio-temporal regions in a video sequence and provides localization functionality especially for hypermedia applications.

It consists of FigureTrajectory and ParameterTrajectory.

Reference Region

Motion

Reference Region

Motion

Reference Region

Motion

FigureTrajectory

FigureTrajectory describes a spatio-temporal region by trajectories of the representative points of a reference region. Reference regions are represented by three kinds of figures: rectangles, ellipses and polygons

TemporalInterpolationD

TemporalInterpolationD

TemporalInterpolationD

ParameterTrajectory

Motion

Motion Parameters a

1 a

2 a

3 a

4 time

TemporalInterpolationD

ParameterTrajectory describes a spatio-temporal region by a reference region and trajectories of motion parameters. Reference regions are described using the

RegionLocator descriptor. Motion parameters and parametric motion model specify a mapping from the reference region to a region of an arbitrary frame

AUDIO DESCRIPTORS

Audio Framework. The main hook into a description for all audio description schemes and descriptors

Spoken Content DS. A DS representing the output of

Automatic Speech Recognition (ASR).

Timbre Description. A collection of descriptors describing the perceptual features of instrument sounds

Audio Independent Components. A DS containing an

Independent Component Analysis (ICA) of audio

EXAMPLES

AudioPowerType describes the temporally-smoothed instantaneous power

<!-- definition of "AudioPowerType" -->

<complexType name="AudioPowerType" base="mpeg7:AudioSampledType" derivedBy="extension">

<element name="Value" type="mpeg7:SeriesOfScalarType" maxOccurs="unbounded"/>

</complexType

AudioSpectrumCentroidType describes the center of gravity of the log-frequency power spectrum

<!-- Center of gravity of log-frequency power spectrum -->

<complexType name="AudioSpectrumCentroidType" base="mpeg7:AudioSampledType" derivedBy="extension">

<element name="Value" type="mpeg7:SeriesOfScalarType" maxOccurs="unbounded"/>

</complexType

THERE ARE QUITE MANY OF AUDIO Ds

3.1.1

AudioDescriptorType 1

3.1.2

AudioSampledType 2

3.1.3

AudioWaveformEnvelopeType 2

3.1.4

AudioSpectrumEnvelopeType 2

3.1.5

AudioPowerType 3

3.1.6

AudioSpectrumCentroidType 4

3.1.7

AudioSpectrumSpreadType

3.1.8

AudioFundamentalFrequencyType

4

3.1.9

AudioHarmonicityType 5

1.2 AudioDescriptorType 7

1.2.2

AudioSampledType 7

1.2.3

AudioWaveformEnvelopeType 7

1.2.4

AudioSpectrumEnvelopeType 7

1.2.5

AudioPowerType

1.2.6

1.2.7

9

AudioSpectrumCentroidType

AudioSpectrumSpreadType

1.2.8

AudioFundamentalFrequencyType

1.2.9

AudioHarmonicityType

10

10

5

11

SPOKEN CONTENT DESCRIPTORS

Spoken Content DS consists of combined word and phone lattices for each speaker in an audio stream

The DS can be used for two broad classes of retrieval scenario: indexing into and retrieval of an audio stream, and indexing of multimedia objects annotated with speech

EXAMPLE APPLICATIONS

Recall of audio/video data by memorable spoken events. An example would be a film or video recording where a character or person spoke a particular word or sequence of words. The source media would be known, and the query would return a position in the media a)

Spoken Document Retrieval. In this case, there is a database consisting of separate spoken documents. The result of the query is the relevant documents, and optionally the position in those documents of the matched speech.

.

A lattice structure for an hypothetical (combined phone and word) decoding of the expression “Taj Mahal drawing …”. It is assumed that the name ‘Taj Mahal’ is out of the vocabulary of the ASR system.

Definition of the SpokenContentHeader

-->

<!--

-->

<!-- The header consists of the following components:

-->

<!-- 1. The speakers which comprise the audio. There must be at least one -->

<!--

--> speaker.

<!-- 2. The phone lexicons used to represent the speech. -->

<!-- 3. The word lexicons used to represent the speech. -->

<!-- Note:

-->

<!-- a) A word or phone lexicon may be used by more than one speaker. -->

<!-- b) Although there must be at least one word or phone lexicon.

TIMBRE DESCRIPTOR

Timbre Descriptors aim at describing perceptual features of instrument sounds. Timbre is currently defined in the literature, as the perceptual features that make two sounds having the same pitch and loudness sound different. The aim of the Timbre DS is to describe these perceptual features with a reduced set of descriptors. The descriptors relate to notions such as “attack”, “brightness” or “richness” of a sound .

< DSType name=' TimbreDS '>

<SubDSof=’AudioDS/>

< DtypeRef =' LogAttackTimeD ' maxoccurs='1'/>

< DtypeRef =' HarmonicSpectralCentroidD ' minoccurs='0' maxoccurs='1'/> minoccurs='0'

< DtypeRef =' HarmonicSpectralDeviationD ' minoccurs='0' maxoccurs='1'/>

< DtypeRef =' HarmonicSpectralStdD ' maxoccurs='1'/>

< DtypeRef =' SpectralCentroidD maxoccurs='1'/>

' minoccurs='0'

< DtypeRef =' HarmonicSpectralVariationD ' minoccurs='0' maxoccurs='1'/> minoccurs='0' minoccurs='0' < DtypeRef =' TemporalCentroidD ' maxoccurs='1'/>

< /DSType >

DEFINITIONS ARE GIVEN:

EXAMPLE

LOG-ATTACK-TIME lat

 log

10

( T 1

T 0 ) where

T0 is the time the signal starts

T1 is the time the signal reaches its sustained part

Signal envelope(t) t

T0 T1

ESTIMATION O´F SOUND TIMBRE DESCRIPTORS

Download