TOWARDS INTELLIGENT
AUDIO-VISUAL
INFORMATION HANDLING
MPEG-7, formally named “Multimedia Content
Description Interface”, is a standard for describing the multimedia content data that supports some degree of interpretation of the information’s meaning, which can be passed onto, or accessed by, a device or a computer code.
MPEG-7 is not aimed at any one application in particular; rather, the elements that MPEG-7 standardizes support as broad a range of applications as possible.
The elements that MPEG-7 standardizes provide support to a broad range of applications (for example, multimedia digital libraries, broadcast media selection, multimedia editing, home entertainment devices, etc.). MPEG-7 will also make the web as searchable for multimedia content as it is searchable for text today. This would apply especially to large content archives, which are being made accessible to the public, as well as to multimedia catalogues enabling people to identify content for purchase.
OBJECTIVES OF MPEG-7 STANDARD
The MPEG-7 standard aims at providing standardized core technologies allowing description of audiovisual data content in multimedia environments. Audiovisual data content that has MPEG-7 data associated with it, may include: still pictures, graphics, 3D models, audio, speech, video, and composition information about how these elements are combined in a multimedia presentation
(scenarios). Special cases of these general data types may include facial expressions and personal characteristics
APPLICATION AREAS OF MPEG-7
Broadcast media selection (e.g., radio channel, TV channel).
Cultural services (history museums, art galleries, etc.).
Digital libraries (e.g., image catalogue, musical dictionary, film, video and radio archives).
E-Commerce (e.g., personalised advertising, on-line catalogues, directories of e-shops).
Education (e.g., repositories of multimedia courses, multimedia search for support material).
Home Entertainment (e.g., systems for the management of personal multimedia collections, including manipulation of content, e.g. home video editing, searching a game, karaoke).
Investigation services (e.g., human characteristics recognition, forensics).
Journalism (e.g. searching speeches of a certain politician using his name, his voice or his face).
Multimedia directory services (e.g. yellow pages, Tourist information, Geographical information systems).
Multimedia editing (e.g., personalised electronic news service, media authoring).
Remote sensing (e.g., cartography, ecology, natural resources management).
Shopping (e.g., searching for clothes that you like).
Social (e.g. dating services).
Surveillance (e.g., traffic control, surface transportation, non-destructive testing in hostile environments).
MPEG-7 Description tools allow to create descriptions of content that may include:
Information describing the creation and production processes of the content (director, title, short feature movie)
Information related to the usage of the content (copyright pointers, usage history, broadcast schedule)
Information of the storage features of the content (storage format, encoding)
Structural information on spatial, temporal or spatio-temporal components of the content (scene cuts, segmentation in regions, region motion tracking)
Information about low level features in the content (colors, textures, sound timbres, melody description)
Conceptual information of the reality captured by the content
(objects and events, interactions among objects)
All these descriptions are coded in an efficient way for searching, filtering, etc.
APPLICATION MODEL
MPEG-7 Visual – the Description Tools dealing with Visual descriptions.
MPEG-7 Audio – the Description Tools dealing with Audio descriptions
MPEG-7 Multimedia Description Schemes - the Description Tools dealing with generic features and multimedia descriptions.
MPEG-7 Description Definition Language - the language defining the syntax of the MPEG-7 Description Tools and for defining new Description Schemes.
Structure of the descriptions
Those main elements of the MPEG-7’s standard are:
•
Descriptors (D): representations of
Features, that define the syntax and the semantics of each feature representation,
•
Description Schemes (DS): specify the structure and semantics of the relationships between their components.
These components may be both
Descriptors and Description Schemes,
Description Definition Language (DDL): to allow the creation of new Description
Schemes and, possibly, Descriptors and to allows the extension and modification of existing Description Schemes,
System tools: to support multiplexing of descriptions, synchronization issues, transmission mechanisms, coded representations (both textual and binary formats) for efficient storage and transmission, management and protection of intellectual property in MPEG-7 descriptions, etc
MPEG-7 Description Definition Language (DDL)
The DDL defines the syntactic rules to express and combine Description Schemes and Descriptors
DDL is a schema language to represent the results of modeling audiovisual data, i.e.
DSs and Ds.
It was decided to adopt XML Schema Language from W3C as the MPEG-7 DDL. The DDL will require some specific extensions to XML
Schema
WHAT IS XML ?
•
Extensible Markup Language (XML) is a subset of SGML (ISO standard). Its goal is to enable generic SGML to be processed on the Web in the same way that is now possible with HTML. XML has been designed for ease of implementation compared to SGML.
XML defines document structure and embeds it directly within the document through the use of markups. A markup is composed of two kinds of tags which encapsulate data : open tags and close tags. XML is similar to HTML but the tags can be defined by the user. The definition of valid document structure is expressed in a language called DTD
(Document Type Definition).
SIMPLE EXAMPLE of an XML
DOCUMENT :
<letter>
<header>
<name>Mr John Smith</name>
<address>
<street>15 rue Lacepede</street>
<city>Paris</city>
</address>
</header>
<text>Dear Mr Doe, .....</text>
</letter>
XML Schema: Structures
XML Schema consists of a set of structural schema components which can be divided into three groups. The primary components are:
The Schema – the wrapper around the definitions and declarations;
•
Simple type definitions;
•
Complex type definitions;
•
Attribute declarations;
•
Element declarations.
The secondary components are:
•
Attribute group definitions;
•
Identity-constraint definitions;
•
Model group definitions;
•
Notation declaration
The third group are the "helper" components which contribute to the other components and cannot stand alone:
•
Annotations;
•
Model groups;
•
Particles;
•
Wildcards.
Simple example of a DTD file :
<!DOCTYPE letter[
<!ELEMENT letter (header, text)>
<!ELEMENT header (name,address)>
<!ELEMENT address (street, city)>
]>
<!ELEMENT name #PCDATA>
<!ELEMENT street #PCDATA>
<!ELEMENT city #PCDATA>
<!ELEMENT text #PCDATA>
What is an XML schema ?
The purpose of an XML schema is almost the same than a DTD except that it goes beyond the current functionalities of a
DTD and allows more precise datatype definitions and easier reuse of structure definitions. Schema can be seen as an extended DTD. Even more important is that an XML schema is itself an XML
document.
XML Schema Overview
The DDL can be broken down into the following logical normative components:
•
XML Schema Structural components;
•
XML Schema Datatype components;
•
MPEG-7 Extensions to XML Schema
The MPEG-7 DDL is basically XML
Schema Language but with some MPEG-
7-specific extensions such as array and matrix datatypes.
The DDL allows to define complexTypes and simpleTypes. The complexTypes specify the structural constraints while simpleTypes express datatype constraints
MPEG-7 Extensions to XML Schema
•
Parameterized array sizes;
•
Typed references ;
•
Built-in array and matrix datatypes;
•
Enumerated datatypes for MimeType,
CountryCode, RegionCode,
CurrencyCode and CharacterSetCode
XML Schema Language parsers available:
XSV - Open Source Edinburgh Schema
Validator (written in Python)
XML Spy - Validating XML Editor
Xerces - Open source XML Parsers in Java and C++
EXAMPLE
<simpleType name=”6bitInteger” base=”nonNegativeInteger”>
<minInclusive value=”0”/>
<maxInclusive value=”63”/>
</simpleType>
A complex type definition is a set of attribute declarations and a content type, applicable to the attributes and children of an element declared to be of this complex type
<complexType name="Organization">
<element name="OrganizationName" type="string"/>
<element name="ContactPerson" type="Individual" minOccurs="0" maxOccurs="unbounded"/>
<element name="Address" minOccurs="0"/> type="Place"
<attribute name="id" type="ID" use=”required”/>
</complexType
>
0.1XML Built-in Primitive Datatypes
Schema:Datatypes:
string;
boolean;
float;
double;
decimal;
timeDuration [ISO 8601];
recurringDuration;
binary;
uriReference;
ID;
IDREF;
ENTITY;
NOTATION;
QName.
0 MPEG-7 Structural Extensions
Defining Arrays and Matrices
<simpleType name="IntegerMatrix3x4" base="integer" derivedBy="list">
<mpeg7:dimension value="3 4" />
</simpleType>
<element name='IntegerMatrix3x4' type='IntegerMatrix3x4'/>
<IntegerMatrix3x4>
5 8 9 4
6 7 8 2
7 1 3 5
</IntegerMatrix3x4>
VECTORS
<!-- Definition of "Vector of integers" -->
<simpleType name="listOfInteger" base="integer" derivedBy="list" />
<complexType name="VectorI" base="listOfInteger" derivedBy="extension">
<attribute ref="mpeg7:dim" />
</complexType>
<!-- Definition of "Vector of reals"
<simpleType name="listOfFloat" derivedBy="list" />
--> base="float"
<complexType name="VectorR" base="listOfFloat" derivedBy="extension">
<attribute ref="mpeg7:dim" />
MPEG-7 Visual Description Tools included in the standard consist of basic structures and
Descriptors that cover the following basic visual features: Color, Texture, Shape, Motion,
Localization, and Face recognition. Each category consists of elementary and sophisticated Descriptors.
Basic Descriptors
There are five Visual related Basic structures: the
Grid layout, and the Time series, Multiple view, the
Spatial 2D coordinates, and Temporal interpolation.
Color Descriptors
There are seven Color Descriptors: Color space,
Color Quantization, Dominant Colors, Scalable Color,
Color Layout, Color-Structure, and GoF/GoP Color.
VISUAL DESCRIPTORS AND
DESCRIPTION SCHEMES
Descriptors: representations of features that define the syntax and the semantics of each feature representation.
Description Schemes: specify the structure and semantics of the relationships between their components, which may be both
Descriptors and Description Schemes
BASIC STRUCTURES
GRID LAYOUT
The grid layout is a splitting of the image into a set of rectangular regions, so that each region can be described separately. Each region of the grid can be described in terms of other descriptors such as color or texture.
1.1. DDL representation syntax
<element name=”GridLayout”>
<complexType content=”empty”>
<attribute name=”PartNumberH” datatype=”positiveInteger”/>
<attribute name=”PartNumberV” datatype=”positiveInteger”/>
</complexType>
</element
PartNumberH 16 bit
This field contains number of horizontal partitions in the grid over the image.
PartNumberV 16 bit
This field contains number of vertical partitions in the grid over the image
COLOR
Color space, several supported
- RGB
- YCbCr
- HSV
- HMMD
- Linear transformation matrix with reference to RGB
- Monochrome
DDL representation syntax
<element name=”ColorSpace”>
<complexType>
<choice>
<element name=”RGB” type=”emptyType”/>
<element name=”YCbCr” type=”emptyType”/>
<element name=”HSV” type=”emptyType”/>
<element name=”HMMD” type=”emptyType”/>
<element name=”LinearMatrix” >
<complexType base=”IntegerMatrix” derivedBy=”restriction”>
<!--matrix element as 16-bit unsigned integer -->
<minInclusive value=”0”/>
<maxInclusive value=”65535”/>
<attribute name=”sizes” use=”fixed” value=”3 3”/>
</complexType>
</element>
<element name=”Monochrome” type=”emptyType”/>
</choice>
</complexType>
</element>
White Color
Sum
Min
Diff
Hue
Max
Black Color
HMMD SPACE REPRESENTATION
1.
Color quantization
This descriptor defines the quantization of a color space. The following quantization types are supported: uniform,subspace_uniform, subspace_nonuniform and lookup_table.
1. Dominant color
This descriptor specifies a set of dominant colors in an arbitrarily-shaped region. It targets content-based retrieval for color, either for the whole image or for an arbitrary region
(rectangular or irregular)
1.1 DDL representation syntax
<element name=”DominantColor”>
<complexType>
<element ref=”ColorSpace”/>
<element ref=”ColorQuantization”/>
<element name=”DomColorValues” minOccursPar=”DomColorsNumber”>
<complexType>
<element name=”Percentage” type=”unsigned5”/>
<element name=”ColorValueIndex”>
<simpleType base=”unsigned12” derivedBy=”list”>
<length valuePar=”ColorSpaceDim”/>
</simpleType>
</element>
<element name=”ColorVariance” minOccurs=”0” maxOccurs=”1”>
<simpleType base=”unsigned1” derivedBy=”list”>
<length valuePar=”ColorSpaceDim”/>
</simpleType>
</element>
</complexType>
1.
Descriptor semantics
DomColorsNumber
This element specifies the number of dominant colors in the region.
The maximum allowed number of dominant colors is 8, the minimum number of dominant colors is 1.
VariancePresent
This is a flag used only in binary representation that signals the presence of the color variances in the descriptor.
SpatialCoherency
The image spatial variance (coherency) per dominant color captures whether or not a given dominant color is coherent and appears to be a solid color in the given image region.
NON-COHERENT AND COHERENT REGIONS
DESCRIPTORS ALREADY DEFINED
FOR THE FOLLOWING ATTRIBUTES:
COLOR (COLOR SPACE, QUANTIZATION,
DOMINANT COLOR, SCALABLE COLOR,
COLOR LAYOUT, COLOR STRUCTURE)
COLOR HISTOGRAM FOR GROUP OF
FRAMES
TEXTURE (HOMOGENOUS,TEXTURE
BROWSING, EDGE HISTOGRAM)
SHAPE (REGION SHAPE, CONTOUR SHAPE)
CNTD.
MOTION(CAMERA MOTION, MOTION
TRAJECTORY, PARAMETRIC MOTION,
MOTION ACTIVITY)
LOCALIZATION(REGION LOCATOR,
SPATIO-TEMPORAL LOCATOR(INCLUDES
FigureTrajectory, ParemeterTrajectory)
TEXTURE
Homogeneous texture
This descriptor provides similarity based image-to-image matching for texture image databases. In order to describe the image texture, energy and energy deviation feature values are extracted from a frequency layout and are used to constitute a texture feature vector for similarity-based retrieval.
Channel (C i
) channel number (i)
4
5 3
6
10
11 9
12
30
18
17
16
15
23
22
24
21
20
19
14
13
29
28
27
26
25
0
8
7
2
1 w q w
30 ANGULAR CHANNELS FOR FREQUENCY LAYOUT
ENERGY FUNCTION IS DEFINED AS FOLLOWS p i
360
w
1
0
q
0
[ G
P s , r
( w
, q
)
P ( w
, q
)]
2
P(ω,θ) is the Fourier transform of an image represented in the polar frequency domain
G
P s , r
w q
exp
w w s
2
2
2
s
exp
q q r
2
2
q
2 r
G is Gaussian function e i
log
10
[ 1
p i
] e i is energy in i channel
DDL representation syntax
<element name=”HomogeneousTexture”>
<complexType>
<attribute name=”FeatureType” type=”boolean”/>
<element name=”AverageFeatureValue” type=”unsigned8”/>
<element name=”StandardDeviationFeatureValue” type=”unsigned8”/>
<element name=”EnergyComponents”>
<simpleType base=”unsigned8” derivedBy=”list”>
<length value=”30”/>
</simpleType>
</element>
<element name=”EnergyDeviationComponents” minOccurs=”0” maxOccurs=”1”>
<simpleType base=”unsigned8” derivedBy=”list”>
<length value=”30”/>
</simpleType>
</element>
</complexType>
</element>
1.
Texture browsing
This descriptor specifies a texture browsing descriptor.
It relates to a perceptual characterisation of texture, similar to a human characterisation, in terms of regularity, coarseness and directionality. This representation is useful for browsing applications and coarse classification of textures. We refer to this as the
Perceptual Browsing Component (PBC).
11 10 01 00
RegularityComponent
This element represents texture’s regularity. A texture is said to be regular if it is a periodic pattern with clear directionalities and of uniform scale
DirectionComponent
This element represents the dominant direction characterising the texture directionality
ScaleComponent
This element represents the coarseness of the texture associated with the corresponding dominant orientation specified in the
DirectionComponent
Edge histogram
The edge histogram descriptor represents the spatial distribution of five types of edges namely, four directional edges and one nondirectional edge a) vertical b) horizontal c) 45 degree d) 135 degree e)non-directional
edge edge edge edge edge
DDL representation semantics element name=”EdgeHistogram”>
<complexType>
<element name=”BinCounts”>
<simpleType base=”unsigned8” derivedBy=”list”>
<length value=”80”>
</simpleType>
</element>
</complexType>
</element>
SHAPE
Region shape
T s The shape of an object may consist of either a single connected region or a set of disjoint regions, as well as some holes in the object.
SHAPES
The region-based shape descriptor utilizes a set of ART (Angular
Radial Transform) coefficients. ART is a 2-D complex transform defined on a unit disk in polar coordinates,
F nm
V nm
2
0
1
0
V
nm
d
d q f (
, q
) is an image function in polar coordinates, and
V nm
(
, q
) is the ART basis function. The ART basis functions are separable along the angular and radial directions, i.e.,
V nm
A m
n
The angular and radial basis functions are defined as follows:
A m
1
2
exp
jm q
R n
1
2 cos
n
0 n
0
ART BASIS FUNCTIONS
CONTOUR SHAPE
The object contour shape descriptor describes a closed contour of a 2D object or region in an image or video sequence
The object contour-based shape descriptor is based on the Curvature Scale Space
(CSS) representation of the contour
HOW THE CONTOUR IS CALCULATED?
N equidistant points are selected on the contour, starting from an arbitrary point on the contour and following the contour clockwise. The x-coordinates of the selected N points are grouped together and the y-coordinates are also grouped together into two series X, Y. The contour is then gradually smoothed by repetitive application of a low-pass filter with the kernel (0.25,0.5,0.25) to X and Y coordinates of the selected N contour points
GlobalCurvatureVector
This element specifies global parameters of the contour, namely the
Eccentricity and Circularity
FOR A CIRCLE
CIRCULARITY IS circularit y
C circle
2
perimeter
( 2
r r area
)
2
2
4
.
eccentrici ty
i
02
( y
y c
)
2 i
20
i
02
i
20
i
02
i
20
2 i
02
2
2 i
20 i
02
4 i
11
2 i
20
2 i
02
2
2 i
20 i
02
4 i
11
2 i
11
( x
x c
)( y
y c
) i
20
( x
x c
)
2
MOTION
Camera motion
This descriptor characterizes 3-D camera motion parameters. It is based on 3-D camera motion parameter information, which can be automatically extracted or generated by capture devices
Tilt up
Boom up Pan right
Track right Dolly backward
Pan left
Dolly forward
Track left
Roll Tilt down
Boom down
Motion trajectory
Motion Trajectory is a high-level feature associated with a moving region, defined as a spatio-temporal localization of one of its representative points (such as centroid
Parametric motion
This descriptor addresses the motion of objects in video sequences, as well as global motion
Motion activity
The activity descriptor captures intuitive notion of “intensity of action” or “pace of action” in a video segment. Examples of high activity include scenes such as “goal scoring in a soccer match”, “scoring in a baseball game”, “a high speed car chase”, etc. On the other hand, scenes such as “news reader shot”, “an interview scene”,
“a still shot” etc. are perceived as low action shots
Localization
Region locator
This descriptor enables localization of regions within images or frames by specifying them with a brief and scalable representation of a Box or a Polygon
Spatio-temporal locator
The SpatioTemporalLocator describes spatio-temporal regions in a video sequence and provides localization functionality especially for hypermedia applications.
It consists of FigureTrajectory and ParameterTrajectory.
Reference Region
Motion
Reference Region
Motion
Reference Region
Motion
FigureTrajectory
FigureTrajectory describes a spatio-temporal region by trajectories of the representative points of a reference region. Reference regions are represented by three kinds of figures: rectangles, ellipses and polygons
TemporalInterpolationD
TemporalInterpolationD
TemporalInterpolationD
ParameterTrajectory
Motion
Motion Parameters a
1 a
2 a
3 a
4 time
TemporalInterpolationD
ParameterTrajectory describes a spatio-temporal region by a reference region and trajectories of motion parameters. Reference regions are described using the
RegionLocator descriptor. Motion parameters and parametric motion model specify a mapping from the reference region to a region of an arbitrary frame
AUDIO DESCRIPTORS
Audio Framework. The main hook into a description for all audio description schemes and descriptors
Spoken Content DS. A DS representing the output of
Automatic Speech Recognition (ASR).
Timbre Description. A collection of descriptors describing the perceptual features of instrument sounds
Audio Independent Components. A DS containing an
Independent Component Analysis (ICA) of audio
EXAMPLES
AudioPowerType describes the temporally-smoothed instantaneous power
<!-- definition of "AudioPowerType" -->
<complexType name="AudioPowerType" base="mpeg7:AudioSampledType" derivedBy="extension">
<element name="Value" type="mpeg7:SeriesOfScalarType" maxOccurs="unbounded"/>
</complexType
AudioSpectrumCentroidType describes the center of gravity of the log-frequency power spectrum
<!-- Center of gravity of log-frequency power spectrum -->
<complexType name="AudioSpectrumCentroidType" base="mpeg7:AudioSampledType" derivedBy="extension">
<element name="Value" type="mpeg7:SeriesOfScalarType" maxOccurs="unbounded"/>
</complexType
THERE ARE QUITE MANY OF AUDIO Ds
3.1.1
AudioDescriptorType 1
3.1.2
AudioSampledType 2
3.1.3
AudioWaveformEnvelopeType 2
3.1.4
AudioSpectrumEnvelopeType 2
3.1.5
AudioPowerType 3
3.1.6
AudioSpectrumCentroidType 4
3.1.7
AudioSpectrumSpreadType
3.1.8
AudioFundamentalFrequencyType
4
3.1.9
AudioHarmonicityType 5
1.2 AudioDescriptorType 7
1.2.2
AudioSampledType 7
1.2.3
AudioWaveformEnvelopeType 7
1.2.4
AudioSpectrumEnvelopeType 7
1.2.5
AudioPowerType
1.2.6
1.2.7
9
AudioSpectrumCentroidType
AudioSpectrumSpreadType
1.2.8
AudioFundamentalFrequencyType
1.2.9
AudioHarmonicityType
10
10
5
11
SPOKEN CONTENT DESCRIPTORS
Spoken Content DS consists of combined word and phone lattices for each speaker in an audio stream
The DS can be used for two broad classes of retrieval scenario: indexing into and retrieval of an audio stream, and indexing of multimedia objects annotated with speech
EXAMPLE APPLICATIONS
Recall of audio/video data by memorable spoken events. An example would be a film or video recording where a character or person spoke a particular word or sequence of words. The source media would be known, and the query would return a position in the media a)
Spoken Document Retrieval. In this case, there is a database consisting of separate spoken documents. The result of the query is the relevant documents, and optionally the position in those documents of the matched speech.
.
A lattice structure for an hypothetical (combined phone and word) decoding of the expression “Taj Mahal drawing …”. It is assumed that the name ‘Taj Mahal’ is out of the vocabulary of the ASR system.
Definition of the SpokenContentHeader
-->
<!--
-->
<!-- The header consists of the following components:
-->
<!-- 1. The speakers which comprise the audio. There must be at least one -->
<!--
--> speaker.
<!-- 2. The phone lexicons used to represent the speech. -->
<!-- 3. The word lexicons used to represent the speech. -->
<!-- Note:
-->
<!-- a) A word or phone lexicon may be used by more than one speaker. -->
<!-- b) Although there must be at least one word or phone lexicon.
TIMBRE DESCRIPTOR
Timbre Descriptors aim at describing perceptual features of instrument sounds. Timbre is currently defined in the literature, as the perceptual features that make two sounds having the same pitch and loudness sound different. The aim of the Timbre DS is to describe these perceptual features with a reduced set of descriptors. The descriptors relate to notions such as “attack”, “brightness” or “richness” of a sound .
< DSType name=' TimbreDS '>
<SubDSof=’AudioDS/>
< DtypeRef =' LogAttackTimeD ' maxoccurs='1'/>
< DtypeRef =' HarmonicSpectralCentroidD ' minoccurs='0' maxoccurs='1'/> minoccurs='0'
< DtypeRef =' HarmonicSpectralDeviationD ' minoccurs='0' maxoccurs='1'/>
< DtypeRef =' HarmonicSpectralStdD ' maxoccurs='1'/>
< DtypeRef =' SpectralCentroidD maxoccurs='1'/>
' minoccurs='0'
< DtypeRef =' HarmonicSpectralVariationD ' minoccurs='0' maxoccurs='1'/> minoccurs='0' minoccurs='0' < DtypeRef =' TemporalCentroidD ' maxoccurs='1'/>
< /DSType >
DEFINITIONS ARE GIVEN:
EXAMPLE
LOG-ATTACK-TIME lat
log
10
( T 1
T 0 ) where
T0 is the time the signal starts
T1 is the time the signal reaches its sustained part
Signal envelope(t) t
T0 T1
ESTIMATION O´F SOUND TIMBRE DESCRIPTORS