Evaluating MPEG-4 Video Decoding Complexity for an Alternative

advertisement
Evaluating MPEG-4 Video Decoding Complexity for an
Alternative Video Complexity Verifier Model
J. Valentim, P. Nunes, and F. Pereira
Instituto Superior Técnico (IST) – Instituto de Telecomunicações
Av. Rovisco Pais, 1049001 Lisboa, Portugal
Phone: + 351 21 841 8460; Fax: + 351 21 841 8472
e-mail: {joao.valentim, paulo.nunes, fernando.pereira}@lx.it.pt

Abstract – MPEG-4 is the first object-based audiovisual coding
standard. To control the minimum decoding complexity resources
required at the decoder, the MPEG-4 Visual standard defines the socalled Video Buffering Verifier mechanism, which includes three
virtual buffer models, among them the Video Complexity Verifier
(VCV). This paper proposes an alternative VCV model, based on a
set of macroblock (MB) relative decoding complexity weights
assigned to the various MB coding types used in MPEG-4 video
coding. The new VCV model allows a more efficient use of the
available decoding resources by preventing the over-evaluation of
the decoding complexity of certain MB types and thus making
possible to encode scenes (for the same profile@level decoding
resources) which otherwise would be considered too demanding.
Index Terms – Decoding complexity, MPEG-4 standard,
profiles and levels, VCV model.
I. INTRODUCTION
R
ecognizing that audiovisual content should be created and
represented using a framework that is able to give the user
as many as possible real-world like capabilities, such as
interaction and manipulation, MPEG decided, in 1993, to
launch a new project, well known as MPEG-4 [1]. Since
human beings do not want to interact with abstract entities,
such as pixels, but rather with meaningful entities that are part
of the audiovisual scene, the concept of object is central to
MPEG-4. MPEG-4 models an audiovisual scene as a
composition of audiovisual objects with specific
characteristics and behavior, notably in space and time. The
object composition approach allows to support new
functionalities, such as object-based interaction, manipulation
and hyper-linking, as well as to improve already available
functionalities, such as coding efficiency by just using for each
type of object the most adequate coding tools and parameters.
The MPEG-4 audiovisual coding standard has been
designed to be generic in the sense that it does not target a
particular application but instead includes many coding tools
that can be used for a wide variety of applications, under
different situations, notably in terms of bit rate, type of channel
and storage media, and delay constraints [2]. This toolbox
approach provides the mechanisms to cover a wide range of
audiovisual
applications
from
mobile
multimedia
communications to studio and interactive TV [2, 3].
Since it is not reasonable that all MPEG-4 visual terminals
support the whole MPEG-4 visual toolbox, subsets of the
MPEG-4 Visual standard tools [4] have been defined, using
the concept of profiling, to address classes of applications with
similar functional and operational requirements [5]. A similar
approach has been applied to the audio tools as well as to the
systems tools. This approach allows manufacturers to
implement only the subsets of the standard – profiles – that
they need to achieve particular functionalities, while
maintaining interoperability with other MPEG-4 devices built
within the same conditions, but also restricting the
computational resources required by the given terminals. A
subset of the syntax and semantics corresponding to a subset of
the tools of the MPEG-4 Visual standard [4] defines a visual
profile, while sets of restrictions within each visual profile,
e.g., in terms of computational resources and memory, define
the various levels for a profile [5]. Moreover, since a scene
may include visual objects encoded using different tools,
object types define the syntax of the bitstream for one single
object that can represent a meaningful entity in the
(audiovisual) scene. Note that object types correspond to a set
of tools, in this case applied to each object and not to the scene
as a whole. Following the definition of audio and visual object
types, audio and visual profiles are defined as sets of audio and
visual object types, respectively.
In order that a particular set of visual data bitstreams
building a scene may be considered compliant with a given
MPEG-4 visual profile@level combination, it must not contain
any disallowed syntax element for that profile and additionally
it must not violate the Video Buffering Verifier mechanism
constraints [4]. This mechanism consists of three normative
models, each one defining a set of rules and limits to verify
that the amount required for a specific type of decoding
resource is within the values allowed by the corresponding
profile and level specifications:
 Video Rate Buffer Verifier (VBV) – This model is used
to verify that the bitstream memory required at the
decoder(s) does not exceed the values specified for the
corresponding profile and level. The model is defined in
terms of the VBV buffer sizes for all the Video Object
Layers (VOLs) corresponding to the objects building the
scene. Each VBV buffer size corresponds to the maximum
amount of bits that the decoder can store in the bitstream
memory for the corresponding VOL. There is also a
limitation on the sum of the VOL VBV buffer sizes. The
bitstream memory is the memory where the decoder puts
the bits received for a VOL while waiting to be decoded.
 Video Complexity Verifier (VCV) – This model is used
to verify that the computational power (processing speed)
required at the decoder, and defined in terms of MB/s,
does not exceed the values specified for the corresponding
profile and level. The model is defined in terms of the
VCV MB/s decoding rate and VCV buffer size and is
applied to all MBs in the scene. If arbitrarily shaped
Video Objects (VOs) exist in the scene, an additional
VCV buffer and VCV decoding rate are also defined, to
be applied only to the boundary MBs.
 Video Reference Memory Verifier (VMV) – This
model is used to verify that the picture memory required
at the decoder for the decoding of a given scene does not
exceed the values specified for the corresponding profile
and level. The model is defined in terms of the VMV
buffer size, which is the maximum number of decoded
MBs that the decoder can store during the decoding
process of all VOLs corresponding to the scene.
This paper evaluates the decoding complexity of various
MB coding types included in the Simple and Core object types
(and thus profiles) based on the MB decoding times obtained
with an optimized version of the MPEG-4 reference software
[6]. Following this decoding complexity evaluation, an
alternative model to the current MPEG-4 VCV model [4]
exploiting the relative decoding complexity of various MB
coding types used in MPEG-4 video coding is presented. This
model is based on a closer estimate of the actual decoding
complexity of the various video objects composing a scene,
thus allowing a much better use of the video decoding
resources which may be a critical factor in applications
environments where resources are scarce and expensive, such
as mobile and smart card applications.
II. MPEG-4 VCV MODEL
The MPEG-4 VCV model defines, for each profile@level
combination, a set of rules and limits which when respected at
the encoder ensure that the required decoding computational
power is always available at the decoder (which also respects
the same limits) [4]. The computational power of the decoder
is defined by two buffers and the corresponding MB decoding
rates, measured in MB/s, which specify the drain rate of the
two buffers:
 Boundary-VCV (B-VCV) – The B-VCV buffer keeps
track of the number of boundary MBs.
 VCV – The VCV buffer keeps track of the number of all
MBs without distinction.
Compliance regarding the VCV model can only be
guaranteed if these buffers never overflow. For each of these
buffers, the buffer size and the decoding rate are specified for
each profile@level combination. The buffer size and the
decoding rate are defined in terms of MBs and MB/s, without
any differentiation in terms of MB types. The VCV and BVCV buffer sizes and decoding rates for the video
profile@level(s)1 studied in this paper are shown in Table I.
TABLE I
VCV AND B-VCV BUFFER SIZE AND DECODING RATE FOR THE SIMPLE AND
CORE VIDEO PROFILE@LEVEL(S)
Profile@Level
Simple@L1
Simple@L2
Simple@L3
Core@L1
Core@L2
VCV/
B-VCV
buffer size
(MB)
99
396
396
198
792
VCV
decoding
rate
(MB/s)
1485
5940
11880
5940
23760
B-VCV
decoding
rate
(MB/s)
–
–
–
2970
11880
Since the current MPEG-4 VCV model [4] does not
distinguish the various MB coding types, besides the boundary
versus non-boundary distinction introduced by the B-VCV,
this means that the decoder must be able to decode any set of
MBs that does not overflow the VCV buffers for the given
profile@level, independently of the MB coding type.
Additionally, each Video Object Plane (VOP), i.e., a time
sample of the video object, must be available in the
composition memory for composition at the VOP composition
time plus a fixed delay, the VCV latency, which is the time
needed to decode a full VCV buffer [4]. Since the B-VCV has
half the VCV decoding rate, to fulfill the requirement above
the amount of boundary MBs for each decoding time (this
means for each VOP) cannot exceed 50% of the VCV
capacity. This implies that the decoder must be prepared to
deal with the worst-case scenario, i.e., the case where all MBs
are from the most complex coding type, while observing the
50% boundary MB limitation expressed above.
However, many indicators highlight the fact that it is not
adequate the assumption that the decoding complexity is equal
for all MB types except the boundary MBs, as implicit in the
MPEG-4 VCV model. Based on these indicators, this paper
starts by making a study of the MPEG-4 video decoding
complexity according to an adequate complexity model to
effectively conclude that this complexity is not the same for
various MB coding types. Following this conclusion, this
paper proposes a new VCV model intending to improve the
usage of the decoding resources available for a certain
profile@level combination. The method used consists in
attributing different complexity weights to the various MB
types, reflecting their effective decoding complexity.
III. DECODING COMPLEXITY MODELING
The decoding complexity of the encoded video data can, in
a first approach, be related to the data rate that the decoder has
to process, i.e., can be related to the number of MBs per
second that the decoder has to decode. However, the
1 In the context of this paper, a video profile is an MPEG-4 visual profile
which only includes natural visual (video) object types; as such, a video
profile does not include synthetic visual objects such as 3D faces.
computational power required to decode each MB may largely
vary due to the many different MB types (e.g., shape
information: opaque, transparent and boundary MBs) and
coding modes (e.g., texture coding modes: Intra, Inter,
Inter4V, etc.) that can be used. The complexity measure to
choose to evaluate the effective decoding complexity for a
certain MB depends on the degree of approximation to the real
decoding complexity that is required; however, the closer to
the real decoding complexity the model intends to go, the more
difficult it is to be generic, since the decoding complexity also
depends on implementation issues.
A careful analysis of the problem shows that there are
several ways to measure the decoding complexity of the
encoded video data, associated to the rate of any of the
following parameters [7]:
 Number of MBs.
 Number of MBs per shape type, e.g., opaque (completely
inside the VOP), boundary (part inside and part outside
the VOP) or transparent (completely outside the VOP)
(See Figure 1).
 Number of MBs per combination of texture and shape
coding types (Inter+NoUpdate, Inter4V+InterCAE, etc).
 Number of arithmetic instructions and memory
Read/Write operations.
VOP
Boundary MB
Transparent MB
Opaque MB
Figure 1 – MB shape types
The decoding complexity model proposed in this paper is
based on the number of MBs per combined coding type
(combination of texture and shape coding), which was found to
be the one best representing the major factors determining the
actual decoding complexity of the encoded video data, while
maintaining a certain level of independence regarding the
decoder implementation. While the first two models miss to
express some determining factors in terms of decoding
complexity, the last one may become too specific of a certain
implementation. This means that the MB complexity types for
which the decoding complexity will be evaluated will be
characterized by a combination of shape and texture coding
tools.
IV. MPEG-4 MACROBLOCK CLASSIFICATION
In the MPEG-4 Visual standard [4], a video object is
defined by its texture and shape data. Although video objects
can have arbitrary shapes, texture and shape coding relies on a
MB structure (1616 luminance pixels), where texture coding
as well as motion estimation and compensation tools are
similar to those used in the previously available video coding
standards, e.g., MPEG-1 and H.263.
In this paper, six different MPEG-4 texture coding modes
are studied [4]:
 Intra – The MB is encoded independently from past or
future MBs.
 Inter – The MB is differentially encoded, using motion
compensation with one motion vector.
 Intra+Q – Intra MB with a modified quantization step.
 Inter+Q – Inter MB with a modified quantization step.
 Inter4V – Inter MB using motion compensation with four
motion vectors (one for each 88 luminance block).
 Skipped – MB with no texture update information to be
sent.
The texture coding modes above are the most basic
MPEG-4 MB texture coding modes; they exist in all object
types and thus in all profiles. However some visual object
types may have more sophisticated texture coding modes that
were not considered in this paper although the approach here
followed also applies to them. The results obtained with the
MB texture coding modes considered above are enough to
demonstrate the claim that a more efficient VCV model than
the one used by MPEG-4 can be developed based on the
principles described in this paper.
Shape data can be MPEG-4 encoded using seven different
coding modes [4]:
 NoUpdate && MVDS == 0 – The shape information for
the current MB is equal to the shape of the corresponding
MB in the past prediction VOP.
 NoUpdate && MVDS != 0 – The shape information for
the current MB is obtained from the past prediction VOP
after motion compensation.
 Opaque – All shape pixels in the MB belong to the object
support; the object support is defined as the pixels where
the shape value is higher than 0.
 Transparent – None of the shape pixels in the MB
belongs to the object support.
 IntraCAE – The shape is encoded using Context-based
Arithmetic Encoding (CAE) [4, 8] in Intra mode.
 InterCAE && MVDS == 0 – The shape is encoded
using CAE in Inter mode, without motion compensation.
 InterCAE && MVDS != 0 – The shape is encoded using
CAE in Inter mode, with motion compensation.
In order to reduce the number of MB decoding complexity
types, the MB coding types with obvious similar complexities
were grouped in the same complexity class as shown in Table
II. This is the case of Intra and Intra+Q as well as Inter and
Inter+Q MB texture coding types, where the quantization step
change does not cause a significant decoding complexity
difference; the same is true for the MB shape coding types
with and without MVDS (Motion Vector Difference for
Shape), since both types need a prediction, although from
different past spatial positions.
TABLE II
STUDIED TEXTURE AND SHAPE MB CODING TYPES
Texture coding types
Shape coding types
Intra
(Intra & Intra+Q)
Inter
(Inter & Inter+Q)
Inter4V
NoUpdate
(MVDS == 0 & MVDS != 0)
Skipped
Opaque
Transparent
IntraCAE
InterCAE
(MVDS == 0 & MVDS != 0)
Following the conclusion of Section III, the MB coding
types which decoding complexity will be evaluated are the
combinations of the texture and shape coding types shown in
Table II.
V. MACROBLOCK COMPLEXITY EVALUATION
To evaluate the decoding complexity of the various MB
complexity types (combination of shape and texture coding
tools), it is necessary to establish a complexity criterion, i.e., a
complexity measure. In this paper, the proposed measure is the
decoding time of each MB complexity type obtained with an
optimized version of the MoMuSys MPEG-4 video decoder
[6] – the IST MPEG video compliant codec – using several
representative MPEG-4 test sequences and profile@level
combinations.
The MoMuSys codec is included in MPEG-4 Part 5:
Reference Software [6] and has been improved with the
implementation of the MPEG-4 Video Buffering Verifier
mechanism following the architecture proposed in [9]. With
the implementation of this mechanism, the codec allows the
user to compliantly encode the created video scenes by
choosing the appropriate object type for each video object and
the selected profile@level for the scene [10]. Additionally, the
coding structure of the IST MPEG-4 video codec has been
modified in order that a basic principle is followed for each
coding time instant: all analysis processing is performed before
encoding. With this structure, it is possible to implement more
powerful rate control solutions, with the video analysis data,
reflecting the current characteristics of each VO (and not of
the previous VOP), being used to efficiently distribute the
available resources before really encoding each VOP. This is
especially useful when the scene or a particular VO quickly
changes its characteristics and thus the allocation and
distribution of resources should immediately reflect these
changes. Moreover, with this structure, the encoder can take
the adequate actions when facing an imminent violation of the
Video Buffering Verifier mechanism, such as skipping the
encoding of one or more VOPs when the incoming VOPs
exceed the VMV limits or the VCV limits (decoding
complexity). This is not so efficiently handled when only
statistics of the previous time instant are used as in the case of
the original MPEG-4 reference software implementation.
A. Complexity evaluation conditions
To measure the MB decoding time, the video objects were
encoded normally, i.e., using an encoder that chooses for each
MB the best way to encode it according to its characteristics
(and not a somehow predefined way). The decoding time for
each MB was measured individually. In this way, the decoding
times of all encoded MBs were measured always using the
most appropriate MB coding type. Since the decoding time of
one MB can be very small to be measured with the available
time functions due to their limited precision, each MB was
decoded 1000 times and thus the total MB decoding time was
divided by 1000 to obtain single MB decoding times.
The decoding time depends on the MB coding
characteristics, notably Intra or Inter, since this determines the
decoding operations to be performed:
 Intra MBs – The measured decoding time is the sum of
the shape header and the shape decoding time (only for
arbitrarily shaped objects), plus the texture header and
texture decoding time, and the time spent in the padding
process (only for boundary and transparent MBs) [4].
 Inter MBs – The measured decoding time is the sum of
the shape header and the shape decoding time (only for
arbitrarily shaped objects), plus the texture header and
texture decoding time, the motion vector decoding time,
the time spent in the motion compensation, and the time
spent in the padding process (only for boundary and
transparent MBs).
The sequences and conditions used for measuring the MB
decoding times and thus evaluating the decoding complexity
were2:
 Akiyo, 1 rect. VO, QCIF, Simple@L2 at 128 kbit/s and
Core@L1 at 384 kbit/s;
 News, 1 rect. VO, QCIF, Simple@L3 at 384 kbit/s and
Core@L2 at 2000 kbit/s;
 Stefan, 1 rect. VO, QCIF, Simple@L3 at 384 kbit/s and
Core@L2 at 2000 kbit/s with I-VOPs only;
 Children, 1 shaped VO, QCIF, Core@L1 at 384 kbit/s;
 Weather, 1 shaped VO, QCIF, Core@L1 at 384 kbit/s.
 Coastguard, 4 shaped VOs, QCIF, Core@L1 at 384 kbit/s
and Core@L1 at 384 kbit/s using I-VOPs only;
 News, 1 rect. VO, CIF, Simple@L3 at 384 kbit/s;
 Stefan, 2 shaped VOs, CIF, Core@L2 at 2000 kbit/s;
The set of cases above tries to cover as much as possible the
most relevant conditions in order that a statistically meaningful
number of MBs is encoded for each MB coding type. This is
the explanation to have some sequences encoded with Intra
frames only since otherwise too few Intra encoded MBs would
appear.
The Simple profile only accepts objects of type Simple, and
2 For the Core object type in the Core profile, only the MB coding types
presented in Table II were used.
was created with low complexity applications in mind. The
major applications are mobile audiovisual communications
and very low complexity video on the Internet. The Simple
object type objects are error resilient, rectangular video objects
of arbitrary height/width ratio, developed for low complexity
terminals; this object type uses relatively simple and
inexpensive coding tools, based on I (Intra) and P (Predicted)
VOPs. The Core profile accepts Core and Simple object types
and it is useful for higher quality interactive services,
combining good quality with limited complexity and
supporting arbitrary shape objects; also mobile broadcast
services could be supported by this profile. The Core object
type uses a tool superset of the Simple object type, giving
better compression efficiency, and including binary shapes.
This means that while the Simple profile only accepts
rectangular objects the Core profile accepts rectangular and
arbitrarily shaped objects.
B. Complexity evaluation results
Figures 2 to 11 show the MB decoding times for the various
MB coding types considered. The decoding times for the MB
coding types that do not have texture data (no DCT
coefficients) are presented using histograms, while the
decoding times for the MB types with DCT coefficients are
presented as a function of the number of DCT coefficients in
the MB. In these Figures, each dot represents the decoding
time of one MB (thousands are there).
1) Rectangular VOs
Figure 2 shows the MB decoding times for rectangular
video objects. Considering maximum values, for the same
number of DCT coefficients, Inter4V MBs take more time to
be decoded than Inter MBs and Intra MBs are decoded faster
than the other MB coding types.
Figure 3 shows the Skipped MBs decoding time for
rectangular video objects. The MBs from this type are decoded
very fast because they do have neither texture nor shape
encoded data to process.
Figure 3 – Skipped MBs decoding time for rect. VOs
2) Arbitrarily shaped VOs
With arbitrarily shaped VOs, there are not only opaque MBs
to encode but also transparent and boundary MBs, and thus
shape coding comes to the arena.
Transparent MBs
As can be seen in Figure 4, Transparent MBs take less time
to be decoded than the other types of MBs because they do not
have any texture or shape data to be decoded. There are
however two types of Transparent MBs: the MBs that are far
away from the object border and the MBs that are next to the
object border to which a repetitive padding process has to be
applied [4]. This padding process is responsible for the
increase in the decoding time, leading to two distinct cases of
Transparent MBs as shown in Figure 4.
Figure 4 – Transparent MBs decoding time
Figure 2 – MB decoding time for rect. VOs
(Texture) Skipped MBs
MBs with skipped texture and opaque shape from arbitrarily
shaped objects take more time to decode than Skipped MBs
from rectangular objects due to the shape decoding time
(Figure 5). To decode the Skipped MBs with NoUpdate shape
(Skipped+NoUpdate), it is necessary to use the past VOP as
well as the shape header. For this MB type, there are two
distinct decoding cases: MBs that are padded and MBs without
padding (Figure 6). The results also show that the decoding
time increases when the shape is encoded with CAE. In this
case, IntraCAE MBs take typically less time to be decoded
than InterCAE MBs (Figure 7 and Figure 8, respectively).
Figure 5 – Skipped+Opaque MBs decoding time
Figure 6 – Skipped+NoUpdate MBs decoding time
Figure 7 – Skipped+IntraCAE MBs decoding time
coding and considering the same number of DCT coefficients,
the maximum decoding time increases with the texture coding
type in the following order: Intra, Inter, and Inter4V. However,
the differences are very small and it is difficult to establish a
clear relation for the full range of number of DCT coefficients.
For Intra (texture) MBs there are two distinct cases, depending
on the use or not of AC prediction for the DCT coefficients;
the MBs that use AC prediction take in fact longer time to be
decoded.
If the same type of texture coding is considered, the
decoding time increases with the shape coding type in the
following order: Opaque, NoUpdate, IntraCAE and InterCAE.
Figure 9 – Opaque MBs decoding time
Figure 10 – (Shape) NoUpdate MBs decoding time
Figure 11 – IntraCAE and InterCAE MBs decoding time
Figure 8 – Skipped+InterCAE MBs decoding time
DCT Encoded MBs
The decoding time for the MBs whose texture is encoded
with DCT depends on the number of encoded DCT
coefficients, and increases linearly with the number of
coefficients (Figures 9 to 11). For the same type of shape
VI. RELATIVE COMPLEXITY WEIGHTS
The previous section has shown that the decoding
complexity, measured in terms of the decoding time, varies
significantly according to the MB coding type and not only
according to the boundary and non-boundary distinction, as
assumed by the MPEG-4 VCV model [4]. This section
proposes MB relative complexity weights (obtained after
extensive measurements) that should model more effectively
the decoding complexity of an MPEG-4 encoded video object.
Taking into account that the MPEG-4 VCV model is
implicitly designed for the most complex MB coding type, the
complexity weights must be defined relatively to the most
complex MB coding type in the context of each profile; this
means that the maximum complexity weight is set to 1 for this
MB coding type and all the other weights are relative to this
one and thus smaller than 1. This solution allows the
implementation of a “trading system”, where it is possible, for
example, to trade one of the most complex MBs by two MBs
with half the relative complexity, while still maintaining the
bitstream decodable by a compliant decoder, this means
without having to require higher decoding resources.
The relative complexity weight for each MB complexity
type, k i , is thus obtained as the ratio between the maximum
decoding time for each MB type (this is a conservative
approach since most of the time the MBs for that type will be
less complex) and the highest maximum decoding time from
all the MB types relevant for the profile in question for the
cases studied in this paper: the Inter4V+InterCAE MB type for
the Core profile and the Inter4V MB type for the Simple
profile, considering the MB coding types indicated in Table
II3:
ki 
t max MBi 
t max Inter 4V  InterCAE  or t max Inter 4V 
(1)
Table III and Table IV show the maximum decoding times
and the relative complexity weights for the various MB coding
types studied, respectively for the Simple profile and for the
Core profile, in the conditions mentioned above.
TABLE III
MAXIMUM DECODING TIME AND RELATIVE COMPLEXITY WEIGHT FOR THE
VARIOUS MB CODING TYPES IN THE SIMPLE PROFILE
MB coding type
Skipped
Intra
Inter
Inter4V
Skipped (only rect. VO)
Intra (only rect. VO)
3
COMPLEXITY CLASSES
MB
complexity
class
C1
0.16
1.08
1.06
1.21
0.13
0.89
0.88
1.00
C2
Relative
complexity
weight (ki)
0.16
1.08
0.09
0.59
For the Core profile not all MB coding types have been studied.
0.58
0.66
0.12
0.12
0.58
0.57
0.70
0.21
0.53
0.64
0.76
0.32
0.80
0.88
0.97
0.40
0.95
1.00
TABLE V
RELATIVE COMPLEXITY WEIGHT FOR THE DEFINED MB DECODING
Relative complexity
weight (ki)
Maximum
time (ms)
1.06
1.21
0.21
0.22
1.06
1.04
1.28
0.38
0.97
1.17
1.38
0.58
1.46
1.60
1.77
0.73
1.73
1.82
Since there are some MB coding types whose decoding
times, and particularly the maximum decoding times, are very
similar, the VCV model to be proposed in the following can be
simplified by grouping these MB types in just one complexity
class as shown in Table V. The relative complexity weight
attributed to each class is the weight of the most complex MB
type included in that class (again a conservative approach).
Maximum
time (ms)
TABLE IV
MAXIMUM DECODING TIME AND RELATIVE COMPLEXITY WEIGHT FOR THE
STUDIED MB CODING TYPES IN THE CORE PROFILE
MB coding type
Inter (only rect. VO)
Inter4V (only rect. VO)
Transparent
Skipped+Opaque
Intra+Opaque
Inter+Opaque
Inter4V+Opaque
Skipped+NoUpdate
Intra+NoUpdate
Inter+NoUpdate
Inter4V+NoUpdate
Skipped+IntraCAE
Intra+IntraCAE
Inter+IntraCAE
Inter4V+IntraCAE
Skipped+InterCAE
Inter+InterCAE
Inter4V+InterCAE
C3
C4
C5
C6
C7
C8
C9
C10
MB coding type
Inter4V+InterCAE
Inter+InterCAE
Inter4V+IntraCAE
Inter+IntraCAE
Intra+IntraCAE
Inter4V+NoUpdate
Inter+NoUpdate
Intra+NoUpdate
Inter4V+Opaque
Inter+Opaque
Intra+Opaque
Skipped+InterCAE
Skipped+IntraCAE
Skipped+NoUpdate
Skipped+Opaque
Transparent
Inter4V
(only rect. VO)
Relative
complexity weight
Simple
Core
Profile
Profile
–
1.00
–
0.88
–
0.76
–
0.70
–
–
–
–
–
0.40
0.32
0.21
0.12
0.12
1.00
0.66
C11
C12
Inter
(only rect. VO)
Intra
(only rect. VO)
Skipped
(only rect. VO)
Nc
0.89
0.59
0.13
0.09
The weights presented above have been defined in a rather
conservative way, by using the most complex case within each
MB complexity class, in order that the weights stay valid even
if there is some decoding complexity variation for different
decoder implementation platforms.
VII. IST VCV MODEL: A MORE EFFICIENT SOLUTION
Exploiting the MB relative decoding complexity weights
presented above, this section proposes an alternative, more
efficient, VCV model: the IST VCV model. This model is
based on a single buffer with a single decoding rate using
different MB complexity weights for the various MB decoding
complexity classes [7]. The major characteristics of the IST
VCV model are:
 Complexity model based on the MB coding tools – The
distinction in terms of decoding complexity between the
various MBs is associated to the different MB texture and
shape coding tools used, i.e., the MB complexity classes
are related to a texture-shape tools combination for which
a relative complexity weight is measured.
 Single buffer with relative MB complexity weights – In
the proposed model, a single buffer stores all the encoded
MBs, but each MB is weighted according to its decoding
complexity class. Thus, the IST VCV buffer occupancy
corresponds to a weighted sum of the encoded MBs. The
IST and MPEG-4 VCV buffer sizes are made the same,
making possible to compare the two models in a simple
way, since the decoding computational resources remain
the same.
 Single decoding rate – The use of a single buffer with
MB complexity weights implies a single decoding rate.
The IST and MPEG-4 VCV decoding rates are made the
same, making possible to compare the two models in a
simple way, since the decoding computational resources
remain the same.
The main advantage of the IST VCV solution relatively to
the MPEG-4 VCV solution is to model more closely the real
decoding complexity of a given set of bitstreams building a
video scene, since the different types of MBs are distinguished
in terms of decoding complexity and thus decoding resources
are not wasted due to the “killing” assumption that all MBs
beside boundary MBs are equally and maximally difficult (and
there are actually big variations as shown in Table V).
For the IST VCV model proposed in this paper, the number
of equivalent (to the most complex) MBs for a given VOP i,
Mi, that is added to the VCV buffer at each decoding time
instant, ti, is given by the following expression
Mi 
k
j
 Mc j
j 1
where kj is the relative complexity weight (1) associated to the
MB complexity class j, Mcj is the number of MBs in VOP i
belonging to the complexity class j and Nc is the number of
complexity classes: 3 for the Simple profile and 12 for the
Core profile (using the MB coding types in Table II).
The new VCV model assumes a full sharing of the available
decoding resources, which is in principle valid at least for
decoder software implementations.
The adoption by MPEG-4 of the alternative VCV model
following the approach proposed in this paper (already
recognized by MPEG as more adequate) would imply the
change of the MPEG-4 video decoding complexity model,
removing the B-VCV buffer while keeping the VCV buffer
with the same parameters; moreover the VCV filling model
would change from a simple addition of the number of MBs to
an weighted addition of the number of MBs, using as weights
relative complexity weights.
Since the IST VCV decoding rate and buffer size are
unchanged relatively to the MPEG-4 VCV model for each
profile@level, a direct comparison between the two models
can easily be done, because the decoder computational
resources are maintained. A comparison between the two VCV
models will be presented in the next section.
VIII. IST AND MPEG-4 VCV MODELS: A COMPARISON
The ideal way to compare and validate the IST VCV model
relatively to the MPEG-4 VCV model would be by decoding
bitstreams which, for a given profile@level, would violate the
MPEG-4 VCV model but not the IST VCV model, and
showing that these scenes could be decoded by a compliant
MPEG-4 decoder, fulfilling the required timing constraints.
This would show that the existing profile@level decoding
resources were enough for the non-compliant MPEG-4
bitstreams and thus that the MPEG-4 VCV model wastes these
resources (due to complexity over-dimensioning) when
prevents those bitstreams from being classified as compliant to
the profile@level in question. However, this comparison and
validation can only be done with a real-time decoder that was
not available at the time this work was done. Thus, the
comparison between the two VCV models is here done by
comparing the occupancy of the two VCV buffers and thus the
effects of the proposed approach under the assumption that the
measured relative complexity weights are valid.
To perform this comparison an encoder with only the VBV
(bit rate) rate control active was used. The feedback
mechanism that prevents the violation of the VCV
(complexity) and the VMV (memory) models has been
disabled in order to allow the visualization of the
corresponding buffer occupancy evolution even if it is above
100% occupancy. In this case, the encoded bitstreams are the
same for both models since there is no feedback control, but
the VCV fullness computation (MB decoding complexity
evaluation) is done differently, depending on the considered
model. This comparison methodology allows to verify that, in
many situations, the MPEG-4 VCV model exceeds the 100%
buffer occupancy while the IST VCV does not exceed that
limit. This means that the use of IST VCV model would allow,
for a given profile@level, to encode in a “compliant way”
(meaning using the same decoding resources) video scenes that
the MPEG-4 VCV model does not allow due to its clear overdimensioning of the MB decoding complexity for most MB
coding types. This is true when similar spatial resolutions and
temporal VOP rates are used, e.g., no VOP skipping is
applied.
A. Scenes with one rectangular object
Figure 12 shows one original frame of the test sequence
Akiyo, rectangular, in QCIF format, which has been encoded at
15 fps with Simple@L1 at 64 kbit/s. Figure 13 shows the
MPEG-4 and IST VCV occupancies for this sequence.
Figure 12 – Sample of the Akiyo sequence
As can be seen in Figure 13, the MPEG-4 VCV occupancy
is always 100%, which makes the encoded bitstream MPEG-4
compliant for the considered profile@level. With the IST
VCV model, the sequence can be compliantly encoded with
the same profile@level, but the VCV occupancy is lower,
varying between 25% and 50% during the whole encoding
process. This means that with the IST VCV model, it would be
possible to encode this sequence at a higher frame rate, for
example 25 fps, exploring the fact that some MB types, e.g.,
Skipped, are in reality much less complex than the most
complex MB coding type (Inter4V). The term “compliant”
means here that maintaining the standardized decoding
resources for a certain profile@level, the bitstream would be
able to be decoded fulfilling the necessary time constraints
since it is not really more complex than other MPEG-4
“officially” compliant bitstreams (for the relevant
profile@level). The non-compliant classification by the
MPEG-4 VCV model is due to the decoding complexity overestimation of some MB types; for rectangular objects, this
over-estimation is mainly related to the Skipped MBs as can
be seen in Table V.
Figure 13 – MPEG-4 and IST VCV occupancy: Akiyo, QCIF, 15 fps,
Simple@L1 at 64 kbit/s
B. Scenes with several arbitrarily shaped objects
To make a rigorous comparison between the MPEG-4 VCV
and the IST VCV in scenes with arbitrarily shaped objects, the
restriction imposed by the MPEG-4 B-VCV requiring that the
number of boundary MBs for each decoding time is not greater
than half the B-VCV capacity must be considered. This means
that, from a complexity point of view, the MPEG-4 VCV
worst-case scenario corresponds to the case where the MBs are
50% of the most complex non-boundary MB type
(Inter4V+Opaque) and 50% of the most complex boundary
MB type (Inter4V + InterCAE). To accommodate this case,
and only for the purpose of the comparison between the two
VCV models, the relative complexity weights have to be
changed using as reference the average time between the
maximum decoding time of the Inter4V+InterCAE and
Inter4V+Opaque types, and not the Inter4V+InterCAE type
proposed in the IST VCV model which should be used if the
MPEG-4 B-VCV did not exist. In this circumstance, the
“trading system” is not simply referred to the
Inter4V+InterCAE type, because the decoder does not have to
support the case where all the MBs are of this type, but has to
be referred to the average complexity between the most
complex MB types of rectangular and arbitrarily shaped
objects (using the MB coding types indicated in Table II). In
this situation, it is natural that the complexity cost of an
Inter4V+InterCAE MB is higher than 1, since this type is more
complex than the reference complexity.
Considering all these facts, to compare the two MPEG-4
VCV models, new relative decoding complexity weights have
to be computed for the profiles under study using the following
reference time
tref 
tmax Inter 4V  InterCAE   tmax Inter 4V  Opaque
2
As a consequence, the relative complexity weight for each
MB coding type is obtained by
ki 
t max MBi 
t ref
The new relative complexity weights for the various MB
complexity classes for the Simple and Core profiles using the
MB coding types indicated in Table II are presented in Table
VI.
TABLE VI
RELATIVE COMPLEXITY WEIGHT FOR THE STUDIED
MB CODING TYPES IN THE SIMPLE AND CORE PROFILES
MB
complexity
class
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
MB coding type
Inter4V+InterCAE
Inter+InterCAE
Inter4V+IntraCAE
Inter+IntraCAE
Intra+IntraCAE
Inter4V+NoUpdate
Inter+NoUpdate
Intra+NoUpdate
Inter4V+Opaque
Inter+Opaque
Intra+Opaque
Skipped+InterCAE
Skipped+IntraCAE
Skipped+NoUpdate
Skipped+Opaque
Transparent
Inter4V (only rect. VO)
Inter (only rect. VO)
Intra (only rect. VO)
Skipped (only rect. VO)
Relative
complexity
weight (ki)
Figure 14 – Sample of the Coastguard sequence
1.17
1.03
0.89
0.83
0.47
0.37
0.25
0.14
0.14
0.78
0.70
0.10
Figure 14 shows one original frame of the test sequence
Coastguard, rectangular, in QCIF format, which has been
encoded at 15 fps with Simple@L1 at 64 kbit/s. Figure 15,
presenting the MPEG-4 and IST VCV occupancies for this
sequence, shows that the MPEG-4 VCV buffer overflows, and
thus the bitstream is not compliant. The IST VCV model
occupancy shows that the scene is not too complex to be
encoded at the considered profile@level, since the VCV
occupancy is around 70% during most of the time of the
decoding process. The MPEG-4 VCV occupancy peak that can
be seen in Figure 15 is caused by a great number of transparent
MBs that appear in those VOPs (after the peak the total
number of MBs remains constant and hence the MPEG-4 VCV
becomes flat). Since, in the IST VCV model, Transparent MBs
have a low relative computational weight, the peak is
attenuated and the IST VCV buffer does not overflow. It is
important to notice that the number of transparent MBs in a
scene strongly influences the MPEG-4 VCV performance.
Figure 15 – MPEG-4 and IST VCV occupancy: Coastguard, QCIF, 30 fps,
Core@L1, 384 kbit/s
Figure 16 shows one original frame of the
Children_and_Flag test sequence, with 3 VOs, in QCIF
format, which has been encoded at 30 fps with Core@L1 at
384 kbit/s. The “Children” and “Flag” bounding boxes have a
high number of transparent MBs and overlap in the scene; this
contributes to the high MPEG-4 VCV occupancy as can be
seen in Figure 17. Notice that although this scene only has 3
video objects and the majority of the MBs are transparent
(58% transparent, 41% boundary and 1% opaque), it cannot be
encoded with Core@L1 using the MPEG-4 VCV model which
looks unrealistic even at first sight. The MPEG-4 VCV buffer
occupancy is always very high, due to the high number of
transparent MBs in the “Children” and “Flag” bounding boxes.
The occupancy increases when the “MPEG-4 Logo” object
appears, leading to a non-compliant set of bitstreams because
the VCV occupancy exceeds 100%. The same figure shows
that the IST VCV buffer occupancy is always under 50% since
the Transparent MBs complexity weight is rather low and thus
allowing the scene to be “compliantly” encoded with
Core@L1.
Figure 16 – Sample of the Children_and_Flag sequence
Figure 17 – MPEG-4 and IST VCV occupancy: Children_and_Flag, QCIF,
30 fps, Core@L1, 384 kbit/s
Another example that shows the weakness of the MPEG-4
VCV model in the presence of transparent MBs and the
effectiveness of IST VCV model is the MPEG-4 test sequence
News (Figure 18), 4 VOs, in CIF format, encoded at 30 fps
with Core@L2 at 2000 kbit/s. Figure 19 shows the MPEG-4
and IST VCV occupancies for this sequence. The MPEG-4
VCV buffer capacity is largely exceeded during the coding
process. On the other hand, the IST VCV occupancy is always
around 50%, which shows that this scene can be encoded in
Core@L2. The influence of transparent MBs in the MPEG-4
VCV is easily verified in this example. Figure 20 shows the
number of transparent, opaque, and boundary MBs along the
scene. The number of boundary and opaque MBs stays
approximately constant along the scene, while the number of
transparent MBs oscillates between two (rather high and
similar) values. As can be seen in the Figures, when the
number of transparent MBs increases, there is a corresponding
increase in the MPEG-4 VCV occupancy, and when the
number of transparent MBs decreases, the MPEG-4 VCV
occupancy decreases. On the other hand, the IST VCV buffer
occupancy stays approximately constant, because of the low
complexity weight that Transparent MBs have in this model.
Figure 18 – Sample of the News sequence
Figure 19 – MPEG-4 and IST VCV occupancy: News, CIF, 30 fps, Core@L2,
2000 kbit/s
Figure 20 – Number of MBs per shape type: News sequence
IX. FINAL REMARKS
This paper proposes an alternative Video Complexity
Verifier model approach to the one specified in the MPEG-4
Visual standard, based on a set of MB relative decoding
complexity weights assigned to the MPEG-4 MB coding types
presented in Table II. These weights allow measuring the real
decoding complexity of a given MPEG-4 encoded scene more
precisely. Complexity measurements show that the MPEG-4
VCV model over-estimates the decoding complexity of some
scenes notably because some MB types, such as the
Transparent and Skipped MBs, are over-evaluated (not
distinguished from the real complex ones) in terms of
decoding complexity. On the other hand, the IST VCV model
allows the encoding of many of the scenes considered too
complex by the MPEG-4 VCV model, for a given
profile@level. These scenes can be decoded by a compliant
decoder without changing the decoding resources, and thus
making a better use of these resources. The efficient use and
sharing of the available decoding resources is very important,
mainly in applications where they are scarce and expensive,
e.g., mobile terminals. Mobile applications should be among
the first where MPEG-4 will “explode”, as demonstrated by
the adoption of MPEG-4 video coding by 3GPP (3rd
Generation Partnership Project), responsible for the UMTS
specification.
REFERENCES
[1]
[2]
[3]
[4]
F. Pereira, “MPEG-4: Why, What, How and When?”, Signal Processing:
Image Communication, Tutorial Issue on the MPEG-4 Standard, vol.
15, nº 4-5, December 1999, pp. 271-279.
MPEG Requirements Group, “MPEG-4 Applications Document”,
Document ISO/IEC JTC1/SC29/WG11/N2724, 47 th MPEG meeting,
Seoul, March 1999.
MPEG Requirements Group, “MPEG-4 Overview”, Document ISO/IEC
JTC1/SC29/WG11/N3930, 55th MPEG meeting, Pisa, January 2001.
ISO/IEC
14496-2:1999
Information
technology

Coding
of
audiovisual objects  Part 2: Visual, December 1999.
R. Koenen, “Profiles and Levels in MPEG-4: Approach and Overview”,
Tutorial Issue on the MPEG-4 Standard, Signal Processing: Image
Communication, vol. 15, nº 4-5, pp. 463-478, December 1999.
[6] ISO/IEC 14496-5: 1999, “Information Technology – Coding of Audiovisual Objects – Part 5: Reference software”, December 1999.
[7] P. Nunes, F. Pereira, “MPEG-4 Compliant Video Encoding: Analysis
and Rate Control Strategies”, Proceedings of the ASILOMAR 2000
Conference, Pacific Grove – CA, USA, October 2000.
[8] N. Brady, “MPEG-4 Standardized Methods for the Compression of
Arbitrarily Shaped Video Objects”, IEEE Transactions on Circuits and
Systems for Video Technology, Vol. 9, nº 8, December 1999, pp. 11701189.
[9] P. Nunes, F. Pereira, “Implementing the MPEG-4 Natural Visual
Profiles and Levels”, Doc. M4878, 48th MPEG meeting, Vancouver,
July 1999.
[10] J. Valentim, P. Nunes, F. Pereira, “IST MPEG-4 Video Compliant
Framework”, 3rd Conference on Telecommunications, Figueira da Foz –
Portugal, April 2001.
[5]
Download