Evaluating MPEG-4 Video Decoding Complexity for an Alternative Video Complexity Verifier Model J. Valentim, P. Nunes, and F. Pereira Instituto Superior Técnico (IST) – Instituto de Telecomunicações Av. Rovisco Pais, 1049001 Lisboa, Portugal Phone: + 351 21 841 8460; Fax: + 351 21 841 8472 e-mail: {joao.valentim, paulo.nunes, fernando.pereira}@lx.it.pt Abstract – MPEG-4 is the first object-based audiovisual coding standard. To control the minimum decoding complexity resources required at the decoder, the MPEG-4 Visual standard defines the socalled Video Buffering Verifier mechanism, which includes three virtual buffer models, among them the Video Complexity Verifier (VCV). This paper proposes an alternative VCV model, based on a set of macroblock (MB) relative decoding complexity weights assigned to the various MB coding types used in MPEG-4 video coding. The new VCV model allows a more efficient use of the available decoding resources by preventing the over-evaluation of the decoding complexity of certain MB types and thus making possible to encode scenes (for the same profile@level decoding resources) which otherwise would be considered too demanding. Index Terms – Decoding complexity, MPEG-4 standard, profiles and levels, VCV model. I. INTRODUCTION R ecognizing that audiovisual content should be created and represented using a framework that is able to give the user as many as possible real-world like capabilities, such as interaction and manipulation, MPEG decided, in 1993, to launch a new project, well known as MPEG-4 [1]. Since human beings do not want to interact with abstract entities, such as pixels, but rather with meaningful entities that are part of the audiovisual scene, the concept of object is central to MPEG-4. MPEG-4 models an audiovisual scene as a composition of audiovisual objects with specific characteristics and behavior, notably in space and time. The object composition approach allows to support new functionalities, such as object-based interaction, manipulation and hyper-linking, as well as to improve already available functionalities, such as coding efficiency by just using for each type of object the most adequate coding tools and parameters. The MPEG-4 audiovisual coding standard has been designed to be generic in the sense that it does not target a particular application but instead includes many coding tools that can be used for a wide variety of applications, under different situations, notably in terms of bit rate, type of channel and storage media, and delay constraints [2]. This toolbox approach provides the mechanisms to cover a wide range of audiovisual applications from mobile multimedia communications to studio and interactive TV [2, 3]. Since it is not reasonable that all MPEG-4 visual terminals support the whole MPEG-4 visual toolbox, subsets of the MPEG-4 Visual standard tools [4] have been defined, using the concept of profiling, to address classes of applications with similar functional and operational requirements [5]. A similar approach has been applied to the audio tools as well as to the systems tools. This approach allows manufacturers to implement only the subsets of the standard – profiles – that they need to achieve particular functionalities, while maintaining interoperability with other MPEG-4 devices built within the same conditions, but also restricting the computational resources required by the given terminals. A subset of the syntax and semantics corresponding to a subset of the tools of the MPEG-4 Visual standard [4] defines a visual profile, while sets of restrictions within each visual profile, e.g., in terms of computational resources and memory, define the various levels for a profile [5]. Moreover, since a scene may include visual objects encoded using different tools, object types define the syntax of the bitstream for one single object that can represent a meaningful entity in the (audiovisual) scene. Note that object types correspond to a set of tools, in this case applied to each object and not to the scene as a whole. Following the definition of audio and visual object types, audio and visual profiles are defined as sets of audio and visual object types, respectively. In order that a particular set of visual data bitstreams building a scene may be considered compliant with a given MPEG-4 visual profile@level combination, it must not contain any disallowed syntax element for that profile and additionally it must not violate the Video Buffering Verifier mechanism constraints [4]. This mechanism consists of three normative models, each one defining a set of rules and limits to verify that the amount required for a specific type of decoding resource is within the values allowed by the corresponding profile and level specifications: Video Rate Buffer Verifier (VBV) – This model is used to verify that the bitstream memory required at the decoder(s) does not exceed the values specified for the corresponding profile and level. The model is defined in terms of the VBV buffer sizes for all the Video Object Layers (VOLs) corresponding to the objects building the scene. Each VBV buffer size corresponds to the maximum amount of bits that the decoder can store in the bitstream memory for the corresponding VOL. There is also a limitation on the sum of the VOL VBV buffer sizes. The bitstream memory is the memory where the decoder puts the bits received for a VOL while waiting to be decoded. Video Complexity Verifier (VCV) – This model is used to verify that the computational power (processing speed) required at the decoder, and defined in terms of MB/s, does not exceed the values specified for the corresponding profile and level. The model is defined in terms of the VCV MB/s decoding rate and VCV buffer size and is applied to all MBs in the scene. If arbitrarily shaped Video Objects (VOs) exist in the scene, an additional VCV buffer and VCV decoding rate are also defined, to be applied only to the boundary MBs. Video Reference Memory Verifier (VMV) – This model is used to verify that the picture memory required at the decoder for the decoding of a given scene does not exceed the values specified for the corresponding profile and level. The model is defined in terms of the VMV buffer size, which is the maximum number of decoded MBs that the decoder can store during the decoding process of all VOLs corresponding to the scene. This paper evaluates the decoding complexity of various MB coding types included in the Simple and Core object types (and thus profiles) based on the MB decoding times obtained with an optimized version of the MPEG-4 reference software [6]. Following this decoding complexity evaluation, an alternative model to the current MPEG-4 VCV model [4] exploiting the relative decoding complexity of various MB coding types used in MPEG-4 video coding is presented. This model is based on a closer estimate of the actual decoding complexity of the various video objects composing a scene, thus allowing a much better use of the video decoding resources which may be a critical factor in applications environments where resources are scarce and expensive, such as mobile and smart card applications. II. MPEG-4 VCV MODEL The MPEG-4 VCV model defines, for each profile@level combination, a set of rules and limits which when respected at the encoder ensure that the required decoding computational power is always available at the decoder (which also respects the same limits) [4]. The computational power of the decoder is defined by two buffers and the corresponding MB decoding rates, measured in MB/s, which specify the drain rate of the two buffers: Boundary-VCV (B-VCV) – The B-VCV buffer keeps track of the number of boundary MBs. VCV – The VCV buffer keeps track of the number of all MBs without distinction. Compliance regarding the VCV model can only be guaranteed if these buffers never overflow. For each of these buffers, the buffer size and the decoding rate are specified for each profile@level combination. The buffer size and the decoding rate are defined in terms of MBs and MB/s, without any differentiation in terms of MB types. The VCV and BVCV buffer sizes and decoding rates for the video profile@level(s)1 studied in this paper are shown in Table I. TABLE I VCV AND B-VCV BUFFER SIZE AND DECODING RATE FOR THE SIMPLE AND CORE VIDEO PROFILE@LEVEL(S) Profile@Level Simple@L1 Simple@L2 Simple@L3 Core@L1 Core@L2 VCV/ B-VCV buffer size (MB) 99 396 396 198 792 VCV decoding rate (MB/s) 1485 5940 11880 5940 23760 B-VCV decoding rate (MB/s) – – – 2970 11880 Since the current MPEG-4 VCV model [4] does not distinguish the various MB coding types, besides the boundary versus non-boundary distinction introduced by the B-VCV, this means that the decoder must be able to decode any set of MBs that does not overflow the VCV buffers for the given profile@level, independently of the MB coding type. Additionally, each Video Object Plane (VOP), i.e., a time sample of the video object, must be available in the composition memory for composition at the VOP composition time plus a fixed delay, the VCV latency, which is the time needed to decode a full VCV buffer [4]. Since the B-VCV has half the VCV decoding rate, to fulfill the requirement above the amount of boundary MBs for each decoding time (this means for each VOP) cannot exceed 50% of the VCV capacity. This implies that the decoder must be prepared to deal with the worst-case scenario, i.e., the case where all MBs are from the most complex coding type, while observing the 50% boundary MB limitation expressed above. However, many indicators highlight the fact that it is not adequate the assumption that the decoding complexity is equal for all MB types except the boundary MBs, as implicit in the MPEG-4 VCV model. Based on these indicators, this paper starts by making a study of the MPEG-4 video decoding complexity according to an adequate complexity model to effectively conclude that this complexity is not the same for various MB coding types. Following this conclusion, this paper proposes a new VCV model intending to improve the usage of the decoding resources available for a certain profile@level combination. The method used consists in attributing different complexity weights to the various MB types, reflecting their effective decoding complexity. III. DECODING COMPLEXITY MODELING The decoding complexity of the encoded video data can, in a first approach, be related to the data rate that the decoder has to process, i.e., can be related to the number of MBs per second that the decoder has to decode. However, the 1 In the context of this paper, a video profile is an MPEG-4 visual profile which only includes natural visual (video) object types; as such, a video profile does not include synthetic visual objects such as 3D faces. computational power required to decode each MB may largely vary due to the many different MB types (e.g., shape information: opaque, transparent and boundary MBs) and coding modes (e.g., texture coding modes: Intra, Inter, Inter4V, etc.) that can be used. The complexity measure to choose to evaluate the effective decoding complexity for a certain MB depends on the degree of approximation to the real decoding complexity that is required; however, the closer to the real decoding complexity the model intends to go, the more difficult it is to be generic, since the decoding complexity also depends on implementation issues. A careful analysis of the problem shows that there are several ways to measure the decoding complexity of the encoded video data, associated to the rate of any of the following parameters [7]: Number of MBs. Number of MBs per shape type, e.g., opaque (completely inside the VOP), boundary (part inside and part outside the VOP) or transparent (completely outside the VOP) (See Figure 1). Number of MBs per combination of texture and shape coding types (Inter+NoUpdate, Inter4V+InterCAE, etc). Number of arithmetic instructions and memory Read/Write operations. VOP Boundary MB Transparent MB Opaque MB Figure 1 – MB shape types The decoding complexity model proposed in this paper is based on the number of MBs per combined coding type (combination of texture and shape coding), which was found to be the one best representing the major factors determining the actual decoding complexity of the encoded video data, while maintaining a certain level of independence regarding the decoder implementation. While the first two models miss to express some determining factors in terms of decoding complexity, the last one may become too specific of a certain implementation. This means that the MB complexity types for which the decoding complexity will be evaluated will be characterized by a combination of shape and texture coding tools. IV. MPEG-4 MACROBLOCK CLASSIFICATION In the MPEG-4 Visual standard [4], a video object is defined by its texture and shape data. Although video objects can have arbitrary shapes, texture and shape coding relies on a MB structure (1616 luminance pixels), where texture coding as well as motion estimation and compensation tools are similar to those used in the previously available video coding standards, e.g., MPEG-1 and H.263. In this paper, six different MPEG-4 texture coding modes are studied [4]: Intra – The MB is encoded independently from past or future MBs. Inter – The MB is differentially encoded, using motion compensation with one motion vector. Intra+Q – Intra MB with a modified quantization step. Inter+Q – Inter MB with a modified quantization step. Inter4V – Inter MB using motion compensation with four motion vectors (one for each 88 luminance block). Skipped – MB with no texture update information to be sent. The texture coding modes above are the most basic MPEG-4 MB texture coding modes; they exist in all object types and thus in all profiles. However some visual object types may have more sophisticated texture coding modes that were not considered in this paper although the approach here followed also applies to them. The results obtained with the MB texture coding modes considered above are enough to demonstrate the claim that a more efficient VCV model than the one used by MPEG-4 can be developed based on the principles described in this paper. Shape data can be MPEG-4 encoded using seven different coding modes [4]: NoUpdate && MVDS == 0 – The shape information for the current MB is equal to the shape of the corresponding MB in the past prediction VOP. NoUpdate && MVDS != 0 – The shape information for the current MB is obtained from the past prediction VOP after motion compensation. Opaque – All shape pixels in the MB belong to the object support; the object support is defined as the pixels where the shape value is higher than 0. Transparent – None of the shape pixels in the MB belongs to the object support. IntraCAE – The shape is encoded using Context-based Arithmetic Encoding (CAE) [4, 8] in Intra mode. InterCAE && MVDS == 0 – The shape is encoded using CAE in Inter mode, without motion compensation. InterCAE && MVDS != 0 – The shape is encoded using CAE in Inter mode, with motion compensation. In order to reduce the number of MB decoding complexity types, the MB coding types with obvious similar complexities were grouped in the same complexity class as shown in Table II. This is the case of Intra and Intra+Q as well as Inter and Inter+Q MB texture coding types, where the quantization step change does not cause a significant decoding complexity difference; the same is true for the MB shape coding types with and without MVDS (Motion Vector Difference for Shape), since both types need a prediction, although from different past spatial positions. TABLE II STUDIED TEXTURE AND SHAPE MB CODING TYPES Texture coding types Shape coding types Intra (Intra & Intra+Q) Inter (Inter & Inter+Q) Inter4V NoUpdate (MVDS == 0 & MVDS != 0) Skipped Opaque Transparent IntraCAE InterCAE (MVDS == 0 & MVDS != 0) Following the conclusion of Section III, the MB coding types which decoding complexity will be evaluated are the combinations of the texture and shape coding types shown in Table II. V. MACROBLOCK COMPLEXITY EVALUATION To evaluate the decoding complexity of the various MB complexity types (combination of shape and texture coding tools), it is necessary to establish a complexity criterion, i.e., a complexity measure. In this paper, the proposed measure is the decoding time of each MB complexity type obtained with an optimized version of the MoMuSys MPEG-4 video decoder [6] – the IST MPEG video compliant codec – using several representative MPEG-4 test sequences and profile@level combinations. The MoMuSys codec is included in MPEG-4 Part 5: Reference Software [6] and has been improved with the implementation of the MPEG-4 Video Buffering Verifier mechanism following the architecture proposed in [9]. With the implementation of this mechanism, the codec allows the user to compliantly encode the created video scenes by choosing the appropriate object type for each video object and the selected profile@level for the scene [10]. Additionally, the coding structure of the IST MPEG-4 video codec has been modified in order that a basic principle is followed for each coding time instant: all analysis processing is performed before encoding. With this structure, it is possible to implement more powerful rate control solutions, with the video analysis data, reflecting the current characteristics of each VO (and not of the previous VOP), being used to efficiently distribute the available resources before really encoding each VOP. This is especially useful when the scene or a particular VO quickly changes its characteristics and thus the allocation and distribution of resources should immediately reflect these changes. Moreover, with this structure, the encoder can take the adequate actions when facing an imminent violation of the Video Buffering Verifier mechanism, such as skipping the encoding of one or more VOPs when the incoming VOPs exceed the VMV limits or the VCV limits (decoding complexity). This is not so efficiently handled when only statistics of the previous time instant are used as in the case of the original MPEG-4 reference software implementation. A. Complexity evaluation conditions To measure the MB decoding time, the video objects were encoded normally, i.e., using an encoder that chooses for each MB the best way to encode it according to its characteristics (and not a somehow predefined way). The decoding time for each MB was measured individually. In this way, the decoding times of all encoded MBs were measured always using the most appropriate MB coding type. Since the decoding time of one MB can be very small to be measured with the available time functions due to their limited precision, each MB was decoded 1000 times and thus the total MB decoding time was divided by 1000 to obtain single MB decoding times. The decoding time depends on the MB coding characteristics, notably Intra or Inter, since this determines the decoding operations to be performed: Intra MBs – The measured decoding time is the sum of the shape header and the shape decoding time (only for arbitrarily shaped objects), plus the texture header and texture decoding time, and the time spent in the padding process (only for boundary and transparent MBs) [4]. Inter MBs – The measured decoding time is the sum of the shape header and the shape decoding time (only for arbitrarily shaped objects), plus the texture header and texture decoding time, the motion vector decoding time, the time spent in the motion compensation, and the time spent in the padding process (only for boundary and transparent MBs). The sequences and conditions used for measuring the MB decoding times and thus evaluating the decoding complexity were2: Akiyo, 1 rect. VO, QCIF, Simple@L2 at 128 kbit/s and Core@L1 at 384 kbit/s; News, 1 rect. VO, QCIF, Simple@L3 at 384 kbit/s and Core@L2 at 2000 kbit/s; Stefan, 1 rect. VO, QCIF, Simple@L3 at 384 kbit/s and Core@L2 at 2000 kbit/s with I-VOPs only; Children, 1 shaped VO, QCIF, Core@L1 at 384 kbit/s; Weather, 1 shaped VO, QCIF, Core@L1 at 384 kbit/s. Coastguard, 4 shaped VOs, QCIF, Core@L1 at 384 kbit/s and Core@L1 at 384 kbit/s using I-VOPs only; News, 1 rect. VO, CIF, Simple@L3 at 384 kbit/s; Stefan, 2 shaped VOs, CIF, Core@L2 at 2000 kbit/s; The set of cases above tries to cover as much as possible the most relevant conditions in order that a statistically meaningful number of MBs is encoded for each MB coding type. This is the explanation to have some sequences encoded with Intra frames only since otherwise too few Intra encoded MBs would appear. The Simple profile only accepts objects of type Simple, and 2 For the Core object type in the Core profile, only the MB coding types presented in Table II were used. was created with low complexity applications in mind. The major applications are mobile audiovisual communications and very low complexity video on the Internet. The Simple object type objects are error resilient, rectangular video objects of arbitrary height/width ratio, developed for low complexity terminals; this object type uses relatively simple and inexpensive coding tools, based on I (Intra) and P (Predicted) VOPs. The Core profile accepts Core and Simple object types and it is useful for higher quality interactive services, combining good quality with limited complexity and supporting arbitrary shape objects; also mobile broadcast services could be supported by this profile. The Core object type uses a tool superset of the Simple object type, giving better compression efficiency, and including binary shapes. This means that while the Simple profile only accepts rectangular objects the Core profile accepts rectangular and arbitrarily shaped objects. B. Complexity evaluation results Figures 2 to 11 show the MB decoding times for the various MB coding types considered. The decoding times for the MB coding types that do not have texture data (no DCT coefficients) are presented using histograms, while the decoding times for the MB types with DCT coefficients are presented as a function of the number of DCT coefficients in the MB. In these Figures, each dot represents the decoding time of one MB (thousands are there). 1) Rectangular VOs Figure 2 shows the MB decoding times for rectangular video objects. Considering maximum values, for the same number of DCT coefficients, Inter4V MBs take more time to be decoded than Inter MBs and Intra MBs are decoded faster than the other MB coding types. Figure 3 shows the Skipped MBs decoding time for rectangular video objects. The MBs from this type are decoded very fast because they do have neither texture nor shape encoded data to process. Figure 3 – Skipped MBs decoding time for rect. VOs 2) Arbitrarily shaped VOs With arbitrarily shaped VOs, there are not only opaque MBs to encode but also transparent and boundary MBs, and thus shape coding comes to the arena. Transparent MBs As can be seen in Figure 4, Transparent MBs take less time to be decoded than the other types of MBs because they do not have any texture or shape data to be decoded. There are however two types of Transparent MBs: the MBs that are far away from the object border and the MBs that are next to the object border to which a repetitive padding process has to be applied [4]. This padding process is responsible for the increase in the decoding time, leading to two distinct cases of Transparent MBs as shown in Figure 4. Figure 4 – Transparent MBs decoding time Figure 2 – MB decoding time for rect. VOs (Texture) Skipped MBs MBs with skipped texture and opaque shape from arbitrarily shaped objects take more time to decode than Skipped MBs from rectangular objects due to the shape decoding time (Figure 5). To decode the Skipped MBs with NoUpdate shape (Skipped+NoUpdate), it is necessary to use the past VOP as well as the shape header. For this MB type, there are two distinct decoding cases: MBs that are padded and MBs without padding (Figure 6). The results also show that the decoding time increases when the shape is encoded with CAE. In this case, IntraCAE MBs take typically less time to be decoded than InterCAE MBs (Figure 7 and Figure 8, respectively). Figure 5 – Skipped+Opaque MBs decoding time Figure 6 – Skipped+NoUpdate MBs decoding time Figure 7 – Skipped+IntraCAE MBs decoding time coding and considering the same number of DCT coefficients, the maximum decoding time increases with the texture coding type in the following order: Intra, Inter, and Inter4V. However, the differences are very small and it is difficult to establish a clear relation for the full range of number of DCT coefficients. For Intra (texture) MBs there are two distinct cases, depending on the use or not of AC prediction for the DCT coefficients; the MBs that use AC prediction take in fact longer time to be decoded. If the same type of texture coding is considered, the decoding time increases with the shape coding type in the following order: Opaque, NoUpdate, IntraCAE and InterCAE. Figure 9 – Opaque MBs decoding time Figure 10 – (Shape) NoUpdate MBs decoding time Figure 11 – IntraCAE and InterCAE MBs decoding time Figure 8 – Skipped+InterCAE MBs decoding time DCT Encoded MBs The decoding time for the MBs whose texture is encoded with DCT depends on the number of encoded DCT coefficients, and increases linearly with the number of coefficients (Figures 9 to 11). For the same type of shape VI. RELATIVE COMPLEXITY WEIGHTS The previous section has shown that the decoding complexity, measured in terms of the decoding time, varies significantly according to the MB coding type and not only according to the boundary and non-boundary distinction, as assumed by the MPEG-4 VCV model [4]. This section proposes MB relative complexity weights (obtained after extensive measurements) that should model more effectively the decoding complexity of an MPEG-4 encoded video object. Taking into account that the MPEG-4 VCV model is implicitly designed for the most complex MB coding type, the complexity weights must be defined relatively to the most complex MB coding type in the context of each profile; this means that the maximum complexity weight is set to 1 for this MB coding type and all the other weights are relative to this one and thus smaller than 1. This solution allows the implementation of a “trading system”, where it is possible, for example, to trade one of the most complex MBs by two MBs with half the relative complexity, while still maintaining the bitstream decodable by a compliant decoder, this means without having to require higher decoding resources. The relative complexity weight for each MB complexity type, k i , is thus obtained as the ratio between the maximum decoding time for each MB type (this is a conservative approach since most of the time the MBs for that type will be less complex) and the highest maximum decoding time from all the MB types relevant for the profile in question for the cases studied in this paper: the Inter4V+InterCAE MB type for the Core profile and the Inter4V MB type for the Simple profile, considering the MB coding types indicated in Table II3: ki t max MBi t max Inter 4V InterCAE or t max Inter 4V (1) Table III and Table IV show the maximum decoding times and the relative complexity weights for the various MB coding types studied, respectively for the Simple profile and for the Core profile, in the conditions mentioned above. TABLE III MAXIMUM DECODING TIME AND RELATIVE COMPLEXITY WEIGHT FOR THE VARIOUS MB CODING TYPES IN THE SIMPLE PROFILE MB coding type Skipped Intra Inter Inter4V Skipped (only rect. VO) Intra (only rect. VO) 3 COMPLEXITY CLASSES MB complexity class C1 0.16 1.08 1.06 1.21 0.13 0.89 0.88 1.00 C2 Relative complexity weight (ki) 0.16 1.08 0.09 0.59 For the Core profile not all MB coding types have been studied. 0.58 0.66 0.12 0.12 0.58 0.57 0.70 0.21 0.53 0.64 0.76 0.32 0.80 0.88 0.97 0.40 0.95 1.00 TABLE V RELATIVE COMPLEXITY WEIGHT FOR THE DEFINED MB DECODING Relative complexity weight (ki) Maximum time (ms) 1.06 1.21 0.21 0.22 1.06 1.04 1.28 0.38 0.97 1.17 1.38 0.58 1.46 1.60 1.77 0.73 1.73 1.82 Since there are some MB coding types whose decoding times, and particularly the maximum decoding times, are very similar, the VCV model to be proposed in the following can be simplified by grouping these MB types in just one complexity class as shown in Table V. The relative complexity weight attributed to each class is the weight of the most complex MB type included in that class (again a conservative approach). Maximum time (ms) TABLE IV MAXIMUM DECODING TIME AND RELATIVE COMPLEXITY WEIGHT FOR THE STUDIED MB CODING TYPES IN THE CORE PROFILE MB coding type Inter (only rect. VO) Inter4V (only rect. VO) Transparent Skipped+Opaque Intra+Opaque Inter+Opaque Inter4V+Opaque Skipped+NoUpdate Intra+NoUpdate Inter+NoUpdate Inter4V+NoUpdate Skipped+IntraCAE Intra+IntraCAE Inter+IntraCAE Inter4V+IntraCAE Skipped+InterCAE Inter+InterCAE Inter4V+InterCAE C3 C4 C5 C6 C7 C8 C9 C10 MB coding type Inter4V+InterCAE Inter+InterCAE Inter4V+IntraCAE Inter+IntraCAE Intra+IntraCAE Inter4V+NoUpdate Inter+NoUpdate Intra+NoUpdate Inter4V+Opaque Inter+Opaque Intra+Opaque Skipped+InterCAE Skipped+IntraCAE Skipped+NoUpdate Skipped+Opaque Transparent Inter4V (only rect. VO) Relative complexity weight Simple Core Profile Profile – 1.00 – 0.88 – 0.76 – 0.70 – – – – – 0.40 0.32 0.21 0.12 0.12 1.00 0.66 C11 C12 Inter (only rect. VO) Intra (only rect. VO) Skipped (only rect. VO) Nc 0.89 0.59 0.13 0.09 The weights presented above have been defined in a rather conservative way, by using the most complex case within each MB complexity class, in order that the weights stay valid even if there is some decoding complexity variation for different decoder implementation platforms. VII. IST VCV MODEL: A MORE EFFICIENT SOLUTION Exploiting the MB relative decoding complexity weights presented above, this section proposes an alternative, more efficient, VCV model: the IST VCV model. This model is based on a single buffer with a single decoding rate using different MB complexity weights for the various MB decoding complexity classes [7]. The major characteristics of the IST VCV model are: Complexity model based on the MB coding tools – The distinction in terms of decoding complexity between the various MBs is associated to the different MB texture and shape coding tools used, i.e., the MB complexity classes are related to a texture-shape tools combination for which a relative complexity weight is measured. Single buffer with relative MB complexity weights – In the proposed model, a single buffer stores all the encoded MBs, but each MB is weighted according to its decoding complexity class. Thus, the IST VCV buffer occupancy corresponds to a weighted sum of the encoded MBs. The IST and MPEG-4 VCV buffer sizes are made the same, making possible to compare the two models in a simple way, since the decoding computational resources remain the same. Single decoding rate – The use of a single buffer with MB complexity weights implies a single decoding rate. The IST and MPEG-4 VCV decoding rates are made the same, making possible to compare the two models in a simple way, since the decoding computational resources remain the same. The main advantage of the IST VCV solution relatively to the MPEG-4 VCV solution is to model more closely the real decoding complexity of a given set of bitstreams building a video scene, since the different types of MBs are distinguished in terms of decoding complexity and thus decoding resources are not wasted due to the “killing” assumption that all MBs beside boundary MBs are equally and maximally difficult (and there are actually big variations as shown in Table V). For the IST VCV model proposed in this paper, the number of equivalent (to the most complex) MBs for a given VOP i, Mi, that is added to the VCV buffer at each decoding time instant, ti, is given by the following expression Mi k j Mc j j 1 where kj is the relative complexity weight (1) associated to the MB complexity class j, Mcj is the number of MBs in VOP i belonging to the complexity class j and Nc is the number of complexity classes: 3 for the Simple profile and 12 for the Core profile (using the MB coding types in Table II). The new VCV model assumes a full sharing of the available decoding resources, which is in principle valid at least for decoder software implementations. The adoption by MPEG-4 of the alternative VCV model following the approach proposed in this paper (already recognized by MPEG as more adequate) would imply the change of the MPEG-4 video decoding complexity model, removing the B-VCV buffer while keeping the VCV buffer with the same parameters; moreover the VCV filling model would change from a simple addition of the number of MBs to an weighted addition of the number of MBs, using as weights relative complexity weights. Since the IST VCV decoding rate and buffer size are unchanged relatively to the MPEG-4 VCV model for each profile@level, a direct comparison between the two models can easily be done, because the decoder computational resources are maintained. A comparison between the two VCV models will be presented in the next section. VIII. IST AND MPEG-4 VCV MODELS: A COMPARISON The ideal way to compare and validate the IST VCV model relatively to the MPEG-4 VCV model would be by decoding bitstreams which, for a given profile@level, would violate the MPEG-4 VCV model but not the IST VCV model, and showing that these scenes could be decoded by a compliant MPEG-4 decoder, fulfilling the required timing constraints. This would show that the existing profile@level decoding resources were enough for the non-compliant MPEG-4 bitstreams and thus that the MPEG-4 VCV model wastes these resources (due to complexity over-dimensioning) when prevents those bitstreams from being classified as compliant to the profile@level in question. However, this comparison and validation can only be done with a real-time decoder that was not available at the time this work was done. Thus, the comparison between the two VCV models is here done by comparing the occupancy of the two VCV buffers and thus the effects of the proposed approach under the assumption that the measured relative complexity weights are valid. To perform this comparison an encoder with only the VBV (bit rate) rate control active was used. The feedback mechanism that prevents the violation of the VCV (complexity) and the VMV (memory) models has been disabled in order to allow the visualization of the corresponding buffer occupancy evolution even if it is above 100% occupancy. In this case, the encoded bitstreams are the same for both models since there is no feedback control, but the VCV fullness computation (MB decoding complexity evaluation) is done differently, depending on the considered model. This comparison methodology allows to verify that, in many situations, the MPEG-4 VCV model exceeds the 100% buffer occupancy while the IST VCV does not exceed that limit. This means that the use of IST VCV model would allow, for a given profile@level, to encode in a “compliant way” (meaning using the same decoding resources) video scenes that the MPEG-4 VCV model does not allow due to its clear overdimensioning of the MB decoding complexity for most MB coding types. This is true when similar spatial resolutions and temporal VOP rates are used, e.g., no VOP skipping is applied. A. Scenes with one rectangular object Figure 12 shows one original frame of the test sequence Akiyo, rectangular, in QCIF format, which has been encoded at 15 fps with Simple@L1 at 64 kbit/s. Figure 13 shows the MPEG-4 and IST VCV occupancies for this sequence. Figure 12 – Sample of the Akiyo sequence As can be seen in Figure 13, the MPEG-4 VCV occupancy is always 100%, which makes the encoded bitstream MPEG-4 compliant for the considered profile@level. With the IST VCV model, the sequence can be compliantly encoded with the same profile@level, but the VCV occupancy is lower, varying between 25% and 50% during the whole encoding process. This means that with the IST VCV model, it would be possible to encode this sequence at a higher frame rate, for example 25 fps, exploring the fact that some MB types, e.g., Skipped, are in reality much less complex than the most complex MB coding type (Inter4V). The term “compliant” means here that maintaining the standardized decoding resources for a certain profile@level, the bitstream would be able to be decoded fulfilling the necessary time constraints since it is not really more complex than other MPEG-4 “officially” compliant bitstreams (for the relevant profile@level). The non-compliant classification by the MPEG-4 VCV model is due to the decoding complexity overestimation of some MB types; for rectangular objects, this over-estimation is mainly related to the Skipped MBs as can be seen in Table V. Figure 13 – MPEG-4 and IST VCV occupancy: Akiyo, QCIF, 15 fps, Simple@L1 at 64 kbit/s B. Scenes with several arbitrarily shaped objects To make a rigorous comparison between the MPEG-4 VCV and the IST VCV in scenes with arbitrarily shaped objects, the restriction imposed by the MPEG-4 B-VCV requiring that the number of boundary MBs for each decoding time is not greater than half the B-VCV capacity must be considered. This means that, from a complexity point of view, the MPEG-4 VCV worst-case scenario corresponds to the case where the MBs are 50% of the most complex non-boundary MB type (Inter4V+Opaque) and 50% of the most complex boundary MB type (Inter4V + InterCAE). To accommodate this case, and only for the purpose of the comparison between the two VCV models, the relative complexity weights have to be changed using as reference the average time between the maximum decoding time of the Inter4V+InterCAE and Inter4V+Opaque types, and not the Inter4V+InterCAE type proposed in the IST VCV model which should be used if the MPEG-4 B-VCV did not exist. In this circumstance, the “trading system” is not simply referred to the Inter4V+InterCAE type, because the decoder does not have to support the case where all the MBs are of this type, but has to be referred to the average complexity between the most complex MB types of rectangular and arbitrarily shaped objects (using the MB coding types indicated in Table II). In this situation, it is natural that the complexity cost of an Inter4V+InterCAE MB is higher than 1, since this type is more complex than the reference complexity. Considering all these facts, to compare the two MPEG-4 VCV models, new relative decoding complexity weights have to be computed for the profiles under study using the following reference time tref tmax Inter 4V InterCAE tmax Inter 4V Opaque 2 As a consequence, the relative complexity weight for each MB coding type is obtained by ki t max MBi t ref The new relative complexity weights for the various MB complexity classes for the Simple and Core profiles using the MB coding types indicated in Table II are presented in Table VI. TABLE VI RELATIVE COMPLEXITY WEIGHT FOR THE STUDIED MB CODING TYPES IN THE SIMPLE AND CORE PROFILES MB complexity class C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 MB coding type Inter4V+InterCAE Inter+InterCAE Inter4V+IntraCAE Inter+IntraCAE Intra+IntraCAE Inter4V+NoUpdate Inter+NoUpdate Intra+NoUpdate Inter4V+Opaque Inter+Opaque Intra+Opaque Skipped+InterCAE Skipped+IntraCAE Skipped+NoUpdate Skipped+Opaque Transparent Inter4V (only rect. VO) Inter (only rect. VO) Intra (only rect. VO) Skipped (only rect. VO) Relative complexity weight (ki) Figure 14 – Sample of the Coastguard sequence 1.17 1.03 0.89 0.83 0.47 0.37 0.25 0.14 0.14 0.78 0.70 0.10 Figure 14 shows one original frame of the test sequence Coastguard, rectangular, in QCIF format, which has been encoded at 15 fps with Simple@L1 at 64 kbit/s. Figure 15, presenting the MPEG-4 and IST VCV occupancies for this sequence, shows that the MPEG-4 VCV buffer overflows, and thus the bitstream is not compliant. The IST VCV model occupancy shows that the scene is not too complex to be encoded at the considered profile@level, since the VCV occupancy is around 70% during most of the time of the decoding process. The MPEG-4 VCV occupancy peak that can be seen in Figure 15 is caused by a great number of transparent MBs that appear in those VOPs (after the peak the total number of MBs remains constant and hence the MPEG-4 VCV becomes flat). Since, in the IST VCV model, Transparent MBs have a low relative computational weight, the peak is attenuated and the IST VCV buffer does not overflow. It is important to notice that the number of transparent MBs in a scene strongly influences the MPEG-4 VCV performance. Figure 15 – MPEG-4 and IST VCV occupancy: Coastguard, QCIF, 30 fps, Core@L1, 384 kbit/s Figure 16 shows one original frame of the Children_and_Flag test sequence, with 3 VOs, in QCIF format, which has been encoded at 30 fps with Core@L1 at 384 kbit/s. The “Children” and “Flag” bounding boxes have a high number of transparent MBs and overlap in the scene; this contributes to the high MPEG-4 VCV occupancy as can be seen in Figure 17. Notice that although this scene only has 3 video objects and the majority of the MBs are transparent (58% transparent, 41% boundary and 1% opaque), it cannot be encoded with Core@L1 using the MPEG-4 VCV model which looks unrealistic even at first sight. The MPEG-4 VCV buffer occupancy is always very high, due to the high number of transparent MBs in the “Children” and “Flag” bounding boxes. The occupancy increases when the “MPEG-4 Logo” object appears, leading to a non-compliant set of bitstreams because the VCV occupancy exceeds 100%. The same figure shows that the IST VCV buffer occupancy is always under 50% since the Transparent MBs complexity weight is rather low and thus allowing the scene to be “compliantly” encoded with Core@L1. Figure 16 – Sample of the Children_and_Flag sequence Figure 17 – MPEG-4 and IST VCV occupancy: Children_and_Flag, QCIF, 30 fps, Core@L1, 384 kbit/s Another example that shows the weakness of the MPEG-4 VCV model in the presence of transparent MBs and the effectiveness of IST VCV model is the MPEG-4 test sequence News (Figure 18), 4 VOs, in CIF format, encoded at 30 fps with Core@L2 at 2000 kbit/s. Figure 19 shows the MPEG-4 and IST VCV occupancies for this sequence. The MPEG-4 VCV buffer capacity is largely exceeded during the coding process. On the other hand, the IST VCV occupancy is always around 50%, which shows that this scene can be encoded in Core@L2. The influence of transparent MBs in the MPEG-4 VCV is easily verified in this example. Figure 20 shows the number of transparent, opaque, and boundary MBs along the scene. The number of boundary and opaque MBs stays approximately constant along the scene, while the number of transparent MBs oscillates between two (rather high and similar) values. As can be seen in the Figures, when the number of transparent MBs increases, there is a corresponding increase in the MPEG-4 VCV occupancy, and when the number of transparent MBs decreases, the MPEG-4 VCV occupancy decreases. On the other hand, the IST VCV buffer occupancy stays approximately constant, because of the low complexity weight that Transparent MBs have in this model. Figure 18 – Sample of the News sequence Figure 19 – MPEG-4 and IST VCV occupancy: News, CIF, 30 fps, Core@L2, 2000 kbit/s Figure 20 – Number of MBs per shape type: News sequence IX. FINAL REMARKS This paper proposes an alternative Video Complexity Verifier model approach to the one specified in the MPEG-4 Visual standard, based on a set of MB relative decoding complexity weights assigned to the MPEG-4 MB coding types presented in Table II. These weights allow measuring the real decoding complexity of a given MPEG-4 encoded scene more precisely. Complexity measurements show that the MPEG-4 VCV model over-estimates the decoding complexity of some scenes notably because some MB types, such as the Transparent and Skipped MBs, are over-evaluated (not distinguished from the real complex ones) in terms of decoding complexity. On the other hand, the IST VCV model allows the encoding of many of the scenes considered too complex by the MPEG-4 VCV model, for a given profile@level. These scenes can be decoded by a compliant decoder without changing the decoding resources, and thus making a better use of these resources. The efficient use and sharing of the available decoding resources is very important, mainly in applications where they are scarce and expensive, e.g., mobile terminals. Mobile applications should be among the first where MPEG-4 will “explode”, as demonstrated by the adoption of MPEG-4 video coding by 3GPP (3rd Generation Partnership Project), responsible for the UMTS specification. REFERENCES [1] [2] [3] [4] F. Pereira, “MPEG-4: Why, What, How and When?”, Signal Processing: Image Communication, Tutorial Issue on the MPEG-4 Standard, vol. 15, nº 4-5, December 1999, pp. 271-279. MPEG Requirements Group, “MPEG-4 Applications Document”, Document ISO/IEC JTC1/SC29/WG11/N2724, 47 th MPEG meeting, Seoul, March 1999. MPEG Requirements Group, “MPEG-4 Overview”, Document ISO/IEC JTC1/SC29/WG11/N3930, 55th MPEG meeting, Pisa, January 2001. ISO/IEC 14496-2:1999 Information technology Coding of audiovisual objects Part 2: Visual, December 1999. R. Koenen, “Profiles and Levels in MPEG-4: Approach and Overview”, Tutorial Issue on the MPEG-4 Standard, Signal Processing: Image Communication, vol. 15, nº 4-5, pp. 463-478, December 1999. [6] ISO/IEC 14496-5: 1999, “Information Technology – Coding of Audiovisual Objects – Part 5: Reference software”, December 1999. [7] P. Nunes, F. Pereira, “MPEG-4 Compliant Video Encoding: Analysis and Rate Control Strategies”, Proceedings of the ASILOMAR 2000 Conference, Pacific Grove – CA, USA, October 2000. [8] N. Brady, “MPEG-4 Standardized Methods for the Compression of Arbitrarily Shaped Video Objects”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, nº 8, December 1999, pp. 11701189. [9] P. Nunes, F. Pereira, “Implementing the MPEG-4 Natural Visual Profiles and Levels”, Doc. M4878, 48th MPEG meeting, Vancouver, July 1999. [10] J. Valentim, P. Nunes, F. Pereira, “IST MPEG-4 Video Compliant Framework”, 3rd Conference on Telecommunications, Figueira da Foz – Portugal, April 2001. [5]