Chapter 7: MPEG-4 by Zhang Digital Signal Processing Handbook, CRC Press, 1999 MPEG-4 Based Multimedia Information System Ya-Qin Zhang Microsoft Research, China 5F, Beijing Sigma Center No.49, Zhichun Road, Haidian District Beijing 10080, PRC yzhang@microsoft.com ABSTRACT Recent creation and finalization of the MPEG4 international standard has provided a common platform and unified framework for multimedia information representation. In addition to provide highly efficient compression of both natural and synthetic audiovisual (AV)contents such as video, audio, sound, texture maps, graphics, still images, MIDI, and animated structure, MPEG4 enables greater capabilities for manipulating AV contents in the compressed domain with object-based representation. MPEG4 is a natural migration of the technological convergence of several fields: digital television, computer graphics, interactive multimedia, and Internet. This tutorial chapter briefly discusses some example features and applications enabled by the MPEG4 standard. Key words: Multimedia, MPEG4, Digital Video, World Wide Web 1 Chapter 7: MPEG-4 by Zhang Digital Signal Processing Handbook, CRC Press, 1999 INTRODUCTION During the last decade, a spectrum of standards in digital video and multimedia has emerged for different applications. These standards include the ISO JPEG for still images [JPEG-90]; ITU-T H.261 for video conferencing from 64 kilobits per second (kbps) to 2 Megabits per second (Mbps) [H261-91]; ITU-T H.263 for PSTN-based video telephony [H263-95]; ISO MPEG-1 for CD-ROM and storage at VHS quality [MPEG1-92]; the ISO MPEG-2 standard for digital TV [MPEG2-94]; and the recently completed ISO/MPEG4 international standard for multimedia representation and integration [MPEG4-98]. Two new ISO standards are under development to address the nextgeneration still image coding (JPEG2000) and content-based multimedia information description (MPEG7). Several special issues of IEEE journals have been devoted to summarizing recent advances in digital image and video compression and advanced TV in terms of standards, algorithms, implementations, and applications [IEEE-95-2,IEEE95-7,IEEE-97-2, IEEE-98-11]. The successful convergence and implementation of MPEG1 and MPEG2 have become a catalyst for propelling the new digital consumer markets such as Video CD, Digital TV, DVD, and DBS. While the MPEG-1 and MPEG-2 standards were primarily targeted at providing high compression efficiency for storage and transmission of pixel-based video and audio, MPEG-4 envisions to support a wide variety of multimedia applications and new functionalities of object-based audio-visual (AV) contents. The recent completion of MPEG4 version1 is expected to provide a stimulus to the emerging multimedia applications in wireless networks, internet, and content creation. The MPEG-4 effort was originally conceived in late 1992 to address very low bit rate video (VLBR) applications at below 64 kbps such as PSTN-based videophone, video email, security applications, and video over cellular networks. The main motivations for focusing MPEG-4 at VLBR applications were: Applications such as PSTN videophone and remote monitoring were important, but not adequately addressed by established or emerging standards. In fact, new products were introduced to the market with proprietary schemes. The need for a standard at rates below 64 kbps was eminent; Research activities had intensified in VLBR video coding, some of which have gone beyond the boundary of the traditional statistical-based and pixel-oriented methodology; It was felt that a new breakthrough in video compression was possible within a five-year time window. This ``quantum leap'' would likely make compressed video quality at below 64 kbps adequate for many applications such as videophone. 2 Chapter 7: MPEG-4 by Zhang Digital Signal Processing Handbook, CRC Press, 1999 Based on the above assumptions, a workplan was generated to have the MPEG-4 Committee Draft (CD) completed in 1997 to provide a generic audiovisual coding standard at very low bit rates. Several MPEG-4 seminars were held in parallel with the WG11 meetings, many workshops and special sessions have been organized, and several special issues have been devoted to such topics. However, as of July 1994 in the Norway WG11 meeting, there was still no clear evidence that a ``quantum leap'' in compression technology was going to happen within the MPEG-4 timeframe. On the other hand, ITUT has embarked on an effort to define the H.263 standard for videophone applications in PSTN and mobile networks. The need for defining a pure compression standard at very low bitrates was, therefore, not entirely justified. In light of the situation, a change of direction was called to refocus on new or improved functionalities and applications that are not addressed by existing and emerging standards. Examples include object-oriented features for content-based multimedia database, error-robust communications in wireless networks, hybrid nature and synthetic image authoring and rendering. With the technological convergence of digital video, computer graphics, and Internet, MPEG-4 aims at providing an audiovisual coding standard allowing for interactivity, high compression, and/or universal accessibility, with a high degree of flexibility and extensibility. In particular MPEG-4 intends to establish a flexible content-based audio-visual environment that can be customized for specific applications and that can be adapted in the future to take advantage of new technological advances. It is foreseen that this environment will be capable of addressing new application areas ranging from conventional storage and transmission of audio and video to truly interactive AV services requiring content-based AV database access, e.g. video games or AV content creation. Efficient coding, manipulation and delivery of AV information over Internet will be key features of the standard. MPEG4 MULTIMEDIA SYSTEM Figure 1 shows an architectural overview of MPEG-4. The standard defines a set of syntax to represent individual audiovisual objects, with both natural and synthetic contents. These objects are first encoded independently into their own elementary streams. Scene description information is provided separately, defining the location of these objects in space and time that are composed into the final scene presented to the user. This representation includes support for user interaction and manipulation. The scene description uses a tree-based structure, following the Virtual Reality Modeling Language (VRML) design. Moving far beyond the capabilities of VRML, MPEG-4 scene descriptions can be dynamically constructed and updated, enabling much higher levels of interactivity. Object descriptors are used to associate scene description components that relate digital video to the actual elementary streams that contain the corresponding coded data. As shown in Figure 1, these components are encoded sepa- 3 Chapter 7: MPEG-4 by Zhang Digital Signal Processing Handbook, CRC Press, 1999 rately, and transmitted to the receiver. The receiving terminal then has the responsibility of composing the individual objects for presentation and managing user interaction. Display and User Interaction Audiovisual Interactive Scen e Composition and Rendering ... Object Descrip tor Scen e Descrip tion Information Return Channel Codin g Primitive AV Objects Elementary Streams Figure 1. MPEG-4 Overview. Audio-visual objects, natural audio, as well as synthetic media are independently coded and then combined according to scene description information (courtesy of the ISO/MPEG4 committee). Following eight MPEG-4 functionalities, clustered into three classes, are defined [MPEG4-REQ]: Content-based interactivity Content-based manipulation and bit stream editing Content-based Multimedia data access tools Hybrid natural and synthetic data coding Improved temporal access Compression Improved coding efficiency Coding of multiple concurrent data streams Universal Access 4 Chapter 7: MPEG-4 by Zhang Digital Signal Processing Handbook, CRC Press, 1999 Robustness in error-prone environments Content-based scalability Some of the applications enabled by these functionalities include: Video streaming over Internet Multimedia authoring and presentations View of the contents of video data in different resolutions, speeds, angles, and quality levels Storage and retrieval of multimedia database in mobile links with high error rates and low channel capacity (e.g. Personal Digital Assistant) Multipoint teleconference with selective transmission, decoding, and display of ``interesting'' parties Interactive home shopping with customers' selection from a video catalogue Stereo-vision and multiview of video contents, e.g. sports ``Virtual'' conference and classroom Video email, agents, and answering machines AN EXAMPLE Figure 2 shows an example of an object-based authoring tool for MPEG-4 AV contents, recently developed by the Multimedia Technology Laboratory at Sarnoff Corporation in Princeton, New Jersey. This tool has the following features: compression/decompression of different visual objects into MPEG4-compliant bitstreams drag-and-drop of video objects into a window while resizing the objects or adapting them to different frame rates, speeds, transparencies, and layers substitution of different backgrounds mixing natural image and video objects with computer-generated synthetic texture and animated objects creating metadata information for each visual objects 5 Chapter 7: MPEG-4 by Zhang Digital Signal Processing Handbook, CRC Press, 1999 Figure 2. An example multimedia authoring system using MPEG 4 tools and functionalities (courtesy of Sarnoff Corporation) This set of authoring tools can be used for interactive Web design, digital studio, and multimedia presentation. It empowers users to compose and interact with digital video on a higher semantic level. 6 Chapter 7: MPEG-4 by Zhang Digital Signal Processing Handbook, CRC Press, 1999 REFERENCES [JPEG-90] ISO/IEC 10918-1, ``JPEG Still Image Coding Standard,'' 1990 [H261-91] CCITT Recommendation H.261, ``Video Codec for Audiovisual Services at 64 to1920 kbps,'' 1990 [H263-95] ITU-T/SG15/LBC, ``Recommendation H.263P Video Coding for Narrow Telecommunication Channels at below 64kbps'', May 1995 [MPEG1-92] ISO/IEC 11172 , ``Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbps,'' 1992 [MPEG2-94] ISO/IEC 13818, ``Generic Coding of Moving Pictures and Associated Audio,'' 1994 [IEEE-95-2] Y.-Q.Zhang, W.Li and M.Liou, Ed. ``Advances in Digital Image and Video Compression,'' Special Issue, Proceedings of IEEE, Feb. 1995 [IEEE-95-7] M.Kunt, Ed. ``Digital Television,'' Special Issue, Proceedings of IEEE, July 1995 [IEEE-97-2]Y.-Q.Zhang, F.Pereria,T.Sikora, and C.Reader, Ed, MPEG-4, Special Issue, IEEE Transactions on Circuits and Systems for Video Technology, Feb.1997 [IEEE-98-3] T.Chen, R.Liu and A.Tekalp ed, Multimedia Signal Processing, Special issue on Proceedings of IEEE, May 1998 [IEEE-98-11] M.T.Sun, K.Ngan, T.Sikora, and S.Panchnatham, Ed. Representation and Coding of Images and Video, IEEE Transactions on Circuits and Systems for Video Technology, November 1998 [MPEG4-REQ] MPEG-4 Requirements Ad-Hoc Group, ``MPEG-4 Requirements,'' ISO/IEC JTC1/SC29/WG11/MPEG4,Maceio, Nov.1996 [MPEG4-98] ISO/IEC JTC1/SC29/WG11, ``MPEG-4 Draft International Standard'' October, 1998 7