A Picture is Worth a Thousand Words Milton Chen What’s a Picture Worth? • A thousand words - Descartes (1596-1650) • A thousand bytes - modern translation – 1000 * 5 * 5 / 3 8,000 bits • 75,000 bytes - ATSC/MPEG-2 – 20 M / 30 600,000 bits Frequency Response of the Eye • Lens - low pass • Photoreceptors - low pass • Lateral inhibition - high pass – edge is important Today’s Video Coding YUV (lossy) Motion DCT Quantize (lossy) Order Entropy Designed for natural scenes => Higher frequency DCT coefficients are quantized more => Sharp edges are not well preserved What’s Wrong with Today’s Video Coding • Poor performance for – text (channel logo, stock ticks) – graphics – anything with sharp edges Desirable Features • • • • • • Postproduction support Personalized delivery / presentation Interactive Error resilience More compression Facilitate search / indexing (MPEG-7) Outline • • • • Why MPEG-4 Overview Systems Layer Visual Coding – Arbitrarily shaped video – Meshed video – Face and body Goals of MPEG-4 • One content – convergence of DTV, computer graphics, and WWW – broadcast, internet, local • User interactivity • Higher compression rates • Robustness in mobile environment MPEG-4 Applications • Interactive TV (broadcast) – Home-shopping, Interactive game show • Virtual workspace (internet) – virtual meeting, collaborative design • Infotainment – Virtual-City-Guide (local) MPEG-4 Key Concepts • Independent coding of objects – allow user interactivity (client & server) – higher compression rates • Provide tools as well as solutions – allow content specific and user defined compression algorithms MPEG-4 History • Started in July 1993 • Originally for low-bit-rate applications • Version 1 to be standardized by January 1999 • Continue work on version 2, etc. MPEG-4 Standard 1) Systems (manage streams, composition) 2) Visual (natural and synthetic) 3) Audio (natural and synthetic) 4) Conformance Testing 5) Reference Software 6) Delivery Multimedia Integration Framework (medium abstraction layer) audiovisual objects voice hierarchically multiplexed downstream control / data sprite hierarchically multiplexed upstream control / data 2D background audiovisual presentation y 3D objects scene coordinate system x z user events video compositor projection plane audio compositor hypothetical viewer speaker display user input Display and User Interaction Audiovisual Interactive Scene Composition and Rendering ... Scene Description Information Object Descriptor Return Channel Coding Primitive AV Objects Elementary Streams AL ... AL AL Elementary Stream Interface ... AL AL AL-Packetized Streams FlexMux FlexMux FlexMux (RTP) UDP IP AAL2 ATM FlexMux H223 PSTN DAB Mux TransMux Streams Transmission/Storage Medium AccessUnit Layer Stream Multiplex Interface FlexMux Streams (PES) MPEG-2 TS Compression Layer FlexMux Layer TransMux Interface ... ... TransMux Layer Previous Work in Object Coding • Synthetic High System (Schreiber ‘59) • Contour-Texture Approach (Kocher & Kunt ‘82) • Object-Based Video Coder (Musmann et. al. ‘89) • Talisman (Torborg & Kajiya ‘96) • Blue screen matting (Vlahos ‘64) Shape Coding • Bitmap-based – 1 means in, 0 means out – Chroma-keying, GIF89a – G4 fax standard • Contour-based – chain code – polygon/curve approximation – Fourier descriptor Chain Code • Follows the contour and encode the direction of next boundary pel • 4 or 8 directions for an avg. of 1.2 or 1.4 bits per boundary pel • Extensions – length – angular resolution Polygon Approximation • Add control points until maximum error is below threshold • Threshold <= 1.4 pel for CIF (352*288) video • Extension – curves of various order Fourier Descriptor • • • • Translation, rotation, and scale invariant Sample contour -> ( xi, yi ) i, ( yi+1 - yi ) / ( xi + 1 - xi ) Compute Fourier Series coefficients • Good for recognition, but not an efficient shape coder MPEG-4 Experiments • Chroma-keying – color bleeding – need to decode whole frame to get shape • Bitmap and contour-based coding are similar in: – error resilience – coding efficiency • Bitmap-based is simpler for hardware due to regular memory access MPEG-4 Shape Coding • Three types of macroblocks – transparent, opaque, and object boundary • • • • Context-based arithmetic encoder Macroblocks can be subsampled Texture padded with 0 or mean value Transparency – constant: one 8 bit value – arbitrary: treat it like color Meshed Video • 2D mesh tessellates the video into patches • Motion vector for each vertex • Texture warped in each patch Meshed Video - Motivation • Motion Modeling – Translational-block motion does not model rotation, scaling, reflection, and shear • Shape Modeling – Possible without depth Meshed Video - Applications • Compression – better motion compensation – transmit texture only at key frames – spatio-temporal interpolation (zooming, framerate up-conversion) • Manipulation – augmented reality – transfiguration (replace billboards) • Indexing / searching Face • Face object – Default face model with terminal – Facial Definition Parameter or user supplied model/texture – Facial Animation Parameter plus Amplification and Filters – Lip Shape Animation from phoneme Facial Definition Parameter 11.5 11.5 11.4 11.4 11.1 11.2 11.3 11.1 11.2 4.4 4.2 4.1 4.3 10.2 10.4 4.5 10.10 10.9 10.1 10.2 10.3 5.1 5.2 10.8 9.6 10.10 10.4 9.14 10.8 10.7 10.6 10.5 2.13 2.14 Y 10.6 Y 9.12 9.3 9.4 9.2 X 2.14 Z 7.1 2.10 X 4.4 4.6 4.6 2.12 2.1 2.11 2.10 2.1 2.12 Z Right Eye Left Eye 3.14 3.2 3.4 3.12 3.13 3.6 3.8 3.11 3.1 3.3 Nose 3.5 9.6 3.9 3.10 9.7 3.7 9.12 9.14 9.13 Teeth 9.8 9.2 9.10 9.11 9.4 9.3 9.15 9.9 Mouth 8.6 8.9 8.1 2.7 8.4 Tongue 8.10 8.5 .2.2 2.6 2.5 2.4 8.3 2.9 8.8 2.3 8.2 Feature points affected by FAPs Other feature points 6.2 6.4 2.8 6.3 8.7 6.1 9.1 9.5 Facial Animation Parameter IRISD0 ES0 ENS0 MNS0 MW0 Body • Like the face Ultimate Compression Technique Computer Graphics ??? • • • • • • Block based DCT (MPEG-1/2) Arbitrary shaped video (MPEG-4) Meshed video (MPEG-4) Image based rendering Textured 3D graphics Geometry only 3D graphics