A Picture is Worth a Thousand Words Milton Chen

advertisement
A Picture is Worth a Thousand
Words
Milton Chen
What’s a Picture Worth?
• A thousand words - Descartes (1596-1650)
• A thousand bytes - modern translation
– 1000 * 5 * 5 / 3  8,000 bits
• 75,000 bytes - ATSC/MPEG-2
– 20 M / 30  600,000 bits
Frequency Response of the Eye
• Lens - low pass
• Photoreceptors - low pass
• Lateral inhibition - high pass
– edge is important
Today’s Video Coding
YUV
(lossy)
Motion
DCT
Quantize
(lossy)
Order
Entropy
Designed for natural scenes =>
Higher frequency DCT coefficients are quantized more =>
Sharp edges are not well preserved
What’s Wrong with Today’s
Video Coding
• Poor performance for
– text (channel logo, stock ticks)
– graphics
– anything with sharp edges
Desirable Features
•
•
•
•
•
•
Postproduction support
Personalized delivery / presentation
Interactive
Error resilience
More compression
Facilitate search / indexing (MPEG-7)
Outline
•
•
•
•
Why
MPEG-4 Overview
Systems Layer
Visual Coding
– Arbitrarily shaped video
– Meshed video
– Face and body
Goals of MPEG-4
• One content
– convergence of DTV, computer graphics, and
WWW
– broadcast, internet, local
• User interactivity
• Higher compression rates
• Robustness in mobile environment
MPEG-4 Applications
• Interactive TV
(broadcast)
– Home-shopping, Interactive game show
• Virtual workspace (internet)
– virtual meeting, collaborative design
• Infotainment
– Virtual-City-Guide
(local)
MPEG-4 Key Concepts
• Independent coding of objects
– allow user interactivity (client & server)
– higher compression rates
• Provide tools as well as solutions
– allow content specific and user defined
compression algorithms
MPEG-4 History
• Started in July 1993
• Originally for low-bit-rate applications
• Version 1 to be standardized by
January 1999
• Continue work on version 2, etc.
MPEG-4 Standard
1) Systems (manage streams, composition)
2) Visual (natural and synthetic)
3) Audio (natural and synthetic)
4) Conformance Testing
5) Reference Software
6) Delivery Multimedia Integration
Framework (medium abstraction layer)
audiovisual objects
voice
hierarchically multiplexed
downstream control / data
sprite
hierarchically multiplexed
upstream control / data
2D background
audiovisual
presentation
y
3D objects
scene
coordinate
system
x
z
user events
video
compositor
projection
plane
audio
compositor
hypothetical viewer
speaker
display
user input
Display and
User
Interaction
Audiovisual Interactive
Scene
Composition and Rendering
...
Scene
Description
Information
Object
Descriptor
Return
Channel
Coding
Primitive
AV Objects
Elementary Streams
AL
...
AL
AL
Elementary Stream Interface
...
AL
AL
AL-Packetized Streams
FlexMux FlexMux
FlexMux
(RTP)
UDP
IP
AAL2
ATM
FlexMux
H223
PSTN
DAB
Mux
TransMux Streams
Transmission/Storage Medium
AccessUnit
Layer
Stream Multiplex Interface
FlexMux Streams
(PES)
MPEG-2
TS
Compression
Layer
FlexMux
Layer
TransMux Interface
...
...
TransMux
Layer
Previous Work in Object Coding
• Synthetic High System (Schreiber ‘59)
• Contour-Texture Approach (Kocher & Kunt
‘82)
• Object-Based Video Coder (Musmann et. al.
‘89)
• Talisman (Torborg & Kajiya ‘96)
• Blue screen matting (Vlahos ‘64)
Shape Coding
• Bitmap-based
– 1 means in, 0 means out
– Chroma-keying, GIF89a
– G4 fax standard
• Contour-based
– chain code
– polygon/curve approximation
– Fourier descriptor
Chain Code
• Follows the contour and encode the
direction of next boundary pel
• 4 or 8 directions for an avg. of 1.2 or 1.4
bits per boundary pel
• Extensions
– length
– angular resolution
Polygon Approximation
• Add control points until maximum error is
below threshold
• Threshold <= 1.4 pel for CIF (352*288)
video
• Extension
– curves of various order
Fourier Descriptor
•
•
•
•
Translation, rotation, and scale invariant
Sample contour -> ( xi, yi )
i, ( yi+1 - yi ) / ( xi + 1 - xi )
Compute Fourier Series coefficients
• Good for recognition, but not an efficient
shape coder
MPEG-4 Experiments
• Chroma-keying
– color bleeding
– need to decode whole frame to get shape
• Bitmap and contour-based coding are
similar in:
– error resilience
– coding efficiency
• Bitmap-based is simpler for hardware due
to regular memory access
MPEG-4 Shape Coding
• Three types of macroblocks
– transparent, opaque, and object boundary
•
•
•
•
Context-based arithmetic encoder
Macroblocks can be subsampled
Texture padded with 0 or mean value
Transparency
– constant: one 8 bit value
– arbitrary: treat it like color
Meshed Video
• 2D mesh tessellates the video into patches
• Motion vector for each vertex
• Texture warped in each patch
Meshed Video - Motivation
• Motion Modeling
– Translational-block motion does not model
rotation, scaling, reflection, and shear
• Shape Modeling
– Possible without depth
Meshed Video - Applications
• Compression
– better motion compensation
– transmit texture only at key frames
– spatio-temporal interpolation (zooming, framerate up-conversion)
• Manipulation
– augmented reality
– transfiguration (replace billboards)
• Indexing / searching
Face
• Face object
– Default face model with terminal
– Facial Definition Parameter or user supplied
model/texture
– Facial Animation Parameter plus Amplification
and Filters
– Lip Shape Animation from phoneme
Facial Definition Parameter
11.5
11.5
11.4
11.4
11.1
11.2
11.3
11.1
11.2
4.4 4.2 4.1 4.3
10.2
10.4
4.5
10.10
10.9
10.1
10.2
10.3
5.1
5.2
10.8
9.6
10.10
10.4
9.14
10.8
10.7
10.6
10.5
2.13
2.14
Y
10.6
Y
9.12
9.3
9.4
9.2
X
2.14
Z
7.1
2.10
X
4.4
4.6
4.6
2.12 2.1 2.11
2.10
2.1
2.12
Z
Right Eye
Left Eye
3.14
3.2
3.4
3.12
3.13
3.6
3.8
3.11
3.1
3.3
Nose
3.5
9.6
3.9
3.10
9.7
3.7
9.12
9.14
9.13
Teeth
9.8
9.2
9.10
9.11
9.4
9.3
9.15
9.9
Mouth
8.6
8.9
8.1
2.7
8.4
Tongue
8.10 8.5
.2.2 2.6
2.5
2.4 8.3
2.9
8.8
2.3
8.2
Feature points affected by FAPs
Other feature points
6.2
6.4
2.8
6.3
8.7
6.1
9.1
9.5
Facial Animation Parameter
IRISD0
ES0
ENS0
MNS0
MW0
Body
• Like the face
Ultimate Compression Technique
Computer Graphics ???
•
•
•
•
•
•
Block based DCT (MPEG-1/2)
Arbitrary shaped video (MPEG-4)
Meshed video (MPEG-4)
Image based rendering
Textured 3D graphics
Geometry only 3D graphics
Download