Pamela C. Cosman
1
Exploit spatial redundancy within frames (like JPEG: transforming, quantizing, variable length coding)
Exploit temporal redundancy between frames
Only the sun has changed position between these 2 frames
Previous Frame
Current Frame
2
Frame 0 (still image)
Difference frame 1 = Frame 1
– Frame 0
Difference frame 2 = Frame 2
– Frame 1
If no movement in the scene, all difference frames are 0.
Can be greatly compressed!
If movement, can see it in the difference images
0 1 2 3
3
Differences between two frames can be caused by
Camera motion: the outlines of background or stationary objects can be seen in the Diff Image
Object motion: the outlines of moving objects can be seen in the Diff Image
Illumination changes (sun rising, headlights, etc.)
Scene Cuts: Lots of stuff in the Diff Image
Noise
4
If the only difference between two frames is noise (nothing moved), then you won’t recognize anything in the Difference Image
But, if you can see something in the Diff
Image and recognize it, there’s still correlation in the difference image
Goal: remove the correlation by compensating for the motion
5
6
7
8
Frame n
Translation: simple movement of typically rigid objects
Camera pans vs. movement of objects
Frame n Frame n+1
Frame n+1
(Rotation)
Frame n+2
(Zoom)
Rotation: spinning about an axis
Camera versus object rotation
Zooms –in/out
Camera zoom vs. object zoom (movement in/out)
9
Translational
Move (object) from (x,y) to (x+dx,y+dy)
Rotational
Rotate (object) by (r rads) (counter/clockwise)
Zoom
Move (in/out) from (object) to increase its size by
(t times)
Which is easiest? Which are we most likely to encounter?
10
Determining parameters for the motion descriptions
For some portion of the frame, estimate its movement between 2 frames- the current frame and the reference frame
What is some portion?
Individual pixels (all of them)?
Lines/edges (have to find them first)
Objects (must define them)
Uniform regions (just chop up the frame)
11
For a region P
C region P
R in the current frame, find a in the search window in reference frame so that Error(P
R
,P
C
) is minimized
Issues: Error measures, search techniques, choice of search window, choice of reference frame, choice of region P
C
Current
Search window
Reference
Frame
Portion of interest
Frame
P
C
12
P
C is a block of pixels (in the current frame)
The search window is a rectangular segment
(in the reference frame)
T=1 (reference) T=2 (current)
13
A motion vector (MV) describes the offset between the location of the block being coded (in the current frame) and the location of the best-match block in the reference frame
T=1 (reference) T=2 (current)
14
The blocks being predicted are on a grid
1
9
5
6
2
14
13
10
3
7
15
11
4
8
12
1
5
9
13
2
6
10
14
3
7
11
8
12
15 16
16
The blocks used for prediction are NOT
4
15
1. Mean squared error
Select a block in the reference frame to minimize
Σ(b(B ref
)-b(B curr
)) 2
2. Mean abs. error
Select block to minimize
Σ|b(B ref
)-b(B curr
)|
Given error measure, how to efficiently determine best-match block in search window?
Full search: best results, most computation
Logarithmic search – heuristic, faster
Hierarchical motion estimation
16
Full search: Evaluate every position in the search window
Logarithmic Search: First examine positions marked 1.
Choose best of these (lowest error measure) and examine positions marked 2 surrounding it
Choose the best of these, and examine the positions marked 3
Final result = best of these
17
Use an averaging filter on the image, then downsample by a factor of 2
Conduct a search on the downsampled image (only ¼ of the size)
Given the results of the search on the downsampled image, return to the full resolution image and refine the search there
18
The standards do not specify HOW the encoder will find the motion vectors (MVs)
The encoder can use exhaustive/fast search,
MSE /MAE/other error metric, etc.
The standard DOES specify
The allowable syntax for specifying the MVs
What the decoder will do with them
What the decoder does is to grab the indicated block from reference frame, and glue it in place
19
The video compression standards define syntax and semantics for the bit stream between encoder and decoder
This allows future encoders of better performance to remain compatible with existing decoders.
ENCODER bit stream
DECODER not this
Standard defines not this this
Encoder is not specified by
MPEG except that it produces a compliant bit stream
Compliant decoder must interpret all legal MPEG bit streams
Also allows for commercially secret encoders to be compatible with standard decoders
Today’s Ho-Hum
Encoder
Tomorrow’s Nifty
Encoder
Very secret
Encoder
Today’s
Decoder
Today’s decoder still works!
20
Frame n-1
(0,0)
(0,0)
(-16,0)
(16,7)
(5,0)
(5,2)
(0,0)
(0,0)
(20,-24) (0,0) (-20,-18) (0,0)
Frame n
MOTION COMPENSATED
Frame n
21
Real moving objects will not coincide with boundaries of macroblocks background moving object
Prediction error
Moving object well encoded with motion vector
Background well encoded (no motion vector)
Prediction error
If encoder sends MV=(MotX,MotY), object well coded, but background poorly coded
If encoder sends MV=(0,0), background well coded, but moving object poorly coded
Either approach is valid
22
This glued together frame is called
the motion compensated frame
The encoder can also form the difference between the motion compensated frame and the actual frame.
This is called the motion compensated difference frame
This difference frame formed using MC should have less correlation between pixels than the difference frame formed without using MC
23
Suppose we are doing lossless coding
Encoder has sequence of frames: …, F(n-2), F(n-1)
Next: encode F(n)
Past frames have been losslessly encoded, so the decoder knows F(n-1) perfectly already
Encoder sends the motion vectors for frame F(n) relative to frame F(n-1), to form motion compensated frame M(n)
Encoder knows M(n), Decoder knows M(n)
24
F(n-1)
(0,0)
(0,0)
(-16,0)
(16,7)
(5,0)
(5,2)
(0,0)
(0,0)
(20,-24) (0,0) (-20,-18) (0,0)
F(n)
M(n)
MOTION COMPENSATED
Frame
25
Encoder forms motion compensated diff frame:
MCD(n) = F(n) – M(n)
Encoder losslessly encodes MCD(n)
Decoder can then do
F(n) = MCD(n) + M(n)
→ knows F(n) exactly
With no motion compensation encoder could do frame diff:
FD(n) = F(n) – F(n-1)
Encoder losslessly encodes FD(n)
Decoder can then do
F(n) = FD(n) + F(n-1)
→ knows F(n) exactly
If successive frames are very similar:
fewer bits to send Motion Vectors + MCD(n) instead of FD(n)
fewer bits to send FD(n) instead of F(n)
26
Decoder knows F(n-1) and, once you send the motion vectors, it knows M(n)
Send FD(n)
Reference Frame
F(n-1)
Original Frame
F(n)
Difference Image
FD(n)=F(n)-F(n-1)
Send Motion
Vectors
Motion compensated frame M(n)
Send MCD(n)
Motion compensated difference image
MCD(n) =F(n) – M(n)
27
But we are NOT doing lossless coding
Encoder has sequence of frames: …, F(n-2), F(n-1)
Next: encode F(n)
Past frames have been lossy encoded, so the decoder has versions …, G(n-2), G(n-1)
Encoder knows …, G(n-2), G(n-1) also
Encoder sends the motion vectors for frame F(n) relative to frame G(n-1), to form motion compensated frame M(n)
28
Encoder forms motion compensated difference frame: MCD(n) = F(n) – M(n)
Encoder lossy encodes MCD(n)
Call the decoder version MCD*(n)
If the decoder received MCD(n) exactly, could do: F(n) = MCD(n) + M(n)
But with MCD*(n), decoder can do
G(n) = MCD*(n) + M(n)
→ knows F(n) approximately
29
Goal of motion estimation is NOT to provide a careful analysis of the actual motion
Goal is to achieve a given quality of representation of the video while globally minimizing the bit rate required to send
The motion information
The prediction error information
Most of the time, for a given representation quality
fewer bits to send MV+MCD(n) instead of sending FD(n)
fewer bits to send FD(n) instead of sending F(n) itself.
30
Luminance is highly correlated, more so than chrominance
The “best” motion vectors are available by searching in the luminance plane
Motion vectors for chrominance are not computed separately, simply scaled as needed
31
At the encoder:
For each block in the frame being coded, examine the search window(s) in the reference frame to find the best match block (do this for luminance only)
Form the MC difference image = original image minus motion compensated image
Scale the motion vectors for the chrominance, form the motion compensated chrominance frames, and form chrominance difference image
32
At the decoder:
Decode the reference frames (Y,Cr,Cb)
For each block in a temporally coded Y frame, use the motion vector to select a block from the reference frame and glue it in place
Add the Y difference image
For each block in temporally coded Cr,Cb frames, first scale the motion vector, then do the previous
2 steps with Cr and Cb data
33
34
35
36
The reference frame need not occur before the temporally coded frames which use it
Why? Scene changes, allow better matches
37
1. Forward predicted blocks: the best-match block occurs in the reference frame before the block’s frame
2. Backward predicted blocks: the best-match block occurs in the reference frame after the block’s frame
3. Interpolatively predicted blocks: the best-match block is the average of the best-match blocks from reference frames before & after
The motion compensation direction can be selected independently for each block in a frame.
38
Intra (I) pictures: coded by themselves, as still images. No temporal coding. No motion vectors.
39
Forward Motion Compensated predicted (P) pictures – forward motion compensated from the previous I or P frame
40
Motion Compensated interpolated (B) pictures – forward, backward, and interpolatively motion compensated from previous/next I/P frames
41
How are the motion vectors actually encoded for transmission to the decoder?
Start by taking the difference between the current motion vector and the most recent previous one of the same type (forward/backward/interpolative)
Encode the difference using variable length coding
Horizontal and vertical components coded separately
42
A block contains 8x8 pixels
The DCT unit
A macroblock (MB) contains 4 blocks from the luminance, plus the corresponding chrominance blocks
4 blocks from each of Cr/Cb if 4:4:4 format
2 blocks from each of Cr/Cb if 4:2:2 format
1 block from each of Cr/Cb if 4:1:1 or 4:2:0 format
The motion compensation unit
43
A slice is a collection of macroblocks, tracing in a raster scan from upper left to lower right
The resynchronization unit
A picture is a frame, either progressive (noninterlaced) or interlaced
The primary coding unit
A Group of Pictures (GOP) contains ≥ 1 frame.
The unit for random access into the sequence
44
A Group of Pictures (GOP) may contain
All I pictures
I & P pictures only
I, P, & B Pictures
A common GOP format for 30 frames/sec:
I-picture spacing 15 frames (1/2 second)
P-picture spacing 3 frames (1/10 second)
I B B P B B P B B P B B P B B I
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
45
Display order (encoder input order):
B B I B B P B B P B B P B B P B B I
-1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
But consider coding dependencies:
Frame 2 (B) needs frame 4 (P) to be decoded first, etc.
So better transmit frame 4 before frame 2
I B B P B B P B B P B B P B B I B B
1 -1 0 4 2 3 7 5 6 10 8 9 13 11 12 16 14 15
46
What if the best-match block in the reference frame is a great match ?
Then the motion vector is all you need to send
What if it is a terrible match ?
Then don’t use the motion vector at all, just code the block by itself, with something like JPEG
(called intra mode coding)
What if it is a so-so match ?
Then you can send the MV, and also send the frame difference information for that macroblock
47
Inter coding refers to coding with motion vectors
Macro
Block
Previous
Frame
Motion Vector
Current
Frame
48
INTRA coding refers to coding without motion vectors
The MB is coded all by itself, in a manner similar to JPEG
Macro
Block
Previous
Frame
Current
Frame
49
Two possible coding modes for macroblocks in I-frames
Intra - code the 4 blocks with the current quantization parameters
Intra with modified quantization : scale the quantization matrix before coding this MB
All macroblocks in intra pictures are coded
Quantized DC coefficients are losslessly
DPCM coded, then Huffman as in JPEG
50
Use of macroblocks modifies block-scan order:
8 8
8
8
Quantized coefficients are zig-zag scanned and runlength/Huffman coded as in JPEG
Very similar to JPEG except (1) scaling Q matrix separately for each MB, and (2) order of blocks
51
Motion compensated coding: Motion Vector only
Motion compensated coding: MV plus difference macroblock
Motion
Vector
DCT
Motion compensation: MV
& difference MB with modified quant. scaling
MV = (0,0), just send difference block
MV=(0,0), just send diff block, with modified quantization scaling
Intra: the MB is coded with DCTs (no difference is computed)
Intra with modified quantization scaling
52
MPEG does not specify how to choose mode
Full search = try everything…the different possibilities will lead to different rate/distortion outcomes for that macroblock distortion •
•
MV, no difference
•
•
•
MV, plus difference
•
•
Intra • rate
53
Tree search: use a decision tree
For example:
First find the best-match block in the search window. If it’s a very good match, then use motion compensation.
Otherwise, don’t.
If you decided to use motion compensation, then need to decide whether or not to send the difference block as well. Make decision based on how good a match it is.
If you decided to send the difference block, then have to decide whether or not to scale the quantization parameter… check the current rate usage…
54
B pictures have even more possible modes:
Forward prediction MV, no difference block
Forward prediction MV, plus difference block
Backward prediction MV, no difference block
Backward prediction MV, plus difference block
Interpolative prediction MV, no difference block
Interpolative prediction MV, plus difference block
Intra coding
Some of above with modified Quant parameter
55
IIIII…: Every picture is intra-coded.
Fully decodable without reference to any other picture
Editing is straightforward
Requires about 2.5 more bit rate than bidirectional
IBBPBBPB…: Forward and bidirectional
Best compression factor
Needs large decoder memory
Hard to edit
Most useful for final delivery of post-produced material
(e.g., broadcast) because no editing requirement
56
IPPPPIPP…: Forward predicted only.
Needs less decoder memory
IBIBIB…: bidirectional compromise
Some of the bit rate advantage of bidirectional coding
Not nearly the full latency penalty of bidirectional
Editable with moderate processing.
For example, if the video after a B picture needs to be deleted, the B frame would not be decodable.
Solution is to decode the B frame first, re-encode it using forward prediction only. Some quality loss.
57