Adaptive Multiple Description Mode Selection for

Adaptive Multiple Description Mode Selection for
Error Resilient Video Communications
by
Brian A. Heng
S.M., Massachusetts Institute of Technology (2001)
B.S., University of Minnesota, (1999)
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
at the
Massachusetts Institute of Technology
June 2005
C 2005 Massachusetts Institute of Technology
All rights reserved
Signature of Author
4-
. it or
Department of Electrical Engineering and Computer Science
June 29, 2005
Certified by
Jae S. Lim
Professor of Electrical Engineering
Thesis Supervisor
Accepted by
Arthur C. Smithi
Chairman, Departmental Committee on Graduate Students
OF TECHNOLOGY
MAR 2 8 2006
LIBRARIES
-2-
Adaptive Multiple Description Mode Selection for
Error Resilient Video Communications
by
Brian A. Heng
Submitted to the Department of Electrical Engineering and Computer Science
on June 29, 2005 in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
Abstract
Streaming video applications must be able to withstand the potentially harsh conditions present
on best-effort networks like the Internet, including variations in available bandwidth, packet
losses, and delay. Multiple description (MD) video coding is one approach that can be used to
reduce the detrimental effects caused by transmission over best-effort networks. In a multiple
description system, a video sequence is coded into two or more complementary streams in such a
way that each stream is independently decodable. The quality of the received video improves
with each received description, and the loss of any one of these descriptions does not cause
complete failure. A number of approaches have been proposed for MD coding, where each
provides a different tradeoff between compression efficiency and error resilience. How
effectively each method achieves this tradeoff depends on network conditions as well as on the
characteristics of the video itself.
This thesis proposes an adaptive MD coding approach that adapts to changing conditions through
the use of MD mode selection. The encoder in this system is able to accurately estimate the
expected end-to-end distortion, accounting for both compression and packet-loss-induced
distortions, as well as for the bursty nature of channel losses and the effective use of multiple
transmission paths. With this model of the expected end-to-end distortion, the encoder selects
between MD coding modes in a rate-distortion (R-D) optimized manner to most effectively
trade-off compression efficiency for error resilience.
We show how this approach adapts to both the local characteristics of the video and to network
conditions and demonstrate the resulting gains in performance using an H.264-based adaptive
MD video coder. We also analyze the sensitivity of this system to imperfect knowledge of
channel conditions and explore the benefits of using such a system with both single and multiple
paths.
Thesis Supervisor: Jae S. Lim
Title: Professor of Electrical Engineering
-3-
-4-
Dedication
To Mom and Dad,
For Always Believing.
To Susanna,
For Her Encouragement,
Patience, and Love.
-5-
-6-
Acknowledgements
Many people have contributed to this thesis both directly and indirectly during these past few
years. I would like to take this opportunity to recognize these contributions and to thank all those
who have made this accomplishment possible.
I would like to start by thanking my thesis supervisor Professor Jae Lim for his guidance
and support during my time at MIT. I am very grateful to him for providing me a place in his lab
and for the extensive advice he has given me about both research and life. It was a great honor to
work with him these last six years. I would also like to express thanks to Dr. John
Apostolopoulos and Professor Vivek Goyal for serving on my thesis committee. They have both
spent many hours working with me to improve the quality of this research, and their comments
have always been useful and insightful. I would also like to acknowledge Hewlett-Packard and
the Advanced Telecommunications and Signal Processing (ATSP) group for their financial
support of this research.
My friends and colleagues in the ATSP group have made my time at MIT much more
enjoyable, and my interactions with each of them have been rewarding in many ways. Special
thanks to fellow Ph.D. students Wade Wan, Eric Reed, and David Baylon for making me feel
welcome, for helping me get started, and for continuing to assist me even after their careers at
MIT. I would like to thank the group's administrative assistant, Cindy LeBlanc, for making my
life here so much easier and for always looking out for me.
I am grateful to Jason Demas, Sherman Chen, and Jiang Fu for the opportunity to work
with them at Broadcom Corporation. My summers at Broadcom were very enjoyable, and the
knowledge I gained during this time has been immensely helpful. I would also like to thank
Davis Pan and Shiufun Cheung for giving me the chance to learn from them during my
internship at Compaq Computer Corporation.
I have been fortunate to have a number of supportive and loyal friends throughout my life.
I am very lucky to have met Wade Wan when I started at MIT. His guidance and advice have
been invaluable and I am grateful to have such a close friend. I would like to thank fellow Ph.D.
-7-
student Everest Huang for his companionship and for our many enjoyable conversations. To my
group of friends back home, including Neil Dizon, Steve Keu, Dao Yang, Mark Schermerhorn,
Efren Dizon, Chris Takase, and Nitin Jain, thank you for always being there.
I am privileged to have a wonderful family. I am especially thankful to my parents Mary
Jane and Duane Heng for their unending source of love and support. The opportunities they have
provided for me have made this accomplishment possible. I would also like to thank my brother
David for his encouragement and for always being a good friend. My family has suffered the
loss of my grandparents, James and Margaret Pribyl, while I have been away. I hope in my heart
I have made them proud on this day. I honor their memory and will never forget them.
Finally, I am very fortunate to have the love and support of my girlfriend, Susanna. It has
been difficult living apart all these years, and I am extremely grateful for her patience and
understanding. She has always been my source of strength, and her encouragement and love
have made this work possible.
Brian Heng
Cambridge, MA
June 29, 2005
-8-
Contents
1
2
3
Introduction..................................................................................................................
19
1.1
Video Processing Terminology ........................................................................
20
1.2
Multiple Description Video Coding .................................................................
23
1.3
Thesis Motivation and Overview.......................................................................
27
Multiple Description Video Coding.........................................................................
31
2.1
Multiple Description Coding Techniques.........................................................
31
2.1.1
Multiple Description Quantization ........................................................
32
2.1.2
Spatial/Temporal MD Splitting .............................................................
34
2.1.3
MD Splitting in the Transform Domain ...............................................
35
2.2
Predictive Multiple Description Coding ...........................................................
38
2.3
Applications of Multiple Description Video Coding.........................................
40
Adaptive MD Mode Selection..................................................................................
43
3.1
Adaptive Mode Selection Systems ....................................................................
43
3.2
Rate-Distortion Optimized Mode Selection ......................................................
46
3.2.1
Independent Rate-Distortion Optimization...........................................
46
3.2.2
Effects of Dependencies on Rate-Distortion Optimization ..................
50
3.3
End-to-End R-D Optimized MD Mode Selection .............................................
51
3.3.1
Lagrangian Optimization ......................................................................
51
3.3.2
Rate-Distortion Optimization over Lossy Channels.............................
52
4 Modeling End-to-End Distortion over Lossy Packet Networks ...........................
53
4.1
4.2
4.3
4.4
Optimal Intra-Coding for Error Resilient Video Streams..................................
Recursive Optimal Per Pixel Estimate of Expected Distortion ........................
Multiple Description ROPE Model .................................................................
Extended ROPE Model......................................................................................
-9-
53
55
56
56
Contents
5 MD Mode Selection System ....................................................................................
5.1
5.2
6
61
MPEG4-AVC / H.264 Video Coding Standard................................................
61
5.1.1
Intra-Frame Prediction.........................................................................
64
5.1.2
Hierarchical Block Motion Estimation..................................................
65
5.1.3
Multiple Reference Frames...................................................................
66
5.1.4
Quarter Pixel Motion Vector Accuracy ................................................
67
5.1.5
In-Loop Deblocking Filter.....................................................................
68
5.1.6
Entropy Coding.....................................................................................
69
5.1.7
H.264 Performance ................................................................................
70
MD System Implementation.................................................................................
71
5.2.1
Examined MD Coding Modes ..............................................................
72
5.2.2
Data Packetization ................................................................................
74
5.2.3
Discussion of Modifications Not in Compliance with H.264 Standard ... 75
Experimental Results and Analysis.........................................................................
77
6.1
Test Sequences...................................................................................................
78
6.2
Performance of Extended ROPE Algorithm.....................................................
81
6.3
MD Coding Adapted to Local Video Characteristics.......................................
85
6.4
MD Coding Adapted to Network Conditions..................................................
91
6.4.1 Variations in Average Packet Loss Rate...................................................
91
6.4.2 Variations in Expected Burst Length.......................................................
94
6.5
End-to-End R-D Performance ...........................................................................
97
6.6
Unbalanced Paths and Time Varying Conditions..............................................
99
6.6.1
6.7
6.8
Balanced versus Unbalanced Paths.........................................................
6.6.2 Time Varying Network Conditions ........................................................
Sensitivity Analysis ............................................................................................
6.7.1 Sensitivity to Packet Loss Rate...............................................................
6.7.2 Sensitivity to Burst Length .....................................................................
Comparisons between Using Single and Multiple Paths....................................
-10-
101
102
104
106
111
114
Contents
7
Conclusions.................................................................................................................
121
7.1
Summ ary.............................................................................................................
121
7.2
Future Research Directions.................................................................................
123
Bibliography.....................................................................................................................
-11-
127
-12-
List of Figures
1.1
Scan Modes for Video Sequences ...................................................................
21
1.2
Gilbert Packet Loss Model ...............................................................................
23
1.3
Two Stream Multiple Description System.........................................................
24
1.4
Classic Depiction of an MD System..................................................................
24
1.5
Example of Multiple Description Video Coding .............................................
25
1.6
Comparison between Scalable Video Coding and MD Coding .......................
27
2.1
MD Coding of Audio........................................................................................
32
2.2
MD Scalar Quantization ....................................................................................
33
2.3
MD Splitting of an Image .................................................................................
35
2.4
Transform Domain MD Splitting ......................................................................
36
2.5
Applications of MD Coding ............................................................................
41
3.1
Dynamic Programming Tree ............................................................................
48
3.2
Comparison between Lagrangian Optimization and Dynamic Programming...... 49
4.1
Conceptual Computation of First Moment Values in MD ROPE Approach ....... 57
4.2
Gilbert Packet Loss Model ...............................................................................
58
5.1
H.264 Encoder Architecture .............................................................................
62
5.2
H.264 Decoder Architecture .............................................................................
62
5.3
4x4 Intra-Prediction...........................................................................................
63
-13-
List of Figures
5.4
Two of the Nine Available 4x4 Intra-Prediction Modes ..................................
63
5.5
16x16 Intra-Prediction M odes ...........................................................................
64
5.6
Macroblock Partitions for Motion Estimation..................................................
65
5.7
M ultiple Reference Fram es................................................................................
66
5.8
Six-Tap Filter used for Half Pixel Interpolation................................................
67
5.9
In-Loop Deblocking Filter ................................................................................
68
5.10
Examined MD Coding Modes ...........................................................................
72
5.11
Packetization of Data in MD Modes................................................................
75
6.1
Performance of ROPE Algorithm - Actual vs. Expected PSNR.......................
80
6.2
Tim e Varying Packet Loss Rates.......................................................................
82
6.3
Performance of ROPE Algorithm with Time Varying Loss Rates...................
83
6.4
Bernoulli Losses versus Gilbert Losses with ROPE Algorithm........................
84
6.5
MD Coding Adapted to Local Video Characteristics.......................................
86
6.6
Distribution of Selected MD Modes - Foreman Sequence ...............................
87
6.7
Visual Results - Frame 5 Foreman Sequence ..................................................
89
6.8
Visual Results - Frame 231 Foreman Sequence ...............................................
90
6.9
PSNR versus Average Packet Loss Rate ...........................................................
92
6.10
PSNR versus Expected Burst Length ................................................................
95
6.11
Effects of Expected Burst Length on the TS Mode ...........................................
96
6.12
End-to-End R-D Performance at 5% Loss Rate with Burst Length of 3......
98
6.13
R-D Optimized Quantization Levels versus Fixed Quantization Level ............. 100
-14-
List of Figures
6.14
Distribution of Selected MD Modes - Time Varying Loss Rates......................
103
6.15
PSNR versus Frame with Time Varying Loss Rates..........................................
105
6.16
Sensitivity to Errors in Assumed Packet Loss Rate - Part 1 ..............................
107
6.17
Sensitivity to Errors in Assumed Packet Loss Rate - Part 2 ..............................
109
6.18
Sensitivity of ADAPT Relative to Non-Adaptive Methods ...............................
110
6.19
Sensitivity to Errors in Assumed Burst Length - Part 1.....................................
112
6.20
Sensitivity to Errors in Assumed Burst Length - Part 2.....................................
113
6.21
Comparison of Single Path vs. Multiple Paths ...................................................
114
6.22
Multiple Paths vs. Single Path - Variations in Expected Burst Length ............. 116
6.23
Multiple Paths vs. Single Path - Variations in Packet Loss Rate.......................
118
6.24
Multiple Paths vs. Single Path - R-D Performance............................................
119
-15-
-16-
List of Tables
5.1
Exponential-Golomb Codebook ........................................................................
69
5.2
List of MD Coding Modes...............................................................................
74
6.1
Test Sequences.................................................................................................
79
6.2
Distribution of MD Modes at 0%, 5%, and 10% Packet Loss Rates................ 93
6.3
Distribution of MD Modes at Various Burst Lengths .....................................
97
6.4
Distribution of MD Modes with Unbalanced Paths ...........................................
101
6.5
Percentage of Total Bandwidth in Each Stream for Unbalanced Paths.............. 101
-17-
-18-
Chapter 1
Introduction
The transmission of video information over error prone channels poses a number of interesting
challenges. One would like to compress the video as much as possible in order to transmit it in a
timely manner and/or store it within a limited amount of space. Yet, by compressing a video
sequence, one tends to make it more susceptible to transmission losses and errors. Video
applications ranging from high definition television down to wireless video phones all face this
same tradeoff. However, best-effort networks like the Internet present a particularly harsh
environment for real-time streaming video applications. In this type of environment, applications
must be able to withstand inhospitable conditions including variations in available bandwidth,
packet losses, and delays. Those that are unable to adapt to these conditions can suffer serious
performance degradations each time the network becomes congested.
Multiple description (MD) video coding is one approach that can be used to reduce the
detrimental effects caused by packet loss on best-effort networks. In a multiple description
system, a video sequence is coded into two or more complementary streams in such a way that
each stream is independently decodable. The quality of the received video improves with each
received description, but the loss of any one of these descriptions does not cause complete
failure. If one of the streams is lost or delivered late, the video playback can continue with
hopefully only a slight reduction in overall quality.
There have been a number of proposals for MD video coding, each providing its own
tradeoff between compression efficiency and error resilience. Previous MD coding approaches
applied a single MD technique to an entire sequence. However, the optimal MD coding method
will depend on many factors including the amount of motion in the scene, the amount of spatial
detail, desired bitrates, error recovery capabilities of each technique, and current network
conditions. This thesis examines the adaptive use of multiple MD coding modes within a single
sequence. Specifically, this thesis proposes adaptive MD mode selection by allowing the encoder
to select among MD coding modes in an optimized manner as a function of local video
characteristics and network conditions.
-19-
Chapter I
Introduction
The following section presents a brief introduction to video processing, establishing the
terminology used throughout this thesis and providing background information necessary for
discussing this work. In the second section, we discuss the motivation behind this research and
present an overview of this thesis.
1.1
Video Processing Terminology
A video frame is a picture made up of a two-dimensional discrete grid of pixels or picture
elements. A video sequence is a collection of frames, with equal dimensions, displayed at fixed
time intervals. The dimensions of each frame are referred to as the spatial resolution, and the
resolution along the temporal direction is known as the frame rate. The term macroblock is used
to describe a subdivision of a frame of size 16x16 pixels. For the purposes of this research, a
video stream will be defined as a sequence transmitted across the given network (e.g. the
Internet, wireless connections, etc...) and viewed in real-time. This differs from video file
transfer in which sequences are fully downloaded and playback only begins once the entire
video sequence has been received. Buffering is the process of storing up data at the receiver
before playback begins in the event that the network throughput drops temporarily. All streaming
applications use some amount of buffering in order to reduce the effect of variations in network
bandwidth and delay. The more buffering used, the longer it takes to initially fill that buffer, and
thus, the more delay experienced at the receiver. Video file transfer is essentially the same as
maximum buffering; the entire video sequence is stored at the receiver before playback begins.
The scan mode is the method in which the pixels of each frame are displayed. As shown in
Figure 1.1, video sequences can have one of two scan modes: progressive or interlaced. A
progressive scan sequence is one in which every line of the video is scanned in every frame. This
type of scanning is typically used in computer monitors, handheld devices, and high definition
television displays. An interlaced sequence is one in which the display alternates between
scanning the even lines and odd lines of the corresponding progressive frames. The termfield is
used (rather than frame) to describe pictures scanned using interlaced scanning, with the even
field containing all the even lines of one frame and the odd field containing the odd lines.
Interlaced scanning is currently used in many standard television displays. The process of
interlaced to progressive conversion is known as deinterlacing.
-20-
Introduction
Chapter I
Interlaced
Progressive
n-1
n-1
n
n
field
frame
(a)
(b)
Figure 1.1: Scan modes for video sequences. (a) In interlaced fields either the even or the
odd lines are scanned. The solid lines represent the field that is present in the current
frame. (b) In progressively scanned frames all lines are scanned in each frame.
The main focus of this research will be on real-time video streaming and the difficulties
presented when the network is unable to meet necessary time constraints. With this application
in mind, the sequences analyzed in this work are progressively scanned sequences since the vast
majority of computer and handheld displays use progressive scanning. However, it is sometimes
useful to process fields independently, which is where the concepts of interlacing and
deinterlacing become important. Also, the work described here could later be applied to
interlaced sequences in a fairly straightforward manner.
The extensive bandwidth required for transmission of raw video sequences is typically not
feasible, so most systems require the use of significant video compression to reduce the amount
of bandwidth needed. There can be a considerable amount of redundant information present in a
typical video sequence in both the spatial and temporal directions. Within a single frame, each
pixel is likely to be highly correlated with neighboring pixels since most frames contain
relatively large regions of smoothly varying intensity. Similarly, in the temporal direction two
frames are likely to be highly correlated since typical sequences do not change rapidly from one
frame to the next. There are many ways to take advantage of this redundancy in video coding. To
reduce correlation along the temporal direction, most video coders use some form of motion
estimation / motion compensation to predict the current frame from previously decoded frames.
In this approach, the encoder estimates the motion from one frame to the next, and uses this
model to generate a prediction of the next frame by compensating for the motion that has
-21-
Introduction
ChapterI
occurred. Coded blocks that depend on other frames due to the use of motion compensated
prediction are referred to as inter-coded blocks; blocks that do not depend on any other frames
are referred to as intra-coded.Once the temporal redundancy has been exploited, most encoders
use the Discrete Cosine Transform (DCT), or some other decorrelating transform, to remove as
much remaining redundancy as possible from the spatial dimension.
Despite efficient exploitation of the spatial and temporal redundancy present in typical
video sequences, the resulting bandwidth is typically not low enough to allow for lossless
transmission. For this reason, lossy compression algorithms are necessary for an effective
transmission scheme. For the purposes of this thesis, the distortion caused by losses during data
compression as well as losses during network transmission will be quantitatively measured using
the peak-signal-to-noiseratio (PSNR). The PSNR for a given frame is defined as
PSNR =10 -lo
(1.1)
5
(MSE)
where the mean square error (MSE) is the average squared difference between the original and
distorted video frames, F, and Fd .
NI-1 N 2 -1
MSE=
NIN
I
I(F[n,n
2
]-Fd[n n
2
(1.2)
n1 =0 n2 =0
Here the values N, and N 2 represent the horizontal and vertical dimensions of the frames, and
the values n and n2 are used to index each pixel location. The value 255 is used as the peak
signal value since it is the maximum value encountered with 8-bit pixel representations. It should
be noted that PSNR and perceived quality are not always directly correlated. Higher PSNR does
not necessarily indicate better video, but the use of PSNR is a common practice and has been
found to be a useful estimate of video quality.
In this thesis we have simulated network losses by using various probabilistic packet loss
models. In the Bernoulli loss model, the packet losses are independent and have equal
probability. Actual network losses tend to arrive in bursts, a behavior that is not captured by the
Bernoulli loss model and that has been shown to significantly affect video quality [2, 27]. We
use the Gilbertmodel to simulate the nature of bursty losses where packet losses are more likely
-22-
Introduction
Chapter I
State 1: packet received
1-PO
State 0: packet lost
Average Packet Loss Rate =
Po0
1
0
1+ p1 - pO
1
Expected Burst Length
-
0IP
P1
Figure 1.2: Gilbert packet loss model. Assuming pi < po , there is a greater probability
the current packet will be lost if the previous packet was lost. This causes bursty losses in
the resulting stream.
if the previous packet has been lost. This can be represented by the Markov model shown in
Figure 1.2 assuming p, < po.
1.2
Multiple Description Video Coding
The demand for streaming video applications has grown rapidly over recent years, and by all
indications this demand will continue to grow in the future. However, the majority of packet
networks, like the Internet, provide only best-effort service; there are no guarantees of minimum
bandwidth or delay [54]. Applications must be able to withstand changing conditions on the
network or they can suffer severe performance degradations.
For some applications, these problems can be reduced by using a suitable amount of
buffering at the receiver. However, buffering introduces an extra delay in the system that is
unacceptable for many applications such as video conferencing. This type of application requires
a high degree of interaction between opposite ends of the network and places stringent demands
on end-to-end delay. There exists a limit on the maximum amount of delay that can exist
between two users attempting to maintain a reasonable conversation. Once this limit is exceeded,
the two parties can no longer interact without significant effort. Therefore significant buffering is
not an option. Even in applications where some amount of buffering is acceptable, the amount of
buffering necessary in any situation is unknown ahead of time due to the time-varying properties
of the network. Occasionally network links fail altogether, and there may be some extended
period of time during which two nodes in the network cannot talk to one another at all. This type
of outage can underflow any reasonably-sized buffer. For these reasons, current approaches for
-23-
Introduction
Chapter I
Packet Network
Original
MD
Video
Encoder
MD
Packet Stream 1
Reconstructed
Decoder
Packet Stream 2
Figure 1.3: Two stream multiple description system. The original video source is
encoded into two complementary streams which are transmitted independently through
the network. As long as both streams are not simultaneously lost, the remaining stream
can still be decoded to achieve acceptable video quality.
Reconstructed
Video
Original _
Video
MD
Decoder 1
-
Good Quality
Decoder 0
-+
Best Quality
Decoder 2
-+
Good Quality
EncoderI
Figure 1.4: Classic depiction of a two stream MD coding system. The central decoder
(Decoder 0) uses both descriptions to reconstruct the highest quality video. The two side
decoders (Decoders 1 and 2) use only one description to generate acceptable quality
video.
real-time video streaming often suffer from severe glitches each time the network becomes
congested.
Multiple description video coding is one method that can be used to reduce the detrimental
effects caused by this type of best-effort network. In a multiple description system, a video
sequence is encoded into two or more complementary streams in such a way that each stream is
independently decodable (see Figure 1.3). When combined, the streams provide the highest level
of quality, but even independently they are able to provide an acceptable level of quality. These
streams can then be sent along separate paths through the network to experience more or less
-24-
Introduction
Chapter I
...
*4 4
6
8
1..
Figure 1.5: One example of multiple description coding. Original sequence is partitioned
along the temporal direction into even and odd frames. Even frames are predicted from
even frames and odd from odd. If an even frame is lost (e.g. Frame 4), errors will
propagate to other even frames, but the remaining description (the odd frames) can still
be straightforwardly decoded, resulting in video at half the original frame rate.
independent losses and delays. In the event that a portion of one of the streams is lost or
delivered late, the video playback will not suffer a severe glitch or stop completely to allow for
rebuffering. On the contrary, the remaining stream(s) will continue to be played out with only a
slight reduction in overall quality. Conceptually, a two stream MD decoder can be thought of as
three separate decoders, as shown in Figure 1.4. Here the central decoder (Decoder 0) is able to
decode both descriptions resulting in the highest quality video. The two side decoders (Decoders
1 and 2) receive only one of the descriptions resulting in lower, but still acceptable, video
quality.
Perhaps the simplest example of an MD video coding system is one where the original
video sequence is partitioned along the temporal direction into even and odd frames that are then
independently coded into two separate streams for transmission over the network. As shown in
Figure 1.5, this approach generates two descriptions, where each has half the temporal resolution
of the original video. In the event that both descriptions are received, the frames from each can
be decoded and interleaved together to reconstruct the full sequence. In the event one stream is
lost, the remaining stream can still be straightforwardly decoded and displayed, resulting in
video at half the original frame rate.
Of course, this gain in robustness comes at a cost. Temporally sub-sampling the sequence
lowers the temporal correlation, thus reducing coding efficiency and increasing the number of
bits necessary to maintain the same level of quality per frame. Without losses, the total bit rate
necessary for this MD system to achieve a given distortion will in general be higher than the
-25-
Introduction
Chapter I
corresponding rate for a single description (SD) encoder to achieve the same distortion. This is a
tradeoff between coding efficiency and robustness. However, in the type of application under
consideration, it is not so much a question of whether it is useful to give up some amount of
efficiency for an increase in reliability as it is a question of finding the most effective way to
achieve this tradeoff.
It should be noted here that multiple description coding is not the same as scalable video
coding. Similar to MD coding, a scalable coder encodes a sequence into multiple streams that are
referred to as layers. However, scalable coding makes use of a single independent base layer
followed by one or more dependent enhancement layers (see Figure 1.6). This allows some
receivers to receive basic video by decoding only the base layer, while others can decode the
base layer and one or more enhancement layers to achieve improved quality, spatial resolution,
and/or frame rate. However, unlike MD coding, the loss of the base layer renders the
enhancement layer(s) useless. In some sense, scalable coding is a special case approach to
multiple description coding where it is assumed that the base layer will be delivered with
absolute reliability.
-26-
Introduction
Chapter I
Scalable Video Coding
Reconstructed
Video
Base Layer
Decoder
Original
Video
Scalable
Encoder
Enhancement Layer
Decoder
P
Good Quality
-
Best Quality
Enhancement Layer
Multiple Description Coding
Reconstructed
Video
Decoder 1
Original
Video
MD
0
-
Best Quality
Decoder 2
-
Good Quality
Decoder
Encoder
Good Quality
Description 2
Figure 1.6: A comparison between scalable video coding and multiple description
coding. In scalable coding the enhancement layer(s) are dependent on the base layer, and
therefore the enhancement layer alone is not useful. In multiple description coding, each
stream is equally important, so either Description 1 or Description 2 will still yield
acceptable video quality.
1.3
Thesis Motivation and Overview
There have been many approaches proposed for MD coding, each providing a different tradeoff
between compression efficiency and error resilience. How efficiently each method achieves this
tradeoff depends on the quality of video desired, the current network conditions, and the
characteristics of the video itself. Most prior work in MD coding apply a single MD method to
the entire sequence; this approach is taken so as to evaluate the performance of each MD
method. However, it would be more efficient to adaptively select the best MD method based on
-27-
Introduction
Chapter I
the situation at hand [22]. Since the encoder in such a system has access to the original source, it
is possible to analyze the performance of each coding mode and adaptively select between them
in an optimized manner. That insight has provided the main motivation for this research.
Variations in both source material and network conditions make it highly unlikely that any single
MD approach will be most effective under all situations. By selecting between a small number of
complementary MD modes, it is possible for the system to more effectively adapt to all possible
video inputs and network conditions.
A number of adaptive MD approaches have been previously proposed [26, 33, 34, 47], but
the concept of adaptive MD mode selection has not been fully explored. In general, previous
adaptive approaches have used a single approach to MD coding, but have allowed the encoder to
adjust the amount of redundancy used to match source and/or channel characteristics.
Dynamically trading off compression efficiency for error resilience, in such a way, can provide
significant improvements over a non-adaptive MD approach, but fundamentally each of these
systems use a single MD method for an entire sequence. For instance, if the encoder in such a
system encounters a block that is particularly susceptible to errors, the response taken is to
increase redundancy and therefore increase the number of bits used to code this region.
However, it may be more effective to use an entirely different approach for this region, which
may allow the encoder to achieve the same error resilience without increasing the bitrate as
significantly, if at all.
The main goal of this thesis is to investigate the use of adaptive MD mode selection and
better understand its applicability to error resilient video streaming. There are many different
aspects of this idea that have not been fully explored. For instance, can we find a small set of
complementary MD modes that is able to adapt to a variety of video sources and network
conditions? If there are gains possible from adaptive mode selection, can these gains overcome
the overhead necessary for adaptive processing? Is it even possible for the encoder to make
mode selection choices in an optimized manner? We have previously suggested that the encoder
can analyze the performance of each MD method, however the random nature of channel losses
combined with spatial and temporal error propagation make this quite a difficult task. These are
some of the questions that motivated this work.
In the second chapter of this thesis, we provide a more detailed introduction to multiple
description video coding and provide an overview of previous research in this field. The chapter
begins with a review of MD coding techniques followed by a discussion of some of the issues
-28-
Introduction
Chapter I
that arise specifically when applying MD coding to video compression. The final section of
Chapter 2 discusses some applications that are particularly well suited for the use of MD coding.
Chapter 3 provides a more in-depth introduction to the concept of adaptive MD mode
selection. Section 3.1 reviews the role adaptive mode selection has played throughout the history
of video processing and describes some previous uses of adaptive mode selection. Section 3.2
discusses the process of optimal mode selection and provides a review of rate-distortion (R-D)
theory. Finally, Section 3.3 describes how these techniques can be used for adaptive MD mode
selection and also includes a discussion on R-D optimization for lossy packet networks.
The use of R-D optimization over lossy channels requires the use of some form of channel
modeling to estimate the effects potential losses will have on end-to-end distortion. Chapter 4
provides a review of previous attempts at this type of modeling and suggests one particular
approach that can quite effectively model end-to-end distortion taking into account both the
distortion due to quantization as well as the distortion due to channel losses.
Chapter 5 provides an overview of the system designed to investigate the concept of
adaptive MD mode selection. The system has been implemented based on the H.264 video
coding standard. The first portion of Chapter 5 reviews the H.264 codec to provide the necessary
background information for discussing this work. The remainder of Chapter 5 details the specific
implementation of the system we have used in this thesis.
The implementation described in Chapter 5 has been used to perform a number of different
simulations in order to evaluate the performance and behavior of the adaptive MD mode
selection system. The results of these experiments are provided in Chapter 6. We show how this
approach adapts to both the local characteristics of the video and to network conditions and
demonstrate the resulting gains in performance using our H.264-based adaptive MD video coder.
We also analyze the sensitivity of this system to imperfect knowledge of channel conditions and
explore the benefits of such a system when using both single and multiple paths.
Chapter 7 summarizes the main conclusions of this thesis and describes possible future
research directions.
-29-
-30-
Chapter 2
Multiple Description Video Coding
This chapter provides a more detailed introduction to multiple description video coding and
provides a summary of previous research in this area. The first section discusses several
techniques commonly used for multiple description coding and some background on the history
of the topic. Predictive coding is used in most video coding systems to remove the temporal
redundancy that exists in typical video sequences. This approach significantly increases the
efficiency of the overall system but also introduces the possibility for error propagation. Section
2.2 discusses some of the challenges introduced by the use of predictive coding in a MD system
and some of the approaches that have been used for addressing these issues. Finally, in Section
2.3, we discuss some applications that are particularly well suited for the use of MD video
coding.
2.1
Multiple Description Coding Techniques
The multiple description approach was originally introduced for audio coding through research
done at AT&T Bell labs in the 1970s to increase the reliability of the telephone system. One
early approach was suggested by Jayant [24, 25]. Here audio is partitioned along the temporal
direction into even and odd samples in an attempt to improve the reliability of digital audio
communications (see Figure 2.1). In this approach, if either stream is lost, the remaining stream
can still be played at half the original sampling rate.
Around the same time, the MD problem was introduced into the information theory
community by Wyner, Witsenhausen, Wolf, and Ziv [52, 53]. This problem became very
interesting from a theoretical point of view and much work has been done to analyze the problem
in depth. The main focus in the information theory community has been on characterizing the
multiple description region, defined as the set of all achievable operating points, under various
assumptions about the statistical properties of the source. Extensive work has been done to map
-31-
Multiple Description Video Coding
Chapter2
T
r
Stream 1
Original Audio
Stream 2
Figure 2.1: Multiple description coding of audio using even-odd sample splitting. Each
sub-sampled audio stream is encoded independently and transmitted over the network.
The temporary loss of either stream can be concealed by up-sampling the correctly
received stream by interpolating the missing values [24].
out achievable rate-distortion regions using multiple description codes for channel splitting [14].
The problem has many variations including generalizations to more than two channels.
For some time, multiple description coding was viewed only as an interesting information
theory problem. Only in recent years has the value of MD coding become apparent. The
widespread use of packetized multimedia applications over best-effort networks has brought the
MD problem to forefront. Using multiple description coding for packetized data can provide a
powerful tool for providing error resilient packet streams. Many approaches have been suggested
for multiple description coding including correlating transforms [18, 34, 48, 49], MD-Forward
Error Correction (FEC) techniques [11, 32], as well as MD splitting in the spatial [43], temporal
[1, 2], and transform domains [6, 8, 9, 19, 36]. Some of these methods are further discussed in
the following sections. For a more in depth review of the MD problem, see the overview by
Goyal [15].
2.1.1 Multiple Description Quantization
One of the early proposals for MD coding was multiple description quantization [41, 43]. Here
two or more complementary quantizers are used to compress the original source. A single
quantization gives a coarse reconstruction of the source. Any additional received quantizations
-32-
Multiple Description Video Coding
Chapter 2
Quantizer 1
Quantizer 2
Combined
0
1
2
3
4
5
6
7
I
I
I
I
I
I
I
L
0
1
2
3
4
5
6
7
1
1
1
1
1
1
1
1
0
1
2 3
4
5
6 7
8
9 10 11 12 13 14
Figure 2.2: Multiple description scalar quantizer. Quantizers 1 and 2 independently
describe the original source with 3 bits of accuracy. When combined together (by taking
the average of the two reconstruction levels) they can provide 3.9 bits of accuracy.
further refine this description. Given that quantizers are already an essential piece in any lossy
compression system, making slight modifications to form MD quantizers can be an easy way to
generate multiple descriptions of a source. One can design complementary quantizers that alone
coarsely describe a single source, but when combined together provide a more refined
description.
As a simple example, consider Figure 2.2. Here the reconstruction levels from two uniform
scalar quantizers independently divide the given space. Both of these are three bit quantizers and
they can each provide coarse descriptions of the original source. However, when both
quantizations are received, the two reconstructions can be combined to generate the 15
reconstruction levels shown below. The example shows how to use two complementary 3-bit
quantizers to create a log 2 (15) = 3.9 bit combined quantizer.
As with any MD approach, the example above makes a tradeoff between compression
efficiency and error resilience. Using a single description coding approach with the same number
of bits, the encoder could have described this source with 6 bits of accuracy. However, in general
if this data had been lost there would be no way of reconstructing it. The MD approach sacrifices
2.1 bits of accuracy for an increase in error resilience. This is only one example of possible
quantizers. Through proper choice of reconstruction levels, systems with more or less
redundancy can be easily designed. This is one beneficial feature of MD quantization. The same
concept is extendable to vector quantization and trellis coded quantization as well [12, 16, 42,
45].
-33-
Multiple Description Video Coding
Chapter 2
2.1.2 Spatial/Temporal MD Splitting
A straightforward method of creating multiple description streams is to sub-sample a sequence
along the spatial or temporal direction and encode each sub-sequence independently. The
significant redundancy in video or audio data, for example, can be used quite effectively to
reconstruct any missing descriptions.
Figure 2.1 is an illustration of this approach applied to audio coding, but the same idea can
be extended to video coding as well. The original video sequence could, for example, be
partitioned temporally into even and odd frames. As shown in Figure 1.5, this approach
generates two descriptions, where each has half the frame rate of the original video. In the event
that both descriptions are received, the frames from each can be interleaved to reconstruct the
full sequence. In the event one stream is lost, the other stream can still be straightforwardly
decoded and displayed, resulting in video at half the original frame rate.
One such approach has been suggested by Apostolopoulos [1]. Here the author develops a
novel approach for repairing a damaged description by using a clean description through the use
of sophisticated motion compensated temporal interpolation. The wealth of information present
in correctly received previous and future frames can be used to accurately estimate missing
frames. By filtering the motion vector fields from neighboring frames, an estimate of the motion
vectors from the current frame can be obtained. Then, the data from the missing frame is
estimated by interpolating along these motion vectors, while accounting for covered and
uncovered regions.
It has been shown that this approach can accurately reconstruct missing frames. However,
this gain comes at a cost. In order to maintain two separate prediction loops, motion
compensated prediction cannot be used with directly adjacent frames; even frames must be
predicted from even frames and odd from odd. Since temporal prediction decreases in efficiency
as the distance between two frames increases, these two streams are coded less efficiently than
when they are coded as a single stream.
-34-
Chapter 2
Multiple Description Video Coding
Chapter 2
MultilECdesripi-+
o
odn
Figure 2.3: Spatial splitting of an image. The original image is low-pass filtered using
four shifted averaging filters. The outputs are then sub-sampled and independently JPEG
encoded. After transmission, the loss of any one stream can be concealed quite
accurately given the significant correlation with the remaining streams [46].
One approach for splitting data in the spatial direction was suggested by Wang and Chung
for image coding [46]. Their algorithm creates four sub-images by filtering an image with an
averaging filter and its shifted variants (see Figure 2.3). They found that this approach was
extremely robust, but correspondingly very inefficient. The correlation between the four streams
allows for very accurate reconstruction when one description is missing. This also greatly
reduces coding efficiency, since the encoder cannot make use of this correlation to help reduce
the bit rate. In the end, their encoder required nearly double the bit rate to achieve the same
distortion as the single stream case in the absence of losses.
2.1.3 MD Splitting in the Transform Domain
Given the inefficiencies of spatial or temporal domain splitting, many have suggested making
use of the compression efficiency of decorrelating transforms, like the DCT, prior to partitioning
the sequence (see Figure 2.4). By decorrelating the data first, a significant gain in compression
efficiency is obtained. However, this gain comes at the cost of reconstruction quality since
-35-
.
.....
.......
. -
Multiple Description Video Coding
Chapter 2
Multiple Description Video Coding
Chapter 2
Figure 2.4: Transform domain multiple description splitting. Use of the decorrelating
transform (e.g. DCT) prior to partitioning allows the MD encoder to take advantage of
the significant spatial correlation present in a video sequence. The transformed
coefficients are then partitioned, quantized, and independently entropy coded [6].
transformed coefficients are, by design, less correlated and thus more difficult to predict from
one another.
In image and video coding, the multiple description quantizers presented in Section 2.1.1
are essentially transform domain splitting techniques. Strictly speaking, they do not need to be
used in the transform domain, and can work quite effectively in spatial/temporal domains, as is
done in speech applications. In image and video coding, the transform domain is where
quantization takes place, and thus, MD quantization is one approach for transform domain
splitting. The only reason MD quantization appears separately in this chapter is that it was not
historically developed specifically for use in the transform domain.
The use of correlating transforms is another approach for transform domain splitting. In
general, there exists extensive correlation between neighboring pixels of an image or video
frame. In image or video coding the purpose of decorrelating transforms, like the DCT, is exactly
that: to decorrelate the input variables and to reduce spatial correlation. This allows for much
more efficient coding and significant bit rate reduction. However, by removing this correlation
between transformed coefficients, it becomes very difficult to estimate missing coefficients in
the event that one of the descriptions is lost. One method to help solve this problem is the use of
correlating transforms [17, 49]. These transforms add back correlation between coefficients by
introducing statistical redundancy. The variance of the resulting coefficients, conditioned on
correctly receiving other descriptions, can be significantly reduced and can allow much more
accurate estimation.
Consider the following example. Let,
Y2
_C
-36-
x2
D[y,][A
B[xl](2.1)
Chapter 2
Multiple Description Video Coding
where x, and x2 are zero-mean independent Gaussian random variables with variances oC
72 respectively.
and
E [yIy2]= E [(A -x, + B. x 2 )(C -x +D. x 2 )
(2.2)
=AC -
+BD-a22
Given that the correlation between x, and x2 was 0 by definition, any appropriate choice for A,
B, C, and D will increase the correlation between y, and y 2 relative to the correlation between
x1 and x 2 . At this point, assuming y 2 has been lost, y, can be used to estimate x, and x 2
Depending on whether y, or y2was correctly received, and given that the random variables are
jointly Gaussian, the optimal estimators are
X,~
L2
A
1
2
A 72Y
A207+B2o2
1
C2f+D2
(2.3)
BoJ(.
Cc7
22
(2.4)
y
_Dcr22
-
The corresponding average mean squared error distortions are
(A 2 +B2)aofiy
2(A2o2+ B2 2 )
given y1
or
(C2 +D2)0a
2(C2o2+D22
given Y.
(2.5)
2
With appropriate choices for A, B, C, and D, these expected distortions can be made lower than
2
2
the expected distortion using only x, and x2 , namely " or 2. As always, this gain comes at a
cost. The increased correlation between y, and y 2 will decrease the relative efficiency of
entropy coding and will increase the bit rate of the stream. Also, and perhaps more important for
image and video coding, is that this type approach can be highly inefficient when most of the
quantized coefficients are equal to zero. Most image/video coders use run-length encoding
(encoding the number of consecutive zeros, not each individual zero value) to take advantage of
this sparse nature of quantized coefficient data. The use of correlating transforms will generally
-37-
Multiple Description Video Coding
Chapter 2
increase the number of nonzero coefficients, which decreases the effectiveness of run-length
coding and can be very costly.
In contrast to methods like the correlating transforms suggested above, that insert artificial
redundancy into the transformed coefficients, a number of techniques have been developed that
make use of splitting transformed coefficients directly. Using the block-based DCT and splitting
coefficients in the DCT domain is one option. However, DCT coefficients are highly
uncorrelated and any attempt at reconstruction when one description is missing can leave a great
deal of visual distortion. In [8] and [9], this idea is modified by using a lapped orthogonal
transform. The overlapping nature of this particular transform introduces redundancy and allows
for easier reconstructions in the event of lost descriptions. Bajic and Woods suggest using subband wavelet transforms rather than a DCT allowing for more accurate reconstruction by using
interpolation in the lowest frequency bands [6].
2.2
Predictive Multiple Description Coding
In a typical video sequence there exists a significant amount of redundancy between one frame
and the next. Thus, coding efficiency can be considerably improved by using some form of
predictive coding (specifically most video coders use motion compensated temporal prediction).
Predictive coding is based on the assumption that the encoder and the decoder are able to
maintain the same state, meaning that the frames they use for prediction are identical. However,
transmission losses can cause errors in frames at the decoder resulting in a mismatch in states
between the encoder and the decoder. This state mismatch can lead to significant error
propagation into subsequent frames, even if those frames are correctly received. This section
discusses some of the issues predictive MD coding presents since predictive coding is an
essential piece of most video coding systems. For an in-depth review of MD video coding see
the overview by Wang, Reibman, and Lin [50].
In the strictest sense, each MD stream should be independently decodable and losses in one
description should not affect any other descriptions. Given the use of predictive coding,
accomplishing this requirement can be somewhat difficult. There are a number of approaches to
predictive MD coding; some accomplish this strict independence constraint while others relax or
ignore this constraint to some extent. In [50], the authors partition predictive MD coders into
-38-
Chapter 2
Multiple Description Video Coding
three useful classes. We use these same classes here since they provide a convenient means of
understanding this topic.
Predictors from the first class, Class A, achieve complete independence through the use of
less efficient predictors. For instance, the system proposed in [1] uses two independent
prediction loops; even frames are predicted from even frames and odd frames are predicted from
odd frames. This prevents losses in one description from propagating to other descriptions (e.g.
the loss of an even frame will only propagate to future even frames). Another approach is to use
a single prediction loop, but only predict from information known to be present in both streams
[7].
Each of the approaches from Class A trade off some amount of prediction efficiency in
order to maintain independence between each of the descriptions. The second class, Class B,
relaxes the independence constraint in favor of using the most efficient predictors possible. In
this case each prediction is generated in the same way as a single description coding scheme
resulting in greater coding efficiency. However, with this approach, losses in one description can
propagate to the remaining descriptions. Some systems using this approach also code the residual
error to reduce the effect of mismatch, others do not.
The final class of predictors, Class C, uses some combination of the first two. They trade
off some of the efficiency of Class B for the increased resilience of Class A. There will be some
amount of mismatch in this type of approach, but presumably less than when using only the most
effective predictors (Class B). In addition, predictors from this class are often able to adapt
between the two extremes, gaining more error resilience where it is most needed. Some
examples of this type of system are [26] and [34]. Depending on the particular modes used, the
adaptive MD mode selection proposed in this thesis can use any one of these three approaches.
By using an end-to-end rate-distortion optimized framework, the approach proposed in this
thesis can most effectively trade off efficiency for resilience to optimize the expected quality at
the receiver. The particular implementation described in Chapter 5 is an example of a Class C
predictor.
-39-
Multiple Description Video Coding
Chapter 2
2.3
Applicationsof Multiple Description Video Coding
Multiple description coding can be useful for a wide range of video streaming applications. This
section discusses a few of examples where MD coding can make a significant impact on overall
performance.
MD coding can certainly be used to improve standard point-to-point video communications
over a single path, see Figure 2.5 (a). This approach can not handle a total outage of the single
path, yet the susceptibility to packet loss may be reduced relative to single description coding. If
packet losses along this path are approximately independent (Bernoulli) then any particular
subset of packets sent along the path will also be lost independent of all other subsets. With this
in mind, each description is lost or received independently of all other descriptions. However,
packet losses are often bursty in nature. To remain effective, the MD coding approach relies on
the assumption that it is unlikely that losses will occur on both descriptions. Bursty packet losses
along a single transmission path can cause losses in both descriptions which can significantly
reduce the effectiveness of the MD approach. Interleaving (reordering the sequence or
transmitted packets) is often used to reduce the effect of bursty packet losses. However, the
delay constraints of real-time systems limit the extent to which this is possible.
While MD codes can be used to improve transmission over a single path, they are
particularly well suited for use with multiple paths, see Figure 2.5 (b) and (c). In this type of
approach, each description is sent along an independent path through the network to the receiver.
Even if the channel experiences bursty losses along one path, path diversity makes it unlikely
that both descriptions will be lost. There are a number of approaches for transmitting over
multiple paths. For instance, standard point-to-point transmissions over the Internet can be
modified to include multi-path routing, as in Figure 2.5 (b). The sender can explicitly route
packets along separate paths by directing them to intermediate routers on their way to the
receiver. Another approach is to use a streaming media content delivery network (CDN) to
stream complementary descriptions from multiple senders as shown in Figure 2.5 (c) [3, 5]. Even
with this type of multiple path approach, it is often difficult to generate completely independent
paths through the network. Eventually, the paths are likely to converge resulting in two paths
that are partially independent and partially shared. In [4], the authors provide a useful model for
evaluating the performance of path diversity and multiple description streaming along partially
independent, partially shared paths. They use this model to show the benefits of MD
-40-
Multiple Description Video Coding
Chapter 2
Multiple Description Video Coding
Chapter 2
-,0%-
<
4S
_-%
(a)
(b)
rr1,
(c)
Ile
0~~
4
'9.
~
1-0
j
*A.~
p
'9.,
~7
N
bE
-*
h
N.
(d)
-w
(e)
Figure 2.5: Applications of multiple description coding. (a) Traditional point-to-point
communications. (b) Point-to-point communications using multiple paths. (c) Multiple
senders via Content Delivery Networks (CDN). (d) Wireless communication via
multiple base stations. (e) Ad-hoc peer-to-peer wireless networks.
-41-
Multiple Description Video Coding
Chapter 2
coding in situations ranging from fully independent paths to fully dependent paths. These models
also enable one to select the best paths [3, 5].
The use of MD coding also has significant potential in wireless applications. Individual
links often fail due to interference from the environment or from other wireless devices. In
addition, a single link may not be able to support the necessary bandwidth for video
transmission. Thus, transmission using multiple paths in wireless applications is particularly
attractive. For instance, packets could be routed through two different base stations on their way
to the handheld device as shown in Figure 2.5 (d). If one of the links begins to fail due to
interference, multiple description coding can allow graceful degradation in quality, allowing
time for the device to initiate communications with a third base station or wait for the
interference to clear. The same approach can be used with ad-hoc peer-to-peer wireless devices
as shown in Figure 2.5 (e). Individual devices enter and exit the network sporadically due to the
movement of each device, interference with other devices, or simply from being turned on/off.
The use of MD coding allows the system to be more resilient to this type of dynamic network
topology and to maintain reasonable video quality.
-42-
Chapter 3
Adaptive MD Mode Selection
Each approach to MD coding trades off some amount of compression efficiency for an increase
in error resilience. How efficiently each method achieves this tradeoff depends on the quality of
video desired, the current network conditions, and the characteristics of the video itself. Most
prior research in MD coding involved the design and analysis of novel MD coding techniques,
where a single MD method is applied to the entire sequence. This approach is taken so as to
evaluate the performance of each MD method. However, it would be more effective to
adaptively select the best MD method based on the situation at hand. Since the encoder in this
type of adaptive MD mode selection system has access to the original source, it is possible to
analyze the performance of each coding mode and select between MD modes in an R-D
optimized manner.
This chapter introduces the concept of adaptive MD mode selection in more detail and
presents some of the tools that can be used to achieve it. The first section of this chapter
discusses the essential role adaptive mode selection plays in video coding, and the second
introduces R-D optimization techniques that can be used to accomplish optimized mode
selection. The final section discusses how this thesis has applied these ideas to adaptive MD
mode selection.
3.1
Adaptive Mode Selection Systems
Adaptive mode selection (AMS) has played a vital role throughout the history of video coding.
Even the earliest video coding standards made use of hybrid inter/intra coding which is
fundamentally an AMS approach. This adaptation between inter-coded blocks, which are
predicted from previously coded frames, and intra-coded blocks, which are coded independently
of any other frames, has been shown to greatly improve video compression efficiency.
-43-
Adaptive MD Mode Selection
Chapter 3
The benefits of AMS should be fairly clear; adaptive processing allows the encoder to
adjust to local regions of the video in order to increase its overall effectiveness. However, there
are two main tradeoffs when using adaptive mode selection. First, any implementation of AMS
will require some form of additional overhead, or side information, since it is necessary for the
encoder to convey to the decoder which particular mode has been used for each individual region
or block. With a small number of modes, this overhead can be minor, perhaps just one or two
bits per block, yet if the number of modes grows quite large, this overhead could increase
significantly. Secondly, the use of multiple modes typically increases the complexity of the
encoder. The encoder must somehow evaluate the performance of each available mode, which
generally means that the encoder attempts each one and analyzes the results. The usefulness of
any particular AMS approach is determined by comparing the gain in performance from adaptive
processing with the costs of additional overhead and the increase in complexity. For some
approaches, there may be little or no benefit to adaptation, so the additional overhead and
complexity would be wasted. Yet in many situations the gain in performance can significantly
outweigh these costs.
It is interesting to point out that, in general the complexity of the decoder is not increased
significantly by the use of AMS. The decoder must understand how to decode each possible
mode, but only needs to use one particular mode per block. For this reason, AMS is often quite
useful in a situation where the encoding is done only once and the decoding is done many times
or in an application where there may be a smaller number of relatively expensive encoders and a
vast number of inexpensive decoders.
Adaptive mode selection can be used to achieve any number of different goals, but perhaps
the two most common uses in video compression are to improve compression efficiency and to
increase error resilience. One example of an AMS system that has shown significant gains in
compression efficiency is the intra/inter hybrid video coding approach mentioned above. In a
video sequence there often exists significant temporal correlation between neighboring frames.
By using motion estimation/compensation for inter-frame prediction, the encoder can often
generate a fairly accurate prediction of the next frame. Encoding the residual difference between
the prediction and the original is often significantly more efficient than coding the original data.
However, during scene changes or periods of significant motion, motion compensation can fail
to provide an accurate prediction. In this type of situation inter-frame coding can actually be less
efficient than using intra-frame coding to code the original data itself. By adaptively switching
-44-
Adaptive MD Mode Selection
Chapter 3
between the two modes, the encoder can adapt to the situation at hand, and coding efficiency can
be greatly improved.
AMS is often used for increasing error resilience as well. The hybrid intra/inter coding
approach happens to be one example that can be used for both purposes. In the presence of
channel errors or losses, the use of inter-frame encoding can lead to the spatial and temporal
propagation of errors. The use of intra-frame coding stops this error propagation and can
significantly improve the error resilience of the system. However, the goal of increasing error
resilience is at odds with the goal of increasing compression efficiency. At one extreme,
exclusively using intra-coding would lead to the highest error resilience, yet the worst coding
efficiency. Most hybrid intra/inter coding systems are designed to make some tradeoff between
compression efficiency and error resilience.
The latest video codec, H.264, includes many examples of AMS including the availability
of multiple intra prediction modes, variable motion compensation block sizes, adaptive
frame/field coding, and so on. The H.264 standard is discussed in further detail in Chapter 5.
The adaptive format conversion (AFC) system discussed in [44] is an example of AMS for
scalable video coding and was in some ways the inspiration for the current work. As discussed in
the previous chapter, scalable coding and MD coding are closely related, scalable coding being a
special case of MD coding. The approach taken in [44] focuses on improving compression
efficiency rather than error resilience, but many of the concepts are the same as those discussed
in this thesis. In this AFC approach, the base layer provides interlaced video and the
enhancement layer provides progressive video. The encoder in this system adaptively selects
from among a small set of deinterlacing modes in order to use the most effective mode for each
block. It then transmits this mode selection information as enhancement layer data. In many
situations, this small amount of enhancement information can significantly improve performance
over non-adaptive deinterlacing and this approach is often found to be more efficient than
encoding the residual data itself.
-45-
Adaptive MD Mode Selection
Chapter 3
3.2
Rate-Distortion Optimized Mode Selection
The main question that arises with any adaptive mode selection system is how exactly to choose
one particular mode for each region. The main goal is often to minimize some distortion metric
subject to a bit rate constraint. The modes resulting in the lowest distortion often use the greatest
number of bits, so this rate constraint forces the encoder to make certain tradeoffs. It must select
the best modes possible while still keeping the rate below a fixed level. It is the goal of optimal
mode selection to determine which modes most effectively accomplish this tradeoff.
The following section provides a brief summary of rate-distortion (R-D) optimization
techniques that can be used for the purpose of optimal mode selection.
3.2.1 Independent Rate-Distortion Optimization
One common assumption in R-D optimization for video coding is that the distortion Di and rate
R, can be determined independently for each block and that the decisions made for each block
will not affect any other blocks. The inter- and/or intra-frame prediction techniques used in video
codecs invalidate this assumption for the most part, but nonetheless this is a very common
approach and it greatly simplifies this discussion. Section 3.2.2 will discuss the problems these
prediction dependencies introduce and argue why the above assumption is necessary for a
practical implementation of R-D optimization.
The adaptive mode selection problem can be summarized as a budget constrained
allocation problem: for a given total rate Rroal,select modes for each individual block i that
minimize the distortion metric
D,
(3.1)
R, !; Rrotal
(3.2)
subject to the bitrate constraint
-46-
Adaptive MD Mode Selection
Chapter 3
Here R, is the total number of bits necessary to code block i and D, is the resulting distortion.
The distortion metric we have used for this problem is the mean square error (MSE) introduced
in Chapter 1.
The classic solution to this problem is the discrete version of Lagrangian optimization [31].
In this approach the Lagrangian cost function J(A) is calculated
J,(A)= D,+AR,
(3.3)
where A is a non-negative real number used to define the acceptable tradeoff between rate and
distortion. Setting 2 at zero effectively ignores the rate and results in minimizing the distortion.
Setting 2 arbitrarily high effectively ignores the distortion and results in minimizing the bitrate.
Lagrangian optimization theory states that if a particular set of modes minimizes
D, +AR,
J,(A)=
(3.4)
then the same set of modes will also solve the above budget constrained allocation problem for
the particular case where the total rate budget is
Rr1tai =
R.
If we assume that each block is independent, then minimizing
min
J,(A)
=
min(J, (A))
(3.5)
J1 (2) can be rewritten as
(3.6)
Therefore, the total sum can be minimized by minimizing the Lagrangian cost for each
individual block. The encoder in the system computes the Lagrangian cost for each mode and the
mode that minimizes this cost is selected. By doing so, the encoder guarantees that the resulting
MSE is the smallest possible for that particular rate. The encoder can then adjust the value of A
to achieve various operating points.
-47-
Chapter 3
Adaptive MD Mode Selection
Stage 0
(R0, Do)
(RI, D,)
Stage 1
(ROSDoo)
(RpDoj)
(RI0,DO)
(RD1)
Stage 2
Stage 3
(R000,DO00)
(Rooj, Dooj) (Reo, Dojo) (R011,DO1)
(R,00,Djoo) (RjojDjoj) (RI10,D,10) (RAWD1)
Figure 3.1: Dynamic programming tree generation for a two-mode AMS system. At
each stage, the encoder calculates the rate and distortion possible with each available
mode. If two paths result in the same cumulative rate, the path with the highest distortion
is pruned from the tree. If a path results in a cumulative rate over the allowed budget,
that path is pruned from the tree. Once the entire tree is generated, the remaining path
with the lowest distortion is selected.
Another approach to solving this problem is the use of dynamic programming. In this case,
the encoder creates a trellis or tree of all possible outcomes. For example, consider the case of a
two-mode AMS system, see Figure 3-1. The encoder has two possible choices for the first block
or stage. The resulting rates and distortions for these two outcomes are computed and stored.
After the second stage there are four possible outcomes, eight after the third stage, and so on. If
two or more paths result in the same cumulative rate at any time, those with the higher distortion
are pruned from the tree. If a path results in a cumulative rate higher than the available budget, it
is pruned from the tree. After the entire tree is generated, the remaining path with the lowest
distortion is selected and the set of modes used to travel along that path are used.
-48-
Adaptive MD Mode Selection
Chapter 3
Adaptive MD Mode Selection
Chapter 3
L
distortion
.
00
P
2
0
e
0
2
RTotal
rate
Figure 3.2: Comparison between Lagrangian optimization and dynamic programming.
Given total rate constraint Rot,,,, operational point P, would be optimal. However
Lagrangian optimization will yield the suboptimal result P since it can only achieve
points on the convex hull of the rate-distortion characteristic. Dynamic programming
considers all possible solutions and will select point 1.
Compared to Lagrangian optimization, there is one main advantage of dynamic
programming. Lagrangian optimization can only achieve points that lie on the convex hull of the
rate-distortion curve shown in Figure 3.2. Consider the three operational points labeled PO, P,
and P2, and the budget constraint RO,, in Figure 3.2. Given this budget constraint, the optimal
operational point would be P,. Lagrangian optimization can achieve points PO and P2 since these
lie on the convex hull of the rate-distortion curve. Point P2 is above the allowed rate, so it is
unacceptable. Even though point P, is under the allowed rate and has a lower distortion than P,
Lagrangian optimization will not find this solution. Dynamic programming will achieve the
optimal operational point since it will fill its tree with every possible solution including point P,.
However, the complexity of the dynamic programming approach is significantly higher
than using Lagrangian optimization. Even when just two modes are used, the tree can grow
extremely quickly and the memory requirements for this approach can be quite unreasonable. In
-49-
Chapter3
Adaptive MD Mode Selection
addition, if there are many operational points on the convex hull, there is little or no benefit to
dynamic programming.
3.2.2 Effects of Dependencies on Rate-Distortion Optimization
As mentioned in the previous section, it is common to assume that the decisions made in current
blocks will have no effect on future blocks. However in reality this is not the case. Motion
compensated prediction introduces a clear dependency between frames that is not accounted for
with this assumption. Similarly, most video coding standards make use of some form of intraframe prediction (motion vector prediction, differentially coded quantization parameters, intrapixel prediction, etc...) that will introduce dependencies within a frame as well.
As mentioned above, Lagrangian optimization is a greedy approach in that it optimizes
coding decisions for the current block alone. As demonstrated in Figure 3-2, it is possible to
make slightly sub-optimal decisions for the current block that may leave more bits remaining for
future blocks, resulting in a total solution with better performance. This same problem is
increased with the introduction of prediction dependencies.
For example, it may be possible to encode the current frame with more bits than would be
optimal according to Lagrangian optimization. This lowers the distortion in the current frame,
but leaves fewer bits for the next frame. However, since the next frame is predicted from the
current frame and the current frame has a lower distortion, perhaps it is easier to code the next
frame. The total result might use the same number of bits but have a lower distortion in both
frames.
As before, dynamic programming can be used to solve this problem in an optimal manner.
Since every possible solution is considered, the effects of prediction dependencies are taken into
account. However, given that prediction dependencies introduce memory into the system, fewer
paths may be trimmed from the tree. Those paths that happen to merge to the same rate now also
need to be in the same memory state before they can be trimmed. Nonetheless, it is possible for
dynamic programming to find the optimal solution.
-50-
Adaptive MD Mode Selection
Chapter 3
3.3
End-to-End R-D Optimized MD Mode Selection
The previous sections in this chapter have provided motivation for using adaptive mode selection
and discussed some of the optimization tools that can be used to accomplish this task. This
section discusses how these tools have been applied to the current problem of adaptive MD
mode selection.
3.3.1
Lagrangian Optimization
Despite the benefits of using dynamic programming that have been mentioned in Section 3.2,
this thesis will use Lagrangian optimization to perform adaptive MD mode selection. We have
made this decision primarily for the three reasons listed here:
* Complexity: The sheer complexity of a dynamic programming approach makes it
unreasonable for an actual implementation. Even a simple 2-mode system used
for a small frame size of 100 blocks results in 2'00 different branches by the end of
a single frame. A few of these might be trimmed, but in general this approach
requires too many computations and far too much memory.
*
Delay: In order to use dynamic programming to fully account for inter-frame
dependencies, it is necessary for the encoder to wait until it has received and
encoded every single frame. This introduces a delay into the system that is
unacceptable for many applications.
*
Slight
Reduction in Performance: Ignoring prediction dependencies,
Lagrangian results would be fairly close to the results with dynamic programming
since, in the current application, the population of operating points on the R-D
curve is fairly dense (see Section 3.2.1). In terms of prediction dependencies, it
has been shown that optimized dependent solutions can often be approximated
using an independent approach with little loss in performance [29].
-51-
Chapter 3
Adaptive MD Mode Selection
3.3.2 Rate-Distortion Optimization over Lossy Channels
The Lagrangian optimization techniques presented above can be used to minimize distortion
subject to a bitrate constraint. However, this approach assumes the encoder has full knowledge
of the end-to-end distortion experienced by the decoder. When transmitted over a lossy channel,
the end-to-end distortion consists of two terms: 1) known distortion from quantization and 2)
unknown distortion from random packet loss. The unknown distortion from losses can only be
determined in expectation due to the random nature of losses.
Modifying the Lagrangian cost function to account for the total end-to-end distortion gives
qunE[Doss
J ( A)= D,"" + E
] + 2R .
1
(I"D+,R.'37
Here R, is the total number of bits necessary to code region i
,
(3.7)
Du""" is the distortion due to
quantization, and D"O"is a random variable representing the distortion due to packet losses.
Thus, the expected distortion experienced by the decoder can be minimized by coding each
region with all available modes and choosing the mode that minimizes this Lagrangian cost.
Calculating the expected end-to-end distortion is not a straightforward task. The
quantization distortion D,"" and bitrate R are easily determined at the encoder. However, the
channel distortion DO"" is difficult to calculate due to spatial and temporal error propagation. In
the next chapter we discuss approaches for modeling expected end-to-end distortion and the
extensions necessary to apply these concepts to the current problem of MD coding over multiple
paths with Gilbert (bursty) losses.
-52-
Chapter 4
Modeling End-to-End Distortion
over Lossy Packet Networks
As mentioned in the previous chapter, random packet losses force the encoder to model the
network channel and estimate the expected end-to-end distortion including both quantization
distortion and distortion due to channel losses. With an accurate model of expected distortion the
encoder can make optimized decisions to improve the quality of the reconstructed video at the
decoder. A number of approaches have been suggested in the past to estimate end-to-end
distortion. This chapter discusses some of the previous work in this area and the extensions
necessary to apply this work to the current problem of multiple description coding over multiple
transmission paths with bursty packet losses.
4.1
Optimal Intra-Coding for Error Resilient Video Streams
The problem of optimal mode selection over lossy packet networks was originally considered for
optimal intra/inter coding decisions in single description streams as a means of combating
temporal error propagation. This problem has received considerable attention in the error
resilience community and is also applicable to the current problem of optimal MD mode
selection.
Besides the obvious use of intra-coded frames (I-frames) as starting points for random
access into video bitstreams, I-frames resynchronize the video prediction loop and stop error
propagation. In the extreme case, a sequence consisting of all I-frames would prevent error
propagation entirely. However, the use of intra-coding is inefficient given the extensive temporal
correlation present in typical video sequences. Thus the intra/inter coding decision presents a
tradeoff between compression efficiency and error resilience.
The simplest approach for re-synchronizing the motion compensated prediction loop is the
use of periodic replenishment [21]. This can be performed in any number of ways from
-53-
Chapter 4
Modeling End-to-EndDistortionover Lossy PacketNetworks
periodically intra-coding entire frames, to periodically coding rows of macroblocks, or even
using a pseudo-random pattern of intra-coded blocks. However, this type of approach does not
take into account the characteristics of the video source. Certain blocks (e.g. stationary
background regions) may not need intra-coding if they can be well reconstructed even if packet
losses occur. One early attempt to estimate which blocks are most susceptible is the concept of
conditional replenishment presented in [21]. Here the encoder simulates the loss of each block
and calculates the resulting mean squared error. If this distortion exceeds a certain threshold the
block is intra-coded. This approach is essentially a rough attempt to estimate the expected endto-end distortion and provide content-aware intra-coding. Other approaches for this type of
conditional replenishment have been proposed including more complex sensitivity metrics and
decisions that take into account packet loss rates on the network, e.g. [28]. Depending on the
application and the availability of a feedback channel, it may be possible for the decoder to
inform the encoder when certain loss events occur and perhaps allow the encoder to compensate
for these known error events through some form of error tracking as is done in [37].
Some early approaches to solving this problem in an R-D optimized framework appear in
[23] and [10]. In [23] the authors solve the problem using Lagrangian optimization under certain
simplifications; most notably they assume all motion vectors are zero. In [10] a weighted
average is used in the Lagrangian cost function as an estimate of expected distortion.
J(2)=1- p)-D,+ p -D, + AR
(4.1)
Here Dq is the distortion due only to quantization, D, is the distortion due to error concealment,
and R is the rate. The variable p represents the probability of loss. This weighted average is a
reasonable estimate of expected end-to-end distortion, however it assumes that previous frames
have been properly decoded and effectively ignores error propagation.
Accurately estimating the expected end-to-end distortion in a video frame is quite difficult
due to spatial and temporal error propagation. Each of the above approaches provides a rough
estimate of expected distortion and is able to provide significant improvements over contentunaware periodic intra-coding. However, none of these approaches truly provides an accurate
measure of end-to-end distortion. The algorithm described in the next section has demonstrated
significant improvement in performance and allows for accurate estimation of expected
distortion on a pixel-by-pixel basis.
-54-
Modeling End-to-EndDistortion over Lossy Packet Networks
Chapter 4
4.2
Recursive Optimal Per Pixel Estimate of Expected Distortion
In [56] the authors suggest a recursive optimal per-pixel estimate (ROPE) for optimal intra/inter
mode selection. Here, the expected distortion for any pixel location is calculated recursively as
follows. Suppose
fni
represents the original pixel value at location i in frame n, and ,
represents the reconstruction of the same pixel at the decoder. The expected distortion d;, at that
location can then be written as
d =E[(f -f)=fn2
-2fE
d+jE
2
.
(4.2)
At the encoder, the value fn is known and the value fn' is a random variable. So, the expected
distortion at each location can be determined by calculating the first and second moment of the
random variable fn.
If we assume the encoder uses full pixel motion estimation, each correctly received pixel
value can be written as fn = en + fn_
where fn, represents the pixel value in the previous frame
that has been used for motion compensated prediction and en represents the quantized residual
(in the case of intra pixels, the prediction is zero and the residual is just the quantized pixel
value). The first moment of each received pixel can then be recursively calculated by the encoder
as follows
£
.
received] = d', + E
(4.3)
If we assume the decoder uses frame copy error concealment, each lost pixel is reconstructed by
copying the pixel at the same location in the previous frame. Thus, the first moment of each lost
pixel is
E[
lost]= E[
.
(4.4)
The total expectation can then be calculated as
E[If]
= P(received)
E
received + P(lost) E [12' lost].
-55-
(4.5)
Chapter4
Modeling End-to-EndDistortion over Lossy Packet Networks
The calculations necessary for computing the second moment off, can be derived in a similar
recursive fashion.
4.3
Multiple Description ROPE Model
In [33] the ROPE model presented in the previous section has been extended to a two-stream
multiple description system by recognizing the four possible loss scenarios for each frame: both
descriptions are received, one or the other description is lost, or both descriptions are lost. For
notational convenience, we will refer to these outcomes as 11, 10, 01, and 00 respectively. The
conditional expectations of each of these four possible outcomes are recursively calculated and
multiplied by the probability of each occurring to calculate the total expectation.
E[]
= P(11)E[
'11 1]+P(I0)E[- 10]
+P(01)E[
ol]+P(00)E[
oo]
(4.6)
Graphically, this can be depicted as shown in Figure 4.1 (a). The first moments of the
random variables fnA
as calculated in the previous frame are used to calculate the four
intermediate expected outcomes that are then combined together using equation (4.6) and stored
for future frames. Again, the second moment calculations can be computed in a similar manner.
4.4
Extended ROPE Model
These previous methods have assumed a Bernoulli independent packet loss model where the
probability that a packet is lost is independent of any other packet. However, the idea can be
modified for a channel with bursty packet losses as well. Recent work has identified the
importance of burst length in characterizing error resilience schemes, and that examining
performance as a function of burst length is an important feature for comparing the relative
merits of different error resilient coding methods [2, 4, 27].
-56-
Modeling End-to-EndDistortion over Lossy Packet Networks
Chapter 4
E~f21]P(11)
Previous values
E[
110
EP
10
E[f_
i]
Stored for next frame
E~
E[~
I
Poo
P(o
E j,'00]
Frame n -I
Frame n
(a) Recursion with Bernoulli packet loss model
Previous values
E
Stored for next frame
11
E
I11
P(ii)
E[-
110]
E
I10]
P(10)
P(01)
E[-
101]
E[
E[-
101]
P(O0)
E[-
o00]
E
Frame n -1
00]
Frame n
(b) Recursion with Gilbert packet loss model
Figure 4.1: Conceptual computation of first moment values in MD ROPE approach. (a)
Bernoulli case: the moment values from the previous frame are used to compute the
expected values in each of the four possible outcomes that are then combined to find the
moment values for the current frame. (b) Gilbert losses: due to the Gilbert model, the
probability of transitioning from any one outcome at time n- 1 to any other outcome at
time n changes depending on which outcome is currently being considered. Thus, the
four expected outcomes cannot be combined into one single value as was done in the
Bernoulli case. Each of these four values must be stored separately for future
calculations.
-57-
Modeling End-to-EndDistortion over Lossy Packet Networks
Chapter 4
State 1: packet received
State 0: packet lost
-PO
Average Packet Loss Rate =
1+ p, - pO
Expected Burst Length =
I1- p"
P1
Figure 4.2: Gilbert packet loss model. Assuming p, < po , there is a greater probability
the current packet will be lost if the previous packet was lost. This causes bursty losses
in the resulting stream.
For this thesis we have extended the MD ROPE approach to account for bursty packet loss.
Here we use a 2-state Gilbert loss model, but the same approach could be used for any multistate loss model including those with fixed burst lengths. We use the Gilbert model to simulate
the nature of bursty losses where packet losses are more likely if the previous packet has been
lost. This can be represented by the Markov model shown in Figure 4.2 assuming p, < po.
The expected value of any outcome in a multi-state packet loss model can be calculated
by computing the expectation conditioned on transitioning from one outcome to another
multiplied by the probability of making that transition. For the two-state Gilbert model, this idea
can be roughly depicted as shown in Figure 4.1 (b). For example, assume T' represents the
event of transitioning from outcome A at time n -1 to outcome B at time n and
p(TiB )represents the probability of making this transition. Then the expected value conditioned
on outcome 11 can be computed as shown in (4.7).
E[ .111 = P(T")-E[ ~'T
f n'
f 1n 11
+ PT O1)-E [ _-T
+P(T")-E [~IT"
0
f n 101(
+ P(T ')-E f n'T 1
4 7)
The remaining three outcomes can be computed in a similar manner. Due to the Gilbert
model, the probability of transitioning from any outcome at time n-I to any other outcome at
time n changes depending on which outcome is currently being considered. For instance, when
computing the expected value conditioned on outcome 00, the result when both streams are lost,
the probability that the previous outcome was 10, 01, or 00 is much higher than when computing
-58-
Chapter 4
Modeling End-to-End Distortionover Lossy Packet Networks
the expected value conditioned on outcome 11. Since the transitional probabilities vary from
outcome to outcome, it is not possible to combine the four expected outcomes into one value as
can be done in the Bernoulli case. The four values must be stored separately for future use as
shown in Figure 4. 1b. Once again, the second moment values can be computed using a similar
approach.
The above discussion assumed full pixel motion vectors and frame copy error concealment,
but it is possible to extend this approach to sub-pixel motion vector accuracy and more
complicated error concealment schemes. As discussed in [56] the main difficulty with this arises
when computing the second moment of pixel values that depend on a linear combination of
previous pixels. The second moment depends on the correlations between each of these previous
pixels and is difficult to compute in a recursive manner. We have modified the above approach
in order to apply it to the H.264 video coding standard with quarter-pixel motion vector accuracy
and more sophisticated error concealment methods by using the techniques proposed in [55] for
estimation of cross-correlation terms. Specifically, each correlation term E[XY] is estimated by
E[XY] = E[X] EY2
E[Y]
-59-
.
(4.8)
-60-
Chapter 5
MD Mode Selection System
This chapter discusses the implementation of an adaptive MD mode selection system we have
used to examine the concept of adaptive MD mode selection. These details are provided to allow
for a better understanding of the results presented in Chapter 6. However, it should be noted that
this is simply one approach that we have used to illustrate the concept of adaptive MD mode
selection. We make no claim that this is the optimal implementation of such a system, and in fact
that is highly unlikely.
The system used for this work has been based on the H.264 video coding standard. This
choice was made since, at the time of publication, H.264 was the most advanced video coding
standard. H.264 makes use of state-of-the-art video coding techniques and has been shown to
significantly increase coding efficiency relative to previous coding standards. In addition, by
using H.264, the results presented in this thesis can more easily be compared against current and
future work.
The first section of this chapter presents an overview of H.264, providing a short history of
the standard and the details necessary for discussing the implementation of the adaptive MD
mode selection system. The second section provides a detailed explanation of the specific MD
mode selection system developed for this thesis.
5.1
MPEG4-AVC / H.264 Video Coding Standard
The H.264 video coding standard has been developed as a joint effort between the ITU and
ISO/IEC standards committees. Originally referred to as H.26L, H.264 began as a long-term
effort by the ITU in 1998 with the goal of doubling the compression efficiency of previous video
coding standards. In 2001, the ITU and ISO joined together to form the Joint Video Team (JVT)
and developed the standard formally referred to as MPEG4-Advanced Video Coding (AVC) or
-61-
MD Mode Selection System
Chapter5
Control Data
-------------------------
---------Quantized Video Data
Digital Video
*
H.264
Bit Stream
t
.4- -,
Motion Vector Data
Figure 5.1: H.264 video encoder architecture.
Control Data
H.264
Bit Stream
Quantized Video Data
Decoded
Video
Motion Vector Data
Figure 5.2: H.264 video decoder architecture.
-62-
MD Mode Selection System
Chapter 5
M
A B C D E F G H
I
J
K
L
a
e
i
m
b c d
f g h
j kl
n o p
Figure 5.3: 4x4 Intra-prediction modes. The encoder can use pixels A-M to predict the
current 4x4 block of data, pixels a-p.
Mode 0: Vertical
Mode 1: Horizontal
Figure 5.4: Two of the nine available 4x4-intra prediction modes. In the vertical mode
the pixels above are copied downward to predict the current block. In the horizontal
mode, the pixels to the left are copied instead. The remaining seven modes copy and/or
average neighboring pixels in various orientations.
H.264. Previous joint efforts between the ISO and ITU have been very successful including the
JPEG standard used extensively for image compression and the MPEG2 video coding standard
used in a vast number of applications including DVD and HDTV.
At its core, the H.264 codec remains very similar to previous coding standards (MPEG2,
MPEG4, H.261, H.263, etc...). It is a block-transform hybrid video coding approach using
motion estimation/compensation to reduce temporal redundancy, and a DCT to reduce spatial
redundancy. Furthermore, entropy coding is used to represent the remaining information with as
few bits as possible. The following sections highlight some of the main aspects of the H.264
video codec specifically focusing on the differences between H.264 and previous video coding
standards. Figures 5.1 and 5.2 provide an overview of the structure of the H.264 video encoder
and decoder.
-63-
MD Mode Selection System
Chapter5
0 oooooooooooooooo
* oooooooooooooooo
* oooooooooooooooo
oooooooooooooooo
oooooooooooooooo
* oooooooooooooooo
* oooooooooooooooo
0 oooooooooooooooo
* oooooooooooooooo
* oooooooooooooooo
* oooooooooooooooo
* oooooooooooooooo
oooooooooooooooo
oooooooooooooooo
* oooooooooooooooo
* oooooooooooooooo
Figure 5.5: 16x16 Intra-prediction modes. The encoder can use the 32 pixels represented
by the black dots above to predict the current 16x16 pixel macroblock.
5.1.1 Intra-Frame Prediction
One significant change in H.264 relative to previous standards is the use of extensive intra-frame
prediction. The use of intra-prediction allows the encoder to predict intra-coded blocks in the
current frame from previously coded pixels in the same frame. MPEG2 and MPEG4 both had
simple intra-prediction (DC value prediction, etc...), but the intra-prediction in H.264 is quite
extensive. There are 13 different intra prediction modes available in H.264; nine 4x4 prediction
modes and four 16x16 modes. The nine 4x4-modes use the 17 neighboring pixels (pixels A-M in
Figure 5.3) in various manners to predict the current 4x4 block of data (pixels a-p). As an
example, two of the 9 4x4-modes are shown in Figure 5.4. Since there is likely a strong
correlation between the modes used in the current block and those used in neighboring blocks,
each of these 4x4 modes is predicted from neighboring blocks to further improve coding
efficiency.
In a similar manner, the four 16x16 modes use the 32 neighboring pixels shown in Figure
5.5 to predict each 16x16 block of data. These four modes are mainly intended for the prediction
of smooth regions of the video containing little or no detail.
-64-
MD Mode Selection System
Chapter 5
Macroblock Divisions
16x16
16x8
8x16
8x8
8x8 Block Subdivisions
8x8
4x8
8x4
4x4
Figure 5.6: Macroblock partitions for block motion estimation. Each macroblock may be
partitioned into each of the four patterns shown on the top row. Furthermore, those blocks
of size 8x8 may be further partitioned into the four patterns shown on the bottom row.
5.1.2 Hierarchical Block Motion Estimation
As with most prior video coding standards, H.264 uses motion estimation/compensation to
remove a significant amount of temporal redundancy from the original sequence. As a hybrid
video coder, the encoder can select between intra and inter coding on a block by block basis.
This decision is represented by the switch in Figure 5.1. Those blocks which are intra coded are
predicted using intra-frame prediction mentioned in the previous section. The remaining blocks
are inter-coded using motion compensated prediction from previous frames.
For motion estimation/compensation, the H.264 encoder can choose from the four
different macroblock divisions shown in the top half of Figure 5.6. Any 8x8 sized blocks may be
further subdivided into any of the four partitions shown in the bottom half of Figure 5.6. The
encoder must assign a translational motion vector to each of these partitions and transmit these
motion vectors to the decoder. The decoder then selects data from previously decoded frames
and offsets it in accordance with the received motion vectors to generate its prediction of the
current block. Partitioning a macroblock into smaller blocks (e.g. 4x4 blocks) obviously allows
-65-
Chapter5
MD Mode Selection System
n-3
n-2
n-1
n
Figure 5.7: The H.264 encoder can use multiple reference frames to generate motion
compensated predictions of the current frame.
much more flexibility than larger partitions and will likely lead to better prediction. However,
the use of smaller partitions requires the encoder to transmit many more motion vectors, which
may or may not outweigh any gains in performance.
5.1.3 Multiple Reference Frames
Typically, it is most efficient for the encoder to predict the current frame from the directly
previous frame. However, this is not always the case. Occasionally, it can be more efficient to
predict from 2 or more frames back. In H.264, more than one reference frame can be used for
motion compensated prediction. This idea is illustrated in Figure 5.7. Every block within a MB
(16x 16, 8x16, 16x8, 8x8) can use their own frame. Blocks smaller than size 8x8 (4x8, 8x4, 4x4)
all use the same reference frame.
The use of multiple reference frames especially helps to solve the "uncovered-background"
problem where a background region may have been visible two or more frames back but was
temporarily hidden by a moving object in the previous frame. If the encoder only considers the
previous frame it will not likely find a good match for this uncovered region since it was hidden
at that point. However, if the encoder looks back more than one frame, this background region
would again be visible.
-66-
MD Mode Selection System
Chapter 5
20
32
20
32
51
32
32
5
32
5
32
Figure 5.8: Six-tap interpolation filter used to generate half pixel locations
Using multiple reference frames generally leads to an increase in efficiency, but increases
the complexity of the motion vector search and requires a significant amount of memory for
frame storage. The encoder in H.264 can use up to 16 previous frames for reference, although
using all 16 is not a requirement. The encoder signals to the decoder how many frames will be
used at the beginning of the sequence so the decoder knows how much memory it will need to
make available. The use of multiple reference frames may also be used by the encoder for error
resilience purposes. For instance, the temporal splitting MD approach presented in Figure 1.5
may be implemented simply by forcing the encoder to exclusively use two-back temporal
prediction.
5.1.4 Quarter Pixel Motion Vector Accuracy
Each motion vector in H.264 uses sub-pixel motion vector accuracy; specifically H.264 uses
quarter pixel accuracy for each motion vector. This allows the encoder to more effectively
compensate for non-integer pixel motion in the sequence. The six-tap interpolation filter shown
in Figure 5.8 is used to generate half pixel samples, and a two-point averaging filter is used to
generate quarter pixel locations.
-67-
Chapter 5
MD Mode Selection System
MD Mode Selection System
Chapter 5
(a)
(b)
Figure 5.9: (a) Coded frame prior to applying the deblocking filter. (b) Resulting frame
after deblocking. Adaptive filtering is applied at block boundaries to reduce the
appearance of blocking artifacts.
5.1.5 In-Loop Deblocking Filter
At lower bitrates, block based compression can lead to blocking artifacts in coded frames. If the
coefficients are too coarsely quantized, discontinuities can appear at boundaries of blocks in
regions that should have been smoothly varying, see Figure 5.9 (a). The H.264 codec uses an inloop deblocking filter to help remove these blocking artifacts. The filtering is done using a onedimensional filter along block edges where the strength of the filtering is adapted to account for
the quantization levels used in that region and for the local activity in the neighborhood of
boundary. The results after filtering Figure 5.9 (a) are shown in (b).
This process is referred to as in-loop filtering since it is actually used within the motion
compensated prediction loop, see Figure 5.1. By using filtered frames for motion compensation
the encoder is often able to produce more efficient predictions resulting in further improvements
in coding efficiency. Typical results show bitrate reductions of 5-10% from using this filter at a
fixed quality level [40].
-68-
MD Mode Selection System
Chapter 5
Symbol
Codeword
0
1
2
1
010
011
3
4
00100
00101
5
00110
6
7
8
9
10
11
00111
0001000
0001001
0001010
0001011
0001100
12
13
14
0001101
0001110
0001111
Table 5.1: Exponential-Golomb codebook for encoding all syntax elements not encoded
using CAVLC or CABAC encoding.
5.1.6 Entropy Coding
After motion compensation, DCT transformation, and quantization of residual transform
coefficients, the compressed video information must be converted to specific codewords (strings
of Is and Os) before being placed into the output bitstream. Entropy coding is used to represent
this compressed video information (motion vectors, quantized residual data, etc...) with as few
bits as possible. The H.264 coding standard provides two different entropy coding options,
universal variable length coding (UVLC) or context-adaptive binary arithmetic coding
(CABAC).
The less complex UVLC approach uses exponential Golomb codes for all syntax elements
except for transform coefficients. Each syntax element is assigned an index or symbol number
and associated codeword, with more likely outcomes assigned to shorter codewords. The first
few entries in the exp-Golomb codebook are shown in Table 5.1. Each codeword has N-zeros
followed by a '1' followed by N-bits of data allowing for a very simple decoding structure. The
residual transform coefficients are encoded using context-adaptive variable length coding
(CAVLC). These adaptive variable length codes are similar to Huffman codes that adapt to
-69-
Chapter 5
MD Mode Selection System
previously transmitted coefficients from earlier blocks so as to more closely match the statistical
properties of this particular video frame and more efficiently code the resulting data.
The second entropy coding mode in H.264 uses context-adaptive binary arithmetic coding
(CABAC). This approach provides increased efficiency relative to the CAVLC approach, yet
requires significantly higher complexity. Arithmetic coding in a sense allows for joint encoding
of all the syntax elements from a frame allowing the encoder to assign non-integer length
codewords to syntax elements, including the possibility of using less than one bit per syntax
element. CABAC encoding is also used on a much wider range of syntax elements unlike
CAVLC which is only used on residual transform coefficients. For further details on CABAC
encoding see [20, 30, 39].
5.1.7 H.264 Performance
The original goal for H.264 was to improve coding efficiency by a factor of two over previous
coding standards. Subsequently, a number of performance comparisons have been made to
estimate how successful the H.264 standard has been in achieving this goal [13, 35, 39, 40, 51].
In [35], the authors have encoded a set of nine CIF (Common Intermediate Format - 288 lines x
352 columns) and QCIF (Quarter Common Intermediate Format - 144 lines x 176 columns)
resolution sequences with H.264, MPEG4, H.263, and MPEG2 at a number of different bitrates.
While maintaining the same quantitative quality level (PSNR), the results with H.264 show an
average bitrate reduction of 39% relative to MPEG4, 49% relative to H.263, and 64% relative to
MPEG2.
Given that quantitative measures are not perfectly correlated with human perceptions of
video quality, a number of subjective tests have been performed as well. Perceptual tests
reported in [13] with a large test set ranging from QCIF to HD resolutions indicate that in
roughly 65% of the cases H.264 is able to increase compression efficiency by a factor of two or
more.
The results of these performance evaluations and others like them have piqued the interest
of a number of industries. H.264/AVC is currently being integrated into a number of applications
including video phones, video conferencing, streaming video, HD DVD, satellite TV, and
broadcast HDTV (in some countries). It is our belief that H.264 will be widespread in the near
future which is why we have elected to use H.264 for this research.
-70-
MD Mode Selection System
Chapter 5
5.2
MD System Implementation
The previous section has introduced the portions of the H.264 standard that are most relevant to
our discussion of adaptive MD mode selection. This section provides further details on the
modifications we have made to implement MD mode selection. The system developed for this
thesis uses H.264 reference software version 8.6 with necessary modifications to support
adaptive mode selection. The adaptive mode selection is performed on a macroblock-bymacroblock basis using the Lagrangian optimization techniques discussed in Chapter 3 along
with the expected distortion modeling from Chapter 4. Note that this optimization is performed
simultaneously for both traditional coding decisions (e.g. inter versus intra coding) as well as for
selecting one of the possible MD modes. Due to the in-loop deblocking filter in H.264, output
pixels from the current macroblock will depend on neighboring macroblocks, including blocks
that have yet to be coded. This dependence on pixels that have yet to be coded presents a
causality problem for the encoder. The deblocking filter has been turned off in our experiments
to remove this causality issue and simplify the problem.
The system uses UVLC entropy coding with quarter pixel motion vector accuracy and all
available intra- and inter-prediction modes. However, we have used an option in H.264 referred
to as constrained intra-prediction. When used, constrained intra-prediction prevents the encoder
from using inter-coded blocks for intra-frame prediction. That is to say intra-macroblocks are
predicted only from other intra-macroblocks. By using this option, the process of intra-prediction
is less efficient; however this prevents intra-frame error propagation. Due to the use of motion
compensated prediction, errors in previous frames can propagate into inter-coded blocks in the
current frame. If these inter-coded blocks are then used for intra-frame prediction, these errors
will then also propagate spatially throughout intra-coded blocks of the frame. It would be
possible to modify the ROPE model to account for unconstrained error propagation by adding a
second random term to the intra-pixel calculations. However, the use of intra-coding in errorresilient video streaming is mainly intended for restoring the prediction loop and ending any
error propagation, and by allowing intra-frame error propagation we would essentially defeat this
purpose. For this reason, we have elected to use constrained intra-prediction.
-71-
MD Mode Selection System
Chapter 5
MD Mode Selection System
Chapter S
Stream 1:
Stream 1:
Stream 2:
4
2
0O
Stream 2:
(a) Single Description Coding (SD)
(b) Temporal Splitting (TS)
Even Lines
Odd Lines
Stream 1:
4,fo
1
2
3
4
Stream 2:
Stream 2:
(c) Spatial Splitting (SS)
(d) Repetition Coding (RC)
Figure 5.10: Examined MD coding methods: (a) Single Description Coding: each frame
is predicted from the previous frame in a standard manner to maximize compression
efficiency. (b) Temporal Splitting: even frames are predicted from even frames and odd
from odd. (c) Spatial Splitting: even lines are predicted from even lines and odd from
odd. (d) Repetition Coding: all coded data repeated in both streams.
5.2.1 Examined MD Coding Modes
This thesis explores the concept of adaptive MD mode selection in which the encoder switches
between different coding modes within a sequence in an intelligent manner. To illustrate this
idea, the system discussed here uses a combination of four simple MD modes: single description
coding (SD), temporal splitting (TS), spatial splitting (SS), and repetition coding (RC), see
Figure 5.10. This section describes each of these methods and their relative advantages and
disadvantages.
Single description (SD) coding represents the typical coding approach where each frame is
predicted from the previous frame in an attempt to remove as much temporal redundancy as
-72-
Chapter 5
MD Mode Selection System
possible. Of all the methods presented here, SD coding has the highest coding efficiency and the
lowest resilience to packet losses. On the other extreme, repetition coding (RC) is similar to the
SD approach except the data is transmitted once in each description. This obviously leads to poor
coding efficiency, but greatly improves the overall error resilience. As long as both descriptions
of a frame are not lost simultaneously, there will be no effect on decoded video quality. The
remaining two modes provide additional tradeoffs between error resilience and coding
efficiency. The temporal splitting (TS) mode effectively partitions the sequence along the
temporal dimension into even and odd frames. Even frames are predicted from even frames and
odd frames from odd frames. Similarly, in spatial splitting (SS), the sequence is partitioned along
the spatial direction into even and odd lines. Even lines are predicted from even lines and odd
from odd. Table 5.2 presents an overview of the relative advantages and disadvantages of each
mode.
We chose to examine these particular modes for the following reasons. First, these methods
tend to complement each other well with one method strong in situations where another method
is weak, and vice versa. This attribute will be further illustrated in Chapter 6. Secondly, each MD
mode makes a different tradeoff between compression efficiency and error resilience. This set of
modes examines a wide range of the compression efficiency/error resilience spectrum, from
most efficient single description coding to most resilient repetition coding. Finally, these
approaches are all fairly simple both conceptually and from a complexity standpoint.
Conceptually, it is possible to quickly understand where each one of these modes might be most
or least effective, and in terms of complexity, the decoder in this system is not much more
complicated than the standard video decoder. It is important to note that additional MD modes of
interest may be straightforwardly incorporated into the adaptive MD encoding framework and
the associated models for determining the optimized MD mode selection. In addition, it is also
possible to account for improved MD decoder processing that may lead to reduced distortion
from losses (e.g. improved methods of error recovery where a damaged description is repaired
by using an undamaged description [1, 2]), and thereby effect the end-to-end distortion
estimation performed as part of the adaptive MD encoding.
Note that when coded in a non-adaptive fashion, each method (SD, TS, SS, RC) is still
performed in an R-D optimized manner as mentioned above. All of the remaining coding
decisions, including inter versus intra coding, are made to minimize the end-to-end distortion.
For instance, the RC mode is not simply a straightforward replica of the SD mode. The system
-73-
MD Mode Selection System
Chapter5
MD Mode
Description
Advantages
Disadvantages
SD
Single Description Coding
Highest coding efficiency of all methods.
Least resilience to errors
TS
Temporal Splitting
Good coding efficiency with better error
resilience than SD coding. Works well in
regions with little or no motion.
Increased temporal distance reduces the
effectiveness of temporal prediction leading to a
decrease in coding efficiency.
SS
Spatial Splitting
High resilience to errors with better coding
efficiency than RC. Works well in regions with
some amount of motion.
Field coding leads to decreased coding
efficiency, with typically lower coding efficiency
than TS mode.
RC
Repetition Coding
Highest resilience to errors of all the methods.
The loss of either stream has no effect on
decoded quality.
Repetition of data is costly leading to low coding
efficiency.
Table 5.2: List of MD coding modes along with their relative advantages and
disadvantages.
recognizes the improved reliability of the RC mode and elects to use far less intra-coding
allowing more intelligent allocation of the available bits.
5.2.2 Data Packetization
The H.264 reference software has an integrated RTP packetization structure. We have made the
assumption that single frame would be placed entirely within one packet per stream. The
experiments we have performed have each used QCIF resolution video (see Section 6.1) which
consists of relatively small coded frames, so this assumption is reasonable.
The packetization of data differs slightly for each mode (see Figure 5.11). In both the SD
or TS approaches, all data for a frame is placed into a single packet. The even frames are then
sent along one stream and the odd frames along the other. While in the SS and RC approaches,
data from a single frame is coded into packets placed into both streams. For SS even lines are
sent in one stream and odd lines sent in the other, while for RC all data is repeated in both
streams. Therefore, for SD and TS each frame is coded into one large packet that is sent in
alternating streams, while for SS and RC each frame is coded into two smaller packets and one
small packet is sent in each stream. Since the adaptive approach (ADAPT) is a combination of
each of these four methods, there is typically one slightly larger packet and one smaller packet
and these alternate streams between frames.
-74-
Chapter 5
MD Mode Selection System
Stream 0:
Stream 1:
Frm la
1
Frm 2a
Frma
Frm2b
Frm lb
Fm 2a
a.)SD and TS
b.)SS and RC
Stream 0:
Stream 1:
a
2rm
2b]
b
Frm2a
c.) ADAPT
Figure 5.11: Packetization of data in MD modes. a.) SD and TS: Data sent along one
path alternating between frames. b.) SS and RC: Data spread across both streams. c.)
ADAPT: Combination of the two resulting in one slightly larger packet and one slightly
smaller.
If a frame is lost in either the TS or SD method, no data exists in the opposite stream at the
same time instant, so the missing data is estimated by directly copying the associated pixels from
the previous frame. Note that here we copy from the previous frame in either description, not the
previous frame in the same description. In the SS method, if only one description is lost the
decoder estimates the missing lines in the frame using linear interpolation, and if both are lost it
estimates the missing frame by copying the previous frame. Similarly for RC, if only one
description is lost the decoder can use the data in the opposite stream, while if both are lost it
copies the previous frame.
5.2.3 Discussion of Modifications Not in Compliance with H.264 Standard
Many of the changes discussed above are fully compliant with the H.264 standard, others are
not. It would be possible to implement a fully standard compliant system (for instance the
temporal splitting mode was implemented using standard-compliant reference frame selection
available in H.264), but that was not our main concern at this point. There were essentially three
main changes that are not in compliance with the H.264 standard: macroblock-level adaptive
interlaced coding, prevention of intra-frame error propagation between MD modes, and the
redefinition of skipped macroblocks for the TS mode. These are discussed below.
The first significant modification was the ability to support macroblock-level adaptive
interlaced coding in order to accommodate the spatial splitting (SS) mode. The H.264 standard
allows for adaptive frame/field coding, but only on a macroblock-pair basis. In H.264
macroblock-adaptive frame/field (MBAFF) coding, each vertical pair of macroblocks is either
-75-
MD Mode Selection System
Chapter 5
coded in frame or field mode. This macroblock-pair processing prevented the use of macroblocklevel MD mode selection. In addition, it was not possible to use MBAFF coding with QCIF
resolution video since QCIF video contains 9 macroblock rows, a number that cannot be evenly
divided into pairs.
Secondly, a few modifications were made to prevent intra-frame propagation of errors
between MD modes. Consider the case of an RC block surrounded by SD blocks. The intention
of repetition coding is that this data can be properly decoded even if one of the streams is lost.
However, the H.264 codec contains many intra-frame predictions including motion vector
prediction and intra-prediction. If the RC block is predicted in any manner from the surrounding
SD blocks, the loss of one stream could alter this surrounding data and propagate these errors
into the RC block. The same situation is true for SS coding as well: if one of the streams is lost,
the field in the opposite stream should still be decodable. However, if this field is predicted from
neighboring blocks, errors in these neighbors can propagate into the current field. For this reason
we have only allowed macroblocks to be predicted from other macroblocks using the same MD
coding mode. Each of the neighboring macroblocks which use a different MD mode is
considered unavailable for prediction to prevent this propagation of errors between modes. Note
that this approach does not prevent propagation of errors from previous frames. For example, it
is still possible for an inter-coded RC block to be corrupted by errors that occurred in previous
frames. Additionally, for the SS approach we have prevented blocks in the bottom field from
being predicted from blocks in the top field to ensure errors in the top field do not propagate to
the bottom field.
The last modification was a slight redefinition of skipped macroblocks for the TS mode.
This modification was only used when the TS mode was used by itself. When a macroblock is
skipped, or not coded, the decoder typically copies a 16x16 block of data from the previous
frame. However, the main concept behind the TS mode was to code even and odd frames
separately. With this in mind, the skip mode in the TS approach has been redefined such that the
decoder copies a block of data from two frames back instead and thereby maintains this
separation.
-76-
Chapter 6
Experimental Results and Analysis
This chapter presents a number of results that have been obtained using the modified H.264
codec described in Chapter 5. It is again important to point out that this is one particular
implementation of an adaptive MD mode selection system and that different realizations of such
a system could yield vastly different results. The results presented here are intended to
demonstrate that there can be significant gains from using adaptive MD mode selection.
However with a different set of MD modes, it is possible that these gains could increase or
possibly diminish. To estimate the actual distortion experienced at the decoder, we have
simulated bursty packet losses with packet loss rates and expected burst lengths as specified in
each section below. Unless otherwise stated, we have run each simulation with 300 different
packet loss traces and have averaged the resulting squared-error distortion. The same packet loss
traces were used throughout a single experiment to allow for meaningful comparisons across the
different MD coding methods.
Each path in the system is assumed to carry 30 packets per second where the packet losses
on each path are modeled as a Gilbert process. For wired networks, the probability of packet loss
is generally independent of packet size so the variation in sizes should not generally affect the
results or the fairness of this comparison. When the two paths are balanced or symmetric, the
optimization automatically sends half the total bitrate across each path. For unbalanced paths, the
adaptive system results in a slight redistribution of bandwidth as is discussed in Section 6.6.
In order for the ROPE model to estimate expected end-to-end distortion the encoder needs
some knowledge of current network conditions. We assume a feedback channel exists so that the
receiver can transmit information (e.g. notification of packet losses) back to the sender, allowing
the sender may form an approximation of the current network conditions. Unless otherwise
stated, each of these experiments assumes the encoder has perfect knowledge of both the average
packet loss rate and expected burst length. The effects of imperfect channel knowledge are
further explored in Section 6.7.
-77-
Experimental Results andAnalysis
Chapter 6
In each of these experiments, the encoder is run in one of two different modes: constant
bitrate encoding (CBR) or variable bitrate encoding (VBR). In the CBR mode, the quantizer and
associated lambda value is adjusted on a macroblock basis in an attempt to keep the number of
bits used in each frame approximately constant. Keeping the bitrate constant allows a number of
useful comparisons between methods on a frame by frame basis such as those presented in
Figure 6.5. Unfortunately, the changes in quantizer level must be communicated along both
streams in the adaptive approach, which leads to somewhat significant overhead. While this
signaling information is included in the bitstream, the amount of signaling overhead is not
incorporated in the R-D optimization decision process, hence leading to potentially sub-optimal
decisions with the adaptive approach. We mention this since if all of the overhead was accounted
for in the R-D optimized rate control then the performance of the adaptive method would be
even slightly better than shown in the current results and therefore the CBR results are lower
bounds on the achievable CBR performance. In the VBR mode, the quantizer level is held fixed
to provide constant quality. In this case there is no quantizer overhead and this approach yields
results closer to the optimal performance. Since the rates of each mode may vary when in VBR
mode (where the quantizer is held fixed), it is not possible to make a fair comparison between
different modes at a given bit rate. Therefore, in experiments where we try to make fair
comparisons among different approaches at the same bit rate per frame we operate in CBR
mode, e.g. Figure 6.5, and we use VBR mode to compute rate-distortion curves, like those
shown in Figure 6.12.
6.1
Test Sequences
The following two test sequences have been used for providing the experimental results
presented in subsequent sections. We have done these experiments on other sequences as well
with similar results but present only these two in the interest of saving space.
Both of these sequences are progressively scanned video sequences with QCIF resolution
(144 lines x 176 columns) at a frame rate of 30 frames per second. The Foreman sequence
consists of 400 frames and the Carphone sequence consists of 382 frames. The first frame from
both sequences is shown in Table 6.1.
-78-
ExperimentalResults and Analysis
Chapter 6
Sequence Name:
Scan Mode:
Frames:
Rows:
Columns:
Frame Rate:
Foreman
Progressive
400
144
176
30 fps
Sequence Name:
Scan Mode:
Frames:
Rows:
Columns:
Frame Rate:
Carphone
Progressive
382
144
176
30 fps
Table 6.1: Test Sequences
The QCIF resolution was selected since much of the prior work in the error resilience field
has focused on QCIF video which allows this thesis to be more easily compared with other
research in the field. The small frame size also allows for quicker experimentation and analysis
since encoding/decoding times are significantly reduced. However, the results presented in this
thesis are directly applicable to higher resolution video as well. With QCIF video, we have made
the assumption that one frame fits within one packet per stream. When considering higher
resolution video sequences, it may be appropriate to split coded frames into multiple packets,
which would also allow for more advanced intra- and inter-frame error concealment techniques.
-79-
ExperimentalResults and Analysis
Chapter6
39
37
35
Ve
W
z
(00
33
e
V)
31
o Expected
29
-Actual
- Quant Only
27 1
0
,
50
100
150
200
250
300
350
Frame
(a) Foreman sequence
41
39
37
35
Z
CE
33
Expected
31 -
29
-
Actual
Quant Only
,
27
0
50
100
150
200
250
300
350
Frame
(b) Carphone sequence
Figure 6.1: Comparison between actual and expected end-to-end PSNR. (a) Foreman
sequence. (b) Carphone sequence. This figure demonstrates the ability of the model to
track the actual end-to-end distortion, since the actual values line up quite closely with
the expected values. Also shown on this figure is the quantization-only distortion, which
shows the distortion from compression and without any packet loss.
-80-
Experimental Results andAnalysis
Chapter 6
6.2
Performance of Extended ROPE Algorithm
This section analyzes the performance of the extended ROPE algorithm. As discussed in Chapter
3, performing R-D optimization over lossy packet networks requires an accurate estimate of the
end-to-end distortion experienced at the decoder. The extended ROPE model presented in
Chapter 4 provides one approach for achieving this result. It is important for the model of
expected distortion to accurately estimate the actual distortion since each of the encoder's
decisions will be made based on this modeling. If the model accurately estimates the distortion
then minimizing the expected distortion as calculated by the model will have the effect of
minimizing the expected actual distortion experienced at the decoder. In this section we examine
the capability of the extended ROPE algorithm to estimate the expected end-to-end distortion
experienced at the decoder accounting for local characteristics of the video as well as network
conditions on multiple paths.
The results of the first experiment are shown in Figure 6.1. To generate this figure we have
coded the Foreman and Carphone test sequences at approximately 0.4 bits per pixel (bpp) with
the H.264 video codec using the SD approach mentioned in Chapter 5. The channel has been
modeled as a two path channel, where the paths are symmetric with Gilbert losses at an average
packet loss rate of 5% and expected burst length of 3 packets. The expected distortion as
calculated at the encoder using the above model has been plotted relative to the actual distortion
experienced by the decoder. This actual distortion was calculated by using 1,200 different packet
loss traces and averaging the resulting squared error distortion for each frame before computing
the PSNR per frame. As shown in both of these sequences, the proposed model is able to track
the end-to-end expected distortion quite accurately. Also shown in this figure for reference is the
quantization only distortion (with no packet loss).
The second example demonstrates the performance of the model under changing network
conditions. In this experiment, the Foreman sequence was coded with the TS approach presented
in Chapter 5 and the packet loss rate was varied as shown in Figure 6.2. The average packet loss
rate jumped up to 10% for frames 50-150 on Path 0 and similarly for frames 250-350 on Path 1.
The expected burst length was held constant at 3 packets during the intervals when losses
occurred. Again, the actual distortion was calculated by using 1,200 different packet loss traces
and averaging the resulting squared error distortion. Figure 6.3 (a) demonstrates how the
expected outcome, as calculated by the ROPE approach, follows the actual result quite closely.
-81-
ExperimentalResults and Analysis
Chapter6
Path 0
Loss Rate
Path
1
10%
0%
50
1 00
250
350
Frame
Figure 6.2: Time varying packet loss rates to examine the performance of the proposed
system under time varying conditions. Here the packet loss rate jumps up from 0% to
10% first on one stream and then on the other. The expected burst length is 3 packets.
With the TS approach, when even frames are lost it only affects even frames and when odd
frames are lost it only affects odd frames. For instance, when the loss rate jumps up on Path 0 the
quality of the odd frames drops but the quality of the even frames remains unaffected. When the
loss rate on Path 1 jumps up the quality of the even frames drops but the odd frames are
unaffected. This characteristic causes the rapid fluctuation shown in Figure 6.3 (a). Figure 6.3 (b)
shows the same result with the even and odd frames plotted separately to make the figure easier
to read.
The final example presented in this section illustrates the effect of burst length on the
resulting distortion and provides some motivation as to why it was important to compensate for
bursty packet losses in the ROPE algorithm. Figure 6.4 (a) shows the Foreman sequence encoded
with the SD method and (b) shows TS methods both at 0.4 bpp and a fixed 5% average packet
loss rate. In this experiment the encoder made the assumption that the packet losses follow a
Bernoulli model where each packet is lost independently of any other packet. If this assumption
happens to be accurate, then the model is again able to estimate the actual result relatively well.
If however the actual packet losses are bursty in nature then there can be significant deviations
between the model and the actual result. The red lines in Figure 6.4 show the actual result at the
decoder if the losses actually had an expected burst length of 6 packets. Given the errors between
the estimate and the actual result, it is quite possible for the encoder to make less than optimal
decisions that would ultimately affect the performance of the overall system. It is of particular
interest to notice how bursty losses tend to have a negative effect on the SD approach while the
opposite is true for the TS approach. If the encoder had been deciding between TS and SD
coding, this error could severely affect the decision process. This result is one of the main
reasons we felt it was important to account for bursty packet losses in the extended ROPE
algorithm.
-82-
Chapter 6
and Analysis
ExperimentalResults
Experimental Results andAnalysis
Chapter 6
38
?*
36
%
Pi
34 +
32{
44
3u -328 26 +
- Expected PSNR
-Actual
PSNR
24
0
50
150
100
I
I
200
250
II
350
300
40 0
Frame
(a) Time varying packet loss rates - TS method
38
- - - ----
36
--
------
- ---- * - -
- ---
34
32 ---------- -- -----
z
0.
-----
- --
-
-
-
---
- -----
30
* Expected PSNR - Even _____________
28
PSNR - Even
-Actual
9 Expected PSNR - Odd ----------------
26
PSNR - Odd
-Actual
24
0
50
100
150
200
250
300
350
400
Frame
(b) Even versus odd frames
Figure 6.3: Comparison between actual and expected end-to-end PSNR with time
varying packet loss rates. (a) Foreman sequence encoded at 0.4 bpp with the TS
approach. (b) Same result with PSNR for even and odd frames plotted separately for
easier understanding. The model again matches the actual result quite closely.
-83-
Experimental Results andAnalysis
Chapter 6
317
35
33
z
31
U)
IL
29
Expected - Bernoulli
- Actual - Bernoulli
- Actual - Burst Length 6
27
25
0
50
100
150
200
250
300
350
400
Frame
(a) Foreman Sequence Bernoulli vs. Gilbert Losses - SD method
36
35
-
34
33
32-
z
31
0.
30 29 28 -
* Expected - Bernoulli
- Bernoulli
-Actual - Burst Length 6
-Actual
2,)7 -
0
50
100
150
200
250
300
350
400
Frame
(b) Foreman Sequence Bernoulli vs. Gilbert Losses - TS method
Figure 6.4: Comparison between actual result with Bernoulli losses and actual result
with bursty losses. Foreman sequence coded at 0.4 bpp with balanced paths and 5%
average packet loss rate. (a) SD coding (b) TS coding
-84-
Chapter 6
6.3
Experimental Results andAnalysis
MD Coding Adapted to Local Video Characteristics
Having shown the capability of the ROPE algorithm to accurately model the expected end-toend distortion experienced at the decoder, we next examine the system's ability to adapt to the
characteristics of the video source. The channel in this experiment was simulated with two
balanced paths each having 5% average packet loss rate and expected burst length of 3 packets.
The video was coded in CBR mode at approximately 0.4 bits per pixel (bpp). Figure 6.5
demonstrates the resulting distortion in each frame averaged over 300 packet loss traces for the
adaptive MD method and each of its non-adaptive MD counterparts.
The Foreman sequence contains a significant amount of motion from frames 250 to 350
and is fairly stationary from frame 350 to 399. Notice how the SS/RC methods work better
during periods of significant motion while the SD/TS methods work better as the video becomes
stationary. The adaptive method intelligently switches between the two, maintaining at least the
best performance of any non-adaptive approach. Since the adaptive approach adapts on a
macroblock level, it is often able to do even better than the best non-adaptive case by selecting
different MD modes within a frame as well. Similar results can be seen with the Carphone
sequence. The best performing non-adaptive approach varies from frame to frame depending on
the characteristics of the video. The adaptive approach generally provides the best performance
of each of these.
Also shown in Figure 6.5 are the results from a typical video coding approach that we will
refer to as standard video coding (STD). Here R-D optimization is only performed with respect
to quantization distortion, not the end-to-end R-D optimization used in the other approaches.
Instead of making inter/intra coding decisions in an end-to-end R-D optimized manner as
performed by SD, it periodically intra updates one line of macroblocks in every other frame to
combat error propagation (this update rate was chosen as the optimal intra refresh rate [38] that
is typically approximately I/ p, where p is the packet loss rate).
The adaptive MD approach is able to outperform optimized SD coding by up to 2 dB for
the Foreman sequence, depending on the amount of motion present at the time. Note that by
making intelligent decisions through end-to-end R-D optimization, the SD method examined
here is able to outperform the conventional STD method by as much as 4 or 5 dB for the
-85-
Chapter6
ExperimentalResults and Analysis
Experimental Results and Analysis
Chapter 6
36-
- ---
34-
-----------------------------------------
32
30-
z(I)
C.
28-
-
-TS
-_ SS
26-
---RC
-ADAPT
-STD
2422
--
- SD
--- - --- -
-
------
------ --------
--
--
--- --- -
- - -
----- -
- -
-
200
250
300
350
400
Frame
(a) Foreman sequence
40
38
-
36
-
-SD
IS
-
-
-- RC
ADAPT
STD
I34I
z
V) 32-
-
-
30- - - -- -- -- -- -- ------
- - --
1'
-
-- -- -- -
-
28
26-
100
150
200
250
300
Frame
(b) Carphone sequence
Figure 6.5: Average distortion in each frame for ADAPT versus each non-adaptive
approach. Coded at 0.4 bpp with balanced paths and 5% average packet loss rate and
expected burst length of 3. (a) Foreman sequence. (b) Carphone sequence.
-86-
Chapter 6
andAnalysis
Experimental Results
Experimental Results and Analysis
Chapter 6
100%
-
0
50%
0
50
100
150
200
250
300
350
I
j10%
f2 50% -
I~_
400
_
W-WL
4V
0%1
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
40 0
100%
CO,
co
50%
0%
100%
50%
n o/
Frame
Figure 6.6: Distribution of selected MD modes used in the adaptive method for each
frame of the Foreman sequence illustrating how mode selection adapts to the video
characteristics. 5% average packet loss rate, expected burst length 3.
Foreman sequence. The adaptive MD approach outperforms optimized SD coding by up to 1 dB
with the Carphone sequence, and optimized SD coding outperforms the conventional STD
approach by up to approximately 3 dB. Again, the use of a different set of MD modes could
certainly increase or decrease these gains.
In Figure 6.6 we illustrate how the mode selection varies as a function of the characteristics
of the video source. Specifically we show the percentage of macroblocks using each MD mode
in each frame of the Foreman sequence. From this distribution of MD modes, one can roughly
segment the Foreman sequence into three distinct regions: almost exclusively SD/TS in the last
50 frames, mostly SS/RC from frames 250-350, and a combination of the two during the first
half. This matches up with the characteristics of the video which contains some amount of
motion at the beginning, a fast camera scan in the middle, and is nearly stationary at the end.
The next two pages provide some visual results from the same experiment. The results
presented in this and subsequent sections focus on quantitative performance as measured by
-87-
Experimental Results andAnalysis
Chapter 6
PSNR. However, as was mentioned in the first chapter of this thesis, PSNR values are not
always directly correlated with human perceptions of video quality. While no formal perceptual
testing was performed for this thesis, a number of informal qualitative assessments have been
performed to verify that the conclusions we have derived from PSNR measurements can also be
confirmed with visual analysis. With the addition of random channel losses it becomes difficult
to assess the quality of any one method by a single realization of the channel. Any one particular
packet loss trace may favor one method over another. The quantitative results presented in this
thesis are averaged over many realizations of the channel to provide a more accurate and fair
comparison. That being said, we provide the following sets of images to facilitate a discussion of
some of the properties exhibited by each of these modes, and not as a means of fairly judging
performance. These results are from the Foreman sequence, but similar results can be shown
with the Carphone sequence as well.
Figure 6.7 shows frame 5 from the Foreman sequence encoded with each approach. A burst
of losses occurred along one path, affecting frames 1 and 3. As can be seen in Figure 6.7, the SD
and STD are quite distorted due to propagation of errors into the
5 th
frame. These two
approaches generally demonstrate about the same distortion immediately after a loss, however
the SD approach tends to recover faster due to more intelligent intra-coding. The SS method is
also visually distorted with many jagged edges appearing due to spatial interpolation of the
missing fields. The RC mode was unaffected by this loss since it occurred on only one stream.
The TS method is slightly corrupted, but not as severely as some of the others. However, one
visual artifact that occurs with the TS modes yet does not appear in still images is the rapid
fluctuation between high and low distortion frames (even and odd frames). This fluctuation
causes very visible flicker that can be quite irritating despite the fact that half of the frames have
high PSNR. By intelligent selection of MD modes the adaptive approach is able to more
effectively protect itself against these losses and is barely distorted in this example.
Figure 6.8 shows frame 231 from the same sequence. In most cases the results are fairly
similar to the first set of images, except notice the corruption that occurs with the RC mode. In
this example losses occurred earlier in the sequence that affected both descriptions
simultaneously. Because this outcome was relatively unlikely, the RC mode used far less intracoding. However, in the event this type of simultaneous loss does occur, it can propagate through
many frames in the RC approach as shown in this example.
-88-
Chapter 6
Experimental Results
andAnalysis
Experimental Results and Analysis
Chapter 6
ADAPT
STD
SD
TS
L
SS
RC
Figure 6.7: Frame 5 of the Foreman Sequence. A burst of loses along one path affected
frames 1 and 3. Most of the approaches are fairly distorted except for ADAPT and RC
approaches that are relatively unaffected.
-89-
Experimental Results andAnalysis
Chapter 6
'..jMNMMPW4=L
STD
ADAPT
SD
TS
-AMEW
-MMAPPW
t::O
I
Pr
'AMIIZM
RC
SS
Figure 6.8: Frame 231 of the Foreman Sequence. Two bursts of losses earlier in the
sequence affected both streams simultaneously, causing severe distortions in the RC
approach.
-90-
Chapter 6
6.4
ExperimentalResults andAnalysis
MD Coding Adapted to Network Conditions
In this section we examine how the system performs under various network conditions,
specifically we looked at variations in average packet loss rate and expected burst length. In later
sections we also explore the behavior of the system with unbalanced paths, where one path has a
higher packet loss rate than the other, and to time varying network conditions.
6.4.1 Variations in Average Packet Loss Rate
The main purpose of these experiments was to examine the effect of variations in average packet
loss rate on the resulting performance. The channel in this experiment was simulated with two
balanced paths each with expected burst length of 3 packets. The video was coded in CBR mode
at approximately 0.4 bits per pixel (bpp) and the average packet loss rate was varied from 0 to
10%. Figure 6.9 demonstrates the resulting distortion in the sequence for the adaptive MD
method and each of its non-adaptive MD counterparts. These results were computed by first
calculating the mean-squared error distortion by averaging across all the frames in the sequence
and across the 300 packet loss traces, and then computing the PSNR.
By choosing the most efficient modes possible, the adaptive approach achieves a
performance similar to the SD approach when no losses occur, and yet the performance does not
fall off as quickly as the average packet loss rate is increased. Near the 10% loss rate, the
adaptive method adjusts for the unreliable channel and has a performance closer to the RC mode.
Note that the intra update rate for the STD method was adjusted in the experiment to be as close
as possible to 1/p, where p is the packet loss rate, as an approximation of the optimal intra
update frequency. Since this update rate could only be adjusted in an integer manner, the STD
curves in Figure 6.9 tend to have some jagged fluctuations and in some cases the curves are not
even monotonically decreasing. As an example, an update rate of I/p would imply that one
should update one line of macroblocks every 2.22 frames at 5% loss and every 1.85 frames at
6% loss. These two cases have both been rounded to an update of one line of macroblocks every
2 frames resulting in the slightly irregular curves.
Table 6.2 shows the distribution of MD modes in the adaptive approach at 0%, 5%, and
10% average packet loss rates. As the loss rate increases the system responds by switching from
-91-
Chapter6
Experimental Results andAnalysis
38
+ADAPT
- -- -- -- - ---
---- - - --- --- -- - -- ----
36
+W
SD
-+-
TS
---
-.- SS
34
- - ------------------
-
+ RC
-@-+STD
I
z
32
CL
0.
30
28
26
0
1
2
3
4
5
6
7
9
8
10
Average Packet Loss Rate (%)
(a) Foreman sequence
37
-ADAPT
36
----------------------
35
-
---
-- ---- - -- - -------
-SD
TS
-+-X--
_- _ _ _ _ _ _
- 34
_ _ __ _ _ __ _ _ _ __ _ _
---
--- -
SS
+ - RC
-0- STD
a 33
z
(0)
0. 32
31 - - -
30 -
-----------------
-----
---------
- --
----- - - -- - - --------------
------------
29
0
2
4
6
8
10
Average Packet Loss Rate (%)
(b) Carphone sequence
Figure 6.9: PSNR versus average packet loss rate. (a) Foreman sequence. (b) Carphone
sequence. Video coded at approximately 0.4 bpp. The average packet loss rate for this
experiment was varied from 0-10%, and the expected burst length was held constant at 3
packets.
-92-
Chapter 6
ExperimentalResults andAnalysis
Foreman Sequence
MD Mode
0% Loss
SD
TS
5% Loss
10% Loss
70.87%
50.73%
43.57%
18.13%
21.22%
18.97%
SS
10.81%
10.69%
10.15%
RC
0.19%
17.36%
27.31%
(a)
Carphone Sequence
MD Mode
0% Loss
5% Loss
10% Loss
SD
67.49%
61.92%
57.51%
TS
17.29%
22.53%
20.68%
SS
15.13%
9.57%
9.19%
RC
0.09%
5.98%
12.63%
(b)
Table 6.2: Comparing the distribution of MD modes in the adaptive approach at 0%, 5%,
and 10% average packet loss rates. (a.) Foreman sequence. (b) Carphone sequence.
lower redundancy methods (SD) to higher redundancy methods (RC) in an attempt to provide
more protection against losses. It is interesting to point out that even at 0% loss the system does
not choose 100% SD coding. The adaptive approach recognizes that occasionally it can be more
efficient to predict from two frames ago than from the prior frame, so it chooses TS coding.
Occasionally it can be more efficient to code the even and odd lines of a macroblock separately,
so it chooses SS coding. The fact that it selects any RC at 0% loss rate is a little counterintuitive,
but this results since coding a macroblock using RC changes the prediction dependencies
between macroblocks. The H.264 codec contains many intra-frame predictions including motion
vector prediction and intra-prediction. In order for the RC mode to be correctly decoded even
when one stream is lost, the adaptive system must not allow RC blocks to be predicted in any
manner from non-RC blocks. (If RC blocks had been predicted from SD blocks, for example, the
loss of one stream would affect the SD blocks that would consequently alter the RC data as
well.) Occasionally, prediction methods like motion vector prediction may not help and can
actually reduce the coding efficiency for certain blocks. If this is extreme enough, it can actually
be more efficient to use RC, where the prediction would not be used, even though the data is
then unnecessarily repeated in both descriptions. This happens in approximately 1 out of every
500 to 1000 macroblocks in our experiments.
-93-
Experimental Results andAnalysis
Chapter 6
6.4.2 Variations in Expected Burst Length
Next, a similar experiment was run to examine the effects of expected burst length on the
resulting PSNR. The channel in this experiment was simulated with two balanced paths each
with average packet loss rates of 5%. The video was coded in CBR mode at approximately 0.4
bits per pixel (bpp) and the expected burst length was varied from a little over 1 packet
(corresponding to Bernoulli losses) up to 10 packets. Figure 6.10 demonstrates the resulting
distortion for the adaptive MD method and each of its non-adaptive MD counterparts.
The effect of burst length has a number of interesting affects on the performance of each of
the methods. The adaptive approach again demonstrates higher performance than any of the nonadaptive approaches across the whole range of burst lengths. Also as expected, the repetition
coding approach is generally unaffected by burst length since it is only affected by simultaneous
losses on both streams. The probability that packets from both streams are simultaneously lost
does not change with burst length since the channel has been simulated with independent paths
and the average loss rate was held constant. The SD and STD approaches both demonstrate a
decrease in quality with higher burst lengths. As explored in [27], generally the loss of a burst of
frames leads to an increase in distortion relative to the same number of isolated losses. The
correlation between errors in two sequential frames can increase the total distortion and often
results in a significant increase in distortion.
Interestingly, the TS and SS approaches both demonstrate better performance for longer
expected burst lengths. Conceptually, this can be explained by the following example. Consider
the case when a burst of L consecutive losses occurs along one path when using the TS
approach. Each lost frame will be reconstructed by an unaffected frame from the opposite
stream, so assume for the sake of this example that the distortion resulting from each of these
reconstructions is the same for each frame and is equal to D,,. This is a fairly reasonable
assumption, given that each frame is reconstructed using a "clean" frame from the opposite
stream, yet is in sharp contrast to the SD case where each subsequent frame is reconstructed
from a frame that has also been distorted by propagating losses. Once the burst is finished after
L frames, any remaining error will propagate into future frames. Suppose the sum of the
distortion caused by this propagation of errors is equal to Dprop - Then, the total distortion due to
this burst would be
L -D
+ Dprop
-94-
(6.1)
ExperimentalResults andAnalysis
Chapter 6
32.5
32 --------
---
----
--------
--
31.5
-
-4-ADAPT
0---+ SD
+TS
- SS
-------------
31
S30.5
W
z
-* RC
------
-0-
30
STD
------------
29.5
29 28.5 28 27.5
1
3
5
7
9
Expected Burst Length (packets)
(a) Foreman sequence
33.5
-- ADAPT
W SD
-a- TS
33 ---
32.5 ----------------------------------------
w
z
U)
a.
SS
+ RC
-@-STD
32 31.5 31
30.5
1
3
5
7
9
Expected Burst Length (packets)
(b) Carphone sequence
Figure 6.10: PSNR versus expected burst length. (a) Foreman sequence. (b) Carphone
sequence. Video coded at approximately 0.4 bpp. The expected burst length for this
experiment was varied from slightly above 1 packet (corresponding to Bernoulli losses)
to 10 packets while the average packet loss rate was held constant at 5%.
-95-
Chapter6
ExperimentalResults andAnalysis
Experimental Results and Analysis
Chapter 6
+--
(a)
Distortion
Distortion
D055
- - - --
---
---
--
L
- - - - - - -- - - -- - - - - -- -
------ - - - --- - - ---4-4- -- - --
Frame
L
2
(b)
L
- - -- - - -
Frame
(c)2
Figure 6.11: Effects of burst length on the TS mode. (a) A burst of L frames are lost
along one path and are reconstructed using unaffected frames from the opposite path. (b)
Distortion resulting from burst of L frames. (c) Distortion resulting from two shorter
bursts. In general, with the TS or SS approaches, longer burst lengths tend to cause less
distortion than a number of shorter bursts with an equal number of total losses.
Now, if there were two burst losses of length L instead, the total distortion would be
approximately
2(L-D,,,,+D,,)=L-D, +2-D
(6.2)
Therefore, a larger number of shorter bursts will tend to lead to more distortion in the TS
approach assuming the same total number of losses. A similar argument can be used to explain
the increase in performance with the SS method, since each lost field is reconstructed by an
unaffected "clean" field from the opposite stream. In [2] it was shown that a TS-based MD
scheme with multiple paths can be quite effective against burst losses.
-96-
Experimental Results and Analysis
Chapter 6
Foreman Sequence
MD Mode
Bernoulli Losses
Burst Length 5
Burst Length 10
SD
53.17%
49.26%
48.23%
TS
20.65%
22.01%
22.59%
SS
9.14%
11.20%
11.79%
RC
17.04%
17.52%
17.39%
(a)
Carphone
Sequence
jBurst Length 10
MD Mode
Bernoulli Losses
Burst Length 5
SD
63.83%
60.52%
58.81%
TS
22.19%
23.11%
23.90%
SS
8.17%
10.55%
11.25%
RC
5.81%
5.82%
6.05%
(b)
Table 6.3: Comparing the distribution of MD modes in the adaptive approach at a range
of expected burst lengths ranging from Bernoulli losses up to burst lengths of 10 packets.
(a.) Foreman sequence. (b) Carphone sequence.
Table 6.3 shows the distribution of MD modes in the adaptive approach at three different
burst lengths; 1.053 (Bernoulli), 5, and 10 packets. In agreement with the results shown in Figure
6.10, as the burst length increases the system begins increasing the use of the TS and SS methods
to more effectively match the conditions on the network.
6.5
End-to-End R-D Performance
Having demonstrated the ability of adaptive MD mode selection to adapt to network conditions
and to the characteristics of the video source, we next analyze the performance of this system at
a number of different bitrates. Figure 6.12 shows the end-to-end R-D performance curves of the
adaptive approach and each of its non-adaptive counterparts. This experiment was run in VBR
mode with fixed quantization levels. To generate each point on these curves, the resulting
distortion was averaged across all 300 packet loss simulations, as well as across all frames of the
sequence. The same calculation was then conducted at various quantizer levels to generate each
R-D curve.
-97-
Chapter 6
ExperimentalResults andAnalysis
Experimental Results and Analysis
Chapter 6
34
-+-ADAPT
+0S Du
-+-TS
33
+STD
31
(L
----- ---- -- ---- - -- - - -- - -
SS
+RC
32
zV)
-- - - - -
--
-F
30
-- ------
294
28-
- -
-------------- ---------------
--
- -
-----------------------------
-
----- - -------- - ----- -- -
27
0.15
0.2
0.25
0.3
0.35
0.4
0.45
BPP
(a) Foreman sequence
35
-34 --
33
z
-
ADAPT
- SD
--+TS
+SS
RC
STD
-
--
32 -
31 -
30 -
29
-
0.125
0.175
0.225
0.275
0.325
0.375
0.425
0.475
0.525
BPP
(b) Carphone sequence
Figure 6.12: End-to-end R-D performance of ADAPT and non-adaptive methods. 5%
packet loss rate, expected burst length 3. (a) Foreman sequence. (b) Carphone sequence.
-98-
Experimental Results and Analysis
Chapter 6
With the current system, ADAPT is able to outperform optimized SD coding by up to 1 dB
for the Foreman sequence and about 0.5 dB for the Carphone sequence by switching between
MD methods. The ADAPT method is able to outperform the STD coding approach by as much
as 4.5 dB with the Foreman sequence and up to 3 dB with the Carphone sequence. ADAPT is
able to outperform TS, which more or less performs the second best overall, by as much as 0.5
dB.
One interesting result here is how well RC performs in these experiments. Keep in mind
that this is an R-D optimized RC approach, not simply the half-bitrate SD method repeated
twice. The amount of intra coding used in RC is significantly reduced relative to SD coding as
the encoder recognizes the increased resilience of the RC method and chooses a more efficient
allocation of bits.
Lagrangian optimization is often performed by selecting a fixed A value appropriate for
the desired bitrate and jointly choosing both the mode and quantizer level that minimizes the
Lagrangian cost function. However, the results presented in Figure 6.12 have been obtained by
using a single fixed quantization level to generate each point. In H.264 there are 51 available
quantization levels, and therefore trying all possible quantization levels would increase the
complexity of the system by a factor of 51. This proved to be an unacceptable increase in
complexity, so we have made the decision to use a fixed quantization level throughout this
experiment. To justify this decision we have used the first 25 frames from each sequence to
regenerate the R-D curves from Figure 6.12 using a fully optimized system that considers all
available quantization levels. The R-D curves for the ADAPT and SD approaches using both
optimal quantization and fixed quantization are shown in Figure 6.13. The gains from using this
fully optimized approach are only about 0.1 to 0.3 dB, which is not insignificant but is relatively
small when considering the large increase in complexity.
6.6
Unbalanced Paths and Time Varying Conditions
We next analyze the performance of the adaptive method when used with a channel containing
unbalanced paths. First, in section 6.6.1 we analyze the situation where one path is more reliable
than the other. Then in section 6.6.2 we simulate a momentary rise in packet loss rates on one
stream to simulate a temporary jump in congestion.
-99-
Chapter 6
Experimental Results
andAnalysis
Experimental Results and Analysis
Chapter 6
36
V
+Fixed
SD
SD
+Fixed
ADAPT
35 --
- -x-Optimal
34--
- -+-OptimalADAPT
33--
------------
z
-----
(0 32--
0.
---------------------
---------
-- - -- - -- - ------
-- ----------------
-- -- --
-------------------
31-
30 -- ---29 4-0.15
0.2
------------------------------------------
0.25
0.3
0.35
0.4
0.45
0.5
0.55
BPP
(a) Foreman sequence
38
3736 -
-4-Fixed SD
----------------------------*-OptimalSD
4-Fixed ADAPT
- +Optimal ADAPT --------------
I
350 3433 32 - --
---------------------------------------------
31 +-
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
BPP
(b) Carphone sequence
Figure 6.13: Optimal quantization level selection versus fixed quantization level. The
optimal curves have been calculated by allowing the encoder to choose a quantization
level in an R-D optimized manner. The fixed curves have been calculated using a fixed
quantization level. Only the first 25 frames of each sequence have been used in this
experiment. (a) Foreman sequence. (b) Carphone sequence.
-100-
ExperimentalResults andAnalysis
Chapter 6
Foreman Sequence
Carphone Sequence
MD Mode
Even Frames
More Reliable Path
Odd Frames
Less Reliable Path
SD
54.7%
48.3%
TS
SS
RC
26.5%
7.4%
11.4%
16.5%
12.9%
22.4%
MD Mode
SD
TS
SS
RC
Even Frames
11 More Reliable Path
64.6%
26.9%
5.9%
2.6%
Odd Frames
Less Reliable Path
59.8%
18.0%
12.5%
9.7%
Table 6.4: Percentage of macroblocks using each MD mode in the adaptive approach
when sending over unbalanced paths.
Foreman Sequence
Balanced Paths
Unbalanced Paths
Carphone Sequence
Balanced Paths
Unbalanced Paths
Stream 1
50.5%
55.4%
Stream 1
50.1%
55.9%
Stream 2
49.5%
44.6%
Stream 2
49.9%
44.1%
Table 6.5: Percentage of total bitrate in each stream for both balanced and unbalanced
paths.
6.6.1 Balanced versus Unbalanced Paths
This section explores the behavior of the adaptive system when one path is more reliable than the
other. The channel used in these experiments consisted of one path with 3% average packet loss
rate and another with 7%, both with expected burst lengths of 3 packets. The video in this
experiment was coded at approximately 0.4 bpp in CBR mode. Table 6.4 shows the distribution
of MD modes in even frames of the sequence versus odd frames. The even frames are those
where the larger packet is sent along the more reliable path and the smaller packet is sent along
the less reliable path (see Figure 5.11 regarding the packetization of MD data). The opposite is
true for the odd frames. It is also interesting to compare the results from Table 6.4 with those
from Table 6.2 at 5% balanced loss. The average of the even and odd frames from Table 6.4
matches closely with the values from the balanced case in Table 6.2.
As shown in Table 6.4, the system uses more SS and RC in the less reliable odd frames.
These more redundant methods allow the system to provide additional protection for those
frames which are more likely to be lost. By doing so, the adaptive system is effectively moving
data from the less reliable path into the more reliable path. Table 6.5 shows the bit rate sent
along each path in the balanced versus unbalanced cases. In this situation, the system is shifting
-101-
Experimental Results andAnalysis
Chapter 6
between 5-6% of its total rate into the more reliable stream to compensate for conditions on the
network. Since the non-adaptive methods are forced to send approximately half their total rate
along each path, it is difficult to make a fair comparison across methods in this unbalanced
situation. However, it is quite interesting that the end-to-end R-D optimization is able to adjust to
this situation in such a manner. In the future, it might be interesting to use a similar approach to
optimally distribute bandwidth across multiple paths.
The purpose of this research was to investigate the use of adaptive MD mode selection, and
consequently this system was not necessarily designed to provide optimal bandwidth
distribution. The results presented above are an interesting side-effect and may lead to useful
future research along these lines, but these results are not an example of optimal bandwidth
distribution. For instance, in the extreme case where 100% of packets arrive along one path and
0% of packets arrive along the other path, the optimal distribution would be to send all of the
data along the reliable path and none along the unreliable path. However, the alternating nature
of the SD and TS approaches in the current system (see Figures 5.10 and 5.11) prevent the
encoder from sending all of the data along the more reliable path. An optimal bandwidth
distribution system would need to allow for this type of flexibility. In addition, it may be
appropriate to impose additional constraints on this type of system such as a per-path rate
constraint (a limit on the number of bits sent along each path) rather than a total rate constraint as
used in the current system.
6.6.2 Time Varying Network Conditions
The second set of experiments in this section was designed to analyze the behavior of the
adaptive system under changing network conditions. Here we have used packet loss rates that
vary over time as shown in Figure 6.2. The packet loss rate temporarily increases to 10% first on
one path for frames 50-150 and then on the other for frames 250-350. This type of situation is
particularly well suited for the use of adaptive MD mode selection, and may actually be quite
likely in practice as well. There may be long periods of time where all packets are delivered
followed by short periods of congestion on one path or the other. During the periods of little or
no packet losses the adaptive system makes decisions to maximize coding efficiency. Once the
packet loss rate jumps up, the system can adapt in a number of different ways. As shown in the
previous section, frames sent along the less reliable path can be better protected using SS or RC
-102-
Chapter 6
Experimental Results andAnalysis
100%
50%
0%0%
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
100%
50%
0%
100%
U,
(0 50%
0%
-I
o
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
100%
S50%
0%
Frame
Figure 6.14: Distribution of selected MD modes used in the adaptive method for each
frame of the Foreman sequence illustrating how mode selection adapts to time varying
network conditions. Average packet loss rate varies as shown in Figure 6.2. The expected
burst length is 3 packets.
coding. For the frames sent along the more reliable path the system can elect to use TS coding to
prevent errors from propagating into the more reliable stream. Of course the system can also
elect to use more intra-coding to prevent error propagation as well. The adaptive mode selection
allows the encoder to analyze each of these different approaches and choose the most effective
approach for each block.
Figure 6.14 shows the distribution of MD modes in the adaptive approach for the Foreman
sequence. Specifically we show the percentage of macroblocks using each MD mode in each
frame of the video. Here the sequence has been coded at about 0.4 bpp, the packet loss rate
varies as shown in Figure 6.2, and the expected burst length was 3 packets (although during the
periods with 0% probability of loss the notion of burst length is meaningless). From this
distribution it is easy to notice the changes that occur as the packet loss rate jumps up. The
repetition mode in particular significantly increases during these periods of congestion. One can
-103-
Chapter6
Experimental Results andAnalysis
also notice much more fluctuation when the packet loss rate increases to 10%. Here the
unbalanced nature of the two paths leads to different processing of the even and odd frames.
Those frames sent along the less reliable path are handled differently from those sent along the
more reliable path resulting in the noticeable fluctuation.
Figure 6.15 shows the resulting PSNR from each frame of the Foreman sequence and
compares the adaptive approach against each of its non-adaptive counterparts. Since the TS
method possesses the property that the loss of even frames affect only even frames and similarly
for odd frames, the increased loss rate on one path only reduces the PSNR for every other frame.
This results in severe fluctuation that would reduce the legibility of Figure 6.15. Therefore the
even and odd frames have been plotted separately here. Figure 6.15 (a) contains only the even
frames and Figure 6.15 (b) contains only the odd frames.
In this example, the odd frames are sent along the less reliable path during the first burst
and the even frames are sent along the less reliable path during the second burst. Consequently
the PSNR drops sharply for the TS approach during the first burst for the odd frames and during
the second burst for the even frames. Due to error propagation, both the even and odd frames
drop in quality during both bursts with the SD and to a lesser extent the SS approach. The RC
method is unaffected by either burst of losses since both streams are never simultaneously lost.
The adaptive approach is able to select modes that most effectively match the current network
conditions and maintains good quality throughout both loss events.
6.7
SensitivityAnalysis
The previous experiments have each assumed the encoder had perfect knowledge of the channel
conditions. In practice, the problem of accurately characterizing network conditions is
challenging since conditions on the network can change rapidly making it difficult to get an
accurate measure at any one point in time. Therefore, it is unlikely that the encoder will have
perfect knowledge of the channel. The purpose of this section is to analyze the effectiveness of
the adaptive MD mode selection system given imperfect knowledge of channel conditions. We
analyze the sensitivity of the system to packet loss rate in section 6.7.1 and sensitivity to burst
length in section 6.7.2. As in the previous sections, results presented here apply only to the
specific realization of the MD mode selection system used in this thesis. Other such systems may
-104-
Chapter 6
Experimental Results andAnalysis
39
37
z
S
-
27- -
-
-
-
-
-
-
-
-
-
-
-
i
-
i
---------
-11-
--S29------------------ - -- R
23--
-
-
-SD
-ADAPT
0
50
100
150
200
250
300
350
400
Frame
312
Frames
(a) Even --- -
-- -- - -------
- - --
27
39 - -
-
TS
- -- -- --- - ---
--
- -
-
-
--- --
-- -- -- --
-
- --
- -- SS
23
111
-SD
25
--- -- RC
------
--
-
----
- ---
-
-ADAPT
0
50
100
150
200
250
300
350
400
Frame
(b) Odd Frames
Figure 6.15: Average distortion in each frame for ADAPT versus each non-adaptive
approach. Foreman sequence coded at 0.4 bpp with time varying packet loss rates and
expected burst length of 3 packets. (a) Even frames. (b) Odd frames.
-105-
Chapter6
ExperimentalResults and Analysis
be more or less sensitive to inaccurate knowledge of channel conditions depending on the
particular MD modes used and other aspects of the system design.
6.7.1
Sensitivity to Packet Loss Rate
The purpose of this experiment was to analyze the sensitivity of the system to errors in assumed
packet loss rate (PLR). The channel in this experiment was simulated with two balanced paths
each with expected burst length of 3 packets. The video was coded with the adaptive approach in
CBR mode at approximately 0.4 bits per pixel (bpp) and the assumed packet loss rate was varied
from 0 to 10%. Figure 6.16 demonstrates the resulting PSNR when the actual and assumed PLR
do not match. These results were computed by first calculating the mean-squared error distortion
by averaging across all the frames in the sequence and across the 300 packet loss traces, and then
computing the PSNR.
Each line in Figure 6.16 represents a constant actual packet loss rate, as shown in the
legend on the right, and each point on these lines represents a different assumed packet loss rate.
These results demonstrate the effect of incorrect assumptions about packet loss rate. For
example, in the case of the Foreman sequence, if the actual packet loss rate is 0% and the
encoder is aware of this, the resulting PSNR is about 36 dB as is shown by the first point on the
blue line. If however the encoder incorrectly assumes the packet loss rate is 10% and the actual
loss rate remains at 0%, the encoder is wasting bits providing unnecessary error resilience, and
the PSNR drops to about 33dB as is shown by the last point on the blue line. Similarly, if the
actual packet loss rate is 10% and the encoder makes an accurate assumption, the PSNR is about
31 dB as is shown by the last point on the red line. However, if the encoder incorrectly assumes
the packet loss rate is 0% and the actual loss rate remains at 10%, the encoder is not providing
nearly enough error resilience and the PSNR drops to about 23 dB.
Two main conclusions can be drawn from these figures. First, each of these lines peak at roughly
the point where the assumed and the actual match (e.g. the maximum PSNR from the 5% actual
PLR line occurs at 5% assumed PLR). This indicates that the optimization is appropriately
matching the coding to the channel conditions. Secondly, except for the 0% loss case, which
peaks at 0%, each of these lines is relatively flat to the right side and drops off fairly rapidly to
the left. At least with this particular implementation, it appears that if one fails to match the
amount of error resilience to the actual channel conditions it is more costly to underestimate the
-106-
Chapter 6
ExperimentalResults and Analysis
Actual PLR
38
-+-0%
36
--+- 2%
-------- --- - - - -
34 --------------------
+4%
32 ---- ---- ------------
z
- -- -
-
30 -
--- -
--- X ---
---x- 6%
- -- - - - - - -- - - - - - - - -
U)
--------------------------- ----
a28 --
+8%
---
A
-0-10%
24
22
0%
2%
6%
4%
8%
10%
Assumed Packet Loss Rate
(a) Foreman sequence
Actual PLR
37
+0%
35-+-2%
33---------------
--
-
--
--
44
-6%
W 31
z
0)
+.-8%
29
+ 10%
27
---
-------------------------------
-
-
25
0%
2%
4%
6%
8%
10%
Assumed Packet Loss Rate
(b) Carphone sequence
Figure 6.16: Sensitivity of adaptive MD mode selection system to errors in assumed
packet loss rate. ADAPT approach coded at 0.4 bpp with balanced paths and expected
burst length of 3 packets. Each line represents a different actual packet loss rate. The
assumed packet loss rate varies as shown along the x-axis (a) Foreman sequence. (b)
Carphone sequence.
-107-
Chapter 6
Experimental Results andAnalysis
packet loss rate than it is to overestimate the packet loss rate. For example, if one uses too many
intra-blocks, some bits are wasted on inefficient intra-coding and the decoder must use coarser
quantization to compensate, but the end result is only a small decrease in overall quality. If on
the other hand one uses too few intra-blocks, losses which occur cause significant drops in
quality and propagate through many frames. Again, other implementations of this type of system
may certainly give different results. However, these results would seem to suggest a certain
amount of conservatism in assumed packet loss rates. For instance, at least with this particular
system, if there were some uncertainty in one's estimate of actual packet loss rate, it would be
better to err to the right rather than the left.
Figure 6.17 shows the same data, only this time each line represents a different assumed
packet loss rate and the actual packet loss rate varies as shown along the x-axis. This figure too
illustrates the decreased sensitivity to loss rates at higher assumed packet loss rates. As the
assumed packet loss rate increases, the slope of these lines decrease. As shown by the red line, if
the encoder assumes the packet loss rate is 0% the slightest increase in actual packet loss rate can
cause a significant drop in quality. If one knew a probability distribution of packet loss rates, for
example if the packet loss rate were uniformly distributed between 0% and 1%, one should
choose the line from Figure 6.17 that maximizes the integral over this probability distribution.
The 0% line may have the maximum value over this range, but the integral over the probability
distribution may be lower given the steep drop in the red curve.
Figure 6.18 shows the sensitivity of the ADAPT approach relative to each of its nonadaptive counterparts. Here the actual packet loss rate is held fixed at 4% throughout the
experiment and the assumed packet loss rate varies as is shown on the x-axis. Once the encoder
has inaccurate knowledge of the actual channel conditions, the ROPE model will inaccurately
estimate expected end-to-end distortion leading to suboptimal decisions that are no longer
appropriate given the actual conditions. These results suggest that the adaptive approach is the
most sensitive to inaccurate channel knowledge. While the other methods mainly use the ROPE
model for intra/inter decisions, the adaptive approach also uses this model to select between MD
modes, so it seems reasonable that the adaptive approach would be the most sensitive to
inaccurate knowledge. As also demonstrated in Figures 6.16 and 6.17 this particular system is
not as sensitive to errors in the positive direction which would again suggest a conservative
approach.
-108-
Chapter 6
ExperimentalResults andAnalysis
Assumed PLR
38_
-0%
36-
----------------------------------------
-1%
-2%
34 -
3%
4%
5%
-6%
32 304+
z
-- - -- -
---- -
------
----
-7%
to)
0.
28-
-8%
-9%
26 -
-10%
2422
0 YO
2%
4%
6%
Actual Packet Loss Rate
8%
10%
(a) Foreman sequence
Assumed PLR
37-
-0%
35-
------- -- -- -- -- -- -- -- - ---
-- -- - - --- -
-1%
-2%
3%
33 - -
------- - - -- - - -- - - -- -
4%
-
5%
00
z
-6%
w31 -
-
0)
.
29 - - - -- -- -- -- -- ------
-
- - -- -- -- -- -- - -- --- --
7%
-8%
-9%
-10%
27 -
25 40%
2%
4%
6%
Actual Packet Loss Rate
8%
10%
(b) Carphone sequence
Figure 6.17: Sensitivity of adaptive MD mode
packet loss rate. ADAPT approach coded at 0.4
burst length of 3 packets. Each line represents a
actual packet loss rate varies as shown along
Carphone sequence.
-109-
selection system to errors in assumed
bpp with balanced paths and expected
different assumed packet loss rate. The
the x-axis (a) Foreman sequence. (b)
Chapter6
ExperimentalResults and Analysis
33
+SD
32
+ TS
-x*-
31
SS
+ RC
30--
zU)
CL
--
---
-
-*+ADAPT
29
28
Actual Packet Loss Rate = 4%
27
26
0%
2%
4%
6%
8%
Assumed Packet Loss Rate
(a) Foreman sequence
35
-+-
34
33
SD
*
TS
-4-
SS
+RC
4-ADAPT
z
a.
32
A
31 --
--
--
------- --
-
---------------------------------
-
Actual Packet Loss Rate = 4%
30
29
0%
2%
4%
6%
8%
Assumed Packet Loss Rate
(b) Carphone sequence
Figure 6.18: Sensitivity of adaptive MD mode selection system to errors in assumed
packet loss rate relative to each of its non-adaptive counterparts. The Foreman sequence
has been coded at 0.4 bpp with balanced paths and expected burst length of 3 packets.
The assumed packet loss rate varies as shown along the x-axis and the actual packet loss
rate is 4%. (a) Foreman sequence. (b) Carphone sequence.
-110-
ExperimentalResults and Analysis
Chapter 6
6.7.2 Sensitivity to Burst Length
In a similar manner, we next analyzed the sensitivity of the system to errors in assumed expected
burst length. Figures 6.19 and 6.20 show the results from these experiments. Each line in Figure
6.19 shows a constant actual expected burst length while the assumed burst length varies along
the x-axis. Figure 6.20 shows the same results only the assumed expected burst length is held
constant for each line while the actual burst length varies along the x-axis.
The first thing to note is the scale of these figures, the PSNR does not change nearly as
dramatically as it does with variations in average packet loss rate. Errors in assumed burst length
can lead to drop in PSNR on the order of 0.6 dB (for these sequences), but it is apparent that this
particular system is not as sensitive to errors in assumed burst length as it is to errors in average
packet loss rate. The quality is again maximized when the assumed burst length matches the
actual, however this time it is difficult to make any generalizations as to whether it is better to
assume longer or shorter burst lengths given an amount of uncertainty. The results from the
Foreman sequence would seem to suggest that it might pay off to assume slightly longer burst
lengths, but the same result does not appear in the Carphone sequence.
-111-
Chapter 6
ExperimentalResults andAnalysis
Actual BL
32.5-
-+-
32.4,
32.3
Bernoulli
-3
-+-
-
5
-x- 7
32.2 -
z
+9
32.1 - - -- -
-------
-
-- -- --
-
- -- - - --
()
IL 32.0
-- -
31.9,
31.8
--
- - - - - ---- - -
- - - - - ---- - -
- - - -- - - - -- - ---- - - ----
-- - - - - -- - -
---- - - -- ---
-
31.7
1
3
5
7
9
Assumed Burst Length
(a) Foreman sequence
Actual BL
33.5
-4- Bernoulli
33.4+
+ 3
----- -- - --
33.3-
+5
-- -- -- -- -- -- -- -- -- - - -- -- --
-x- 7
-*- 9
33.2
z
33.1
I
U)
(L 33.0 32.9 32.8 32.7
1
3
5
7
9
Assumed Burst Length
(b) Carphone sequence
Figure 6.19: Sensitivity of adaptive MD mode selection system to errors in assumed
burst length. ADAPT approach coded at 0.4 bpp with balanced paths and average packet
loss rate of 5%. Each line represents a different actual burst length. The assumed burst
length varies as shown along the x-axis (a) Foreman sequence. (b) Carphone sequence.
-112-
Experimental Results andAnalysis
Chapter 6
Assumed BL
32.5
- Bernoulli
-2
-3
4
5
-6
-7
-8
-9
-10
- -------------------------------- - -
32.4
32.3
32.2
z
0W~
32.1
----- - --
- ---- -- -- -- - - ---
32
31.9
31.8
31.7
1
3
5
7
9
Actual Burst Length
(a) Foreman sequence
Assumed BL
33.5
- Bernoulli
-2
33.4
-3
33.3
-- -
33.2
z
(-
--- - --
33.1
- - - ------ - - --- -- -- -- -- -- -- -
5
-6
-7
-8
33.0 -
- -
-- -- -- - --- -- -- -- -- -- -- -- -- -- -- - -
-9
-10
32.9
32.8
32.7
1
3
5
7
9
Actual Burst Length
(b) Carphone sequence
Figure 6.20: Sensitivity of adaptive MD mode selection system to errors in assumed
burst length. ADAPT approach coded at 0.4 bpp with balanced paths and average packet
loss rate of 5%. Each line represents a different assumed burst length. The actual burst
length varies as shown along the x-axis. (a) Foreman sequence. (b) Carphone sequence.
-113-
Chapter 6
Results and
Analysis
Experimental
andAnalysis
Experimental Results
Chapter 6
(a)
(b)
Figure 6.21: Comparison between (a) a single path and (b) multiple paths.
6.8
Comparisons between Using Single and Multiple Paths
The final experiments conducted in this chapter provide some insight into the benefits of using
multiple description coding with multiple paths (MP). Each of the previous experiments has
assumed the use of two independent paths, where burst losses on a single path affect only that
path. The experiments in this section compare this multiple path approach with an approach
using only a single path (SP), see Figure 6.21. If all losses are Bernoulli, where each packet loss
even is independent and identically distributed, then the use of multiple paths is irrelevant since
all packet losses are independent no matter if they are sent along one or two paths. However,
with the introduction of bursty losses, using multiple paths with MD coding can be quite
beneficial. In this section we consider both MP and SP for three different approaches: 1) multiple
description coding, represented by the ADAPT approach, 2) optimized single description coding,
represented by the SD approach, and 3) standard single description coding, represented by the
STD approach. The use of standard single description coding on a single path is the approach
used most often in applications today and therefore provides a baseline of comparison for the
results presented in this thesis.
-114-
Experimental Results andAnalysis
Chapter 6
In all previous experiments, care was taken to ensure that the same packet loss traces were
used throughout an experiment. This ensures a fair comparison between different methods.
However, when comparing SP and MP it is no longer possible to use the same packet loss traces.
By running a large number of realizations of each channel (300 realizations) we believe the
results presented below still provide a reasonable comparison, but we felt that it was important to
note this distinction.
The experiments in this section are similar to the experiments presented in Sections 6.4 and
6.5. The first experiment examines the effects of expected burst length on the resulting PSNR
and compares SP with MP. The multiple path channel in this experiment was simulated with two
balanced paths each with average packet loss rates of 5%. The single path channel also had an
average packet loss rate of 5%. The video was coded in CBR mode at approximately 0.4 bits per
pixel (bpp) and the expected burst length was varied from a little over 1 packet (Bernoulli) up to
10 packets. Figure 6.22 shows the outcome of this experiment. These results were obtained by
first calculating the mean-squared error distortion by averaging across all the frames in the
sequence and across the 300 packet loss traces, and then computing the PSNR.
Figure 6.22 shows the resulting PSNR for the ADAPT, SD, and STD approaches using
both single path (SP) and multiple paths (MP). The MP results are identical to those presented in
Section 6.4.2. As shown in Figure 6.22, burst length has a very significant effect on the
performance when using single path, much more so than with multiple paths. With Bernoulli
losses, the two cases are identical, but the performance falls off rapidly with SP as the burst
length increases. One assumption made with MD coding is that it is relatively unlikely that both
descriptions will be lost, yet bursts of losses along a single path cause losses in both descriptions.
There can still be some gains from using MD coding with SP, but these gains are certainly not as
significant as with MP.
The second experiment shows the effect variations in packet loss rates have on SP versus
MP. The multiple path channel in this experiment was simulated with two balanced paths each
with expected burst length of 3 packets. The single path channel was also simulated with
expected burst lengths of 3 packets. The video was coded in CBR mode at approximately 0.4 bits
per pixel (bpp) and the average packet loss rate was varied from 0 to 10%. These results were
again computed by first calculating the mean-squared error distortion by averaging across all the
frames in the sequence and across the 300 packet loss traces, and then computing the PSNR.
-115-
Chapter 6
Experimental Results andAnalysis
32.5
- -----------------------------
32
-4-ADAPT
-- -- -
31.5-
MP
-++-ADAPT-SP
- - --------------------------
31
---
30.5-
z
-
---------------------- I -0-SD-MP
---------------------
----
+SD-SP
30- - ----------------------
------
--
--
U- 29.5-
STD
-
MP
+STD-SP
29- ------------
---
------ -
----- -
28.5
28 Jf
------- --------------------
0
27.5
1
3
5
7
9
Expected Burst Length (packets)
(a) Foreman sequence
34
33.5
33
-*-ADAPT - MP
-
-*-ADAPT - SP
-- -- -- -- -- -- --
32.5-
z
0,
0.
--------------- -
-M-SD - MP
-*-SD - SP
32 -
-0- STD - MP
31.5 +
-+-STD - SP
31
30.5
30
1
3
5
7
9
Expected Burst Length (packets)
(b) Carphone sequence
Figure 6.22: PSNR versus expected burst length comparing the benefits of using
multiple paths (MP) versus only a single path (SP). Foreman sequence coded at
approximately 0.4 bpp. The expected burst length for this experiment was varied from
Bernoulli to 10 packets, and the average packet loss rate was held constant at 5%. (a)
Foreman sequence. (b) Carphone sequence.
-116-
Experimental Results andAnalysis
Chapter 6
Figure 6.23 demonstrates the resulting PSNR for each of the cases. The MP results are identical
to those presented in Section 6.4.1.
This experiment again shows the benefits of using MD coding with MP. The performance
of each approach drops much more sharply with SP and the gains from MD coding are much
greater when combined with MP. At 0% no losses occur, so the results for SP and MP are
identical, however the PSNR drops off much more rapidly for SP. As indicated by the results of
the previous experiment, this drop in performance is mainly a result of the bursty losses. With
Bernoulli losses, these results would be identical for SP and MP. At burst lengths of 10 packets,
the differences would be even more apparent.
The final experiment here demonstrates the benefits of MP at a number of different
bitrates. Figure 6.24 shows the end-to-end R-D performance curves of both SP and MP
approaches. This experiment was run in VBR mode with fixed quantization levels. To generate
each point on these curves, the resulting distortion was averaged across all 300 packet loss
simulations, as well as across all frames of the sequence. The same calculation was then
conducted at various quantizer levels to generate each R-D curve. The MP results are identical to
those presented in Section 6.5.
By using MP with MD coding, the resulting PSNR is increased approximately 1 dB
relative to MD coding with SP. The gain with SD coding over multiple paths is about 0.5 dB. As
seen in the previous experiments the benefits of MD coding are much more significant relative to
SD coding when used with MP. The STD approach with SP provides a baseline of comparison
since this is the type of approach used in most applications today. MD coding with MP is able to
outperform this standard approach by about 2 dB at lower bitrates and by as much as 4-5 dB at
the higher bitrates.
-117-
Chapter 6
ExperimentalResults andAnalysis
38
36 1 -----------------------------------34 ----
--
--
-ADAPT - MP
-- ADAPT - SP
-a- SD - MP
-+- SD - SP
--- - ---------------
+40STD-MP
+ STD - SP
z
32-
0O
30 ---------
|
---------
---------------------------
28-
26 0
1
2
3
4
5
6
7
8
10
9
Average Packet Loss Rate (%)
(a) Foreman sequence
37
-- ADAPT - MP
36
------------------------------------ -
+ADAPT - SP
-- SD - MP
35
--------------------------+-w-SD
- SP
-+-STD - MP
------ - -- -- -- -- -- -- -- -- - - ----------34-----------+STD - SP
In
z
-
33
32 -------------
-
--------- ---------
-
-
-
30
29
0
1
2
3
4
5
6
7
8
9
10
Average Packet Loss Rate (%)
(b) Carphone sequence
Figure 6.23: PSNR versus average packet loss rate comparing the benefits of using
multiple paths (MP) versus only a single path (SP). Foreman sequence coded at
approximately 0.4 bpp. The average packet loss rate for this experiment was varied from
0% to 10%, and the expected burst length was held constant at 3 packets. (a) Foreman
sequence. (b) Carphone sequence.
-118-
Chapter 6
Experimental Results and Analysis
34-
33
+
32
31
-4- ADAPT - M-P
--x-ADAPT-SP
-*-SD - MP
+f S D - S P+*STD- MP
+STD - SP
- - --- - - - - --
-
- - - -
z
(i) 30 30
29
-
28 +
27 0.15
0.2
0.25
0.3
0.35
0.4
0.45
BPP
35
+ADAPT - MP
34
-
33--
z
-++-ADAPT - SP
-- SD - MP
+SD- SP
+STD-MP
+STD - SP
-
-------
-
--
-
---
32 -
C')
IL
- -
31 +
- -
--- -- -- -- -- -- -- -- ---
30-
29 40.15
0.2
0.25
0.3
0.35
0.4
0.45
BPP
Figure 6.24: End-to-end R-D performance of SP versus MP. 5% packet loss rate,
expected burst length 3 packets. (a) Foreman sequence. (b) Carphone sequence.
-119-
-120-
Chapter 7
Conclusions
7.1
Summary
The transmission of video sequences over lossy packet networks involves a fundamental tradeoff
between compression efficiency and error resilience. Raw video contains an immense amount of
data that must be stored in a limited amount of space and/or transmitted in a finite amount of
time, thus demanding the use of extremely efficient video compression algorithms. Fortunately,
raw video sequences also contain a significant amount of redundancy that allows encoders to
perform considerable compression without significantly distorting the resulting video. The
redundancy present in the original sequence provides a significant amount of error resilience, but
the amount of bandwidth required for real-time transmission of uncompressed video is not
reasonable. Ideally one would like to remove all the redundancy from the video bit stream and
compress the data down to the smallest number of bits possible. However, by doing so, each bit
increases in importance and any losses that occur can have a much more significant impact on
the resulting video quality.
To this end, a number of error resilient video compression algorithms have been developed.
These approaches are essentially joint source-channel coders that trade off some amount of
compression efficiency for an increase in error resilience. Multiple description coding is one
such approach. A multiple description encoder codes a single sequence into two or more
complementary streams and transmits these independently across the network. In the event one
stream is lost, the remaining stream can still be straightforwardly decoded resulting in only a
slight reduction in video quality.
This thesis proposed end-to-end rate-distortion optimized adaptive MD mode selection for
multiple description coding. This approach makes use of multiple MD coding modes within a
single sequence, making optimized decisions using a model for predicting the expected end-toend distortion. The extended ROPE model presented in Chapter 4 is used to predict the distortion
-121-
Chapter 7
Conclusions
experienced at the decoder taking into account both bursty packet losses and the use of multiple
transmission paths. This allows the encoder in this system to make optimized mode selections
using Lagrangian optimization techniques to minimize the expected end-to-end distortion.
We began by examining the performance of the extended ROPE algorithm. It was
originally unknown how well the ROPE algorithm would work with the H.264 codec and in
particular with the use of quarter-pixel motion vector accuracy. We have shown that the ROPE
model is able to accurately track the expected distortion experienced at the decoder even when
used with H.264. We have shown how it accurately takes into account characteristics of the
video source, packet loss rates, bursty packet losses and the use of multiple transmission paths.
We have then shown how one such adaptive MD mode selection system based on H.264 is
able to adapt to local characteristics of the video and to network conditions on multiple paths and
have shown the potential for this adaptive approach, which selects among a small number of
simple complementary MD modes, to significantly improve video quality. The results presented
in this thesis demonstrate how this system accounts for the characteristics of the video source,
e.g. using more redundant modes in regions particularly susceptible to losses, and how it adapts
to conditions on the network, e.g. switching from more bitrate-efficient methods to more resilient
methods as the loss rate increases.
Other experiments have explored the use of adaptive MD mode selection with unbalanced
paths, where one path is more reliable than the other, and with time varying network conditions.
One particularly interesting result from these experiments was the realization that this particular
system was able to intelligently redistribute a portion of the bits to the more reliable path in this
type of unbalanced situation. Since the non-adaptive approaches were unable to take the same
approach, it did not allow for a fair comparison to permit the adaptive system to redistribute bits
in such a way. However, it was an interesting result, and in the future it may be possible to use a
similar approach for optimally distributing packets across multiple unbalanced paths.
Most of the experiments in this thesis assumed the encoder had accurate knowledge of the
current packet loss rate and expected burst length on the network. However, it is unlikely that the
encoder will have such perfect knowledge of the time varying conditions that exist on packet
networks. To this end, we performed a number of experiments to evaluate the sensitivity of the
system to inaccurate knowledge of network conditions. In general the results indicated that this
particular system is most sensitive to underestimating the packet loss rate, which suggested that
-122-
Conclusions
Chapter 7
it may be wise to develop a conservative scheme and encode for slightly higher loss rates than
expected. The results also showed that the system was not as sensitive to overestimation of
packet loss rate or to errors made in assumed burst length.
Finally we have evaluated the gains in performance when MD coding is combined with the
use of multiple paths. The most common approach in video streaming today is the use of single
description coding sent over a single path while the majority of this thesis has focused on the use
of multiple description coding with multiple paths. As a baseline of comparison, we have
performed experiments to evaluate the performance of SD and MD coding along both single and
multiple paths at various packet loss rates, expected burst lengths, and bitrates. The results of
these experiments demonstrated the significant benefits of combining MD coding with the use of
multiple paths.
While the results presented in this work are specific to this particular realization of such a
system, these results demonstrate that the adaptive combination of a small set of MD modes
combined with intelligent mode selection can significantly improve video quality. Overall, the
results with adaptive MD mode selection are quite promising, and we believe MD mode
selection can be a useful tool for the reliable delivery of video streams.
7.2
Future Research Directions
The results of this research indicate that adaptive MD mode selection can be an effective error
resilience tool by allowing the encoder to adapt to current network conditions and to the
characteristics of the video source. However there are a number of areas that would need further
exploration before a practical application of this concept could be implemented, the most
obvious being the issue of complexity. This research has assumed the encoder would be able to
maintain real-time processing in order for it to adapt to current network conditions. The H.264
codec alone is quite computationally demanding not to mention the complexity added by the
ROPE algorithm and R-D optimized adaptive mode selection. A real-time implementation of this
approach at the present time would be a challenging task involving tradeoffs in performance
versus complexity. This goal could certainly be achieved given enough time for the technology
to evolve to the point where it would be easily attainable.
-123-
Conclusions
Chapter 7
One could test the system in an actual network environment by integrating the encoder with
an algorithm for estimating current network characteristics. The problem of estimating packet
loss rates and expected burst lengths is a difficult task given rapidly changing network conditions
and is similar to trying to hit a moving target. The sensitivity analysis in Section 6.7 has provided
some insight into how this particular system is affected by imperfect knowledge of channel
conditions, but it would be interesting to see how this type of approach would perform in a full
end-to-end system.
The MD modes used in this thesis provided a convenient means of introducing the concept
of adaptive MD mode selection, but other interesting modes could also be considered, many of
which could potentially increase the performance of the system dramatically. Because this
system uses MD modes in an adaptive fashion, it is not necessary for each MD approach to work
well under all possible situations. Even if a mode works poorly on its own, it can still be quite
useful in an adaptive system as long as it performs well in certain special cases. One could
reevaluate the performance of previously developed methods and/or develop new MD methods
keeping this fact in mind.
The current results have focused on QCIF video resolution. It may also be interesting to
study the use of this adaptive mode selection on larger video sequences of CIF or even HD
resolution. The use of larger frames would force a single frame to be split across multiple
packets, which also offers some interesting possibilities for more advanced error concealment
techniques. Integrating more advanced techniques into the system, such as repairing one
description from another (which has shown up to 3dB gains, [1]) could offer some significant
improvements in video quality. Several of the components used in the current implementation
could also be individually studied and improved. For instance, improved rate control algorithms,
R-D optimization techniques, or models of expected distortion could be beneficial and could be
easily integrated into the current architecture.
Besides improvements to the existing system, the results from this thesis also point in a
number of other interesting directions as well. For instance, the use of adaptive MD mode
selection on unbalanced paths has led to some interesting results regarding the distribution of
bandwidth across asymmetrical paths. It would be interesting to use a ROPE-like algorithm to
study optimal distribution of bandwidth along unbalanced paths subject to bandwidth constraints
and various packet loss rates. This thesis has mainly studied two-description MD coding along
two paths, but it is not well known how to optimally distribute M-descriptions along N paths
-124-
Conclusions
Chapter 7
(e.g. two descriptions with three available paths). By modeling the expected end-to-end
distortion, it may be possible to gain a better understanding of this more general problem and
develop useful insights. In addition to considering situations where each path has a different
reliability, it may also be useful to explore other types of unbalanced circumstances, for instance
where each path has a different bitrate constraint.
The use of look-ahead (analysis of future frames) in error resilient video coding is another
topic that has not been thoroughly studied. This thesis has assumed that look-ahead was not an
option due to the additional delay involved. However, the use of one or two frames of look-ahead
might provide significant improvements in performance for applications that are able to support
a small amount of additional delay.
Overall, we believe that adaptive MD mode selection using an R-D optimized algorithm
that minimizes the expected end-to-end distortion while accounting for characteristics of the
video source and current network conditions is a promising approach for reliable delivery of
video streams over lossy packet networks.
-125-
-126-
Bibliography
[1]
J. Apostolopoulos. "Error-resilient video compression through the use of multiple
states," in Proceedingsof the IEEE InternationalConference on Image Processing,vol.
3, pp. 352-355, September 2000.
[2]
J. Apostolopoulos. "Reliable video communication over lossy packet networks using
multiple state encoding and path diversity," in Proceedings of the SPIE: Visual
Communications and Image Processing,vol. 4310, pp. 392-409, January 2001.
[3]
J. Apostolopoulos, W. Tan, and S. Wee. "Performance of a multiple description
streaming media content delivery network," in Proceedings of the IEEE International
Conference on Image Processing,vol. 2, pp. 189-192, September 2002.
[4]
J. Apostolopoulos, W. Tan, S. Wee, and G. Wornell. "Modeling path diversity for
multiple description video communication," in Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 2161-2164, May
2002.
[5]
J. Apostolopoulos, T. Wong, W. Tan, and S. Wee. "On multiple description streaming
with content delivery networks," in INFOCOM2002, vol. 3, pp. 1736-1745, June 2002.
[6]
I. Bajic and J. Woods, "Domain-based multiple description coding of images and
video," IEEE Transactions on Image Processing, vol. 12, pp. 1211-1225, October
2003.
[7]
N. Boulgouris, K. Zachariadis, A. Leontaris, and M. Strintzis. "Drift-free multiple
description coding of video," in Proceedings of the IEEE International Workshop on
Multimedia Signal Processing,pp. 105-110, October 2001.
[8]
D. Chung and Y. Wang, "Multiple description image coding using signal
decomposition and reconstruction based on lapped orthogonal transforms," IEEE
Transactions on Circuits and Systems for Video Technology, vol. 9, pp. 895-908,
September 1999.
[9]
D. Chung and Y. Wang, "Lapped orthogonal transforms designed for error-resilient
image coding," IEEE Transactionson Circuitsand Systems for Video Technology, vol.
12, pp. 752-764, September 2002.
-127-
Bibliography
[10] G. Cot6 and F. Kossentini, "Optimal intra coding of blocks for robust video
communication over the Internet," Signal Processing:Image Commulication, vol. 15,
pp. 25-34, September 1999.
[11] T. Cover and J. Thomas, Elements ofInformation Theory. 1991, New York: Wiley.
[12] S. Diggavi, N. Sloane, and V. Vaishampayan, "Asymmetric multiple description lattice
vector quantizers," IEEE Transactions on Information Theory, vol. 48, pp. 174-191,
January 2002.
[13] C. Fenimore, V. Baroncini, T. Oelbaum, and T. K. Tan. "Subjective testing
methodology in MPEG video verification," in Proceedingsof the SPIE: Applications of
DigitalImage ProcessingXXVII, vol. 5558, pp. 503-511, August 2004.
[14] A. El Gamal and T. Cover, "Achievable rates for multiple descriptions," IEEE
Transactionson Information Theory, vol. 28, pp. 851-857, November 1982.
[15] V. Goyal, "Multiple description coding: compression meets the network," IEEE Signal
ProcessingMagazine, vol. 18, pp. 74-93, September 2001.
[16] V. Goyal, J. Kelner, and J. Kovacevic, "Multiple description vector quantization with a
coarse lattice," IEEE Transactionson Information Theory, pp. 781-788, March 2002.
[17] V. Goyal and J. Kovacevic. "Optimal multiple description transform coding of
Gaussian vectors," in Proceedings of the Data Compression Conference, pp. 388-397,
March 1998.
[18] V. Goyal and J. Kovacevic, "Generalized multiple description coding with correlating
transforms," IEEE Transactions on Information Theory, vol. 47, pp. 2199-2224,
September 2001.
[19] V. Goyal, J. Kovacevic, R. Arean, and M. Vetterli. "Multiple description transform
coding of images," in Proceedings of the IEEE International Conference on Image
Processing,vol. 1, pp. 674-678, October 1998.
[20] ITU-T Rec H.264, "Advanced video coding for generic audiovisual services," March
2003.
[21] P. Haskell and D. Messerschmitt. "Resynchronization of motion compensated video
affected by ATM cell loss," in Proceedings of the IEEE InternationalConference on
Acoustics, Speech, and Signal Processing,vol. 3, pp. 545-548, March 1992.
-128-
Bibliography
[22] B. Heng, J. Apostolopoulos, and J. Lim. "End-to-end rate-distortion optimized mode
selection for multiple description video coding," in Proceedings of the IEEE
InternationalConference on Acoustics, Speech, and Signal Processing,vol. 5, pp. 905908, March 2005.
[23] R. Hinds, T. Pappas, and J. Lim. "Joint block-based video source/channel coding for
packet-switched networks," in Proceedings of the SPIE: Visual Communications and
Image Processing,vol. 3309, pp. 124-133, January 1998.
[24] N. Jayant, "Subsampling of a DPCM speech channel to provide two 'self-contained'
half-rate channels," Bell Sys. Tech. Journal,vol. 60, pp. 501-509, April 1981.
[25] N. Jayant and S. Christensen, "Effects of packet losses in waveform coded speech and
improcements due to an odd-even sample-interpolation procedure," IEEE Transactions
on Communications,vol. 29, pp. 101-109, February 1981.
[26] C.-S. Kim, R.-C. Kim, and S.-U. Lee, "Robust transmission of video sequence using
double-vector motion compensation," IEEE Transactions on Circuits and Systems for
Video Technology, vol. 11, pp. 1011-1021, September 2001.
[27] Y. Liang, J. Apostolopoulos, and B. Girod. "Analysis of packet loss for compressed
video: Does burst-length matter?," in Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 684-687, April
2003.
[28] J. Liao and J. Villasenor. "Adaptive intra update for video coding over noisy channels,"
in Proceedings of the IEEE InternationalConference on Image Processing,vol. 3, pp.
763-766, September 1996.
[29] L.-J. Lin and A. Ortega, "Bit-rate control using piecewise approximated rate-distortion
characteristics," IEEE Transactionson Circuitsand Systems for Video Technology, vol.
8, pp. 446-459, August 1998.
[30] D. Marpe, H. Schwarz, and T. Wiegand, "Context-based adaptive binary arithmetic
coding in the H.264/AVC video compression standard," IEEE Transactionson Circuits
and Systemsfor Video Technology, vol. 13, pp. 620-636, July 2003.
[31] A. Ortega and K. Ramchandran, "Rate-distortion methods for image and video
compression," IEEE Signal ProcessingMagazine, vol. 15, pp. 23-50, November 1998.
[32] R. Puri and K. Ramchandran. "Multiple description source coding using forward error
correction codes," in Asliomar Conference on Signals, Systems, and Computers, vol. 1,
pp. 342-346, October 1999.
-129-
Bibliography
[33] A. Reibman. "Optimizing multiple description video coders in a packet loss
environment," in Proceedingsof the Packet Video Workshop, April 2002.
[34] A. Reibman, H. Jafarkhani, Y. Wang, M. Orchard, and R. Puri, "Multiple-description
video coding using motion-compensated temporal prediction," IEEE Transactions on
Circuits and Systemsfor Video Technology, vol. 12, pp. 193-204, March 2002.
[35] R. Schafer, T. Wiegand, and H. Schwarz, "The emerging H.264/AVC standard," EBU
TechnicalReview, January 2003.
[36] S. Servetto, K. Ramchandran, V. Vaishampayan, and K. Nahrstedt, "Multiple
description wavelet based image coding," IEEE Transactions on Image Processing,
vol. 9, pp. 813-826, May 2000.
[37] E. Steinbach, N. Farber, and B. Girod, "Standard compatible extension of H.263 for
robust video transmission in mobile environments," IEEE Transactionson Circuits and
Systemsfor Video Technology, vol. 7, pp. 872-881, December 1997.
[38] K. Stuhlmiiller, N. Farber, M. Link, and B. Girod, "Analysis of video transmission over
lossy channels," IEEE Journal on Selected Areas in Communications, vol. 18, pp.
1012-1032, June 2000.
[39] G. Sullivan, A. Luthra, and P. Topiwala. "The H.264/AVC advanced video coding
standard: overview and introduction to the fidelity range extensions," in Proceedings of
the SPIE: Applications of Digital Image ProcessingXXVII, vol. 5558, pp. 454-474,
August, 2004.
[40] G. Sullivan and T. Wiegand, "Video compression - from concepts to the H.264/AVC
standard," Proceedingsof the IEEE, vol. 93, pp. 18-3 1, January 2005.
[41] V. Vaishampayan, "Design of multiple description scalar quantizers,"
Transactionson Information Theory, vol. 39, pp. 821-834, May 1993.
IEEE
[42] V. Vaishampayan, A. Calderbank, and J. Batllo, "On reducing granular distortion in
multiple description quantization," in Proceedings of the IEEE International
Symposium on Information Theory, pp. 98, August 1998.
[43] V. Vaishampayan and J. Domaszewicz, "Design of entropy-constrained multipledescription scalar quantizers," IEEE Transactions on Information Theory, vol. 40, pp.
245-250, January 1994.
-130-
Bibliography
[44] W. Wan and J. Lim. "Adaptive format conversion for scalable video coding," in
Proceedings of the SPIE: Applications of Digital Image ProcessingXXIV, vol. 4472,
pp. 390-401, December 2001.
[45] X. Wang and M. Orchard. "Multiple description coding using trellis coded
quantization," in Proceedings of the IEEE International Conference on Image
Processing,vol. 1, pp. 391-394, September 2000.
[46] Y. Wang and D. Chung. "Robust image coding and transport in wireless networks using
nonhierarchical decomposition," in International Workshop on Mobile Multimedia
Communications, pp. 285-282, September 1996.
[47] Y. Wang and S. Lin, "Error-resilient video coding using multiple description motion
compensation," IEEE Transactions on Circuits and Systems for Video Technology, vol.
12, pp. 438-452, June 2002.
[48] Y. Wang, M. Orchard, and A. Reibman. "Multiple description image coding for noisy
channels by pairing transform coefficients," in IEEE Workshop on Multimedia Signal
Processing,pp. 419-424, June 1997.
[49] Y. Wang, M. Orchard, V. Vaishampayan, and A. Reibman, "Multiple description
coding using pairwise correlating transforms," IEEE Transactions on Image
Processing,vol. 10, pp. 351-366, March 2001.
[50] Y. Wang, A. Reibman, and S. Lin, "Multiple description coding for video
communications," Proceedingsof the IEEE, vol. 93, pp. 57-70, January, 2005.
[51] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, "Overview of the H.264 /
AVC video coding standard," IEEE Transactions on Circuits and Systems for Video
Technology, vol. 13, pp. 560 - 576, July 2003.
[52] H. Witsenhausen and A. Wyner, "Source coding for multiple descriptions II: a binary
source," Bell Sys. Tech. Journal,vol. 60, pp. 2281-2292, December 1981.
[53] J. Wolf, A. Wyner, and J. Ziv, "Source coding for multiple descriptions," Bell Sys.
Tech. Journal,vol. 59, pp. 1417-1426, October 1980.
[54] D. Wu, Y. Hou, W. Zhu, Y. Zhang, and J. Peha, "Streaming video over the Internet:
approaches and directions," IEEE Transactions on Circuits and Systems for Video
Technology, vol. 11, pp. 282-300, March 2001.
-131-
Bibliography
[55] H. Yang and K. Rose. "Recursive end-to-end distortion estimation with model-based
cross-correlation approximation," in Proceedings of the IEEE Conference on Image
Processing,vol. 3, pp. 469-472, September 2003.
[56] R. Zhang, S. Regunathan, and K. Rose, "Video coding with optimal inter/intra-mode
switching for packet loss resilience," IEEE Journal on Selected Areas in
Communications, vol. 18, pp. 966-976, June 2000.
-132-