Adaptive Multiple Description Mode Selection for Error Resilient Video Communications by Brian A. Heng S.M., Massachusetts Institute of Technology (2001) B.S., University of Minnesota, (1999) Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology June 2005 C 2005 Massachusetts Institute of Technology All rights reserved Signature of Author 4- . it or Department of Electrical Engineering and Computer Science June 29, 2005 Certified by Jae S. Lim Professor of Electrical Engineering Thesis Supervisor Accepted by Arthur C. Smithi Chairman, Departmental Committee on Graduate Students OF TECHNOLOGY MAR 2 8 2006 LIBRARIES -2- Adaptive Multiple Description Mode Selection for Error Resilient Video Communications by Brian A. Heng Submitted to the Department of Electrical Engineering and Computer Science on June 29, 2005 in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract Streaming video applications must be able to withstand the potentially harsh conditions present on best-effort networks like the Internet, including variations in available bandwidth, packet losses, and delay. Multiple description (MD) video coding is one approach that can be used to reduce the detrimental effects caused by transmission over best-effort networks. In a multiple description system, a video sequence is coded into two or more complementary streams in such a way that each stream is independently decodable. The quality of the received video improves with each received description, and the loss of any one of these descriptions does not cause complete failure. A number of approaches have been proposed for MD coding, where each provides a different tradeoff between compression efficiency and error resilience. How effectively each method achieves this tradeoff depends on network conditions as well as on the characteristics of the video itself. This thesis proposes an adaptive MD coding approach that adapts to changing conditions through the use of MD mode selection. The encoder in this system is able to accurately estimate the expected end-to-end distortion, accounting for both compression and packet-loss-induced distortions, as well as for the bursty nature of channel losses and the effective use of multiple transmission paths. With this model of the expected end-to-end distortion, the encoder selects between MD coding modes in a rate-distortion (R-D) optimized manner to most effectively trade-off compression efficiency for error resilience. We show how this approach adapts to both the local characteristics of the video and to network conditions and demonstrate the resulting gains in performance using an H.264-based adaptive MD video coder. We also analyze the sensitivity of this system to imperfect knowledge of channel conditions and explore the benefits of using such a system with both single and multiple paths. Thesis Supervisor: Jae S. Lim Title: Professor of Electrical Engineering -3- -4- Dedication To Mom and Dad, For Always Believing. To Susanna, For Her Encouragement, Patience, and Love. -5- -6- Acknowledgements Many people have contributed to this thesis both directly and indirectly during these past few years. I would like to take this opportunity to recognize these contributions and to thank all those who have made this accomplishment possible. I would like to start by thanking my thesis supervisor Professor Jae Lim for his guidance and support during my time at MIT. I am very grateful to him for providing me a place in his lab and for the extensive advice he has given me about both research and life. It was a great honor to work with him these last six years. I would also like to express thanks to Dr. John Apostolopoulos and Professor Vivek Goyal for serving on my thesis committee. They have both spent many hours working with me to improve the quality of this research, and their comments have always been useful and insightful. I would also like to acknowledge Hewlett-Packard and the Advanced Telecommunications and Signal Processing (ATSP) group for their financial support of this research. My friends and colleagues in the ATSP group have made my time at MIT much more enjoyable, and my interactions with each of them have been rewarding in many ways. Special thanks to fellow Ph.D. students Wade Wan, Eric Reed, and David Baylon for making me feel welcome, for helping me get started, and for continuing to assist me even after their careers at MIT. I would like to thank the group's administrative assistant, Cindy LeBlanc, for making my life here so much easier and for always looking out for me. I am grateful to Jason Demas, Sherman Chen, and Jiang Fu for the opportunity to work with them at Broadcom Corporation. My summers at Broadcom were very enjoyable, and the knowledge I gained during this time has been immensely helpful. I would also like to thank Davis Pan and Shiufun Cheung for giving me the chance to learn from them during my internship at Compaq Computer Corporation. I have been fortunate to have a number of supportive and loyal friends throughout my life. I am very lucky to have met Wade Wan when I started at MIT. His guidance and advice have been invaluable and I am grateful to have such a close friend. I would like to thank fellow Ph.D. -7- student Everest Huang for his companionship and for our many enjoyable conversations. To my group of friends back home, including Neil Dizon, Steve Keu, Dao Yang, Mark Schermerhorn, Efren Dizon, Chris Takase, and Nitin Jain, thank you for always being there. I am privileged to have a wonderful family. I am especially thankful to my parents Mary Jane and Duane Heng for their unending source of love and support. The opportunities they have provided for me have made this accomplishment possible. I would also like to thank my brother David for his encouragement and for always being a good friend. My family has suffered the loss of my grandparents, James and Margaret Pribyl, while I have been away. I hope in my heart I have made them proud on this day. I honor their memory and will never forget them. Finally, I am very fortunate to have the love and support of my girlfriend, Susanna. It has been difficult living apart all these years, and I am extremely grateful for her patience and understanding. She has always been my source of strength, and her encouragement and love have made this work possible. Brian Heng Cambridge, MA June 29, 2005 -8- Contents 1 2 3 Introduction.................................................................................................................. 19 1.1 Video Processing Terminology ........................................................................ 20 1.2 Multiple Description Video Coding ................................................................. 23 1.3 Thesis Motivation and Overview....................................................................... 27 Multiple Description Video Coding......................................................................... 31 2.1 Multiple Description Coding Techniques......................................................... 31 2.1.1 Multiple Description Quantization ........................................................ 32 2.1.2 Spatial/Temporal MD Splitting ............................................................. 34 2.1.3 MD Splitting in the Transform Domain ............................................... 35 2.2 Predictive Multiple Description Coding ........................................................... 38 2.3 Applications of Multiple Description Video Coding......................................... 40 Adaptive MD Mode Selection.................................................................................. 43 3.1 Adaptive Mode Selection Systems .................................................................... 43 3.2 Rate-Distortion Optimized Mode Selection ...................................................... 46 3.2.1 Independent Rate-Distortion Optimization........................................... 46 3.2.2 Effects of Dependencies on Rate-Distortion Optimization .................. 50 3.3 End-to-End R-D Optimized MD Mode Selection ............................................. 51 3.3.1 Lagrangian Optimization ...................................................................... 51 3.3.2 Rate-Distortion Optimization over Lossy Channels............................. 52 4 Modeling End-to-End Distortion over Lossy Packet Networks ........................... 53 4.1 4.2 4.3 4.4 Optimal Intra-Coding for Error Resilient Video Streams.................................. Recursive Optimal Per Pixel Estimate of Expected Distortion ........................ Multiple Description ROPE Model ................................................................. Extended ROPE Model...................................................................................... -9- 53 55 56 56 Contents 5 MD Mode Selection System .................................................................................... 5.1 5.2 6 61 MPEG4-AVC / H.264 Video Coding Standard................................................ 61 5.1.1 Intra-Frame Prediction......................................................................... 64 5.1.2 Hierarchical Block Motion Estimation.................................................. 65 5.1.3 Multiple Reference Frames................................................................... 66 5.1.4 Quarter Pixel Motion Vector Accuracy ................................................ 67 5.1.5 In-Loop Deblocking Filter..................................................................... 68 5.1.6 Entropy Coding..................................................................................... 69 5.1.7 H.264 Performance ................................................................................ 70 MD System Implementation................................................................................. 71 5.2.1 Examined MD Coding Modes .............................................................. 72 5.2.2 Data Packetization ................................................................................ 74 5.2.3 Discussion of Modifications Not in Compliance with H.264 Standard ... 75 Experimental Results and Analysis......................................................................... 77 6.1 Test Sequences................................................................................................... 78 6.2 Performance of Extended ROPE Algorithm..................................................... 81 6.3 MD Coding Adapted to Local Video Characteristics....................................... 85 6.4 MD Coding Adapted to Network Conditions.................................................. 91 6.4.1 Variations in Average Packet Loss Rate................................................... 91 6.4.2 Variations in Expected Burst Length....................................................... 94 6.5 End-to-End R-D Performance ........................................................................... 97 6.6 Unbalanced Paths and Time Varying Conditions.............................................. 99 6.6.1 6.7 6.8 Balanced versus Unbalanced Paths......................................................... 6.6.2 Time Varying Network Conditions ........................................................ Sensitivity Analysis ............................................................................................ 6.7.1 Sensitivity to Packet Loss Rate............................................................... 6.7.2 Sensitivity to Burst Length ..................................................................... Comparisons between Using Single and Multiple Paths.................................... -10- 101 102 104 106 111 114 Contents 7 Conclusions................................................................................................................. 121 7.1 Summ ary............................................................................................................. 121 7.2 Future Research Directions................................................................................. 123 Bibliography..................................................................................................................... -11- 127 -12- List of Figures 1.1 Scan Modes for Video Sequences ................................................................... 21 1.2 Gilbert Packet Loss Model ............................................................................... 23 1.3 Two Stream Multiple Description System......................................................... 24 1.4 Classic Depiction of an MD System.................................................................. 24 1.5 Example of Multiple Description Video Coding ............................................. 25 1.6 Comparison between Scalable Video Coding and MD Coding ....................... 27 2.1 MD Coding of Audio........................................................................................ 32 2.2 MD Scalar Quantization .................................................................................... 33 2.3 MD Splitting of an Image ................................................................................. 35 2.4 Transform Domain MD Splitting ...................................................................... 36 2.5 Applications of MD Coding ............................................................................ 41 3.1 Dynamic Programming Tree ............................................................................ 48 3.2 Comparison between Lagrangian Optimization and Dynamic Programming...... 49 4.1 Conceptual Computation of First Moment Values in MD ROPE Approach ....... 57 4.2 Gilbert Packet Loss Model ............................................................................... 58 5.1 H.264 Encoder Architecture ............................................................................. 62 5.2 H.264 Decoder Architecture ............................................................................. 62 5.3 4x4 Intra-Prediction........................................................................................... 63 -13- List of Figures 5.4 Two of the Nine Available 4x4 Intra-Prediction Modes .................................. 63 5.5 16x16 Intra-Prediction M odes ........................................................................... 64 5.6 Macroblock Partitions for Motion Estimation.................................................. 65 5.7 M ultiple Reference Fram es................................................................................ 66 5.8 Six-Tap Filter used for Half Pixel Interpolation................................................ 67 5.9 In-Loop Deblocking Filter ................................................................................ 68 5.10 Examined MD Coding Modes ........................................................................... 72 5.11 Packetization of Data in MD Modes................................................................ 75 6.1 Performance of ROPE Algorithm - Actual vs. Expected PSNR....................... 80 6.2 Tim e Varying Packet Loss Rates....................................................................... 82 6.3 Performance of ROPE Algorithm with Time Varying Loss Rates................... 83 6.4 Bernoulli Losses versus Gilbert Losses with ROPE Algorithm........................ 84 6.5 MD Coding Adapted to Local Video Characteristics....................................... 86 6.6 Distribution of Selected MD Modes - Foreman Sequence ............................... 87 6.7 Visual Results - Frame 5 Foreman Sequence .................................................. 89 6.8 Visual Results - Frame 231 Foreman Sequence ............................................... 90 6.9 PSNR versus Average Packet Loss Rate ........................................................... 92 6.10 PSNR versus Expected Burst Length ................................................................ 95 6.11 Effects of Expected Burst Length on the TS Mode ........................................... 96 6.12 End-to-End R-D Performance at 5% Loss Rate with Burst Length of 3...... 98 6.13 R-D Optimized Quantization Levels versus Fixed Quantization Level ............. 100 -14- List of Figures 6.14 Distribution of Selected MD Modes - Time Varying Loss Rates...................... 103 6.15 PSNR versus Frame with Time Varying Loss Rates.......................................... 105 6.16 Sensitivity to Errors in Assumed Packet Loss Rate - Part 1 .............................. 107 6.17 Sensitivity to Errors in Assumed Packet Loss Rate - Part 2 .............................. 109 6.18 Sensitivity of ADAPT Relative to Non-Adaptive Methods ............................... 110 6.19 Sensitivity to Errors in Assumed Burst Length - Part 1..................................... 112 6.20 Sensitivity to Errors in Assumed Burst Length - Part 2..................................... 113 6.21 Comparison of Single Path vs. Multiple Paths ................................................... 114 6.22 Multiple Paths vs. Single Path - Variations in Expected Burst Length ............. 116 6.23 Multiple Paths vs. Single Path - Variations in Packet Loss Rate....................... 118 6.24 Multiple Paths vs. Single Path - R-D Performance............................................ 119 -15- -16- List of Tables 5.1 Exponential-Golomb Codebook ........................................................................ 69 5.2 List of MD Coding Modes............................................................................... 74 6.1 Test Sequences................................................................................................. 79 6.2 Distribution of MD Modes at 0%, 5%, and 10% Packet Loss Rates................ 93 6.3 Distribution of MD Modes at Various Burst Lengths ..................................... 97 6.4 Distribution of MD Modes with Unbalanced Paths ........................................... 101 6.5 Percentage of Total Bandwidth in Each Stream for Unbalanced Paths.............. 101 -17- -18- Chapter 1 Introduction The transmission of video information over error prone channels poses a number of interesting challenges. One would like to compress the video as much as possible in order to transmit it in a timely manner and/or store it within a limited amount of space. Yet, by compressing a video sequence, one tends to make it more susceptible to transmission losses and errors. Video applications ranging from high definition television down to wireless video phones all face this same tradeoff. However, best-effort networks like the Internet present a particularly harsh environment for real-time streaming video applications. In this type of environment, applications must be able to withstand inhospitable conditions including variations in available bandwidth, packet losses, and delays. Those that are unable to adapt to these conditions can suffer serious performance degradations each time the network becomes congested. Multiple description (MD) video coding is one approach that can be used to reduce the detrimental effects caused by packet loss on best-effort networks. In a multiple description system, a video sequence is coded into two or more complementary streams in such a way that each stream is independently decodable. The quality of the received video improves with each received description, but the loss of any one of these descriptions does not cause complete failure. If one of the streams is lost or delivered late, the video playback can continue with hopefully only a slight reduction in overall quality. There have been a number of proposals for MD video coding, each providing its own tradeoff between compression efficiency and error resilience. Previous MD coding approaches applied a single MD technique to an entire sequence. However, the optimal MD coding method will depend on many factors including the amount of motion in the scene, the amount of spatial detail, desired bitrates, error recovery capabilities of each technique, and current network conditions. This thesis examines the adaptive use of multiple MD coding modes within a single sequence. Specifically, this thesis proposes adaptive MD mode selection by allowing the encoder to select among MD coding modes in an optimized manner as a function of local video characteristics and network conditions. -19- Chapter I Introduction The following section presents a brief introduction to video processing, establishing the terminology used throughout this thesis and providing background information necessary for discussing this work. In the second section, we discuss the motivation behind this research and present an overview of this thesis. 1.1 Video Processing Terminology A video frame is a picture made up of a two-dimensional discrete grid of pixels or picture elements. A video sequence is a collection of frames, with equal dimensions, displayed at fixed time intervals. The dimensions of each frame are referred to as the spatial resolution, and the resolution along the temporal direction is known as the frame rate. The term macroblock is used to describe a subdivision of a frame of size 16x16 pixels. For the purposes of this research, a video stream will be defined as a sequence transmitted across the given network (e.g. the Internet, wireless connections, etc...) and viewed in real-time. This differs from video file transfer in which sequences are fully downloaded and playback only begins once the entire video sequence has been received. Buffering is the process of storing up data at the receiver before playback begins in the event that the network throughput drops temporarily. All streaming applications use some amount of buffering in order to reduce the effect of variations in network bandwidth and delay. The more buffering used, the longer it takes to initially fill that buffer, and thus, the more delay experienced at the receiver. Video file transfer is essentially the same as maximum buffering; the entire video sequence is stored at the receiver before playback begins. The scan mode is the method in which the pixels of each frame are displayed. As shown in Figure 1.1, video sequences can have one of two scan modes: progressive or interlaced. A progressive scan sequence is one in which every line of the video is scanned in every frame. This type of scanning is typically used in computer monitors, handheld devices, and high definition television displays. An interlaced sequence is one in which the display alternates between scanning the even lines and odd lines of the corresponding progressive frames. The termfield is used (rather than frame) to describe pictures scanned using interlaced scanning, with the even field containing all the even lines of one frame and the odd field containing the odd lines. Interlaced scanning is currently used in many standard television displays. The process of interlaced to progressive conversion is known as deinterlacing. -20- Introduction Chapter I Interlaced Progressive n-1 n-1 n n field frame (a) (b) Figure 1.1: Scan modes for video sequences. (a) In interlaced fields either the even or the odd lines are scanned. The solid lines represent the field that is present in the current frame. (b) In progressively scanned frames all lines are scanned in each frame. The main focus of this research will be on real-time video streaming and the difficulties presented when the network is unable to meet necessary time constraints. With this application in mind, the sequences analyzed in this work are progressively scanned sequences since the vast majority of computer and handheld displays use progressive scanning. However, it is sometimes useful to process fields independently, which is where the concepts of interlacing and deinterlacing become important. Also, the work described here could later be applied to interlaced sequences in a fairly straightforward manner. The extensive bandwidth required for transmission of raw video sequences is typically not feasible, so most systems require the use of significant video compression to reduce the amount of bandwidth needed. There can be a considerable amount of redundant information present in a typical video sequence in both the spatial and temporal directions. Within a single frame, each pixel is likely to be highly correlated with neighboring pixels since most frames contain relatively large regions of smoothly varying intensity. Similarly, in the temporal direction two frames are likely to be highly correlated since typical sequences do not change rapidly from one frame to the next. There are many ways to take advantage of this redundancy in video coding. To reduce correlation along the temporal direction, most video coders use some form of motion estimation / motion compensation to predict the current frame from previously decoded frames. In this approach, the encoder estimates the motion from one frame to the next, and uses this model to generate a prediction of the next frame by compensating for the motion that has -21- Introduction ChapterI occurred. Coded blocks that depend on other frames due to the use of motion compensated prediction are referred to as inter-coded blocks; blocks that do not depend on any other frames are referred to as intra-coded.Once the temporal redundancy has been exploited, most encoders use the Discrete Cosine Transform (DCT), or some other decorrelating transform, to remove as much remaining redundancy as possible from the spatial dimension. Despite efficient exploitation of the spatial and temporal redundancy present in typical video sequences, the resulting bandwidth is typically not low enough to allow for lossless transmission. For this reason, lossy compression algorithms are necessary for an effective transmission scheme. For the purposes of this thesis, the distortion caused by losses during data compression as well as losses during network transmission will be quantitatively measured using the peak-signal-to-noiseratio (PSNR). The PSNR for a given frame is defined as PSNR =10 -lo (1.1) 5 (MSE) where the mean square error (MSE) is the average squared difference between the original and distorted video frames, F, and Fd . NI-1 N 2 -1 MSE= NIN I I(F[n,n 2 ]-Fd[n n 2 (1.2) n1 =0 n2 =0 Here the values N, and N 2 represent the horizontal and vertical dimensions of the frames, and the values n and n2 are used to index each pixel location. The value 255 is used as the peak signal value since it is the maximum value encountered with 8-bit pixel representations. It should be noted that PSNR and perceived quality are not always directly correlated. Higher PSNR does not necessarily indicate better video, but the use of PSNR is a common practice and has been found to be a useful estimate of video quality. In this thesis we have simulated network losses by using various probabilistic packet loss models. In the Bernoulli loss model, the packet losses are independent and have equal probability. Actual network losses tend to arrive in bursts, a behavior that is not captured by the Bernoulli loss model and that has been shown to significantly affect video quality [2, 27]. We use the Gilbertmodel to simulate the nature of bursty losses where packet losses are more likely -22- Introduction Chapter I State 1: packet received 1-PO State 0: packet lost Average Packet Loss Rate = Po0 1 0 1+ p1 - pO 1 Expected Burst Length - 0IP P1 Figure 1.2: Gilbert packet loss model. Assuming pi < po , there is a greater probability the current packet will be lost if the previous packet was lost. This causes bursty losses in the resulting stream. if the previous packet has been lost. This can be represented by the Markov model shown in Figure 1.2 assuming p, < po. 1.2 Multiple Description Video Coding The demand for streaming video applications has grown rapidly over recent years, and by all indications this demand will continue to grow in the future. However, the majority of packet networks, like the Internet, provide only best-effort service; there are no guarantees of minimum bandwidth or delay [54]. Applications must be able to withstand changing conditions on the network or they can suffer severe performance degradations. For some applications, these problems can be reduced by using a suitable amount of buffering at the receiver. However, buffering introduces an extra delay in the system that is unacceptable for many applications such as video conferencing. This type of application requires a high degree of interaction between opposite ends of the network and places stringent demands on end-to-end delay. There exists a limit on the maximum amount of delay that can exist between two users attempting to maintain a reasonable conversation. Once this limit is exceeded, the two parties can no longer interact without significant effort. Therefore significant buffering is not an option. Even in applications where some amount of buffering is acceptable, the amount of buffering necessary in any situation is unknown ahead of time due to the time-varying properties of the network. Occasionally network links fail altogether, and there may be some extended period of time during which two nodes in the network cannot talk to one another at all. This type of outage can underflow any reasonably-sized buffer. For these reasons, current approaches for -23- Introduction Chapter I Packet Network Original MD Video Encoder MD Packet Stream 1 Reconstructed Decoder Packet Stream 2 Figure 1.3: Two stream multiple description system. The original video source is encoded into two complementary streams which are transmitted independently through the network. As long as both streams are not simultaneously lost, the remaining stream can still be decoded to achieve acceptable video quality. Reconstructed Video Original _ Video MD Decoder 1 - Good Quality Decoder 0 -+ Best Quality Decoder 2 -+ Good Quality EncoderI Figure 1.4: Classic depiction of a two stream MD coding system. The central decoder (Decoder 0) uses both descriptions to reconstruct the highest quality video. The two side decoders (Decoders 1 and 2) use only one description to generate acceptable quality video. real-time video streaming often suffer from severe glitches each time the network becomes congested. Multiple description video coding is one method that can be used to reduce the detrimental effects caused by this type of best-effort network. In a multiple description system, a video sequence is encoded into two or more complementary streams in such a way that each stream is independently decodable (see Figure 1.3). When combined, the streams provide the highest level of quality, but even independently they are able to provide an acceptable level of quality. These streams can then be sent along separate paths through the network to experience more or less -24- Introduction Chapter I ... *4 4 6 8 1.. Figure 1.5: One example of multiple description coding. Original sequence is partitioned along the temporal direction into even and odd frames. Even frames are predicted from even frames and odd from odd. If an even frame is lost (e.g. Frame 4), errors will propagate to other even frames, but the remaining description (the odd frames) can still be straightforwardly decoded, resulting in video at half the original frame rate. independent losses and delays. In the event that a portion of one of the streams is lost or delivered late, the video playback will not suffer a severe glitch or stop completely to allow for rebuffering. On the contrary, the remaining stream(s) will continue to be played out with only a slight reduction in overall quality. Conceptually, a two stream MD decoder can be thought of as three separate decoders, as shown in Figure 1.4. Here the central decoder (Decoder 0) is able to decode both descriptions resulting in the highest quality video. The two side decoders (Decoders 1 and 2) receive only one of the descriptions resulting in lower, but still acceptable, video quality. Perhaps the simplest example of an MD video coding system is one where the original video sequence is partitioned along the temporal direction into even and odd frames that are then independently coded into two separate streams for transmission over the network. As shown in Figure 1.5, this approach generates two descriptions, where each has half the temporal resolution of the original video. In the event that both descriptions are received, the frames from each can be decoded and interleaved together to reconstruct the full sequence. In the event one stream is lost, the remaining stream can still be straightforwardly decoded and displayed, resulting in video at half the original frame rate. Of course, this gain in robustness comes at a cost. Temporally sub-sampling the sequence lowers the temporal correlation, thus reducing coding efficiency and increasing the number of bits necessary to maintain the same level of quality per frame. Without losses, the total bit rate necessary for this MD system to achieve a given distortion will in general be higher than the -25- Introduction Chapter I corresponding rate for a single description (SD) encoder to achieve the same distortion. This is a tradeoff between coding efficiency and robustness. However, in the type of application under consideration, it is not so much a question of whether it is useful to give up some amount of efficiency for an increase in reliability as it is a question of finding the most effective way to achieve this tradeoff. It should be noted here that multiple description coding is not the same as scalable video coding. Similar to MD coding, a scalable coder encodes a sequence into multiple streams that are referred to as layers. However, scalable coding makes use of a single independent base layer followed by one or more dependent enhancement layers (see Figure 1.6). This allows some receivers to receive basic video by decoding only the base layer, while others can decode the base layer and one or more enhancement layers to achieve improved quality, spatial resolution, and/or frame rate. However, unlike MD coding, the loss of the base layer renders the enhancement layer(s) useless. In some sense, scalable coding is a special case approach to multiple description coding where it is assumed that the base layer will be delivered with absolute reliability. -26- Introduction Chapter I Scalable Video Coding Reconstructed Video Base Layer Decoder Original Video Scalable Encoder Enhancement Layer Decoder P Good Quality - Best Quality Enhancement Layer Multiple Description Coding Reconstructed Video Decoder 1 Original Video MD 0 - Best Quality Decoder 2 - Good Quality Decoder Encoder Good Quality Description 2 Figure 1.6: A comparison between scalable video coding and multiple description coding. In scalable coding the enhancement layer(s) are dependent on the base layer, and therefore the enhancement layer alone is not useful. In multiple description coding, each stream is equally important, so either Description 1 or Description 2 will still yield acceptable video quality. 1.3 Thesis Motivation and Overview There have been many approaches proposed for MD coding, each providing a different tradeoff between compression efficiency and error resilience. How efficiently each method achieves this tradeoff depends on the quality of video desired, the current network conditions, and the characteristics of the video itself. Most prior work in MD coding apply a single MD method to the entire sequence; this approach is taken so as to evaluate the performance of each MD method. However, it would be more efficient to adaptively select the best MD method based on -27- Introduction Chapter I the situation at hand [22]. Since the encoder in such a system has access to the original source, it is possible to analyze the performance of each coding mode and adaptively select between them in an optimized manner. That insight has provided the main motivation for this research. Variations in both source material and network conditions make it highly unlikely that any single MD approach will be most effective under all situations. By selecting between a small number of complementary MD modes, it is possible for the system to more effectively adapt to all possible video inputs and network conditions. A number of adaptive MD approaches have been previously proposed [26, 33, 34, 47], but the concept of adaptive MD mode selection has not been fully explored. In general, previous adaptive approaches have used a single approach to MD coding, but have allowed the encoder to adjust the amount of redundancy used to match source and/or channel characteristics. Dynamically trading off compression efficiency for error resilience, in such a way, can provide significant improvements over a non-adaptive MD approach, but fundamentally each of these systems use a single MD method for an entire sequence. For instance, if the encoder in such a system encounters a block that is particularly susceptible to errors, the response taken is to increase redundancy and therefore increase the number of bits used to code this region. However, it may be more effective to use an entirely different approach for this region, which may allow the encoder to achieve the same error resilience without increasing the bitrate as significantly, if at all. The main goal of this thesis is to investigate the use of adaptive MD mode selection and better understand its applicability to error resilient video streaming. There are many different aspects of this idea that have not been fully explored. For instance, can we find a small set of complementary MD modes that is able to adapt to a variety of video sources and network conditions? If there are gains possible from adaptive mode selection, can these gains overcome the overhead necessary for adaptive processing? Is it even possible for the encoder to make mode selection choices in an optimized manner? We have previously suggested that the encoder can analyze the performance of each MD method, however the random nature of channel losses combined with spatial and temporal error propagation make this quite a difficult task. These are some of the questions that motivated this work. In the second chapter of this thesis, we provide a more detailed introduction to multiple description video coding and provide an overview of previous research in this field. The chapter begins with a review of MD coding techniques followed by a discussion of some of the issues -28- Introduction Chapter I that arise specifically when applying MD coding to video compression. The final section of Chapter 2 discusses some applications that are particularly well suited for the use of MD coding. Chapter 3 provides a more in-depth introduction to the concept of adaptive MD mode selection. Section 3.1 reviews the role adaptive mode selection has played throughout the history of video processing and describes some previous uses of adaptive mode selection. Section 3.2 discusses the process of optimal mode selection and provides a review of rate-distortion (R-D) theory. Finally, Section 3.3 describes how these techniques can be used for adaptive MD mode selection and also includes a discussion on R-D optimization for lossy packet networks. The use of R-D optimization over lossy channels requires the use of some form of channel modeling to estimate the effects potential losses will have on end-to-end distortion. Chapter 4 provides a review of previous attempts at this type of modeling and suggests one particular approach that can quite effectively model end-to-end distortion taking into account both the distortion due to quantization as well as the distortion due to channel losses. Chapter 5 provides an overview of the system designed to investigate the concept of adaptive MD mode selection. The system has been implemented based on the H.264 video coding standard. The first portion of Chapter 5 reviews the H.264 codec to provide the necessary background information for discussing this work. The remainder of Chapter 5 details the specific implementation of the system we have used in this thesis. The implementation described in Chapter 5 has been used to perform a number of different simulations in order to evaluate the performance and behavior of the adaptive MD mode selection system. The results of these experiments are provided in Chapter 6. We show how this approach adapts to both the local characteristics of the video and to network conditions and demonstrate the resulting gains in performance using our H.264-based adaptive MD video coder. We also analyze the sensitivity of this system to imperfect knowledge of channel conditions and explore the benefits of such a system when using both single and multiple paths. Chapter 7 summarizes the main conclusions of this thesis and describes possible future research directions. -29- -30- Chapter 2 Multiple Description Video Coding This chapter provides a more detailed introduction to multiple description video coding and provides a summary of previous research in this area. The first section discusses several techniques commonly used for multiple description coding and some background on the history of the topic. Predictive coding is used in most video coding systems to remove the temporal redundancy that exists in typical video sequences. This approach significantly increases the efficiency of the overall system but also introduces the possibility for error propagation. Section 2.2 discusses some of the challenges introduced by the use of predictive coding in a MD system and some of the approaches that have been used for addressing these issues. Finally, in Section 2.3, we discuss some applications that are particularly well suited for the use of MD video coding. 2.1 Multiple Description Coding Techniques The multiple description approach was originally introduced for audio coding through research done at AT&T Bell labs in the 1970s to increase the reliability of the telephone system. One early approach was suggested by Jayant [24, 25]. Here audio is partitioned along the temporal direction into even and odd samples in an attempt to improve the reliability of digital audio communications (see Figure 2.1). In this approach, if either stream is lost, the remaining stream can still be played at half the original sampling rate. Around the same time, the MD problem was introduced into the information theory community by Wyner, Witsenhausen, Wolf, and Ziv [52, 53]. This problem became very interesting from a theoretical point of view and much work has been done to analyze the problem in depth. The main focus in the information theory community has been on characterizing the multiple description region, defined as the set of all achievable operating points, under various assumptions about the statistical properties of the source. Extensive work has been done to map -31- Multiple Description Video Coding Chapter2 T r Stream 1 Original Audio Stream 2 Figure 2.1: Multiple description coding of audio using even-odd sample splitting. Each sub-sampled audio stream is encoded independently and transmitted over the network. The temporary loss of either stream can be concealed by up-sampling the correctly received stream by interpolating the missing values [24]. out achievable rate-distortion regions using multiple description codes for channel splitting [14]. The problem has many variations including generalizations to more than two channels. For some time, multiple description coding was viewed only as an interesting information theory problem. Only in recent years has the value of MD coding become apparent. The widespread use of packetized multimedia applications over best-effort networks has brought the MD problem to forefront. Using multiple description coding for packetized data can provide a powerful tool for providing error resilient packet streams. Many approaches have been suggested for multiple description coding including correlating transforms [18, 34, 48, 49], MD-Forward Error Correction (FEC) techniques [11, 32], as well as MD splitting in the spatial [43], temporal [1, 2], and transform domains [6, 8, 9, 19, 36]. Some of these methods are further discussed in the following sections. For a more in depth review of the MD problem, see the overview by Goyal [15]. 2.1.1 Multiple Description Quantization One of the early proposals for MD coding was multiple description quantization [41, 43]. Here two or more complementary quantizers are used to compress the original source. A single quantization gives a coarse reconstruction of the source. Any additional received quantizations -32- Multiple Description Video Coding Chapter 2 Quantizer 1 Quantizer 2 Combined 0 1 2 3 4 5 6 7 I I I I I I I L 0 1 2 3 4 5 6 7 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Figure 2.2: Multiple description scalar quantizer. Quantizers 1 and 2 independently describe the original source with 3 bits of accuracy. When combined together (by taking the average of the two reconstruction levels) they can provide 3.9 bits of accuracy. further refine this description. Given that quantizers are already an essential piece in any lossy compression system, making slight modifications to form MD quantizers can be an easy way to generate multiple descriptions of a source. One can design complementary quantizers that alone coarsely describe a single source, but when combined together provide a more refined description. As a simple example, consider Figure 2.2. Here the reconstruction levels from two uniform scalar quantizers independently divide the given space. Both of these are three bit quantizers and they can each provide coarse descriptions of the original source. However, when both quantizations are received, the two reconstructions can be combined to generate the 15 reconstruction levels shown below. The example shows how to use two complementary 3-bit quantizers to create a log 2 (15) = 3.9 bit combined quantizer. As with any MD approach, the example above makes a tradeoff between compression efficiency and error resilience. Using a single description coding approach with the same number of bits, the encoder could have described this source with 6 bits of accuracy. However, in general if this data had been lost there would be no way of reconstructing it. The MD approach sacrifices 2.1 bits of accuracy for an increase in error resilience. This is only one example of possible quantizers. Through proper choice of reconstruction levels, systems with more or less redundancy can be easily designed. This is one beneficial feature of MD quantization. The same concept is extendable to vector quantization and trellis coded quantization as well [12, 16, 42, 45]. -33- Multiple Description Video Coding Chapter 2 2.1.2 Spatial/Temporal MD Splitting A straightforward method of creating multiple description streams is to sub-sample a sequence along the spatial or temporal direction and encode each sub-sequence independently. The significant redundancy in video or audio data, for example, can be used quite effectively to reconstruct any missing descriptions. Figure 2.1 is an illustration of this approach applied to audio coding, but the same idea can be extended to video coding as well. The original video sequence could, for example, be partitioned temporally into even and odd frames. As shown in Figure 1.5, this approach generates two descriptions, where each has half the frame rate of the original video. In the event that both descriptions are received, the frames from each can be interleaved to reconstruct the full sequence. In the event one stream is lost, the other stream can still be straightforwardly decoded and displayed, resulting in video at half the original frame rate. One such approach has been suggested by Apostolopoulos [1]. Here the author develops a novel approach for repairing a damaged description by using a clean description through the use of sophisticated motion compensated temporal interpolation. The wealth of information present in correctly received previous and future frames can be used to accurately estimate missing frames. By filtering the motion vector fields from neighboring frames, an estimate of the motion vectors from the current frame can be obtained. Then, the data from the missing frame is estimated by interpolating along these motion vectors, while accounting for covered and uncovered regions. It has been shown that this approach can accurately reconstruct missing frames. However, this gain comes at a cost. In order to maintain two separate prediction loops, motion compensated prediction cannot be used with directly adjacent frames; even frames must be predicted from even frames and odd from odd. Since temporal prediction decreases in efficiency as the distance between two frames increases, these two streams are coded less efficiently than when they are coded as a single stream. -34- Chapter 2 Multiple Description Video Coding Chapter 2 MultilECdesripi-+ o odn Figure 2.3: Spatial splitting of an image. The original image is low-pass filtered using four shifted averaging filters. The outputs are then sub-sampled and independently JPEG encoded. After transmission, the loss of any one stream can be concealed quite accurately given the significant correlation with the remaining streams [46]. One approach for splitting data in the spatial direction was suggested by Wang and Chung for image coding [46]. Their algorithm creates four sub-images by filtering an image with an averaging filter and its shifted variants (see Figure 2.3). They found that this approach was extremely robust, but correspondingly very inefficient. The correlation between the four streams allows for very accurate reconstruction when one description is missing. This also greatly reduces coding efficiency, since the encoder cannot make use of this correlation to help reduce the bit rate. In the end, their encoder required nearly double the bit rate to achieve the same distortion as the single stream case in the absence of losses. 2.1.3 MD Splitting in the Transform Domain Given the inefficiencies of spatial or temporal domain splitting, many have suggested making use of the compression efficiency of decorrelating transforms, like the DCT, prior to partitioning the sequence (see Figure 2.4). By decorrelating the data first, a significant gain in compression efficiency is obtained. However, this gain comes at the cost of reconstruction quality since -35- . ..... ....... . - Multiple Description Video Coding Chapter 2 Multiple Description Video Coding Chapter 2 Figure 2.4: Transform domain multiple description splitting. Use of the decorrelating transform (e.g. DCT) prior to partitioning allows the MD encoder to take advantage of the significant spatial correlation present in a video sequence. The transformed coefficients are then partitioned, quantized, and independently entropy coded [6]. transformed coefficients are, by design, less correlated and thus more difficult to predict from one another. In image and video coding, the multiple description quantizers presented in Section 2.1.1 are essentially transform domain splitting techniques. Strictly speaking, they do not need to be used in the transform domain, and can work quite effectively in spatial/temporal domains, as is done in speech applications. In image and video coding, the transform domain is where quantization takes place, and thus, MD quantization is one approach for transform domain splitting. The only reason MD quantization appears separately in this chapter is that it was not historically developed specifically for use in the transform domain. The use of correlating transforms is another approach for transform domain splitting. In general, there exists extensive correlation between neighboring pixels of an image or video frame. In image or video coding the purpose of decorrelating transforms, like the DCT, is exactly that: to decorrelate the input variables and to reduce spatial correlation. This allows for much more efficient coding and significant bit rate reduction. However, by removing this correlation between transformed coefficients, it becomes very difficult to estimate missing coefficients in the event that one of the descriptions is lost. One method to help solve this problem is the use of correlating transforms [17, 49]. These transforms add back correlation between coefficients by introducing statistical redundancy. The variance of the resulting coefficients, conditioned on correctly receiving other descriptions, can be significantly reduced and can allow much more accurate estimation. Consider the following example. Let, Y2 _C -36- x2 D[y,][A B[xl](2.1) Chapter 2 Multiple Description Video Coding where x, and x2 are zero-mean independent Gaussian random variables with variances oC 72 respectively. and E [yIy2]= E [(A -x, + B. x 2 )(C -x +D. x 2 ) (2.2) =AC - +BD-a22 Given that the correlation between x, and x2 was 0 by definition, any appropriate choice for A, B, C, and D will increase the correlation between y, and y 2 relative to the correlation between x1 and x 2 . At this point, assuming y 2 has been lost, y, can be used to estimate x, and x 2 Depending on whether y, or y2was correctly received, and given that the random variables are jointly Gaussian, the optimal estimators are X,~ L2 A 1 2 A 72Y A207+B2o2 1 C2f+D2 (2.3) BoJ(. Cc7 22 (2.4) y _Dcr22 - The corresponding average mean squared error distortions are (A 2 +B2)aofiy 2(A2o2+ B2 2 ) given y1 or (C2 +D2)0a 2(C2o2+D22 given Y. (2.5) 2 With appropriate choices for A, B, C, and D, these expected distortions can be made lower than 2 2 the expected distortion using only x, and x2 , namely " or 2. As always, this gain comes at a cost. The increased correlation between y, and y 2 will decrease the relative efficiency of entropy coding and will increase the bit rate of the stream. Also, and perhaps more important for image and video coding, is that this type approach can be highly inefficient when most of the quantized coefficients are equal to zero. Most image/video coders use run-length encoding (encoding the number of consecutive zeros, not each individual zero value) to take advantage of this sparse nature of quantized coefficient data. The use of correlating transforms will generally -37- Multiple Description Video Coding Chapter 2 increase the number of nonzero coefficients, which decreases the effectiveness of run-length coding and can be very costly. In contrast to methods like the correlating transforms suggested above, that insert artificial redundancy into the transformed coefficients, a number of techniques have been developed that make use of splitting transformed coefficients directly. Using the block-based DCT and splitting coefficients in the DCT domain is one option. However, DCT coefficients are highly uncorrelated and any attempt at reconstruction when one description is missing can leave a great deal of visual distortion. In [8] and [9], this idea is modified by using a lapped orthogonal transform. The overlapping nature of this particular transform introduces redundancy and allows for easier reconstructions in the event of lost descriptions. Bajic and Woods suggest using subband wavelet transforms rather than a DCT allowing for more accurate reconstruction by using interpolation in the lowest frequency bands [6]. 2.2 Predictive Multiple Description Coding In a typical video sequence there exists a significant amount of redundancy between one frame and the next. Thus, coding efficiency can be considerably improved by using some form of predictive coding (specifically most video coders use motion compensated temporal prediction). Predictive coding is based on the assumption that the encoder and the decoder are able to maintain the same state, meaning that the frames they use for prediction are identical. However, transmission losses can cause errors in frames at the decoder resulting in a mismatch in states between the encoder and the decoder. This state mismatch can lead to significant error propagation into subsequent frames, even if those frames are correctly received. This section discusses some of the issues predictive MD coding presents since predictive coding is an essential piece of most video coding systems. For an in-depth review of MD video coding see the overview by Wang, Reibman, and Lin [50]. In the strictest sense, each MD stream should be independently decodable and losses in one description should not affect any other descriptions. Given the use of predictive coding, accomplishing this requirement can be somewhat difficult. There are a number of approaches to predictive MD coding; some accomplish this strict independence constraint while others relax or ignore this constraint to some extent. In [50], the authors partition predictive MD coders into -38- Chapter 2 Multiple Description Video Coding three useful classes. We use these same classes here since they provide a convenient means of understanding this topic. Predictors from the first class, Class A, achieve complete independence through the use of less efficient predictors. For instance, the system proposed in [1] uses two independent prediction loops; even frames are predicted from even frames and odd frames are predicted from odd frames. This prevents losses in one description from propagating to other descriptions (e.g. the loss of an even frame will only propagate to future even frames). Another approach is to use a single prediction loop, but only predict from information known to be present in both streams [7]. Each of the approaches from Class A trade off some amount of prediction efficiency in order to maintain independence between each of the descriptions. The second class, Class B, relaxes the independence constraint in favor of using the most efficient predictors possible. In this case each prediction is generated in the same way as a single description coding scheme resulting in greater coding efficiency. However, with this approach, losses in one description can propagate to the remaining descriptions. Some systems using this approach also code the residual error to reduce the effect of mismatch, others do not. The final class of predictors, Class C, uses some combination of the first two. They trade off some of the efficiency of Class B for the increased resilience of Class A. There will be some amount of mismatch in this type of approach, but presumably less than when using only the most effective predictors (Class B). In addition, predictors from this class are often able to adapt between the two extremes, gaining more error resilience where it is most needed. Some examples of this type of system are [26] and [34]. Depending on the particular modes used, the adaptive MD mode selection proposed in this thesis can use any one of these three approaches. By using an end-to-end rate-distortion optimized framework, the approach proposed in this thesis can most effectively trade off efficiency for resilience to optimize the expected quality at the receiver. The particular implementation described in Chapter 5 is an example of a Class C predictor. -39- Multiple Description Video Coding Chapter 2 2.3 Applicationsof Multiple Description Video Coding Multiple description coding can be useful for a wide range of video streaming applications. This section discusses a few of examples where MD coding can make a significant impact on overall performance. MD coding can certainly be used to improve standard point-to-point video communications over a single path, see Figure 2.5 (a). This approach can not handle a total outage of the single path, yet the susceptibility to packet loss may be reduced relative to single description coding. If packet losses along this path are approximately independent (Bernoulli) then any particular subset of packets sent along the path will also be lost independent of all other subsets. With this in mind, each description is lost or received independently of all other descriptions. However, packet losses are often bursty in nature. To remain effective, the MD coding approach relies on the assumption that it is unlikely that losses will occur on both descriptions. Bursty packet losses along a single transmission path can cause losses in both descriptions which can significantly reduce the effectiveness of the MD approach. Interleaving (reordering the sequence or transmitted packets) is often used to reduce the effect of bursty packet losses. However, the delay constraints of real-time systems limit the extent to which this is possible. While MD codes can be used to improve transmission over a single path, they are particularly well suited for use with multiple paths, see Figure 2.5 (b) and (c). In this type of approach, each description is sent along an independent path through the network to the receiver. Even if the channel experiences bursty losses along one path, path diversity makes it unlikely that both descriptions will be lost. There are a number of approaches for transmitting over multiple paths. For instance, standard point-to-point transmissions over the Internet can be modified to include multi-path routing, as in Figure 2.5 (b). The sender can explicitly route packets along separate paths by directing them to intermediate routers on their way to the receiver. Another approach is to use a streaming media content delivery network (CDN) to stream complementary descriptions from multiple senders as shown in Figure 2.5 (c) [3, 5]. Even with this type of multiple path approach, it is often difficult to generate completely independent paths through the network. Eventually, the paths are likely to converge resulting in two paths that are partially independent and partially shared. In [4], the authors provide a useful model for evaluating the performance of path diversity and multiple description streaming along partially independent, partially shared paths. They use this model to show the benefits of MD -40- Multiple Description Video Coding Chapter 2 Multiple Description Video Coding Chapter 2 -,0%- < 4S _-% (a) (b) rr1, (c) Ile 0~~ 4 '9. ~ 1-0 j *A.~ p '9., ~7 N bE -* h N. (d) -w (e) Figure 2.5: Applications of multiple description coding. (a) Traditional point-to-point communications. (b) Point-to-point communications using multiple paths. (c) Multiple senders via Content Delivery Networks (CDN). (d) Wireless communication via multiple base stations. (e) Ad-hoc peer-to-peer wireless networks. -41- Multiple Description Video Coding Chapter 2 coding in situations ranging from fully independent paths to fully dependent paths. These models also enable one to select the best paths [3, 5]. The use of MD coding also has significant potential in wireless applications. Individual links often fail due to interference from the environment or from other wireless devices. In addition, a single link may not be able to support the necessary bandwidth for video transmission. Thus, transmission using multiple paths in wireless applications is particularly attractive. For instance, packets could be routed through two different base stations on their way to the handheld device as shown in Figure 2.5 (d). If one of the links begins to fail due to interference, multiple description coding can allow graceful degradation in quality, allowing time for the device to initiate communications with a third base station or wait for the interference to clear. The same approach can be used with ad-hoc peer-to-peer wireless devices as shown in Figure 2.5 (e). Individual devices enter and exit the network sporadically due to the movement of each device, interference with other devices, or simply from being turned on/off. The use of MD coding allows the system to be more resilient to this type of dynamic network topology and to maintain reasonable video quality. -42- Chapter 3 Adaptive MD Mode Selection Each approach to MD coding trades off some amount of compression efficiency for an increase in error resilience. How efficiently each method achieves this tradeoff depends on the quality of video desired, the current network conditions, and the characteristics of the video itself. Most prior research in MD coding involved the design and analysis of novel MD coding techniques, where a single MD method is applied to the entire sequence. This approach is taken so as to evaluate the performance of each MD method. However, it would be more effective to adaptively select the best MD method based on the situation at hand. Since the encoder in this type of adaptive MD mode selection system has access to the original source, it is possible to analyze the performance of each coding mode and select between MD modes in an R-D optimized manner. This chapter introduces the concept of adaptive MD mode selection in more detail and presents some of the tools that can be used to achieve it. The first section of this chapter discusses the essential role adaptive mode selection plays in video coding, and the second introduces R-D optimization techniques that can be used to accomplish optimized mode selection. The final section discusses how this thesis has applied these ideas to adaptive MD mode selection. 3.1 Adaptive Mode Selection Systems Adaptive mode selection (AMS) has played a vital role throughout the history of video coding. Even the earliest video coding standards made use of hybrid inter/intra coding which is fundamentally an AMS approach. This adaptation between inter-coded blocks, which are predicted from previously coded frames, and intra-coded blocks, which are coded independently of any other frames, has been shown to greatly improve video compression efficiency. -43- Adaptive MD Mode Selection Chapter 3 The benefits of AMS should be fairly clear; adaptive processing allows the encoder to adjust to local regions of the video in order to increase its overall effectiveness. However, there are two main tradeoffs when using adaptive mode selection. First, any implementation of AMS will require some form of additional overhead, or side information, since it is necessary for the encoder to convey to the decoder which particular mode has been used for each individual region or block. With a small number of modes, this overhead can be minor, perhaps just one or two bits per block, yet if the number of modes grows quite large, this overhead could increase significantly. Secondly, the use of multiple modes typically increases the complexity of the encoder. The encoder must somehow evaluate the performance of each available mode, which generally means that the encoder attempts each one and analyzes the results. The usefulness of any particular AMS approach is determined by comparing the gain in performance from adaptive processing with the costs of additional overhead and the increase in complexity. For some approaches, there may be little or no benefit to adaptation, so the additional overhead and complexity would be wasted. Yet in many situations the gain in performance can significantly outweigh these costs. It is interesting to point out that, in general the complexity of the decoder is not increased significantly by the use of AMS. The decoder must understand how to decode each possible mode, but only needs to use one particular mode per block. For this reason, AMS is often quite useful in a situation where the encoding is done only once and the decoding is done many times or in an application where there may be a smaller number of relatively expensive encoders and a vast number of inexpensive decoders. Adaptive mode selection can be used to achieve any number of different goals, but perhaps the two most common uses in video compression are to improve compression efficiency and to increase error resilience. One example of an AMS system that has shown significant gains in compression efficiency is the intra/inter hybrid video coding approach mentioned above. In a video sequence there often exists significant temporal correlation between neighboring frames. By using motion estimation/compensation for inter-frame prediction, the encoder can often generate a fairly accurate prediction of the next frame. Encoding the residual difference between the prediction and the original is often significantly more efficient than coding the original data. However, during scene changes or periods of significant motion, motion compensation can fail to provide an accurate prediction. In this type of situation inter-frame coding can actually be less efficient than using intra-frame coding to code the original data itself. By adaptively switching -44- Adaptive MD Mode Selection Chapter 3 between the two modes, the encoder can adapt to the situation at hand, and coding efficiency can be greatly improved. AMS is often used for increasing error resilience as well. The hybrid intra/inter coding approach happens to be one example that can be used for both purposes. In the presence of channel errors or losses, the use of inter-frame encoding can lead to the spatial and temporal propagation of errors. The use of intra-frame coding stops this error propagation and can significantly improve the error resilience of the system. However, the goal of increasing error resilience is at odds with the goal of increasing compression efficiency. At one extreme, exclusively using intra-coding would lead to the highest error resilience, yet the worst coding efficiency. Most hybrid intra/inter coding systems are designed to make some tradeoff between compression efficiency and error resilience. The latest video codec, H.264, includes many examples of AMS including the availability of multiple intra prediction modes, variable motion compensation block sizes, adaptive frame/field coding, and so on. The H.264 standard is discussed in further detail in Chapter 5. The adaptive format conversion (AFC) system discussed in [44] is an example of AMS for scalable video coding and was in some ways the inspiration for the current work. As discussed in the previous chapter, scalable coding and MD coding are closely related, scalable coding being a special case of MD coding. The approach taken in [44] focuses on improving compression efficiency rather than error resilience, but many of the concepts are the same as those discussed in this thesis. In this AFC approach, the base layer provides interlaced video and the enhancement layer provides progressive video. The encoder in this system adaptively selects from among a small set of deinterlacing modes in order to use the most effective mode for each block. It then transmits this mode selection information as enhancement layer data. In many situations, this small amount of enhancement information can significantly improve performance over non-adaptive deinterlacing and this approach is often found to be more efficient than encoding the residual data itself. -45- Adaptive MD Mode Selection Chapter 3 3.2 Rate-Distortion Optimized Mode Selection The main question that arises with any adaptive mode selection system is how exactly to choose one particular mode for each region. The main goal is often to minimize some distortion metric subject to a bit rate constraint. The modes resulting in the lowest distortion often use the greatest number of bits, so this rate constraint forces the encoder to make certain tradeoffs. It must select the best modes possible while still keeping the rate below a fixed level. It is the goal of optimal mode selection to determine which modes most effectively accomplish this tradeoff. The following section provides a brief summary of rate-distortion (R-D) optimization techniques that can be used for the purpose of optimal mode selection. 3.2.1 Independent Rate-Distortion Optimization One common assumption in R-D optimization for video coding is that the distortion Di and rate R, can be determined independently for each block and that the decisions made for each block will not affect any other blocks. The inter- and/or intra-frame prediction techniques used in video codecs invalidate this assumption for the most part, but nonetheless this is a very common approach and it greatly simplifies this discussion. Section 3.2.2 will discuss the problems these prediction dependencies introduce and argue why the above assumption is necessary for a practical implementation of R-D optimization. The adaptive mode selection problem can be summarized as a budget constrained allocation problem: for a given total rate Rroal,select modes for each individual block i that minimize the distortion metric D, (3.1) R, !; Rrotal (3.2) subject to the bitrate constraint -46- Adaptive MD Mode Selection Chapter 3 Here R, is the total number of bits necessary to code block i and D, is the resulting distortion. The distortion metric we have used for this problem is the mean square error (MSE) introduced in Chapter 1. The classic solution to this problem is the discrete version of Lagrangian optimization [31]. In this approach the Lagrangian cost function J(A) is calculated J,(A)= D,+AR, (3.3) where A is a non-negative real number used to define the acceptable tradeoff between rate and distortion. Setting 2 at zero effectively ignores the rate and results in minimizing the distortion. Setting 2 arbitrarily high effectively ignores the distortion and results in minimizing the bitrate. Lagrangian optimization theory states that if a particular set of modes minimizes D, +AR, J,(A)= (3.4) then the same set of modes will also solve the above budget constrained allocation problem for the particular case where the total rate budget is Rr1tai = R. If we assume that each block is independent, then minimizing min J,(A) = min(J, (A)) (3.5) J1 (2) can be rewritten as (3.6) Therefore, the total sum can be minimized by minimizing the Lagrangian cost for each individual block. The encoder in the system computes the Lagrangian cost for each mode and the mode that minimizes this cost is selected. By doing so, the encoder guarantees that the resulting MSE is the smallest possible for that particular rate. The encoder can then adjust the value of A to achieve various operating points. -47- Chapter 3 Adaptive MD Mode Selection Stage 0 (R0, Do) (RI, D,) Stage 1 (ROSDoo) (RpDoj) (RI0,DO) (RD1) Stage 2 Stage 3 (R000,DO00) (Rooj, Dooj) (Reo, Dojo) (R011,DO1) (R,00,Djoo) (RjojDjoj) (RI10,D,10) (RAWD1) Figure 3.1: Dynamic programming tree generation for a two-mode AMS system. At each stage, the encoder calculates the rate and distortion possible with each available mode. If two paths result in the same cumulative rate, the path with the highest distortion is pruned from the tree. If a path results in a cumulative rate over the allowed budget, that path is pruned from the tree. Once the entire tree is generated, the remaining path with the lowest distortion is selected. Another approach to solving this problem is the use of dynamic programming. In this case, the encoder creates a trellis or tree of all possible outcomes. For example, consider the case of a two-mode AMS system, see Figure 3-1. The encoder has two possible choices for the first block or stage. The resulting rates and distortions for these two outcomes are computed and stored. After the second stage there are four possible outcomes, eight after the third stage, and so on. If two or more paths result in the same cumulative rate at any time, those with the higher distortion are pruned from the tree. If a path results in a cumulative rate higher than the available budget, it is pruned from the tree. After the entire tree is generated, the remaining path with the lowest distortion is selected and the set of modes used to travel along that path are used. -48- Adaptive MD Mode Selection Chapter 3 Adaptive MD Mode Selection Chapter 3 L distortion . 00 P 2 0 e 0 2 RTotal rate Figure 3.2: Comparison between Lagrangian optimization and dynamic programming. Given total rate constraint Rot,,,, operational point P, would be optimal. However Lagrangian optimization will yield the suboptimal result P since it can only achieve points on the convex hull of the rate-distortion characteristic. Dynamic programming considers all possible solutions and will select point 1. Compared to Lagrangian optimization, there is one main advantage of dynamic programming. Lagrangian optimization can only achieve points that lie on the convex hull of the rate-distortion curve shown in Figure 3.2. Consider the three operational points labeled PO, P, and P2, and the budget constraint RO,, in Figure 3.2. Given this budget constraint, the optimal operational point would be P,. Lagrangian optimization can achieve points PO and P2 since these lie on the convex hull of the rate-distortion curve. Point P2 is above the allowed rate, so it is unacceptable. Even though point P, is under the allowed rate and has a lower distortion than P, Lagrangian optimization will not find this solution. Dynamic programming will achieve the optimal operational point since it will fill its tree with every possible solution including point P,. However, the complexity of the dynamic programming approach is significantly higher than using Lagrangian optimization. Even when just two modes are used, the tree can grow extremely quickly and the memory requirements for this approach can be quite unreasonable. In -49- Chapter3 Adaptive MD Mode Selection addition, if there are many operational points on the convex hull, there is little or no benefit to dynamic programming. 3.2.2 Effects of Dependencies on Rate-Distortion Optimization As mentioned in the previous section, it is common to assume that the decisions made in current blocks will have no effect on future blocks. However in reality this is not the case. Motion compensated prediction introduces a clear dependency between frames that is not accounted for with this assumption. Similarly, most video coding standards make use of some form of intraframe prediction (motion vector prediction, differentially coded quantization parameters, intrapixel prediction, etc...) that will introduce dependencies within a frame as well. As mentioned above, Lagrangian optimization is a greedy approach in that it optimizes coding decisions for the current block alone. As demonstrated in Figure 3-2, it is possible to make slightly sub-optimal decisions for the current block that may leave more bits remaining for future blocks, resulting in a total solution with better performance. This same problem is increased with the introduction of prediction dependencies. For example, it may be possible to encode the current frame with more bits than would be optimal according to Lagrangian optimization. This lowers the distortion in the current frame, but leaves fewer bits for the next frame. However, since the next frame is predicted from the current frame and the current frame has a lower distortion, perhaps it is easier to code the next frame. The total result might use the same number of bits but have a lower distortion in both frames. As before, dynamic programming can be used to solve this problem in an optimal manner. Since every possible solution is considered, the effects of prediction dependencies are taken into account. However, given that prediction dependencies introduce memory into the system, fewer paths may be trimmed from the tree. Those paths that happen to merge to the same rate now also need to be in the same memory state before they can be trimmed. Nonetheless, it is possible for dynamic programming to find the optimal solution. -50- Adaptive MD Mode Selection Chapter 3 3.3 End-to-End R-D Optimized MD Mode Selection The previous sections in this chapter have provided motivation for using adaptive mode selection and discussed some of the optimization tools that can be used to accomplish this task. This section discusses how these tools have been applied to the current problem of adaptive MD mode selection. 3.3.1 Lagrangian Optimization Despite the benefits of using dynamic programming that have been mentioned in Section 3.2, this thesis will use Lagrangian optimization to perform adaptive MD mode selection. We have made this decision primarily for the three reasons listed here: * Complexity: The sheer complexity of a dynamic programming approach makes it unreasonable for an actual implementation. Even a simple 2-mode system used for a small frame size of 100 blocks results in 2'00 different branches by the end of a single frame. A few of these might be trimmed, but in general this approach requires too many computations and far too much memory. * Delay: In order to use dynamic programming to fully account for inter-frame dependencies, it is necessary for the encoder to wait until it has received and encoded every single frame. This introduces a delay into the system that is unacceptable for many applications. * Slight Reduction in Performance: Ignoring prediction dependencies, Lagrangian results would be fairly close to the results with dynamic programming since, in the current application, the population of operating points on the R-D curve is fairly dense (see Section 3.2.1). In terms of prediction dependencies, it has been shown that optimized dependent solutions can often be approximated using an independent approach with little loss in performance [29]. -51- Chapter 3 Adaptive MD Mode Selection 3.3.2 Rate-Distortion Optimization over Lossy Channels The Lagrangian optimization techniques presented above can be used to minimize distortion subject to a bitrate constraint. However, this approach assumes the encoder has full knowledge of the end-to-end distortion experienced by the decoder. When transmitted over a lossy channel, the end-to-end distortion consists of two terms: 1) known distortion from quantization and 2) unknown distortion from random packet loss. The unknown distortion from losses can only be determined in expectation due to the random nature of losses. Modifying the Lagrangian cost function to account for the total end-to-end distortion gives qunE[Doss J ( A)= D,"" + E ] + 2R . 1 (I"D+,R.'37 Here R, is the total number of bits necessary to code region i , (3.7) Du""" is the distortion due to quantization, and D"O"is a random variable representing the distortion due to packet losses. Thus, the expected distortion experienced by the decoder can be minimized by coding each region with all available modes and choosing the mode that minimizes this Lagrangian cost. Calculating the expected end-to-end distortion is not a straightforward task. The quantization distortion D,"" and bitrate R are easily determined at the encoder. However, the channel distortion DO"" is difficult to calculate due to spatial and temporal error propagation. In the next chapter we discuss approaches for modeling expected end-to-end distortion and the extensions necessary to apply these concepts to the current problem of MD coding over multiple paths with Gilbert (bursty) losses. -52- Chapter 4 Modeling End-to-End Distortion over Lossy Packet Networks As mentioned in the previous chapter, random packet losses force the encoder to model the network channel and estimate the expected end-to-end distortion including both quantization distortion and distortion due to channel losses. With an accurate model of expected distortion the encoder can make optimized decisions to improve the quality of the reconstructed video at the decoder. A number of approaches have been suggested in the past to estimate end-to-end distortion. This chapter discusses some of the previous work in this area and the extensions necessary to apply this work to the current problem of multiple description coding over multiple transmission paths with bursty packet losses. 4.1 Optimal Intra-Coding for Error Resilient Video Streams The problem of optimal mode selection over lossy packet networks was originally considered for optimal intra/inter coding decisions in single description streams as a means of combating temporal error propagation. This problem has received considerable attention in the error resilience community and is also applicable to the current problem of optimal MD mode selection. Besides the obvious use of intra-coded frames (I-frames) as starting points for random access into video bitstreams, I-frames resynchronize the video prediction loop and stop error propagation. In the extreme case, a sequence consisting of all I-frames would prevent error propagation entirely. However, the use of intra-coding is inefficient given the extensive temporal correlation present in typical video sequences. Thus the intra/inter coding decision presents a tradeoff between compression efficiency and error resilience. The simplest approach for re-synchronizing the motion compensated prediction loop is the use of periodic replenishment [21]. This can be performed in any number of ways from -53- Chapter 4 Modeling End-to-EndDistortionover Lossy PacketNetworks periodically intra-coding entire frames, to periodically coding rows of macroblocks, or even using a pseudo-random pattern of intra-coded blocks. However, this type of approach does not take into account the characteristics of the video source. Certain blocks (e.g. stationary background regions) may not need intra-coding if they can be well reconstructed even if packet losses occur. One early attempt to estimate which blocks are most susceptible is the concept of conditional replenishment presented in [21]. Here the encoder simulates the loss of each block and calculates the resulting mean squared error. If this distortion exceeds a certain threshold the block is intra-coded. This approach is essentially a rough attempt to estimate the expected endto-end distortion and provide content-aware intra-coding. Other approaches for this type of conditional replenishment have been proposed including more complex sensitivity metrics and decisions that take into account packet loss rates on the network, e.g. [28]. Depending on the application and the availability of a feedback channel, it may be possible for the decoder to inform the encoder when certain loss events occur and perhaps allow the encoder to compensate for these known error events through some form of error tracking as is done in [37]. Some early approaches to solving this problem in an R-D optimized framework appear in [23] and [10]. In [23] the authors solve the problem using Lagrangian optimization under certain simplifications; most notably they assume all motion vectors are zero. In [10] a weighted average is used in the Lagrangian cost function as an estimate of expected distortion. J(2)=1- p)-D,+ p -D, + AR (4.1) Here Dq is the distortion due only to quantization, D, is the distortion due to error concealment, and R is the rate. The variable p represents the probability of loss. This weighted average is a reasonable estimate of expected end-to-end distortion, however it assumes that previous frames have been properly decoded and effectively ignores error propagation. Accurately estimating the expected end-to-end distortion in a video frame is quite difficult due to spatial and temporal error propagation. Each of the above approaches provides a rough estimate of expected distortion and is able to provide significant improvements over contentunaware periodic intra-coding. However, none of these approaches truly provides an accurate measure of end-to-end distortion. The algorithm described in the next section has demonstrated significant improvement in performance and allows for accurate estimation of expected distortion on a pixel-by-pixel basis. -54- Modeling End-to-EndDistortion over Lossy Packet Networks Chapter 4 4.2 Recursive Optimal Per Pixel Estimate of Expected Distortion In [56] the authors suggest a recursive optimal per-pixel estimate (ROPE) for optimal intra/inter mode selection. Here, the expected distortion for any pixel location is calculated recursively as follows. Suppose fni represents the original pixel value at location i in frame n, and , represents the reconstruction of the same pixel at the decoder. The expected distortion d;, at that location can then be written as d =E[(f -f)=fn2 -2fE d+jE 2 . (4.2) At the encoder, the value fn is known and the value fn' is a random variable. So, the expected distortion at each location can be determined by calculating the first and second moment of the random variable fn. If we assume the encoder uses full pixel motion estimation, each correctly received pixel value can be written as fn = en + fn_ where fn, represents the pixel value in the previous frame that has been used for motion compensated prediction and en represents the quantized residual (in the case of intra pixels, the prediction is zero and the residual is just the quantized pixel value). The first moment of each received pixel can then be recursively calculated by the encoder as follows £ . received] = d', + E (4.3) If we assume the decoder uses frame copy error concealment, each lost pixel is reconstructed by copying the pixel at the same location in the previous frame. Thus, the first moment of each lost pixel is E[ lost]= E[ . (4.4) The total expectation can then be calculated as E[If] = P(received) E received + P(lost) E [12' lost]. -55- (4.5) Chapter4 Modeling End-to-EndDistortion over Lossy Packet Networks The calculations necessary for computing the second moment off, can be derived in a similar recursive fashion. 4.3 Multiple Description ROPE Model In [33] the ROPE model presented in the previous section has been extended to a two-stream multiple description system by recognizing the four possible loss scenarios for each frame: both descriptions are received, one or the other description is lost, or both descriptions are lost. For notational convenience, we will refer to these outcomes as 11, 10, 01, and 00 respectively. The conditional expectations of each of these four possible outcomes are recursively calculated and multiplied by the probability of each occurring to calculate the total expectation. E[] = P(11)E[ '11 1]+P(I0)E[- 10] +P(01)E[ ol]+P(00)E[ oo] (4.6) Graphically, this can be depicted as shown in Figure 4.1 (a). The first moments of the random variables fnA as calculated in the previous frame are used to calculate the four intermediate expected outcomes that are then combined together using equation (4.6) and stored for future frames. Again, the second moment calculations can be computed in a similar manner. 4.4 Extended ROPE Model These previous methods have assumed a Bernoulli independent packet loss model where the probability that a packet is lost is independent of any other packet. However, the idea can be modified for a channel with bursty packet losses as well. Recent work has identified the importance of burst length in characterizing error resilience schemes, and that examining performance as a function of burst length is an important feature for comparing the relative merits of different error resilient coding methods [2, 4, 27]. -56- Modeling End-to-EndDistortion over Lossy Packet Networks Chapter 4 E~f21]P(11) Previous values E[ 110 EP 10 E[f_ i] Stored for next frame E~ E[~ I Poo P(o E j,'00] Frame n -I Frame n (a) Recursion with Bernoulli packet loss model Previous values E Stored for next frame 11 E I11 P(ii) E[- 110] E I10] P(10) P(01) E[- 101] E[ E[- 101] P(O0) E[- o00] E Frame n -1 00] Frame n (b) Recursion with Gilbert packet loss model Figure 4.1: Conceptual computation of first moment values in MD ROPE approach. (a) Bernoulli case: the moment values from the previous frame are used to compute the expected values in each of the four possible outcomes that are then combined to find the moment values for the current frame. (b) Gilbert losses: due to the Gilbert model, the probability of transitioning from any one outcome at time n- 1 to any other outcome at time n changes depending on which outcome is currently being considered. Thus, the four expected outcomes cannot be combined into one single value as was done in the Bernoulli case. Each of these four values must be stored separately for future calculations. -57- Modeling End-to-EndDistortion over Lossy Packet Networks Chapter 4 State 1: packet received State 0: packet lost -PO Average Packet Loss Rate = 1+ p, - pO Expected Burst Length = I1- p" P1 Figure 4.2: Gilbert packet loss model. Assuming p, < po , there is a greater probability the current packet will be lost if the previous packet was lost. This causes bursty losses in the resulting stream. For this thesis we have extended the MD ROPE approach to account for bursty packet loss. Here we use a 2-state Gilbert loss model, but the same approach could be used for any multistate loss model including those with fixed burst lengths. We use the Gilbert model to simulate the nature of bursty losses where packet losses are more likely if the previous packet has been lost. This can be represented by the Markov model shown in Figure 4.2 assuming p, < po. The expected value of any outcome in a multi-state packet loss model can be calculated by computing the expectation conditioned on transitioning from one outcome to another multiplied by the probability of making that transition. For the two-state Gilbert model, this idea can be roughly depicted as shown in Figure 4.1 (b). For example, assume T' represents the event of transitioning from outcome A at time n -1 to outcome B at time n and p(TiB )represents the probability of making this transition. Then the expected value conditioned on outcome 11 can be computed as shown in (4.7). E[ .111 = P(T")-E[ ~'T f n' f 1n 11 + PT O1)-E [ _-T +P(T")-E [~IT" 0 f n 101( + P(T ')-E f n'T 1 4 7) The remaining three outcomes can be computed in a similar manner. Due to the Gilbert model, the probability of transitioning from any outcome at time n-I to any other outcome at time n changes depending on which outcome is currently being considered. For instance, when computing the expected value conditioned on outcome 00, the result when both streams are lost, the probability that the previous outcome was 10, 01, or 00 is much higher than when computing -58- Chapter 4 Modeling End-to-End Distortionover Lossy Packet Networks the expected value conditioned on outcome 11. Since the transitional probabilities vary from outcome to outcome, it is not possible to combine the four expected outcomes into one value as can be done in the Bernoulli case. The four values must be stored separately for future use as shown in Figure 4. 1b. Once again, the second moment values can be computed using a similar approach. The above discussion assumed full pixel motion vectors and frame copy error concealment, but it is possible to extend this approach to sub-pixel motion vector accuracy and more complicated error concealment schemes. As discussed in [56] the main difficulty with this arises when computing the second moment of pixel values that depend on a linear combination of previous pixels. The second moment depends on the correlations between each of these previous pixels and is difficult to compute in a recursive manner. We have modified the above approach in order to apply it to the H.264 video coding standard with quarter-pixel motion vector accuracy and more sophisticated error concealment methods by using the techniques proposed in [55] for estimation of cross-correlation terms. Specifically, each correlation term E[XY] is estimated by E[XY] = E[X] EY2 E[Y] -59- . (4.8) -60- Chapter 5 MD Mode Selection System This chapter discusses the implementation of an adaptive MD mode selection system we have used to examine the concept of adaptive MD mode selection. These details are provided to allow for a better understanding of the results presented in Chapter 6. However, it should be noted that this is simply one approach that we have used to illustrate the concept of adaptive MD mode selection. We make no claim that this is the optimal implementation of such a system, and in fact that is highly unlikely. The system used for this work has been based on the H.264 video coding standard. This choice was made since, at the time of publication, H.264 was the most advanced video coding standard. H.264 makes use of state-of-the-art video coding techniques and has been shown to significantly increase coding efficiency relative to previous coding standards. In addition, by using H.264, the results presented in this thesis can more easily be compared against current and future work. The first section of this chapter presents an overview of H.264, providing a short history of the standard and the details necessary for discussing the implementation of the adaptive MD mode selection system. The second section provides a detailed explanation of the specific MD mode selection system developed for this thesis. 5.1 MPEG4-AVC / H.264 Video Coding Standard The H.264 video coding standard has been developed as a joint effort between the ITU and ISO/IEC standards committees. Originally referred to as H.26L, H.264 began as a long-term effort by the ITU in 1998 with the goal of doubling the compression efficiency of previous video coding standards. In 2001, the ITU and ISO joined together to form the Joint Video Team (JVT) and developed the standard formally referred to as MPEG4-Advanced Video Coding (AVC) or -61- MD Mode Selection System Chapter5 Control Data ------------------------- ---------Quantized Video Data Digital Video * H.264 Bit Stream t .4- -, Motion Vector Data Figure 5.1: H.264 video encoder architecture. Control Data H.264 Bit Stream Quantized Video Data Decoded Video Motion Vector Data Figure 5.2: H.264 video decoder architecture. -62- MD Mode Selection System Chapter 5 M A B C D E F G H I J K L a e i m b c d f g h j kl n o p Figure 5.3: 4x4 Intra-prediction modes. The encoder can use pixels A-M to predict the current 4x4 block of data, pixels a-p. Mode 0: Vertical Mode 1: Horizontal Figure 5.4: Two of the nine available 4x4-intra prediction modes. In the vertical mode the pixels above are copied downward to predict the current block. In the horizontal mode, the pixels to the left are copied instead. The remaining seven modes copy and/or average neighboring pixels in various orientations. H.264. Previous joint efforts between the ISO and ITU have been very successful including the JPEG standard used extensively for image compression and the MPEG2 video coding standard used in a vast number of applications including DVD and HDTV. At its core, the H.264 codec remains very similar to previous coding standards (MPEG2, MPEG4, H.261, H.263, etc...). It is a block-transform hybrid video coding approach using motion estimation/compensation to reduce temporal redundancy, and a DCT to reduce spatial redundancy. Furthermore, entropy coding is used to represent the remaining information with as few bits as possible. The following sections highlight some of the main aspects of the H.264 video codec specifically focusing on the differences between H.264 and previous video coding standards. Figures 5.1 and 5.2 provide an overview of the structure of the H.264 video encoder and decoder. -63- MD Mode Selection System Chapter5 0 oooooooooooooooo * oooooooooooooooo * oooooooooooooooo oooooooooooooooo oooooooooooooooo * oooooooooooooooo * oooooooooooooooo 0 oooooooooooooooo * oooooooooooooooo * oooooooooooooooo * oooooooooooooooo * oooooooooooooooo oooooooooooooooo oooooooooooooooo * oooooooooooooooo * oooooooooooooooo Figure 5.5: 16x16 Intra-prediction modes. The encoder can use the 32 pixels represented by the black dots above to predict the current 16x16 pixel macroblock. 5.1.1 Intra-Frame Prediction One significant change in H.264 relative to previous standards is the use of extensive intra-frame prediction. The use of intra-prediction allows the encoder to predict intra-coded blocks in the current frame from previously coded pixels in the same frame. MPEG2 and MPEG4 both had simple intra-prediction (DC value prediction, etc...), but the intra-prediction in H.264 is quite extensive. There are 13 different intra prediction modes available in H.264; nine 4x4 prediction modes and four 16x16 modes. The nine 4x4-modes use the 17 neighboring pixels (pixels A-M in Figure 5.3) in various manners to predict the current 4x4 block of data (pixels a-p). As an example, two of the 9 4x4-modes are shown in Figure 5.4. Since there is likely a strong correlation between the modes used in the current block and those used in neighboring blocks, each of these 4x4 modes is predicted from neighboring blocks to further improve coding efficiency. In a similar manner, the four 16x16 modes use the 32 neighboring pixels shown in Figure 5.5 to predict each 16x16 block of data. These four modes are mainly intended for the prediction of smooth regions of the video containing little or no detail. -64- MD Mode Selection System Chapter 5 Macroblock Divisions 16x16 16x8 8x16 8x8 8x8 Block Subdivisions 8x8 4x8 8x4 4x4 Figure 5.6: Macroblock partitions for block motion estimation. Each macroblock may be partitioned into each of the four patterns shown on the top row. Furthermore, those blocks of size 8x8 may be further partitioned into the four patterns shown on the bottom row. 5.1.2 Hierarchical Block Motion Estimation As with most prior video coding standards, H.264 uses motion estimation/compensation to remove a significant amount of temporal redundancy from the original sequence. As a hybrid video coder, the encoder can select between intra and inter coding on a block by block basis. This decision is represented by the switch in Figure 5.1. Those blocks which are intra coded are predicted using intra-frame prediction mentioned in the previous section. The remaining blocks are inter-coded using motion compensated prediction from previous frames. For motion estimation/compensation, the H.264 encoder can choose from the four different macroblock divisions shown in the top half of Figure 5.6. Any 8x8 sized blocks may be further subdivided into any of the four partitions shown in the bottom half of Figure 5.6. The encoder must assign a translational motion vector to each of these partitions and transmit these motion vectors to the decoder. The decoder then selects data from previously decoded frames and offsets it in accordance with the received motion vectors to generate its prediction of the current block. Partitioning a macroblock into smaller blocks (e.g. 4x4 blocks) obviously allows -65- Chapter5 MD Mode Selection System n-3 n-2 n-1 n Figure 5.7: The H.264 encoder can use multiple reference frames to generate motion compensated predictions of the current frame. much more flexibility than larger partitions and will likely lead to better prediction. However, the use of smaller partitions requires the encoder to transmit many more motion vectors, which may or may not outweigh any gains in performance. 5.1.3 Multiple Reference Frames Typically, it is most efficient for the encoder to predict the current frame from the directly previous frame. However, this is not always the case. Occasionally, it can be more efficient to predict from 2 or more frames back. In H.264, more than one reference frame can be used for motion compensated prediction. This idea is illustrated in Figure 5.7. Every block within a MB (16x 16, 8x16, 16x8, 8x8) can use their own frame. Blocks smaller than size 8x8 (4x8, 8x4, 4x4) all use the same reference frame. The use of multiple reference frames especially helps to solve the "uncovered-background" problem where a background region may have been visible two or more frames back but was temporarily hidden by a moving object in the previous frame. If the encoder only considers the previous frame it will not likely find a good match for this uncovered region since it was hidden at that point. However, if the encoder looks back more than one frame, this background region would again be visible. -66- MD Mode Selection System Chapter 5 20 32 20 32 51 32 32 5 32 5 32 Figure 5.8: Six-tap interpolation filter used to generate half pixel locations Using multiple reference frames generally leads to an increase in efficiency, but increases the complexity of the motion vector search and requires a significant amount of memory for frame storage. The encoder in H.264 can use up to 16 previous frames for reference, although using all 16 is not a requirement. The encoder signals to the decoder how many frames will be used at the beginning of the sequence so the decoder knows how much memory it will need to make available. The use of multiple reference frames may also be used by the encoder for error resilience purposes. For instance, the temporal splitting MD approach presented in Figure 1.5 may be implemented simply by forcing the encoder to exclusively use two-back temporal prediction. 5.1.4 Quarter Pixel Motion Vector Accuracy Each motion vector in H.264 uses sub-pixel motion vector accuracy; specifically H.264 uses quarter pixel accuracy for each motion vector. This allows the encoder to more effectively compensate for non-integer pixel motion in the sequence. The six-tap interpolation filter shown in Figure 5.8 is used to generate half pixel samples, and a two-point averaging filter is used to generate quarter pixel locations. -67- Chapter 5 MD Mode Selection System MD Mode Selection System Chapter 5 (a) (b) Figure 5.9: (a) Coded frame prior to applying the deblocking filter. (b) Resulting frame after deblocking. Adaptive filtering is applied at block boundaries to reduce the appearance of blocking artifacts. 5.1.5 In-Loop Deblocking Filter At lower bitrates, block based compression can lead to blocking artifacts in coded frames. If the coefficients are too coarsely quantized, discontinuities can appear at boundaries of blocks in regions that should have been smoothly varying, see Figure 5.9 (a). The H.264 codec uses an inloop deblocking filter to help remove these blocking artifacts. The filtering is done using a onedimensional filter along block edges where the strength of the filtering is adapted to account for the quantization levels used in that region and for the local activity in the neighborhood of boundary. The results after filtering Figure 5.9 (a) are shown in (b). This process is referred to as in-loop filtering since it is actually used within the motion compensated prediction loop, see Figure 5.1. By using filtered frames for motion compensation the encoder is often able to produce more efficient predictions resulting in further improvements in coding efficiency. Typical results show bitrate reductions of 5-10% from using this filter at a fixed quality level [40]. -68- MD Mode Selection System Chapter 5 Symbol Codeword 0 1 2 1 010 011 3 4 00100 00101 5 00110 6 7 8 9 10 11 00111 0001000 0001001 0001010 0001011 0001100 12 13 14 0001101 0001110 0001111 Table 5.1: Exponential-Golomb codebook for encoding all syntax elements not encoded using CAVLC or CABAC encoding. 5.1.6 Entropy Coding After motion compensation, DCT transformation, and quantization of residual transform coefficients, the compressed video information must be converted to specific codewords (strings of Is and Os) before being placed into the output bitstream. Entropy coding is used to represent this compressed video information (motion vectors, quantized residual data, etc...) with as few bits as possible. The H.264 coding standard provides two different entropy coding options, universal variable length coding (UVLC) or context-adaptive binary arithmetic coding (CABAC). The less complex UVLC approach uses exponential Golomb codes for all syntax elements except for transform coefficients. Each syntax element is assigned an index or symbol number and associated codeword, with more likely outcomes assigned to shorter codewords. The first few entries in the exp-Golomb codebook are shown in Table 5.1. Each codeword has N-zeros followed by a '1' followed by N-bits of data allowing for a very simple decoding structure. The residual transform coefficients are encoded using context-adaptive variable length coding (CAVLC). These adaptive variable length codes are similar to Huffman codes that adapt to -69- Chapter 5 MD Mode Selection System previously transmitted coefficients from earlier blocks so as to more closely match the statistical properties of this particular video frame and more efficiently code the resulting data. The second entropy coding mode in H.264 uses context-adaptive binary arithmetic coding (CABAC). This approach provides increased efficiency relative to the CAVLC approach, yet requires significantly higher complexity. Arithmetic coding in a sense allows for joint encoding of all the syntax elements from a frame allowing the encoder to assign non-integer length codewords to syntax elements, including the possibility of using less than one bit per syntax element. CABAC encoding is also used on a much wider range of syntax elements unlike CAVLC which is only used on residual transform coefficients. For further details on CABAC encoding see [20, 30, 39]. 5.1.7 H.264 Performance The original goal for H.264 was to improve coding efficiency by a factor of two over previous coding standards. Subsequently, a number of performance comparisons have been made to estimate how successful the H.264 standard has been in achieving this goal [13, 35, 39, 40, 51]. In [35], the authors have encoded a set of nine CIF (Common Intermediate Format - 288 lines x 352 columns) and QCIF (Quarter Common Intermediate Format - 144 lines x 176 columns) resolution sequences with H.264, MPEG4, H.263, and MPEG2 at a number of different bitrates. While maintaining the same quantitative quality level (PSNR), the results with H.264 show an average bitrate reduction of 39% relative to MPEG4, 49% relative to H.263, and 64% relative to MPEG2. Given that quantitative measures are not perfectly correlated with human perceptions of video quality, a number of subjective tests have been performed as well. Perceptual tests reported in [13] with a large test set ranging from QCIF to HD resolutions indicate that in roughly 65% of the cases H.264 is able to increase compression efficiency by a factor of two or more. The results of these performance evaluations and others like them have piqued the interest of a number of industries. H.264/AVC is currently being integrated into a number of applications including video phones, video conferencing, streaming video, HD DVD, satellite TV, and broadcast HDTV (in some countries). It is our belief that H.264 will be widespread in the near future which is why we have elected to use H.264 for this research. -70- MD Mode Selection System Chapter 5 5.2 MD System Implementation The previous section has introduced the portions of the H.264 standard that are most relevant to our discussion of adaptive MD mode selection. This section provides further details on the modifications we have made to implement MD mode selection. The system developed for this thesis uses H.264 reference software version 8.6 with necessary modifications to support adaptive mode selection. The adaptive mode selection is performed on a macroblock-bymacroblock basis using the Lagrangian optimization techniques discussed in Chapter 3 along with the expected distortion modeling from Chapter 4. Note that this optimization is performed simultaneously for both traditional coding decisions (e.g. inter versus intra coding) as well as for selecting one of the possible MD modes. Due to the in-loop deblocking filter in H.264, output pixels from the current macroblock will depend on neighboring macroblocks, including blocks that have yet to be coded. This dependence on pixels that have yet to be coded presents a causality problem for the encoder. The deblocking filter has been turned off in our experiments to remove this causality issue and simplify the problem. The system uses UVLC entropy coding with quarter pixel motion vector accuracy and all available intra- and inter-prediction modes. However, we have used an option in H.264 referred to as constrained intra-prediction. When used, constrained intra-prediction prevents the encoder from using inter-coded blocks for intra-frame prediction. That is to say intra-macroblocks are predicted only from other intra-macroblocks. By using this option, the process of intra-prediction is less efficient; however this prevents intra-frame error propagation. Due to the use of motion compensated prediction, errors in previous frames can propagate into inter-coded blocks in the current frame. If these inter-coded blocks are then used for intra-frame prediction, these errors will then also propagate spatially throughout intra-coded blocks of the frame. It would be possible to modify the ROPE model to account for unconstrained error propagation by adding a second random term to the intra-pixel calculations. However, the use of intra-coding in errorresilient video streaming is mainly intended for restoring the prediction loop and ending any error propagation, and by allowing intra-frame error propagation we would essentially defeat this purpose. For this reason, we have elected to use constrained intra-prediction. -71- MD Mode Selection System Chapter 5 MD Mode Selection System Chapter S Stream 1: Stream 1: Stream 2: 4 2 0O Stream 2: (a) Single Description Coding (SD) (b) Temporal Splitting (TS) Even Lines Odd Lines Stream 1: 4,fo 1 2 3 4 Stream 2: Stream 2: (c) Spatial Splitting (SS) (d) Repetition Coding (RC) Figure 5.10: Examined MD coding methods: (a) Single Description Coding: each frame is predicted from the previous frame in a standard manner to maximize compression efficiency. (b) Temporal Splitting: even frames are predicted from even frames and odd from odd. (c) Spatial Splitting: even lines are predicted from even lines and odd from odd. (d) Repetition Coding: all coded data repeated in both streams. 5.2.1 Examined MD Coding Modes This thesis explores the concept of adaptive MD mode selection in which the encoder switches between different coding modes within a sequence in an intelligent manner. To illustrate this idea, the system discussed here uses a combination of four simple MD modes: single description coding (SD), temporal splitting (TS), spatial splitting (SS), and repetition coding (RC), see Figure 5.10. This section describes each of these methods and their relative advantages and disadvantages. Single description (SD) coding represents the typical coding approach where each frame is predicted from the previous frame in an attempt to remove as much temporal redundancy as -72- Chapter 5 MD Mode Selection System possible. Of all the methods presented here, SD coding has the highest coding efficiency and the lowest resilience to packet losses. On the other extreme, repetition coding (RC) is similar to the SD approach except the data is transmitted once in each description. This obviously leads to poor coding efficiency, but greatly improves the overall error resilience. As long as both descriptions of a frame are not lost simultaneously, there will be no effect on decoded video quality. The remaining two modes provide additional tradeoffs between error resilience and coding efficiency. The temporal splitting (TS) mode effectively partitions the sequence along the temporal dimension into even and odd frames. Even frames are predicted from even frames and odd frames from odd frames. Similarly, in spatial splitting (SS), the sequence is partitioned along the spatial direction into even and odd lines. Even lines are predicted from even lines and odd from odd. Table 5.2 presents an overview of the relative advantages and disadvantages of each mode. We chose to examine these particular modes for the following reasons. First, these methods tend to complement each other well with one method strong in situations where another method is weak, and vice versa. This attribute will be further illustrated in Chapter 6. Secondly, each MD mode makes a different tradeoff between compression efficiency and error resilience. This set of modes examines a wide range of the compression efficiency/error resilience spectrum, from most efficient single description coding to most resilient repetition coding. Finally, these approaches are all fairly simple both conceptually and from a complexity standpoint. Conceptually, it is possible to quickly understand where each one of these modes might be most or least effective, and in terms of complexity, the decoder in this system is not much more complicated than the standard video decoder. It is important to note that additional MD modes of interest may be straightforwardly incorporated into the adaptive MD encoding framework and the associated models for determining the optimized MD mode selection. In addition, it is also possible to account for improved MD decoder processing that may lead to reduced distortion from losses (e.g. improved methods of error recovery where a damaged description is repaired by using an undamaged description [1, 2]), and thereby effect the end-to-end distortion estimation performed as part of the adaptive MD encoding. Note that when coded in a non-adaptive fashion, each method (SD, TS, SS, RC) is still performed in an R-D optimized manner as mentioned above. All of the remaining coding decisions, including inter versus intra coding, are made to minimize the end-to-end distortion. For instance, the RC mode is not simply a straightforward replica of the SD mode. The system -73- MD Mode Selection System Chapter5 MD Mode Description Advantages Disadvantages SD Single Description Coding Highest coding efficiency of all methods. Least resilience to errors TS Temporal Splitting Good coding efficiency with better error resilience than SD coding. Works well in regions with little or no motion. Increased temporal distance reduces the effectiveness of temporal prediction leading to a decrease in coding efficiency. SS Spatial Splitting High resilience to errors with better coding efficiency than RC. Works well in regions with some amount of motion. Field coding leads to decreased coding efficiency, with typically lower coding efficiency than TS mode. RC Repetition Coding Highest resilience to errors of all the methods. The loss of either stream has no effect on decoded quality. Repetition of data is costly leading to low coding efficiency. Table 5.2: List of MD coding modes along with their relative advantages and disadvantages. recognizes the improved reliability of the RC mode and elects to use far less intra-coding allowing more intelligent allocation of the available bits. 5.2.2 Data Packetization The H.264 reference software has an integrated RTP packetization structure. We have made the assumption that single frame would be placed entirely within one packet per stream. The experiments we have performed have each used QCIF resolution video (see Section 6.1) which consists of relatively small coded frames, so this assumption is reasonable. The packetization of data differs slightly for each mode (see Figure 5.11). In both the SD or TS approaches, all data for a frame is placed into a single packet. The even frames are then sent along one stream and the odd frames along the other. While in the SS and RC approaches, data from a single frame is coded into packets placed into both streams. For SS even lines are sent in one stream and odd lines sent in the other, while for RC all data is repeated in both streams. Therefore, for SD and TS each frame is coded into one large packet that is sent in alternating streams, while for SS and RC each frame is coded into two smaller packets and one small packet is sent in each stream. Since the adaptive approach (ADAPT) is a combination of each of these four methods, there is typically one slightly larger packet and one smaller packet and these alternate streams between frames. -74- Chapter 5 MD Mode Selection System Stream 0: Stream 1: Frm la 1 Frm 2a Frma Frm2b Frm lb Fm 2a a.)SD and TS b.)SS and RC Stream 0: Stream 1: a 2rm 2b] b Frm2a c.) ADAPT Figure 5.11: Packetization of data in MD modes. a.) SD and TS: Data sent along one path alternating between frames. b.) SS and RC: Data spread across both streams. c.) ADAPT: Combination of the two resulting in one slightly larger packet and one slightly smaller. If a frame is lost in either the TS or SD method, no data exists in the opposite stream at the same time instant, so the missing data is estimated by directly copying the associated pixels from the previous frame. Note that here we copy from the previous frame in either description, not the previous frame in the same description. In the SS method, if only one description is lost the decoder estimates the missing lines in the frame using linear interpolation, and if both are lost it estimates the missing frame by copying the previous frame. Similarly for RC, if only one description is lost the decoder can use the data in the opposite stream, while if both are lost it copies the previous frame. 5.2.3 Discussion of Modifications Not in Compliance with H.264 Standard Many of the changes discussed above are fully compliant with the H.264 standard, others are not. It would be possible to implement a fully standard compliant system (for instance the temporal splitting mode was implemented using standard-compliant reference frame selection available in H.264), but that was not our main concern at this point. There were essentially three main changes that are not in compliance with the H.264 standard: macroblock-level adaptive interlaced coding, prevention of intra-frame error propagation between MD modes, and the redefinition of skipped macroblocks for the TS mode. These are discussed below. The first significant modification was the ability to support macroblock-level adaptive interlaced coding in order to accommodate the spatial splitting (SS) mode. The H.264 standard allows for adaptive frame/field coding, but only on a macroblock-pair basis. In H.264 macroblock-adaptive frame/field (MBAFF) coding, each vertical pair of macroblocks is either -75- MD Mode Selection System Chapter 5 coded in frame or field mode. This macroblock-pair processing prevented the use of macroblocklevel MD mode selection. In addition, it was not possible to use MBAFF coding with QCIF resolution video since QCIF video contains 9 macroblock rows, a number that cannot be evenly divided into pairs. Secondly, a few modifications were made to prevent intra-frame propagation of errors between MD modes. Consider the case of an RC block surrounded by SD blocks. The intention of repetition coding is that this data can be properly decoded even if one of the streams is lost. However, the H.264 codec contains many intra-frame predictions including motion vector prediction and intra-prediction. If the RC block is predicted in any manner from the surrounding SD blocks, the loss of one stream could alter this surrounding data and propagate these errors into the RC block. The same situation is true for SS coding as well: if one of the streams is lost, the field in the opposite stream should still be decodable. However, if this field is predicted from neighboring blocks, errors in these neighbors can propagate into the current field. For this reason we have only allowed macroblocks to be predicted from other macroblocks using the same MD coding mode. Each of the neighboring macroblocks which use a different MD mode is considered unavailable for prediction to prevent this propagation of errors between modes. Note that this approach does not prevent propagation of errors from previous frames. For example, it is still possible for an inter-coded RC block to be corrupted by errors that occurred in previous frames. Additionally, for the SS approach we have prevented blocks in the bottom field from being predicted from blocks in the top field to ensure errors in the top field do not propagate to the bottom field. The last modification was a slight redefinition of skipped macroblocks for the TS mode. This modification was only used when the TS mode was used by itself. When a macroblock is skipped, or not coded, the decoder typically copies a 16x16 block of data from the previous frame. However, the main concept behind the TS mode was to code even and odd frames separately. With this in mind, the skip mode in the TS approach has been redefined such that the decoder copies a block of data from two frames back instead and thereby maintains this separation. -76- Chapter 6 Experimental Results and Analysis This chapter presents a number of results that have been obtained using the modified H.264 codec described in Chapter 5. It is again important to point out that this is one particular implementation of an adaptive MD mode selection system and that different realizations of such a system could yield vastly different results. The results presented here are intended to demonstrate that there can be significant gains from using adaptive MD mode selection. However with a different set of MD modes, it is possible that these gains could increase or possibly diminish. To estimate the actual distortion experienced at the decoder, we have simulated bursty packet losses with packet loss rates and expected burst lengths as specified in each section below. Unless otherwise stated, we have run each simulation with 300 different packet loss traces and have averaged the resulting squared-error distortion. The same packet loss traces were used throughout a single experiment to allow for meaningful comparisons across the different MD coding methods. Each path in the system is assumed to carry 30 packets per second where the packet losses on each path are modeled as a Gilbert process. For wired networks, the probability of packet loss is generally independent of packet size so the variation in sizes should not generally affect the results or the fairness of this comparison. When the two paths are balanced or symmetric, the optimization automatically sends half the total bitrate across each path. For unbalanced paths, the adaptive system results in a slight redistribution of bandwidth as is discussed in Section 6.6. In order for the ROPE model to estimate expected end-to-end distortion the encoder needs some knowledge of current network conditions. We assume a feedback channel exists so that the receiver can transmit information (e.g. notification of packet losses) back to the sender, allowing the sender may form an approximation of the current network conditions. Unless otherwise stated, each of these experiments assumes the encoder has perfect knowledge of both the average packet loss rate and expected burst length. The effects of imperfect channel knowledge are further explored in Section 6.7. -77- Experimental Results andAnalysis Chapter 6 In each of these experiments, the encoder is run in one of two different modes: constant bitrate encoding (CBR) or variable bitrate encoding (VBR). In the CBR mode, the quantizer and associated lambda value is adjusted on a macroblock basis in an attempt to keep the number of bits used in each frame approximately constant. Keeping the bitrate constant allows a number of useful comparisons between methods on a frame by frame basis such as those presented in Figure 6.5. Unfortunately, the changes in quantizer level must be communicated along both streams in the adaptive approach, which leads to somewhat significant overhead. While this signaling information is included in the bitstream, the amount of signaling overhead is not incorporated in the R-D optimization decision process, hence leading to potentially sub-optimal decisions with the adaptive approach. We mention this since if all of the overhead was accounted for in the R-D optimized rate control then the performance of the adaptive method would be even slightly better than shown in the current results and therefore the CBR results are lower bounds on the achievable CBR performance. In the VBR mode, the quantizer level is held fixed to provide constant quality. In this case there is no quantizer overhead and this approach yields results closer to the optimal performance. Since the rates of each mode may vary when in VBR mode (where the quantizer is held fixed), it is not possible to make a fair comparison between different modes at a given bit rate. Therefore, in experiments where we try to make fair comparisons among different approaches at the same bit rate per frame we operate in CBR mode, e.g. Figure 6.5, and we use VBR mode to compute rate-distortion curves, like those shown in Figure 6.12. 6.1 Test Sequences The following two test sequences have been used for providing the experimental results presented in subsequent sections. We have done these experiments on other sequences as well with similar results but present only these two in the interest of saving space. Both of these sequences are progressively scanned video sequences with QCIF resolution (144 lines x 176 columns) at a frame rate of 30 frames per second. The Foreman sequence consists of 400 frames and the Carphone sequence consists of 382 frames. The first frame from both sequences is shown in Table 6.1. -78- ExperimentalResults and Analysis Chapter 6 Sequence Name: Scan Mode: Frames: Rows: Columns: Frame Rate: Foreman Progressive 400 144 176 30 fps Sequence Name: Scan Mode: Frames: Rows: Columns: Frame Rate: Carphone Progressive 382 144 176 30 fps Table 6.1: Test Sequences The QCIF resolution was selected since much of the prior work in the error resilience field has focused on QCIF video which allows this thesis to be more easily compared with other research in the field. The small frame size also allows for quicker experimentation and analysis since encoding/decoding times are significantly reduced. However, the results presented in this thesis are directly applicable to higher resolution video as well. With QCIF video, we have made the assumption that one frame fits within one packet per stream. When considering higher resolution video sequences, it may be appropriate to split coded frames into multiple packets, which would also allow for more advanced intra- and inter-frame error concealment techniques. -79- ExperimentalResults and Analysis Chapter6 39 37 35 Ve W z (00 33 e V) 31 o Expected 29 -Actual - Quant Only 27 1 0 , 50 100 150 200 250 300 350 Frame (a) Foreman sequence 41 39 37 35 Z CE 33 Expected 31 - 29 - Actual Quant Only , 27 0 50 100 150 200 250 300 350 Frame (b) Carphone sequence Figure 6.1: Comparison between actual and expected end-to-end PSNR. (a) Foreman sequence. (b) Carphone sequence. This figure demonstrates the ability of the model to track the actual end-to-end distortion, since the actual values line up quite closely with the expected values. Also shown on this figure is the quantization-only distortion, which shows the distortion from compression and without any packet loss. -80- Experimental Results andAnalysis Chapter 6 6.2 Performance of Extended ROPE Algorithm This section analyzes the performance of the extended ROPE algorithm. As discussed in Chapter 3, performing R-D optimization over lossy packet networks requires an accurate estimate of the end-to-end distortion experienced at the decoder. The extended ROPE model presented in Chapter 4 provides one approach for achieving this result. It is important for the model of expected distortion to accurately estimate the actual distortion since each of the encoder's decisions will be made based on this modeling. If the model accurately estimates the distortion then minimizing the expected distortion as calculated by the model will have the effect of minimizing the expected actual distortion experienced at the decoder. In this section we examine the capability of the extended ROPE algorithm to estimate the expected end-to-end distortion experienced at the decoder accounting for local characteristics of the video as well as network conditions on multiple paths. The results of the first experiment are shown in Figure 6.1. To generate this figure we have coded the Foreman and Carphone test sequences at approximately 0.4 bits per pixel (bpp) with the H.264 video codec using the SD approach mentioned in Chapter 5. The channel has been modeled as a two path channel, where the paths are symmetric with Gilbert losses at an average packet loss rate of 5% and expected burst length of 3 packets. The expected distortion as calculated at the encoder using the above model has been plotted relative to the actual distortion experienced by the decoder. This actual distortion was calculated by using 1,200 different packet loss traces and averaging the resulting squared error distortion for each frame before computing the PSNR per frame. As shown in both of these sequences, the proposed model is able to track the end-to-end expected distortion quite accurately. Also shown in this figure for reference is the quantization only distortion (with no packet loss). The second example demonstrates the performance of the model under changing network conditions. In this experiment, the Foreman sequence was coded with the TS approach presented in Chapter 5 and the packet loss rate was varied as shown in Figure 6.2. The average packet loss rate jumped up to 10% for frames 50-150 on Path 0 and similarly for frames 250-350 on Path 1. The expected burst length was held constant at 3 packets during the intervals when losses occurred. Again, the actual distortion was calculated by using 1,200 different packet loss traces and averaging the resulting squared error distortion. Figure 6.3 (a) demonstrates how the expected outcome, as calculated by the ROPE approach, follows the actual result quite closely. -81- ExperimentalResults and Analysis Chapter6 Path 0 Loss Rate Path 1 10% 0% 50 1 00 250 350 Frame Figure 6.2: Time varying packet loss rates to examine the performance of the proposed system under time varying conditions. Here the packet loss rate jumps up from 0% to 10% first on one stream and then on the other. The expected burst length is 3 packets. With the TS approach, when even frames are lost it only affects even frames and when odd frames are lost it only affects odd frames. For instance, when the loss rate jumps up on Path 0 the quality of the odd frames drops but the quality of the even frames remains unaffected. When the loss rate on Path 1 jumps up the quality of the even frames drops but the odd frames are unaffected. This characteristic causes the rapid fluctuation shown in Figure 6.3 (a). Figure 6.3 (b) shows the same result with the even and odd frames plotted separately to make the figure easier to read. The final example presented in this section illustrates the effect of burst length on the resulting distortion and provides some motivation as to why it was important to compensate for bursty packet losses in the ROPE algorithm. Figure 6.4 (a) shows the Foreman sequence encoded with the SD method and (b) shows TS methods both at 0.4 bpp and a fixed 5% average packet loss rate. In this experiment the encoder made the assumption that the packet losses follow a Bernoulli model where each packet is lost independently of any other packet. If this assumption happens to be accurate, then the model is again able to estimate the actual result relatively well. If however the actual packet losses are bursty in nature then there can be significant deviations between the model and the actual result. The red lines in Figure 6.4 show the actual result at the decoder if the losses actually had an expected burst length of 6 packets. Given the errors between the estimate and the actual result, it is quite possible for the encoder to make less than optimal decisions that would ultimately affect the performance of the overall system. It is of particular interest to notice how bursty losses tend to have a negative effect on the SD approach while the opposite is true for the TS approach. If the encoder had been deciding between TS and SD coding, this error could severely affect the decision process. This result is one of the main reasons we felt it was important to account for bursty packet losses in the extended ROPE algorithm. -82- Chapter 6 and Analysis ExperimentalResults Experimental Results andAnalysis Chapter 6 38 ?* 36 % Pi 34 + 32{ 44 3u -328 26 + - Expected PSNR -Actual PSNR 24 0 50 150 100 I I 200 250 II 350 300 40 0 Frame (a) Time varying packet loss rates - TS method 38 - - - ---- 36 -- ------ - ---- * - - - --- 34 32 ---------- -- ----- z 0. ----- - -- - - - --- - ----- 30 * Expected PSNR - Even _____________ 28 PSNR - Even -Actual 9 Expected PSNR - Odd ---------------- 26 PSNR - Odd -Actual 24 0 50 100 150 200 250 300 350 400 Frame (b) Even versus odd frames Figure 6.3: Comparison between actual and expected end-to-end PSNR with time varying packet loss rates. (a) Foreman sequence encoded at 0.4 bpp with the TS approach. (b) Same result with PSNR for even and odd frames plotted separately for easier understanding. The model again matches the actual result quite closely. -83- Experimental Results andAnalysis Chapter 6 317 35 33 z 31 U) IL 29 Expected - Bernoulli - Actual - Bernoulli - Actual - Burst Length 6 27 25 0 50 100 150 200 250 300 350 400 Frame (a) Foreman Sequence Bernoulli vs. Gilbert Losses - SD method 36 35 - 34 33 32- z 31 0. 30 29 28 - * Expected - Bernoulli - Bernoulli -Actual - Burst Length 6 -Actual 2,)7 - 0 50 100 150 200 250 300 350 400 Frame (b) Foreman Sequence Bernoulli vs. Gilbert Losses - TS method Figure 6.4: Comparison between actual result with Bernoulli losses and actual result with bursty losses. Foreman sequence coded at 0.4 bpp with balanced paths and 5% average packet loss rate. (a) SD coding (b) TS coding -84- Chapter 6 6.3 Experimental Results andAnalysis MD Coding Adapted to Local Video Characteristics Having shown the capability of the ROPE algorithm to accurately model the expected end-toend distortion experienced at the decoder, we next examine the system's ability to adapt to the characteristics of the video source. The channel in this experiment was simulated with two balanced paths each having 5% average packet loss rate and expected burst length of 3 packets. The video was coded in CBR mode at approximately 0.4 bits per pixel (bpp). Figure 6.5 demonstrates the resulting distortion in each frame averaged over 300 packet loss traces for the adaptive MD method and each of its non-adaptive MD counterparts. The Foreman sequence contains a significant amount of motion from frames 250 to 350 and is fairly stationary from frame 350 to 399. Notice how the SS/RC methods work better during periods of significant motion while the SD/TS methods work better as the video becomes stationary. The adaptive method intelligently switches between the two, maintaining at least the best performance of any non-adaptive approach. Since the adaptive approach adapts on a macroblock level, it is often able to do even better than the best non-adaptive case by selecting different MD modes within a frame as well. Similar results can be seen with the Carphone sequence. The best performing non-adaptive approach varies from frame to frame depending on the characteristics of the video. The adaptive approach generally provides the best performance of each of these. Also shown in Figure 6.5 are the results from a typical video coding approach that we will refer to as standard video coding (STD). Here R-D optimization is only performed with respect to quantization distortion, not the end-to-end R-D optimization used in the other approaches. Instead of making inter/intra coding decisions in an end-to-end R-D optimized manner as performed by SD, it periodically intra updates one line of macroblocks in every other frame to combat error propagation (this update rate was chosen as the optimal intra refresh rate [38] that is typically approximately I/ p, where p is the packet loss rate). The adaptive MD approach is able to outperform optimized SD coding by up to 2 dB for the Foreman sequence, depending on the amount of motion present at the time. Note that by making intelligent decisions through end-to-end R-D optimization, the SD method examined here is able to outperform the conventional STD method by as much as 4 or 5 dB for the -85- Chapter6 ExperimentalResults and Analysis Experimental Results and Analysis Chapter 6 36- - --- 34- ----------------------------------------- 32 30- z(I) C. 28- - -TS -_ SS 26- ---RC -ADAPT -STD 2422 -- - SD --- - --- - - ------ ------ -------- -- -- --- --- - - - - ----- - - - - 200 250 300 350 400 Frame (a) Foreman sequence 40 38 - 36 - -SD IS - - -- RC ADAPT STD I34I z V) 32- - - 30- - - -- -- -- -- -- ------ - - -- 1' - -- -- -- - - 28 26- 100 150 200 250 300 Frame (b) Carphone sequence Figure 6.5: Average distortion in each frame for ADAPT versus each non-adaptive approach. Coded at 0.4 bpp with balanced paths and 5% average packet loss rate and expected burst length of 3. (a) Foreman sequence. (b) Carphone sequence. -86- Chapter 6 andAnalysis Experimental Results Experimental Results and Analysis Chapter 6 100% - 0 50% 0 50 100 150 200 250 300 350 I j10% f2 50% - I~_ 400 _ W-WL 4V 0%1 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 40 0 100% CO, co 50% 0% 100% 50% n o/ Frame Figure 6.6: Distribution of selected MD modes used in the adaptive method for each frame of the Foreman sequence illustrating how mode selection adapts to the video characteristics. 5% average packet loss rate, expected burst length 3. Foreman sequence. The adaptive MD approach outperforms optimized SD coding by up to 1 dB with the Carphone sequence, and optimized SD coding outperforms the conventional STD approach by up to approximately 3 dB. Again, the use of a different set of MD modes could certainly increase or decrease these gains. In Figure 6.6 we illustrate how the mode selection varies as a function of the characteristics of the video source. Specifically we show the percentage of macroblocks using each MD mode in each frame of the Foreman sequence. From this distribution of MD modes, one can roughly segment the Foreman sequence into three distinct regions: almost exclusively SD/TS in the last 50 frames, mostly SS/RC from frames 250-350, and a combination of the two during the first half. This matches up with the characteristics of the video which contains some amount of motion at the beginning, a fast camera scan in the middle, and is nearly stationary at the end. The next two pages provide some visual results from the same experiment. The results presented in this and subsequent sections focus on quantitative performance as measured by -87- Experimental Results andAnalysis Chapter 6 PSNR. However, as was mentioned in the first chapter of this thesis, PSNR values are not always directly correlated with human perceptions of video quality. While no formal perceptual testing was performed for this thesis, a number of informal qualitative assessments have been performed to verify that the conclusions we have derived from PSNR measurements can also be confirmed with visual analysis. With the addition of random channel losses it becomes difficult to assess the quality of any one method by a single realization of the channel. Any one particular packet loss trace may favor one method over another. The quantitative results presented in this thesis are averaged over many realizations of the channel to provide a more accurate and fair comparison. That being said, we provide the following sets of images to facilitate a discussion of some of the properties exhibited by each of these modes, and not as a means of fairly judging performance. These results are from the Foreman sequence, but similar results can be shown with the Carphone sequence as well. Figure 6.7 shows frame 5 from the Foreman sequence encoded with each approach. A burst of losses occurred along one path, affecting frames 1 and 3. As can be seen in Figure 6.7, the SD and STD are quite distorted due to propagation of errors into the 5 th frame. These two approaches generally demonstrate about the same distortion immediately after a loss, however the SD approach tends to recover faster due to more intelligent intra-coding. The SS method is also visually distorted with many jagged edges appearing due to spatial interpolation of the missing fields. The RC mode was unaffected by this loss since it occurred on only one stream. The TS method is slightly corrupted, but not as severely as some of the others. However, one visual artifact that occurs with the TS modes yet does not appear in still images is the rapid fluctuation between high and low distortion frames (even and odd frames). This fluctuation causes very visible flicker that can be quite irritating despite the fact that half of the frames have high PSNR. By intelligent selection of MD modes the adaptive approach is able to more effectively protect itself against these losses and is barely distorted in this example. Figure 6.8 shows frame 231 from the same sequence. In most cases the results are fairly similar to the first set of images, except notice the corruption that occurs with the RC mode. In this example losses occurred earlier in the sequence that affected both descriptions simultaneously. Because this outcome was relatively unlikely, the RC mode used far less intracoding. However, in the event this type of simultaneous loss does occur, it can propagate through many frames in the RC approach as shown in this example. -88- Chapter 6 Experimental Results andAnalysis Experimental Results and Analysis Chapter 6 ADAPT STD SD TS L SS RC Figure 6.7: Frame 5 of the Foreman Sequence. A burst of loses along one path affected frames 1 and 3. Most of the approaches are fairly distorted except for ADAPT and RC approaches that are relatively unaffected. -89- Experimental Results andAnalysis Chapter 6 '..jMNMMPW4=L STD ADAPT SD TS -AMEW -MMAPPW t::O I Pr 'AMIIZM RC SS Figure 6.8: Frame 231 of the Foreman Sequence. Two bursts of losses earlier in the sequence affected both streams simultaneously, causing severe distortions in the RC approach. -90- Chapter 6 6.4 ExperimentalResults andAnalysis MD Coding Adapted to Network Conditions In this section we examine how the system performs under various network conditions, specifically we looked at variations in average packet loss rate and expected burst length. In later sections we also explore the behavior of the system with unbalanced paths, where one path has a higher packet loss rate than the other, and to time varying network conditions. 6.4.1 Variations in Average Packet Loss Rate The main purpose of these experiments was to examine the effect of variations in average packet loss rate on the resulting performance. The channel in this experiment was simulated with two balanced paths each with expected burst length of 3 packets. The video was coded in CBR mode at approximately 0.4 bits per pixel (bpp) and the average packet loss rate was varied from 0 to 10%. Figure 6.9 demonstrates the resulting distortion in the sequence for the adaptive MD method and each of its non-adaptive MD counterparts. These results were computed by first calculating the mean-squared error distortion by averaging across all the frames in the sequence and across the 300 packet loss traces, and then computing the PSNR. By choosing the most efficient modes possible, the adaptive approach achieves a performance similar to the SD approach when no losses occur, and yet the performance does not fall off as quickly as the average packet loss rate is increased. Near the 10% loss rate, the adaptive method adjusts for the unreliable channel and has a performance closer to the RC mode. Note that the intra update rate for the STD method was adjusted in the experiment to be as close as possible to 1/p, where p is the packet loss rate, as an approximation of the optimal intra update frequency. Since this update rate could only be adjusted in an integer manner, the STD curves in Figure 6.9 tend to have some jagged fluctuations and in some cases the curves are not even monotonically decreasing. As an example, an update rate of I/p would imply that one should update one line of macroblocks every 2.22 frames at 5% loss and every 1.85 frames at 6% loss. These two cases have both been rounded to an update of one line of macroblocks every 2 frames resulting in the slightly irregular curves. Table 6.2 shows the distribution of MD modes in the adaptive approach at 0%, 5%, and 10% average packet loss rates. As the loss rate increases the system responds by switching from -91- Chapter6 Experimental Results andAnalysis 38 +ADAPT - -- -- -- - --- ---- - - --- --- -- - -- ---- 36 +W SD -+- TS --- -.- SS 34 - - ------------------ - + RC -@-+STD I z 32 CL 0. 30 28 26 0 1 2 3 4 5 6 7 9 8 10 Average Packet Loss Rate (%) (a) Foreman sequence 37 -ADAPT 36 ---------------------- 35 - --- -- ---- - -- - ------- -SD TS -+-X-- _- _ _ _ _ _ _ - 34 _ _ __ _ _ __ _ _ _ __ _ _ --- --- - SS + - RC -0- STD a 33 z (0) 0. 32 31 - - - 30 - ----------------- ----- --------- - -- ----- - - -- - - -------------- ------------ 29 0 2 4 6 8 10 Average Packet Loss Rate (%) (b) Carphone sequence Figure 6.9: PSNR versus average packet loss rate. (a) Foreman sequence. (b) Carphone sequence. Video coded at approximately 0.4 bpp. The average packet loss rate for this experiment was varied from 0-10%, and the expected burst length was held constant at 3 packets. -92- Chapter 6 ExperimentalResults andAnalysis Foreman Sequence MD Mode 0% Loss SD TS 5% Loss 10% Loss 70.87% 50.73% 43.57% 18.13% 21.22% 18.97% SS 10.81% 10.69% 10.15% RC 0.19% 17.36% 27.31% (a) Carphone Sequence MD Mode 0% Loss 5% Loss 10% Loss SD 67.49% 61.92% 57.51% TS 17.29% 22.53% 20.68% SS 15.13% 9.57% 9.19% RC 0.09% 5.98% 12.63% (b) Table 6.2: Comparing the distribution of MD modes in the adaptive approach at 0%, 5%, and 10% average packet loss rates. (a.) Foreman sequence. (b) Carphone sequence. lower redundancy methods (SD) to higher redundancy methods (RC) in an attempt to provide more protection against losses. It is interesting to point out that even at 0% loss the system does not choose 100% SD coding. The adaptive approach recognizes that occasionally it can be more efficient to predict from two frames ago than from the prior frame, so it chooses TS coding. Occasionally it can be more efficient to code the even and odd lines of a macroblock separately, so it chooses SS coding. The fact that it selects any RC at 0% loss rate is a little counterintuitive, but this results since coding a macroblock using RC changes the prediction dependencies between macroblocks. The H.264 codec contains many intra-frame predictions including motion vector prediction and intra-prediction. In order for the RC mode to be correctly decoded even when one stream is lost, the adaptive system must not allow RC blocks to be predicted in any manner from non-RC blocks. (If RC blocks had been predicted from SD blocks, for example, the loss of one stream would affect the SD blocks that would consequently alter the RC data as well.) Occasionally, prediction methods like motion vector prediction may not help and can actually reduce the coding efficiency for certain blocks. If this is extreme enough, it can actually be more efficient to use RC, where the prediction would not be used, even though the data is then unnecessarily repeated in both descriptions. This happens in approximately 1 out of every 500 to 1000 macroblocks in our experiments. -93- Experimental Results andAnalysis Chapter 6 6.4.2 Variations in Expected Burst Length Next, a similar experiment was run to examine the effects of expected burst length on the resulting PSNR. The channel in this experiment was simulated with two balanced paths each with average packet loss rates of 5%. The video was coded in CBR mode at approximately 0.4 bits per pixel (bpp) and the expected burst length was varied from a little over 1 packet (corresponding to Bernoulli losses) up to 10 packets. Figure 6.10 demonstrates the resulting distortion for the adaptive MD method and each of its non-adaptive MD counterparts. The effect of burst length has a number of interesting affects on the performance of each of the methods. The adaptive approach again demonstrates higher performance than any of the nonadaptive approaches across the whole range of burst lengths. Also as expected, the repetition coding approach is generally unaffected by burst length since it is only affected by simultaneous losses on both streams. The probability that packets from both streams are simultaneously lost does not change with burst length since the channel has been simulated with independent paths and the average loss rate was held constant. The SD and STD approaches both demonstrate a decrease in quality with higher burst lengths. As explored in [27], generally the loss of a burst of frames leads to an increase in distortion relative to the same number of isolated losses. The correlation between errors in two sequential frames can increase the total distortion and often results in a significant increase in distortion. Interestingly, the TS and SS approaches both demonstrate better performance for longer expected burst lengths. Conceptually, this can be explained by the following example. Consider the case when a burst of L consecutive losses occurs along one path when using the TS approach. Each lost frame will be reconstructed by an unaffected frame from the opposite stream, so assume for the sake of this example that the distortion resulting from each of these reconstructions is the same for each frame and is equal to D,,. This is a fairly reasonable assumption, given that each frame is reconstructed using a "clean" frame from the opposite stream, yet is in sharp contrast to the SD case where each subsequent frame is reconstructed from a frame that has also been distorted by propagating losses. Once the burst is finished after L frames, any remaining error will propagate into future frames. Suppose the sum of the distortion caused by this propagation of errors is equal to Dprop - Then, the total distortion due to this burst would be L -D + Dprop -94- (6.1) ExperimentalResults andAnalysis Chapter 6 32.5 32 -------- --- ---- -------- -- 31.5 - -4-ADAPT 0---+ SD +TS - SS ------------- 31 S30.5 W z -* RC ------ -0- 30 STD ------------ 29.5 29 28.5 28 27.5 1 3 5 7 9 Expected Burst Length (packets) (a) Foreman sequence 33.5 -- ADAPT W SD -a- TS 33 --- 32.5 ---------------------------------------- w z U) a. SS + RC -@-STD 32 31.5 31 30.5 1 3 5 7 9 Expected Burst Length (packets) (b) Carphone sequence Figure 6.10: PSNR versus expected burst length. (a) Foreman sequence. (b) Carphone sequence. Video coded at approximately 0.4 bpp. The expected burst length for this experiment was varied from slightly above 1 packet (corresponding to Bernoulli losses) to 10 packets while the average packet loss rate was held constant at 5%. -95- Chapter6 ExperimentalResults andAnalysis Experimental Results and Analysis Chapter 6 +-- (a) Distortion Distortion D055 - - - -- --- --- -- L - - - - - - -- - - -- - - - - -- - ------ - - - --- - - ---4-4- -- - -- Frame L 2 (b) L - - -- - - - Frame (c)2 Figure 6.11: Effects of burst length on the TS mode. (a) A burst of L frames are lost along one path and are reconstructed using unaffected frames from the opposite path. (b) Distortion resulting from burst of L frames. (c) Distortion resulting from two shorter bursts. In general, with the TS or SS approaches, longer burst lengths tend to cause less distortion than a number of shorter bursts with an equal number of total losses. Now, if there were two burst losses of length L instead, the total distortion would be approximately 2(L-D,,,,+D,,)=L-D, +2-D (6.2) Therefore, a larger number of shorter bursts will tend to lead to more distortion in the TS approach assuming the same total number of losses. A similar argument can be used to explain the increase in performance with the SS method, since each lost field is reconstructed by an unaffected "clean" field from the opposite stream. In [2] it was shown that a TS-based MD scheme with multiple paths can be quite effective against burst losses. -96- Experimental Results and Analysis Chapter 6 Foreman Sequence MD Mode Bernoulli Losses Burst Length 5 Burst Length 10 SD 53.17% 49.26% 48.23% TS 20.65% 22.01% 22.59% SS 9.14% 11.20% 11.79% RC 17.04% 17.52% 17.39% (a) Carphone Sequence jBurst Length 10 MD Mode Bernoulli Losses Burst Length 5 SD 63.83% 60.52% 58.81% TS 22.19% 23.11% 23.90% SS 8.17% 10.55% 11.25% RC 5.81% 5.82% 6.05% (b) Table 6.3: Comparing the distribution of MD modes in the adaptive approach at a range of expected burst lengths ranging from Bernoulli losses up to burst lengths of 10 packets. (a.) Foreman sequence. (b) Carphone sequence. Table 6.3 shows the distribution of MD modes in the adaptive approach at three different burst lengths; 1.053 (Bernoulli), 5, and 10 packets. In agreement with the results shown in Figure 6.10, as the burst length increases the system begins increasing the use of the TS and SS methods to more effectively match the conditions on the network. 6.5 End-to-End R-D Performance Having demonstrated the ability of adaptive MD mode selection to adapt to network conditions and to the characteristics of the video source, we next analyze the performance of this system at a number of different bitrates. Figure 6.12 shows the end-to-end R-D performance curves of the adaptive approach and each of its non-adaptive counterparts. This experiment was run in VBR mode with fixed quantization levels. To generate each point on these curves, the resulting distortion was averaged across all 300 packet loss simulations, as well as across all frames of the sequence. The same calculation was then conducted at various quantizer levels to generate each R-D curve. -97- Chapter 6 ExperimentalResults andAnalysis Experimental Results and Analysis Chapter 6 34 -+-ADAPT +0S Du -+-TS 33 +STD 31 (L ----- ---- -- ---- - -- - - -- - - SS +RC 32 zV) -- - - - - -- -F 30 -- ------ 294 28- - - -------------- --------------- -- - - ----------------------------- - ----- - -------- - ----- -- - 27 0.15 0.2 0.25 0.3 0.35 0.4 0.45 BPP (a) Foreman sequence 35 -34 -- 33 z - ADAPT - SD --+TS +SS RC STD - -- 32 - 31 - 30 - 29 - 0.125 0.175 0.225 0.275 0.325 0.375 0.425 0.475 0.525 BPP (b) Carphone sequence Figure 6.12: End-to-end R-D performance of ADAPT and non-adaptive methods. 5% packet loss rate, expected burst length 3. (a) Foreman sequence. (b) Carphone sequence. -98- Experimental Results and Analysis Chapter 6 With the current system, ADAPT is able to outperform optimized SD coding by up to 1 dB for the Foreman sequence and about 0.5 dB for the Carphone sequence by switching between MD methods. The ADAPT method is able to outperform the STD coding approach by as much as 4.5 dB with the Foreman sequence and up to 3 dB with the Carphone sequence. ADAPT is able to outperform TS, which more or less performs the second best overall, by as much as 0.5 dB. One interesting result here is how well RC performs in these experiments. Keep in mind that this is an R-D optimized RC approach, not simply the half-bitrate SD method repeated twice. The amount of intra coding used in RC is significantly reduced relative to SD coding as the encoder recognizes the increased resilience of the RC method and chooses a more efficient allocation of bits. Lagrangian optimization is often performed by selecting a fixed A value appropriate for the desired bitrate and jointly choosing both the mode and quantizer level that minimizes the Lagrangian cost function. However, the results presented in Figure 6.12 have been obtained by using a single fixed quantization level to generate each point. In H.264 there are 51 available quantization levels, and therefore trying all possible quantization levels would increase the complexity of the system by a factor of 51. This proved to be an unacceptable increase in complexity, so we have made the decision to use a fixed quantization level throughout this experiment. To justify this decision we have used the first 25 frames from each sequence to regenerate the R-D curves from Figure 6.12 using a fully optimized system that considers all available quantization levels. The R-D curves for the ADAPT and SD approaches using both optimal quantization and fixed quantization are shown in Figure 6.13. The gains from using this fully optimized approach are only about 0.1 to 0.3 dB, which is not insignificant but is relatively small when considering the large increase in complexity. 6.6 Unbalanced Paths and Time Varying Conditions We next analyze the performance of the adaptive method when used with a channel containing unbalanced paths. First, in section 6.6.1 we analyze the situation where one path is more reliable than the other. Then in section 6.6.2 we simulate a momentary rise in packet loss rates on one stream to simulate a temporary jump in congestion. -99- Chapter 6 Experimental Results andAnalysis Experimental Results and Analysis Chapter 6 36 V +Fixed SD SD +Fixed ADAPT 35 -- - -x-Optimal 34-- - -+-OptimalADAPT 33-- ------------ z ----- (0 32-- 0. --------------------- --------- -- - -- - -- - ------ -- ---------------- -- -- -- ------------------- 31- 30 -- ---29 4-0.15 0.2 ------------------------------------------ 0.25 0.3 0.35 0.4 0.45 0.5 0.55 BPP (a) Foreman sequence 38 3736 - -4-Fixed SD ----------------------------*-OptimalSD 4-Fixed ADAPT - +Optimal ADAPT -------------- I 350 3433 32 - -- --------------------------------------------- 31 +- 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 BPP (b) Carphone sequence Figure 6.13: Optimal quantization level selection versus fixed quantization level. The optimal curves have been calculated by allowing the encoder to choose a quantization level in an R-D optimized manner. The fixed curves have been calculated using a fixed quantization level. Only the first 25 frames of each sequence have been used in this experiment. (a) Foreman sequence. (b) Carphone sequence. -100- ExperimentalResults andAnalysis Chapter 6 Foreman Sequence Carphone Sequence MD Mode Even Frames More Reliable Path Odd Frames Less Reliable Path SD 54.7% 48.3% TS SS RC 26.5% 7.4% 11.4% 16.5% 12.9% 22.4% MD Mode SD TS SS RC Even Frames 11 More Reliable Path 64.6% 26.9% 5.9% 2.6% Odd Frames Less Reliable Path 59.8% 18.0% 12.5% 9.7% Table 6.4: Percentage of macroblocks using each MD mode in the adaptive approach when sending over unbalanced paths. Foreman Sequence Balanced Paths Unbalanced Paths Carphone Sequence Balanced Paths Unbalanced Paths Stream 1 50.5% 55.4% Stream 1 50.1% 55.9% Stream 2 49.5% 44.6% Stream 2 49.9% 44.1% Table 6.5: Percentage of total bitrate in each stream for both balanced and unbalanced paths. 6.6.1 Balanced versus Unbalanced Paths This section explores the behavior of the adaptive system when one path is more reliable than the other. The channel used in these experiments consisted of one path with 3% average packet loss rate and another with 7%, both with expected burst lengths of 3 packets. The video in this experiment was coded at approximately 0.4 bpp in CBR mode. Table 6.4 shows the distribution of MD modes in even frames of the sequence versus odd frames. The even frames are those where the larger packet is sent along the more reliable path and the smaller packet is sent along the less reliable path (see Figure 5.11 regarding the packetization of MD data). The opposite is true for the odd frames. It is also interesting to compare the results from Table 6.4 with those from Table 6.2 at 5% balanced loss. The average of the even and odd frames from Table 6.4 matches closely with the values from the balanced case in Table 6.2. As shown in Table 6.4, the system uses more SS and RC in the less reliable odd frames. These more redundant methods allow the system to provide additional protection for those frames which are more likely to be lost. By doing so, the adaptive system is effectively moving data from the less reliable path into the more reliable path. Table 6.5 shows the bit rate sent along each path in the balanced versus unbalanced cases. In this situation, the system is shifting -101- Experimental Results andAnalysis Chapter 6 between 5-6% of its total rate into the more reliable stream to compensate for conditions on the network. Since the non-adaptive methods are forced to send approximately half their total rate along each path, it is difficult to make a fair comparison across methods in this unbalanced situation. However, it is quite interesting that the end-to-end R-D optimization is able to adjust to this situation in such a manner. In the future, it might be interesting to use a similar approach to optimally distribute bandwidth across multiple paths. The purpose of this research was to investigate the use of adaptive MD mode selection, and consequently this system was not necessarily designed to provide optimal bandwidth distribution. The results presented above are an interesting side-effect and may lead to useful future research along these lines, but these results are not an example of optimal bandwidth distribution. For instance, in the extreme case where 100% of packets arrive along one path and 0% of packets arrive along the other path, the optimal distribution would be to send all of the data along the reliable path and none along the unreliable path. However, the alternating nature of the SD and TS approaches in the current system (see Figures 5.10 and 5.11) prevent the encoder from sending all of the data along the more reliable path. An optimal bandwidth distribution system would need to allow for this type of flexibility. In addition, it may be appropriate to impose additional constraints on this type of system such as a per-path rate constraint (a limit on the number of bits sent along each path) rather than a total rate constraint as used in the current system. 6.6.2 Time Varying Network Conditions The second set of experiments in this section was designed to analyze the behavior of the adaptive system under changing network conditions. Here we have used packet loss rates that vary over time as shown in Figure 6.2. The packet loss rate temporarily increases to 10% first on one path for frames 50-150 and then on the other for frames 250-350. This type of situation is particularly well suited for the use of adaptive MD mode selection, and may actually be quite likely in practice as well. There may be long periods of time where all packets are delivered followed by short periods of congestion on one path or the other. During the periods of little or no packet losses the adaptive system makes decisions to maximize coding efficiency. Once the packet loss rate jumps up, the system can adapt in a number of different ways. As shown in the previous section, frames sent along the less reliable path can be better protected using SS or RC -102- Chapter 6 Experimental Results andAnalysis 100% 50% 0%0% 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 100% 50% 0% 100% U, (0 50% 0% -I o 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 100% S50% 0% Frame Figure 6.14: Distribution of selected MD modes used in the adaptive method for each frame of the Foreman sequence illustrating how mode selection adapts to time varying network conditions. Average packet loss rate varies as shown in Figure 6.2. The expected burst length is 3 packets. coding. For the frames sent along the more reliable path the system can elect to use TS coding to prevent errors from propagating into the more reliable stream. Of course the system can also elect to use more intra-coding to prevent error propagation as well. The adaptive mode selection allows the encoder to analyze each of these different approaches and choose the most effective approach for each block. Figure 6.14 shows the distribution of MD modes in the adaptive approach for the Foreman sequence. Specifically we show the percentage of macroblocks using each MD mode in each frame of the video. Here the sequence has been coded at about 0.4 bpp, the packet loss rate varies as shown in Figure 6.2, and the expected burst length was 3 packets (although during the periods with 0% probability of loss the notion of burst length is meaningless). From this distribution it is easy to notice the changes that occur as the packet loss rate jumps up. The repetition mode in particular significantly increases during these periods of congestion. One can -103- Chapter6 Experimental Results andAnalysis also notice much more fluctuation when the packet loss rate increases to 10%. Here the unbalanced nature of the two paths leads to different processing of the even and odd frames. Those frames sent along the less reliable path are handled differently from those sent along the more reliable path resulting in the noticeable fluctuation. Figure 6.15 shows the resulting PSNR from each frame of the Foreman sequence and compares the adaptive approach against each of its non-adaptive counterparts. Since the TS method possesses the property that the loss of even frames affect only even frames and similarly for odd frames, the increased loss rate on one path only reduces the PSNR for every other frame. This results in severe fluctuation that would reduce the legibility of Figure 6.15. Therefore the even and odd frames have been plotted separately here. Figure 6.15 (a) contains only the even frames and Figure 6.15 (b) contains only the odd frames. In this example, the odd frames are sent along the less reliable path during the first burst and the even frames are sent along the less reliable path during the second burst. Consequently the PSNR drops sharply for the TS approach during the first burst for the odd frames and during the second burst for the even frames. Due to error propagation, both the even and odd frames drop in quality during both bursts with the SD and to a lesser extent the SS approach. The RC method is unaffected by either burst of losses since both streams are never simultaneously lost. The adaptive approach is able to select modes that most effectively match the current network conditions and maintains good quality throughout both loss events. 6.7 SensitivityAnalysis The previous experiments have each assumed the encoder had perfect knowledge of the channel conditions. In practice, the problem of accurately characterizing network conditions is challenging since conditions on the network can change rapidly making it difficult to get an accurate measure at any one point in time. Therefore, it is unlikely that the encoder will have perfect knowledge of the channel. The purpose of this section is to analyze the effectiveness of the adaptive MD mode selection system given imperfect knowledge of channel conditions. We analyze the sensitivity of the system to packet loss rate in section 6.7.1 and sensitivity to burst length in section 6.7.2. As in the previous sections, results presented here apply only to the specific realization of the MD mode selection system used in this thesis. Other such systems may -104- Chapter 6 Experimental Results andAnalysis 39 37 z S - 27- - - - - - - - - - - - - i - i --------- -11- --S29------------------ - -- R 23-- - - -SD -ADAPT 0 50 100 150 200 250 300 350 400 Frame 312 Frames (a) Even --- - -- -- - ------- - - -- 27 39 - - - TS - -- -- --- - --- -- - - - - --- -- -- -- -- -- - - -- - -- SS 23 111 -SD 25 --- -- RC ------ -- - ---- - --- - -ADAPT 0 50 100 150 200 250 300 350 400 Frame (b) Odd Frames Figure 6.15: Average distortion in each frame for ADAPT versus each non-adaptive approach. Foreman sequence coded at 0.4 bpp with time varying packet loss rates and expected burst length of 3 packets. (a) Even frames. (b) Odd frames. -105- Chapter6 ExperimentalResults and Analysis be more or less sensitive to inaccurate knowledge of channel conditions depending on the particular MD modes used and other aspects of the system design. 6.7.1 Sensitivity to Packet Loss Rate The purpose of this experiment was to analyze the sensitivity of the system to errors in assumed packet loss rate (PLR). The channel in this experiment was simulated with two balanced paths each with expected burst length of 3 packets. The video was coded with the adaptive approach in CBR mode at approximately 0.4 bits per pixel (bpp) and the assumed packet loss rate was varied from 0 to 10%. Figure 6.16 demonstrates the resulting PSNR when the actual and assumed PLR do not match. These results were computed by first calculating the mean-squared error distortion by averaging across all the frames in the sequence and across the 300 packet loss traces, and then computing the PSNR. Each line in Figure 6.16 represents a constant actual packet loss rate, as shown in the legend on the right, and each point on these lines represents a different assumed packet loss rate. These results demonstrate the effect of incorrect assumptions about packet loss rate. For example, in the case of the Foreman sequence, if the actual packet loss rate is 0% and the encoder is aware of this, the resulting PSNR is about 36 dB as is shown by the first point on the blue line. If however the encoder incorrectly assumes the packet loss rate is 10% and the actual loss rate remains at 0%, the encoder is wasting bits providing unnecessary error resilience, and the PSNR drops to about 33dB as is shown by the last point on the blue line. Similarly, if the actual packet loss rate is 10% and the encoder makes an accurate assumption, the PSNR is about 31 dB as is shown by the last point on the red line. However, if the encoder incorrectly assumes the packet loss rate is 0% and the actual loss rate remains at 10%, the encoder is not providing nearly enough error resilience and the PSNR drops to about 23 dB. Two main conclusions can be drawn from these figures. First, each of these lines peak at roughly the point where the assumed and the actual match (e.g. the maximum PSNR from the 5% actual PLR line occurs at 5% assumed PLR). This indicates that the optimization is appropriately matching the coding to the channel conditions. Secondly, except for the 0% loss case, which peaks at 0%, each of these lines is relatively flat to the right side and drops off fairly rapidly to the left. At least with this particular implementation, it appears that if one fails to match the amount of error resilience to the actual channel conditions it is more costly to underestimate the -106- Chapter 6 ExperimentalResults and Analysis Actual PLR 38 -+-0% 36 --+- 2% -------- --- - - - - 34 -------------------- +4% 32 ---- ---- ------------ z - -- - - 30 - --- - --- X --- ---x- 6% - -- - - - - - -- - - - - - - - - U) --------------------------- ---- a28 -- +8% --- A -0-10% 24 22 0% 2% 6% 4% 8% 10% Assumed Packet Loss Rate (a) Foreman sequence Actual PLR 37 +0% 35-+-2% 33--------------- -- - -- -- 44 -6% W 31 z 0) +.-8% 29 + 10% 27 --- ------------------------------- - - 25 0% 2% 4% 6% 8% 10% Assumed Packet Loss Rate (b) Carphone sequence Figure 6.16: Sensitivity of adaptive MD mode selection system to errors in assumed packet loss rate. ADAPT approach coded at 0.4 bpp with balanced paths and expected burst length of 3 packets. Each line represents a different actual packet loss rate. The assumed packet loss rate varies as shown along the x-axis (a) Foreman sequence. (b) Carphone sequence. -107- Chapter 6 Experimental Results andAnalysis packet loss rate than it is to overestimate the packet loss rate. For example, if one uses too many intra-blocks, some bits are wasted on inefficient intra-coding and the decoder must use coarser quantization to compensate, but the end result is only a small decrease in overall quality. If on the other hand one uses too few intra-blocks, losses which occur cause significant drops in quality and propagate through many frames. Again, other implementations of this type of system may certainly give different results. However, these results would seem to suggest a certain amount of conservatism in assumed packet loss rates. For instance, at least with this particular system, if there were some uncertainty in one's estimate of actual packet loss rate, it would be better to err to the right rather than the left. Figure 6.17 shows the same data, only this time each line represents a different assumed packet loss rate and the actual packet loss rate varies as shown along the x-axis. This figure too illustrates the decreased sensitivity to loss rates at higher assumed packet loss rates. As the assumed packet loss rate increases, the slope of these lines decrease. As shown by the red line, if the encoder assumes the packet loss rate is 0% the slightest increase in actual packet loss rate can cause a significant drop in quality. If one knew a probability distribution of packet loss rates, for example if the packet loss rate were uniformly distributed between 0% and 1%, one should choose the line from Figure 6.17 that maximizes the integral over this probability distribution. The 0% line may have the maximum value over this range, but the integral over the probability distribution may be lower given the steep drop in the red curve. Figure 6.18 shows the sensitivity of the ADAPT approach relative to each of its nonadaptive counterparts. Here the actual packet loss rate is held fixed at 4% throughout the experiment and the assumed packet loss rate varies as is shown on the x-axis. Once the encoder has inaccurate knowledge of the actual channel conditions, the ROPE model will inaccurately estimate expected end-to-end distortion leading to suboptimal decisions that are no longer appropriate given the actual conditions. These results suggest that the adaptive approach is the most sensitive to inaccurate channel knowledge. While the other methods mainly use the ROPE model for intra/inter decisions, the adaptive approach also uses this model to select between MD modes, so it seems reasonable that the adaptive approach would be the most sensitive to inaccurate knowledge. As also demonstrated in Figures 6.16 and 6.17 this particular system is not as sensitive to errors in the positive direction which would again suggest a conservative approach. -108- Chapter 6 ExperimentalResults andAnalysis Assumed PLR 38_ -0% 36- ---------------------------------------- -1% -2% 34 - 3% 4% 5% -6% 32 304+ z -- - -- - ---- - ------ ---- -7% to) 0. 28- -8% -9% 26 - -10% 2422 0 YO 2% 4% 6% Actual Packet Loss Rate 8% 10% (a) Foreman sequence Assumed PLR 37- -0% 35- ------- -- -- -- -- -- -- -- - --- -- -- - - --- - -1% -2% 3% 33 - - ------- - - -- - - -- - - -- - 4% - 5% 00 z -6% w31 - - 0) . 29 - - - -- -- -- -- -- ------ - - - -- -- -- -- -- - -- --- -- 7% -8% -9% -10% 27 - 25 40% 2% 4% 6% Actual Packet Loss Rate 8% 10% (b) Carphone sequence Figure 6.17: Sensitivity of adaptive MD mode packet loss rate. ADAPT approach coded at 0.4 burst length of 3 packets. Each line represents a actual packet loss rate varies as shown along Carphone sequence. -109- selection system to errors in assumed bpp with balanced paths and expected different assumed packet loss rate. The the x-axis (a) Foreman sequence. (b) Chapter6 ExperimentalResults and Analysis 33 +SD 32 + TS -x*- 31 SS + RC 30-- zU) CL -- --- - -*+ADAPT 29 28 Actual Packet Loss Rate = 4% 27 26 0% 2% 4% 6% 8% Assumed Packet Loss Rate (a) Foreman sequence 35 -+- 34 33 SD * TS -4- SS +RC 4-ADAPT z a. 32 A 31 -- -- -- ------- -- - --------------------------------- - Actual Packet Loss Rate = 4% 30 29 0% 2% 4% 6% 8% Assumed Packet Loss Rate (b) Carphone sequence Figure 6.18: Sensitivity of adaptive MD mode selection system to errors in assumed packet loss rate relative to each of its non-adaptive counterparts. The Foreman sequence has been coded at 0.4 bpp with balanced paths and expected burst length of 3 packets. The assumed packet loss rate varies as shown along the x-axis and the actual packet loss rate is 4%. (a) Foreman sequence. (b) Carphone sequence. -110- ExperimentalResults and Analysis Chapter 6 6.7.2 Sensitivity to Burst Length In a similar manner, we next analyzed the sensitivity of the system to errors in assumed expected burst length. Figures 6.19 and 6.20 show the results from these experiments. Each line in Figure 6.19 shows a constant actual expected burst length while the assumed burst length varies along the x-axis. Figure 6.20 shows the same results only the assumed expected burst length is held constant for each line while the actual burst length varies along the x-axis. The first thing to note is the scale of these figures, the PSNR does not change nearly as dramatically as it does with variations in average packet loss rate. Errors in assumed burst length can lead to drop in PSNR on the order of 0.6 dB (for these sequences), but it is apparent that this particular system is not as sensitive to errors in assumed burst length as it is to errors in average packet loss rate. The quality is again maximized when the assumed burst length matches the actual, however this time it is difficult to make any generalizations as to whether it is better to assume longer or shorter burst lengths given an amount of uncertainty. The results from the Foreman sequence would seem to suggest that it might pay off to assume slightly longer burst lengths, but the same result does not appear in the Carphone sequence. -111- Chapter 6 ExperimentalResults andAnalysis Actual BL 32.5- -+- 32.4, 32.3 Bernoulli -3 -+- - 5 -x- 7 32.2 - z +9 32.1 - - -- - ------- - -- -- -- - - -- - - -- () IL 32.0 -- - 31.9, 31.8 -- - - - - - ---- - - - - - - - ---- - - - - - -- - - - -- - ---- - - ---- -- - - - - -- - - ---- - - -- --- - 31.7 1 3 5 7 9 Assumed Burst Length (a) Foreman sequence Actual BL 33.5 -4- Bernoulli 33.4+ + 3 ----- -- - -- 33.3- +5 -- -- -- -- -- -- -- -- -- - - -- -- -- -x- 7 -*- 9 33.2 z 33.1 I U) (L 33.0 32.9 32.8 32.7 1 3 5 7 9 Assumed Burst Length (b) Carphone sequence Figure 6.19: Sensitivity of adaptive MD mode selection system to errors in assumed burst length. ADAPT approach coded at 0.4 bpp with balanced paths and average packet loss rate of 5%. Each line represents a different actual burst length. The assumed burst length varies as shown along the x-axis (a) Foreman sequence. (b) Carphone sequence. -112- Experimental Results andAnalysis Chapter 6 Assumed BL 32.5 - Bernoulli -2 -3 4 5 -6 -7 -8 -9 -10 - -------------------------------- - - 32.4 32.3 32.2 z 0W~ 32.1 ----- - -- - ---- -- -- -- - - --- 32 31.9 31.8 31.7 1 3 5 7 9 Actual Burst Length (a) Foreman sequence Assumed BL 33.5 - Bernoulli -2 33.4 -3 33.3 -- - 33.2 z (- --- - -- 33.1 - - - ------ - - --- -- -- -- -- -- -- - 5 -6 -7 -8 33.0 - - - -- -- -- - --- -- -- -- -- -- -- -- -- -- -- - - -9 -10 32.9 32.8 32.7 1 3 5 7 9 Actual Burst Length (b) Carphone sequence Figure 6.20: Sensitivity of adaptive MD mode selection system to errors in assumed burst length. ADAPT approach coded at 0.4 bpp with balanced paths and average packet loss rate of 5%. Each line represents a different assumed burst length. The actual burst length varies as shown along the x-axis. (a) Foreman sequence. (b) Carphone sequence. -113- Chapter 6 Results and Analysis Experimental andAnalysis Experimental Results Chapter 6 (a) (b) Figure 6.21: Comparison between (a) a single path and (b) multiple paths. 6.8 Comparisons between Using Single and Multiple Paths The final experiments conducted in this chapter provide some insight into the benefits of using multiple description coding with multiple paths (MP). Each of the previous experiments has assumed the use of two independent paths, where burst losses on a single path affect only that path. The experiments in this section compare this multiple path approach with an approach using only a single path (SP), see Figure 6.21. If all losses are Bernoulli, where each packet loss even is independent and identically distributed, then the use of multiple paths is irrelevant since all packet losses are independent no matter if they are sent along one or two paths. However, with the introduction of bursty losses, using multiple paths with MD coding can be quite beneficial. In this section we consider both MP and SP for three different approaches: 1) multiple description coding, represented by the ADAPT approach, 2) optimized single description coding, represented by the SD approach, and 3) standard single description coding, represented by the STD approach. The use of standard single description coding on a single path is the approach used most often in applications today and therefore provides a baseline of comparison for the results presented in this thesis. -114- Experimental Results andAnalysis Chapter 6 In all previous experiments, care was taken to ensure that the same packet loss traces were used throughout an experiment. This ensures a fair comparison between different methods. However, when comparing SP and MP it is no longer possible to use the same packet loss traces. By running a large number of realizations of each channel (300 realizations) we believe the results presented below still provide a reasonable comparison, but we felt that it was important to note this distinction. The experiments in this section are similar to the experiments presented in Sections 6.4 and 6.5. The first experiment examines the effects of expected burst length on the resulting PSNR and compares SP with MP. The multiple path channel in this experiment was simulated with two balanced paths each with average packet loss rates of 5%. The single path channel also had an average packet loss rate of 5%. The video was coded in CBR mode at approximately 0.4 bits per pixel (bpp) and the expected burst length was varied from a little over 1 packet (Bernoulli) up to 10 packets. Figure 6.22 shows the outcome of this experiment. These results were obtained by first calculating the mean-squared error distortion by averaging across all the frames in the sequence and across the 300 packet loss traces, and then computing the PSNR. Figure 6.22 shows the resulting PSNR for the ADAPT, SD, and STD approaches using both single path (SP) and multiple paths (MP). The MP results are identical to those presented in Section 6.4.2. As shown in Figure 6.22, burst length has a very significant effect on the performance when using single path, much more so than with multiple paths. With Bernoulli losses, the two cases are identical, but the performance falls off rapidly with SP as the burst length increases. One assumption made with MD coding is that it is relatively unlikely that both descriptions will be lost, yet bursts of losses along a single path cause losses in both descriptions. There can still be some gains from using MD coding with SP, but these gains are certainly not as significant as with MP. The second experiment shows the effect variations in packet loss rates have on SP versus MP. The multiple path channel in this experiment was simulated with two balanced paths each with expected burst length of 3 packets. The single path channel was also simulated with expected burst lengths of 3 packets. The video was coded in CBR mode at approximately 0.4 bits per pixel (bpp) and the average packet loss rate was varied from 0 to 10%. These results were again computed by first calculating the mean-squared error distortion by averaging across all the frames in the sequence and across the 300 packet loss traces, and then computing the PSNR. -115- Chapter 6 Experimental Results andAnalysis 32.5 - ----------------------------- 32 -4-ADAPT -- -- - 31.5- MP -++-ADAPT-SP - - -------------------------- 31 --- 30.5- z - ---------------------- I -0-SD-MP --------------------- ---- +SD-SP 30- - ---------------------- ------ -- -- U- 29.5- STD - MP +STD-SP 29- ------------ --- ------ - ----- - 28.5 28 Jf ------- -------------------- 0 27.5 1 3 5 7 9 Expected Burst Length (packets) (a) Foreman sequence 34 33.5 33 -*-ADAPT - MP - -*-ADAPT - SP -- -- -- -- -- -- -- 32.5- z 0, 0. --------------- - -M-SD - MP -*-SD - SP 32 - -0- STD - MP 31.5 + -+-STD - SP 31 30.5 30 1 3 5 7 9 Expected Burst Length (packets) (b) Carphone sequence Figure 6.22: PSNR versus expected burst length comparing the benefits of using multiple paths (MP) versus only a single path (SP). Foreman sequence coded at approximately 0.4 bpp. The expected burst length for this experiment was varied from Bernoulli to 10 packets, and the average packet loss rate was held constant at 5%. (a) Foreman sequence. (b) Carphone sequence. -116- Experimental Results andAnalysis Chapter 6 Figure 6.23 demonstrates the resulting PSNR for each of the cases. The MP results are identical to those presented in Section 6.4.1. This experiment again shows the benefits of using MD coding with MP. The performance of each approach drops much more sharply with SP and the gains from MD coding are much greater when combined with MP. At 0% no losses occur, so the results for SP and MP are identical, however the PSNR drops off much more rapidly for SP. As indicated by the results of the previous experiment, this drop in performance is mainly a result of the bursty losses. With Bernoulli losses, these results would be identical for SP and MP. At burst lengths of 10 packets, the differences would be even more apparent. The final experiment here demonstrates the benefits of MP at a number of different bitrates. Figure 6.24 shows the end-to-end R-D performance curves of both SP and MP approaches. This experiment was run in VBR mode with fixed quantization levels. To generate each point on these curves, the resulting distortion was averaged across all 300 packet loss simulations, as well as across all frames of the sequence. The same calculation was then conducted at various quantizer levels to generate each R-D curve. The MP results are identical to those presented in Section 6.5. By using MP with MD coding, the resulting PSNR is increased approximately 1 dB relative to MD coding with SP. The gain with SD coding over multiple paths is about 0.5 dB. As seen in the previous experiments the benefits of MD coding are much more significant relative to SD coding when used with MP. The STD approach with SP provides a baseline of comparison since this is the type of approach used in most applications today. MD coding with MP is able to outperform this standard approach by about 2 dB at lower bitrates and by as much as 4-5 dB at the higher bitrates. -117- Chapter 6 ExperimentalResults andAnalysis 38 36 1 -----------------------------------34 ---- -- -- -ADAPT - MP -- ADAPT - SP -a- SD - MP -+- SD - SP --- - --------------- +40STD-MP + STD - SP z 32- 0O 30 --------- | --------- --------------------------- 28- 26 0 1 2 3 4 5 6 7 8 10 9 Average Packet Loss Rate (%) (a) Foreman sequence 37 -- ADAPT - MP 36 ------------------------------------ - +ADAPT - SP -- SD - MP 35 --------------------------+-w-SD - SP -+-STD - MP ------ - -- -- -- -- -- -- -- -- - - ----------34-----------+STD - SP In z - 33 32 ------------- - --------- --------- - - - 30 29 0 1 2 3 4 5 6 7 8 9 10 Average Packet Loss Rate (%) (b) Carphone sequence Figure 6.23: PSNR versus average packet loss rate comparing the benefits of using multiple paths (MP) versus only a single path (SP). Foreman sequence coded at approximately 0.4 bpp. The average packet loss rate for this experiment was varied from 0% to 10%, and the expected burst length was held constant at 3 packets. (a) Foreman sequence. (b) Carphone sequence. -118- Chapter 6 Experimental Results and Analysis 34- 33 + 32 31 -4- ADAPT - M-P --x-ADAPT-SP -*-SD - MP +f S D - S P+*STD- MP +STD - SP - - --- - - - - -- - - - - - z (i) 30 30 29 - 28 + 27 0.15 0.2 0.25 0.3 0.35 0.4 0.45 BPP 35 +ADAPT - MP 34 - 33-- z -++-ADAPT - SP -- SD - MP +SD- SP +STD-MP +STD - SP - ------- - -- - --- 32 - C') IL - - 31 + - - --- -- -- -- -- -- -- -- --- 30- 29 40.15 0.2 0.25 0.3 0.35 0.4 0.45 BPP Figure 6.24: End-to-end R-D performance of SP versus MP. 5% packet loss rate, expected burst length 3 packets. (a) Foreman sequence. (b) Carphone sequence. -119- -120- Chapter 7 Conclusions 7.1 Summary The transmission of video sequences over lossy packet networks involves a fundamental tradeoff between compression efficiency and error resilience. Raw video contains an immense amount of data that must be stored in a limited amount of space and/or transmitted in a finite amount of time, thus demanding the use of extremely efficient video compression algorithms. Fortunately, raw video sequences also contain a significant amount of redundancy that allows encoders to perform considerable compression without significantly distorting the resulting video. The redundancy present in the original sequence provides a significant amount of error resilience, but the amount of bandwidth required for real-time transmission of uncompressed video is not reasonable. Ideally one would like to remove all the redundancy from the video bit stream and compress the data down to the smallest number of bits possible. However, by doing so, each bit increases in importance and any losses that occur can have a much more significant impact on the resulting video quality. To this end, a number of error resilient video compression algorithms have been developed. These approaches are essentially joint source-channel coders that trade off some amount of compression efficiency for an increase in error resilience. Multiple description coding is one such approach. A multiple description encoder codes a single sequence into two or more complementary streams and transmits these independently across the network. In the event one stream is lost, the remaining stream can still be straightforwardly decoded resulting in only a slight reduction in video quality. This thesis proposed end-to-end rate-distortion optimized adaptive MD mode selection for multiple description coding. This approach makes use of multiple MD coding modes within a single sequence, making optimized decisions using a model for predicting the expected end-toend distortion. The extended ROPE model presented in Chapter 4 is used to predict the distortion -121- Chapter 7 Conclusions experienced at the decoder taking into account both bursty packet losses and the use of multiple transmission paths. This allows the encoder in this system to make optimized mode selections using Lagrangian optimization techniques to minimize the expected end-to-end distortion. We began by examining the performance of the extended ROPE algorithm. It was originally unknown how well the ROPE algorithm would work with the H.264 codec and in particular with the use of quarter-pixel motion vector accuracy. We have shown that the ROPE model is able to accurately track the expected distortion experienced at the decoder even when used with H.264. We have shown how it accurately takes into account characteristics of the video source, packet loss rates, bursty packet losses and the use of multiple transmission paths. We have then shown how one such adaptive MD mode selection system based on H.264 is able to adapt to local characteristics of the video and to network conditions on multiple paths and have shown the potential for this adaptive approach, which selects among a small number of simple complementary MD modes, to significantly improve video quality. The results presented in this thesis demonstrate how this system accounts for the characteristics of the video source, e.g. using more redundant modes in regions particularly susceptible to losses, and how it adapts to conditions on the network, e.g. switching from more bitrate-efficient methods to more resilient methods as the loss rate increases. Other experiments have explored the use of adaptive MD mode selection with unbalanced paths, where one path is more reliable than the other, and with time varying network conditions. One particularly interesting result from these experiments was the realization that this particular system was able to intelligently redistribute a portion of the bits to the more reliable path in this type of unbalanced situation. Since the non-adaptive approaches were unable to take the same approach, it did not allow for a fair comparison to permit the adaptive system to redistribute bits in such a way. However, it was an interesting result, and in the future it may be possible to use a similar approach for optimally distributing packets across multiple unbalanced paths. Most of the experiments in this thesis assumed the encoder had accurate knowledge of the current packet loss rate and expected burst length on the network. However, it is unlikely that the encoder will have such perfect knowledge of the time varying conditions that exist on packet networks. To this end, we performed a number of experiments to evaluate the sensitivity of the system to inaccurate knowledge of network conditions. In general the results indicated that this particular system is most sensitive to underestimating the packet loss rate, which suggested that -122- Conclusions Chapter 7 it may be wise to develop a conservative scheme and encode for slightly higher loss rates than expected. The results also showed that the system was not as sensitive to overestimation of packet loss rate or to errors made in assumed burst length. Finally we have evaluated the gains in performance when MD coding is combined with the use of multiple paths. The most common approach in video streaming today is the use of single description coding sent over a single path while the majority of this thesis has focused on the use of multiple description coding with multiple paths. As a baseline of comparison, we have performed experiments to evaluate the performance of SD and MD coding along both single and multiple paths at various packet loss rates, expected burst lengths, and bitrates. The results of these experiments demonstrated the significant benefits of combining MD coding with the use of multiple paths. While the results presented in this work are specific to this particular realization of such a system, these results demonstrate that the adaptive combination of a small set of MD modes combined with intelligent mode selection can significantly improve video quality. Overall, the results with adaptive MD mode selection are quite promising, and we believe MD mode selection can be a useful tool for the reliable delivery of video streams. 7.2 Future Research Directions The results of this research indicate that adaptive MD mode selection can be an effective error resilience tool by allowing the encoder to adapt to current network conditions and to the characteristics of the video source. However there are a number of areas that would need further exploration before a practical application of this concept could be implemented, the most obvious being the issue of complexity. This research has assumed the encoder would be able to maintain real-time processing in order for it to adapt to current network conditions. The H.264 codec alone is quite computationally demanding not to mention the complexity added by the ROPE algorithm and R-D optimized adaptive mode selection. A real-time implementation of this approach at the present time would be a challenging task involving tradeoffs in performance versus complexity. This goal could certainly be achieved given enough time for the technology to evolve to the point where it would be easily attainable. -123- Conclusions Chapter 7 One could test the system in an actual network environment by integrating the encoder with an algorithm for estimating current network characteristics. The problem of estimating packet loss rates and expected burst lengths is a difficult task given rapidly changing network conditions and is similar to trying to hit a moving target. The sensitivity analysis in Section 6.7 has provided some insight into how this particular system is affected by imperfect knowledge of channel conditions, but it would be interesting to see how this type of approach would perform in a full end-to-end system. The MD modes used in this thesis provided a convenient means of introducing the concept of adaptive MD mode selection, but other interesting modes could also be considered, many of which could potentially increase the performance of the system dramatically. Because this system uses MD modes in an adaptive fashion, it is not necessary for each MD approach to work well under all possible situations. Even if a mode works poorly on its own, it can still be quite useful in an adaptive system as long as it performs well in certain special cases. One could reevaluate the performance of previously developed methods and/or develop new MD methods keeping this fact in mind. The current results have focused on QCIF video resolution. It may also be interesting to study the use of this adaptive mode selection on larger video sequences of CIF or even HD resolution. The use of larger frames would force a single frame to be split across multiple packets, which also offers some interesting possibilities for more advanced error concealment techniques. Integrating more advanced techniques into the system, such as repairing one description from another (which has shown up to 3dB gains, [1]) could offer some significant improvements in video quality. Several of the components used in the current implementation could also be individually studied and improved. For instance, improved rate control algorithms, R-D optimization techniques, or models of expected distortion could be beneficial and could be easily integrated into the current architecture. Besides improvements to the existing system, the results from this thesis also point in a number of other interesting directions as well. For instance, the use of adaptive MD mode selection on unbalanced paths has led to some interesting results regarding the distribution of bandwidth across asymmetrical paths. It would be interesting to use a ROPE-like algorithm to study optimal distribution of bandwidth along unbalanced paths subject to bandwidth constraints and various packet loss rates. This thesis has mainly studied two-description MD coding along two paths, but it is not well known how to optimally distribute M-descriptions along N paths -124- Conclusions Chapter 7 (e.g. two descriptions with three available paths). By modeling the expected end-to-end distortion, it may be possible to gain a better understanding of this more general problem and develop useful insights. In addition to considering situations where each path has a different reliability, it may also be useful to explore other types of unbalanced circumstances, for instance where each path has a different bitrate constraint. The use of look-ahead (analysis of future frames) in error resilient video coding is another topic that has not been thoroughly studied. This thesis has assumed that look-ahead was not an option due to the additional delay involved. However, the use of one or two frames of look-ahead might provide significant improvements in performance for applications that are able to support a small amount of additional delay. Overall, we believe that adaptive MD mode selection using an R-D optimized algorithm that minimizes the expected end-to-end distortion while accounting for characteristics of the video source and current network conditions is a promising approach for reliable delivery of video streams over lossy packet networks. -125- -126- Bibliography [1] J. Apostolopoulos. "Error-resilient video compression through the use of multiple states," in Proceedingsof the IEEE InternationalConference on Image Processing,vol. 3, pp. 352-355, September 2000. [2] J. Apostolopoulos. "Reliable video communication over lossy packet networks using multiple state encoding and path diversity," in Proceedings of the SPIE: Visual Communications and Image Processing,vol. 4310, pp. 392-409, January 2001. [3] J. Apostolopoulos, W. Tan, and S. Wee. "Performance of a multiple description streaming media content delivery network," in Proceedings of the IEEE International Conference on Image Processing,vol. 2, pp. 189-192, September 2002. [4] J. Apostolopoulos, W. Tan, S. Wee, and G. Wornell. "Modeling path diversity for multiple description video communication," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 2161-2164, May 2002. [5] J. Apostolopoulos, T. Wong, W. Tan, and S. Wee. "On multiple description streaming with content delivery networks," in INFOCOM2002, vol. 3, pp. 1736-1745, June 2002. [6] I. Bajic and J. Woods, "Domain-based multiple description coding of images and video," IEEE Transactions on Image Processing, vol. 12, pp. 1211-1225, October 2003. [7] N. Boulgouris, K. Zachariadis, A. Leontaris, and M. Strintzis. "Drift-free multiple description coding of video," in Proceedings of the IEEE International Workshop on Multimedia Signal Processing,pp. 105-110, October 2001. [8] D. Chung and Y. Wang, "Multiple description image coding using signal decomposition and reconstruction based on lapped orthogonal transforms," IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, pp. 895-908, September 1999. [9] D. Chung and Y. Wang, "Lapped orthogonal transforms designed for error-resilient image coding," IEEE Transactionson Circuitsand Systems for Video Technology, vol. 12, pp. 752-764, September 2002. -127- Bibliography [10] G. Cot6 and F. Kossentini, "Optimal intra coding of blocks for robust video communication over the Internet," Signal Processing:Image Commulication, vol. 15, pp. 25-34, September 1999. [11] T. Cover and J. Thomas, Elements ofInformation Theory. 1991, New York: Wiley. [12] S. Diggavi, N. Sloane, and V. Vaishampayan, "Asymmetric multiple description lattice vector quantizers," IEEE Transactions on Information Theory, vol. 48, pp. 174-191, January 2002. [13] C. Fenimore, V. Baroncini, T. Oelbaum, and T. K. Tan. "Subjective testing methodology in MPEG video verification," in Proceedingsof the SPIE: Applications of DigitalImage ProcessingXXVII, vol. 5558, pp. 503-511, August 2004. [14] A. El Gamal and T. Cover, "Achievable rates for multiple descriptions," IEEE Transactionson Information Theory, vol. 28, pp. 851-857, November 1982. [15] V. Goyal, "Multiple description coding: compression meets the network," IEEE Signal ProcessingMagazine, vol. 18, pp. 74-93, September 2001. [16] V. Goyal, J. Kelner, and J. Kovacevic, "Multiple description vector quantization with a coarse lattice," IEEE Transactionson Information Theory, pp. 781-788, March 2002. [17] V. Goyal and J. Kovacevic. "Optimal multiple description transform coding of Gaussian vectors," in Proceedings of the Data Compression Conference, pp. 388-397, March 1998. [18] V. Goyal and J. Kovacevic, "Generalized multiple description coding with correlating transforms," IEEE Transactions on Information Theory, vol. 47, pp. 2199-2224, September 2001. [19] V. Goyal, J. Kovacevic, R. Arean, and M. Vetterli. "Multiple description transform coding of images," in Proceedings of the IEEE International Conference on Image Processing,vol. 1, pp. 674-678, October 1998. [20] ITU-T Rec H.264, "Advanced video coding for generic audiovisual services," March 2003. [21] P. Haskell and D. Messerschmitt. "Resynchronization of motion compensated video affected by ATM cell loss," in Proceedings of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing,vol. 3, pp. 545-548, March 1992. -128- Bibliography [22] B. Heng, J. Apostolopoulos, and J. Lim. "End-to-end rate-distortion optimized mode selection for multiple description video coding," in Proceedings of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing,vol. 5, pp. 905908, March 2005. [23] R. Hinds, T. Pappas, and J. Lim. "Joint block-based video source/channel coding for packet-switched networks," in Proceedings of the SPIE: Visual Communications and Image Processing,vol. 3309, pp. 124-133, January 1998. [24] N. Jayant, "Subsampling of a DPCM speech channel to provide two 'self-contained' half-rate channels," Bell Sys. Tech. Journal,vol. 60, pp. 501-509, April 1981. [25] N. Jayant and S. Christensen, "Effects of packet losses in waveform coded speech and improcements due to an odd-even sample-interpolation procedure," IEEE Transactions on Communications,vol. 29, pp. 101-109, February 1981. [26] C.-S. Kim, R.-C. Kim, and S.-U. Lee, "Robust transmission of video sequence using double-vector motion compensation," IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 1011-1021, September 2001. [27] Y. Liang, J. Apostolopoulos, and B. Girod. "Analysis of packet loss for compressed video: Does burst-length matter?," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 684-687, April 2003. [28] J. Liao and J. Villasenor. "Adaptive intra update for video coding over noisy channels," in Proceedings of the IEEE InternationalConference on Image Processing,vol. 3, pp. 763-766, September 1996. [29] L.-J. Lin and A. Ortega, "Bit-rate control using piecewise approximated rate-distortion characteristics," IEEE Transactionson Circuitsand Systems for Video Technology, vol. 8, pp. 446-459, August 1998. [30] D. Marpe, H. Schwarz, and T. Wiegand, "Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard," IEEE Transactionson Circuits and Systemsfor Video Technology, vol. 13, pp. 620-636, July 2003. [31] A. Ortega and K. Ramchandran, "Rate-distortion methods for image and video compression," IEEE Signal ProcessingMagazine, vol. 15, pp. 23-50, November 1998. [32] R. Puri and K. Ramchandran. "Multiple description source coding using forward error correction codes," in Asliomar Conference on Signals, Systems, and Computers, vol. 1, pp. 342-346, October 1999. -129- Bibliography [33] A. Reibman. "Optimizing multiple description video coders in a packet loss environment," in Proceedingsof the Packet Video Workshop, April 2002. [34] A. Reibman, H. Jafarkhani, Y. Wang, M. Orchard, and R. Puri, "Multiple-description video coding using motion-compensated temporal prediction," IEEE Transactions on Circuits and Systemsfor Video Technology, vol. 12, pp. 193-204, March 2002. [35] R. Schafer, T. Wiegand, and H. Schwarz, "The emerging H.264/AVC standard," EBU TechnicalReview, January 2003. [36] S. Servetto, K. Ramchandran, V. Vaishampayan, and K. Nahrstedt, "Multiple description wavelet based image coding," IEEE Transactions on Image Processing, vol. 9, pp. 813-826, May 2000. [37] E. Steinbach, N. Farber, and B. Girod, "Standard compatible extension of H.263 for robust video transmission in mobile environments," IEEE Transactionson Circuits and Systemsfor Video Technology, vol. 7, pp. 872-881, December 1997. [38] K. Stuhlmiiller, N. Farber, M. Link, and B. Girod, "Analysis of video transmission over lossy channels," IEEE Journal on Selected Areas in Communications, vol. 18, pp. 1012-1032, June 2000. [39] G. Sullivan, A. Luthra, and P. Topiwala. "The H.264/AVC advanced video coding standard: overview and introduction to the fidelity range extensions," in Proceedings of the SPIE: Applications of Digital Image ProcessingXXVII, vol. 5558, pp. 454-474, August, 2004. [40] G. Sullivan and T. Wiegand, "Video compression - from concepts to the H.264/AVC standard," Proceedingsof the IEEE, vol. 93, pp. 18-3 1, January 2005. [41] V. Vaishampayan, "Design of multiple description scalar quantizers," Transactionson Information Theory, vol. 39, pp. 821-834, May 1993. IEEE [42] V. Vaishampayan, A. Calderbank, and J. Batllo, "On reducing granular distortion in multiple description quantization," in Proceedings of the IEEE International Symposium on Information Theory, pp. 98, August 1998. [43] V. Vaishampayan and J. Domaszewicz, "Design of entropy-constrained multipledescription scalar quantizers," IEEE Transactions on Information Theory, vol. 40, pp. 245-250, January 1994. -130- Bibliography [44] W. Wan and J. Lim. "Adaptive format conversion for scalable video coding," in Proceedings of the SPIE: Applications of Digital Image ProcessingXXIV, vol. 4472, pp. 390-401, December 2001. [45] X. Wang and M. Orchard. "Multiple description coding using trellis coded quantization," in Proceedings of the IEEE International Conference on Image Processing,vol. 1, pp. 391-394, September 2000. [46] Y. Wang and D. Chung. "Robust image coding and transport in wireless networks using nonhierarchical decomposition," in International Workshop on Mobile Multimedia Communications, pp. 285-282, September 1996. [47] Y. Wang and S. Lin, "Error-resilient video coding using multiple description motion compensation," IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, pp. 438-452, June 2002. [48] Y. Wang, M. Orchard, and A. Reibman. "Multiple description image coding for noisy channels by pairing transform coefficients," in IEEE Workshop on Multimedia Signal Processing,pp. 419-424, June 1997. [49] Y. Wang, M. Orchard, V. Vaishampayan, and A. Reibman, "Multiple description coding using pairwise correlating transforms," IEEE Transactions on Image Processing,vol. 10, pp. 351-366, March 2001. [50] Y. Wang, A. Reibman, and S. Lin, "Multiple description coding for video communications," Proceedingsof the IEEE, vol. 93, pp. 57-70, January, 2005. [51] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, "Overview of the H.264 / AVC video coding standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 560 - 576, July 2003. [52] H. Witsenhausen and A. Wyner, "Source coding for multiple descriptions II: a binary source," Bell Sys. Tech. Journal,vol. 60, pp. 2281-2292, December 1981. [53] J. Wolf, A. Wyner, and J. Ziv, "Source coding for multiple descriptions," Bell Sys. Tech. Journal,vol. 59, pp. 1417-1426, October 1980. [54] D. Wu, Y. Hou, W. Zhu, Y. Zhang, and J. Peha, "Streaming video over the Internet: approaches and directions," IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 282-300, March 2001. -131- Bibliography [55] H. Yang and K. Rose. "Recursive end-to-end distortion estimation with model-based cross-correlation approximation," in Proceedings of the IEEE Conference on Image Processing,vol. 3, pp. 469-472, September 2003. [56] R. Zhang, S. Regunathan, and K. Rose, "Video coding with optimal inter/intra-mode switching for packet loss resilience," IEEE Journal on Selected Areas in Communications, vol. 18, pp. 966-976, June 2000. -132-