Smoooth Streaming over wireless Networks Sreya Chakraborty Interim Report EE-5359 Abstract: Smooth streaming is a serious problem since bandwidth is a natural resource and it is limited. In this paper the implications of video traffic smoothing on the numbers of statistically multiplexed H.264 SVC,H.264/AVC, and MPEG-4 Part 2 streams, the bandwidth requirements for streaming, and the introduced delay are examined. SVC enables the transmission and decoding of partial bit streams to provide video services with lower temporal or spatial resolutions or reduced fidelity while retaining a reconstruction quality that is high relative to the rate of partial bit streams. Here two algorithms are proposed for compressive multimedia streams to considerate level. Introduction: Smooth streaming is a challenge in areas where bandwidth is low or limited. In most of the cases for streaming video and audio data UDP was found useful over TCP, since TCP introduces various delays. It also waits for the receipt of acknowledgement causing delay in the frame arrival. The loss of data is acceptable to certain extent but not the delay caused. Modern video transmission and storage are based on RTP/IP for real time services. Most RTP/IP access networks are typically characterized by a wide range of connection qualities and receiving devices. The varying connection quality is due to adaptive resource sharing mechanisms of these networks. Traditional digital video transmission and storage systems are based on H.222.0, H.320 [7] for broadcasting services over satellite, cable, and terrestrial transmission channels, for DVD storage and for conversational video conferencing services. International video coding standards H.262, H.263 and MPEG-4 already include several tools by which the most important scalability modes can be supported. But the characteristics of traditional video transmission systems and the quality scalability features came with a significant loss in coding efficiency as well as a large increase in decoder complexity. Simulcast provides similar functionalities as a scalable bit stream. Scalable video coding extension of the H.264/AVC with its hierarchical B-frames compresses single layer video. H.264/AVC and H.264 SVC video encoding are expected to be widely adopted for wired and wireless network video transport due to their increased compression efficiency compared to MPEG-4 and their widespread inclusion in application standards. The compression efficiency of a video codec is generally characterized with a rate distortion curve[2] that shows the bit rate of the compressed video stream as a function of the video quality (distortion), which is typically measured in terms of the Peak Signal to Noise Ratio (PSNR). For a given video quality, the lower the compressed bitrate, the more efο¬cient is the compression. The improvements in rate-distortion (RD) compression efο¬ciency with H.264 SVC and H.264/AVC come at the expense of signiο¬cantly increased variabilities of the encoded frame sizes (in bits). The recently developed H.264/AVC video codec with Scalable Video Coding (SVC) extension, compresses non-scalable (single-layer) and scalable video significantly more efficiently than MPEG–4 Part 2. Since the traffic characteristics of encoded video have a significant impact on its network transport, the bit rate-distortion and bit rate variability-distortion performance of single-layer video traffic of the H.264/AVC codec and SVC extension using long CIF resolution videos is examined. The traffic characteristics of the hierarchical B frames (SVC) versus classical B frames is compared. In addition, we examine the impact of frame size smoothing on the video traffic to mitigate the effect of bit rate variabilities. Compared to MPEG–4 Part 2, the H.264/AVC codec and SVC extension achieve lower average bit rates at the expense of significantly increased traffic variabilities that remain at a high level even with smoothing. Through simulations we investigate the implications of this increase in rate variability on (i) frame losses when transmitting a single video, and (ii) on the number of supported video streams in a bufferless statistical multiplexing scenario with restricted link capacity and information loss. In general, video can be encoded (i) with fixed quantization scales, which results in nearly constant video quality at the expense of variable video traffic (bit rate), or (ii) with rate control, which adapts the quantization scales to keep the video bit rate nearly constant at the expense of variable video quality. In order to examine the fundamental traffic characteristics of the H.264/AVC video coding standard, which does not specify a normative rate control mechanism, primarily on encodings with fixed quantization scales is focused. An additional motivation for the focus on variable bit rate video encoded with fixed quantization scales is that the variable bit rate streams allow for statistical multiplexing gains that have the potential to improve the efficiency of video transport over communication networks. The development of video network transport mechanisms that meet the strict playout deadlines of the video frames and efficiently accommodate the variability of the video traffic is a challenging problem. A wide array of video transport mechanisms has been developed and evaluated, based primarily on the characteristics of MPEG–2 and MPEG–4 Part 2 encoded video. The widespread adoption of the new H.264/AVC video standard necessitates the careful study of the traffic characteristics of video coded with the new H.264/AVC codec and its extensions. Therefore, it is necessary to examine the new video encoder’s statistical characteristics and compression performance from a communication network perspective. We study the Main profile of the H.264/AVC encoder using long Common Intermediate Format (CIF) 352x288 pixel resolution sequences. Our study of the newest H.264 SVC extension analyzes single-layer (non-scalable) video traffic characteristics of long CIF videos, i.e., although the H.264 SVC single-layer encoding supports temporal scalability, we group the individual temporal layers and consider the aggregate stream. H.264/AVC and H.264 SVC single-layer video traffic is significantly more variable than MPEG–4 Part 2 traffic under similar encoding conditions. At the same time, we confirm the significant average bit rate savings. The increased bit rate variability is observed over a wide range of average qualities of the encoded streams and for all tested video sequences. This makes the transport of H.264/AVC and H.264 SVC single-layer traffic more challenging than MPEG–4 Part 2 traffic. SVC’s temporal scalability is built on the hierarchical prediction concept for B frames. Temporal Scalability with Hierarchical B Frames: The introduction of hierarchical B frames has allowed the H.264 SVC encoder to achieve temporal scalability while at the same time improving RD efficiency compared to the classical B frame prediction method employed by the older MPEG standards (MPEG–1/2/4-Part 2) and by default in H.264/AVC. In Fig. 1, we illustrate both concepts for predicting B frames. Hierarchical B frames are an important new concept that was first introduced in H.264/AVC using generalized B frames and was later found to be the best method to build the Scalable Video Coding (SVC) extension on. Hence, the H.264 SVC encoded single-layer stream is decodable by existing H.264/AVC codecs. The scalability modes do require new SVC capability, with the supported modes depending on the applications or equivalently on the H.264 SVC profiles. Fig. 1(a) depicts the classical B frame prediction structure, where each B frame is predicted only from the preceding I or P frame and from the subsequent I or P frame. Other B frames are not referenced since this is not allowed by video standards preceding H.264/AVC. This restriction is lifted in the generalized B frame paradigm that was first introduced in the H.264/AVC standard. Fig. 1(b) depicts the hierarchical B frame structure which uses B frames for the prediction of B frames. The illustrated case is the dyadic hierarchy of B frames, meaning that the number of B frames n in between the key pictures (I or P frames) equals n = 2k ¡ 1. The hierarchy with 3 B frames (I frame period is 16) is depicted in Fig. 1(b). In this example, the frame sequence is I0B2B1B2P0B2B1B2P0B2B1B2P0B2B1B2, where the index represents the temporal layer number. The coding efficiency of hierarchical B frames depends on the number of hierarchical B frames (temporal levels) and on the choice of quantization parameters for each B frame. Therefore, H.264 SVC introduces cascading quantizers which assign a higher quantization parameter value (lower quality) to B frames belonging to higher temporal layers. This concept is based on the insight that the lowest temporal layer 0 requires higher quality than the next temporal layer, since all other predictions depend on it. The quality of each subsequent temporal layer can be gradually reduced since fewer layers depend on it. Apparently the quality fluctuation that is introduced within a GoP is not subjectively noticeable by human observers, as studied by the standard committee. For a video sequence consisting of M frames encoded with a given quantization scale, we let Xm (m = 1; : : : ; M) denote the sizes [bits] the encoded video frames. The mean frame size X [bits] of the encoded video sequence is defined as π 1 π = ∑ ππ π π=1 While the variance ππ₯2 of the frame sizes (ππ₯ is the standard deviation [bits] ) is defined as π 1 ππ₯2 = ∑ (ππ − π)2 π π=1 The coefficient of variation of frame sizes [unit free] is defined as πΆπ ππ₯ = ππ₯ π Fig.1 B frame prediction structures [8] GoP Structure Comparison Selected RD graphs for the Silence of the Lambs sequence encoded with H.264/AVC, H.264 SVC, and MPEG–4 Part 2 are depicted in Fig. 2(a), (c), and (e). Each figure depicts the RD curves for all GoP structures for a particular encoder. We observe that the H.264/AVC encoder achieves the best RD performance for GoP structure G16-B3 with almost coinciding RD curves. For the MPEG–4 Part 2 encoder the RD efficiency decreases significantly with increasing number of B frames in the GoP structures. Contrary to these two encoders, the H.264 SVC encoder achieves best RD performance for the G16-B15 GoP structure and lowest for G16-B1. From RD comparison plots between all three encoders, not included due to space constraints, we find that for GoP structure G16-B1, H.264/AVC and H.264 SVC have comparable RD performance. However, H.264 SVC increasingly outperforms H.264/AVC for GoP structures G16-B3 to G16-B15. We observe that the better the RD performance of a particular GoP structure, the higher the corresponding traffic variability. In the subsequent experiments, we employ four different GoP structures, namely IBPBPBPBPBPBPBPB (16 frames, with 1 B frame per I/P frame), which we denote by G16-B1, IBBBPBBBPBBBPBBB (16 frames, with 3 B frames per I/P frame) denoted by G16-B3, IBBBBBBBPBBBBBBB (16 frames, with 7 B frames per I/P frame) denoted by G16-B7, and IBBBBBBBBBBBBBBB (16 frames, with 15 B frames per I frame) denoted by G16-B15. In the context of SVC, these four GoP structures are respectively designated by their “GoP size” which is the number of hierarchical B frames plus one key picture, either of type I or P. Hence, G16-B1 has GoP size 2, G16-B3 has GoP size 4, G16-B7 has GoP size 8, and G16-B15 has GoP size 16. In the following, we employ our own GoP structure notation to emphasize the repetitive I-P-B frame type patterns in the encodings and to avoid confusion. These four GoP structures are natural structures for hierarchical B frames and allow us to compare the three encoders based on identical underlying GoP patterns. We employ the H.264/AVC encoder in the Main profile with all compression tools enabled, as specified in Section III-B, i.e., using variable block sizes, three reference frames for the past and the future, referenced B frames, P and B frame weighted prediction, CABAC, and rate-distortion optimization (RDO). We designate these settings by “Full-RDO”. The H.264 SVC settings are similar. We use the MPEG–4 Part 2 encoder in the Advanced Simple profile (ASP) to encode the sequences, for comparison with the H.264/AVC encodings. This ASP profile adds B frames to the Simple profile. We employ half pixel motion compensated prediction; RDO is not supported by the reference encoder implementation. The MPEG–4 Part 2 encoder uses one reference frame for the past and one for the future, and 16 £ 16 blocks for motion estimation that can be split into 8 £ 8 blocks. GoP Structure Comparison Selected RD graphs for the Silence of the Lambs sequence encoded with H.264/AVC, H.264 SVC, and MPEG–4 Part 2 are depicted in Fig. 2(a), (c), and (e). Each figure depicts the RD curves for all GoP structures for a particular encoder. We observe that the H.264/AVC encoder achieves the best RD performance for GoP structure G16-B3 with almost coinciding RD curves. For the MPEG–4 Part 2 encoder the RD efficiency decreases significantly with increasing number of B frames in the GoP structures. Contrary to these two encoders, the H.264 SVC encoder achieves best RD performance for the G16-B15 GoP structure and lowest for G16-B1. From RD comparison plots between all three encoders, not included due to space constraints, we find that for GoP structure G16-B1, H.264/AVC and H.264 SVC have comparable RD performance. However, H.264 SVC increasingly outperforms H.264/AVC for GoP structures G16-B3 to G16-B15. Two algorithms are proposed that compress both audio-video and since they have a linear time complexity they will use up the least amount of bandwidth. Hence in the most fluctuating network traffic the smoothest possible audio-video conferencing environments can be achieved. The two problems faced are: Recovery of original data after the decompression phase with a relatively lower compression ratio or achieve a higher compression ratio but only losing more data. ALGO-1 [6] is depicted in Fig. 2. It takes two bytes from the multimedia byte streams which are denoted as Uncompressed Byte-1 and Uncompressed Byte-2 in the figure. After that the four most significant bits of each of the uncompressed bytes are placed in a compressed byte denoted in the figure as Compressed Byte. Initially, we are placing the four most significant bits (a7, a6, a5, a4) of Uncompressed Byte-1 into the four most significant bit positions (c7,c6,c5,c4) of Compressed Byte. We then place the four most significant bits (b7, b6, b5, b4) of Uncompressed Byte-2 into the four least significant bit positions (c3,c2,c1,c0) of Compressed Byte-1. This concludes the compression phase of ALGO-1. [5]. The compressed byte is then sent over the network to the receiver where it is decompressed into two bytes depicted in the figure as Decompressed Byte-1 and Decompressed Byte-2. Bits (c7,c6,c5,c4) are placed into bit positions (d7,d6,d5,d4) of Decompressed Byte- 1 respectively. On the other hand bits (c3,c2,c1,c0) are placed into the byte labeled Decompressed Byte-2 occupying positions (e7,e6,e5,e4) respectively. Now the dilemma arises about what bit values should be placed in the four least significant bit positions of the decompressed bytes. As far as Fig. 1 is concerned we have padded both the least significant bits with zeros. ALGO-1_COMPRESS(U) C[length[U]/2] j 0 for i 0 to length[U]-1 a U[i] b U[i+1] c 0 (c7, c6 ,c5,c4) (a7, a6, a5, a4) (c3, c2 ,c1,c0) (b7, b6, b5, b4) C[j] c j j+1 i i+1 return C Consequently, a decompression procedure for ALGO-1: ALGO-1_DECOMPRESS(C) D[length[C]*2] j 0 for i 0 to length[C]-1 c C[i] d 0 e 0 (d7, d6 ,d5,d4) (c7, c6, c5, c4) (e7, e6 ,e5,e4) D[j] d D[j+1] e if d>0 d d+10 else d d-10 if e>0 e e+10 else e e-10 (c0, c1, c2, c3) j j+1 i i+1 return D Fig. 2 Schematic diagram of ALGO-1 (Compression and Decompression) [6] ALGO-2 [6]-This algorithm compresses four consecutive bytes of a multimedia stream to three compressed bytes. The whole compression and decompression processes are depicted in Fig.3 and Fig.4 respectively. As shown in Figure 3, the four most significant bits (a7,a6,a5,a4) of Uncompressed Byte-1 are placed in the four most significant bit positions (e7,e6,e5,e4) of Compressed Byte-1 respectively. The next step is to place the four most significant bits (b7,b6,b5,b4) of Uncompressed Byte-2 are placed in (e3,e2,e1,e0) of Compressed Byte-1. We repeat the previous step for Uncompressed Byte-3 and Uncompressed Byte-4, in which bits (c7,c6,c5,c4) and (d7,d6,d5,d4) from both the uncompressed bytes are copied to Compressed Byte-3, i.e. the bits from Uncompressed Byte-3 are placed in (g7,g6,g5,g4) and those of Uncompressed Byte-4 being placed in (g3,g2,g1,g0). On receiving the three compressed bytes, we map the bits in the following way as illustrated in Fig. 4: (i) (e7,e6,e5,e4) of Compresse Byte-1 to (h7,h6,h5,h4) of Decompressed Byte-1 respectively. (ii) (e3,e2,e1,e0) of Compressed Byte-1 to (i7,i6,i5,i4) of Decompressed Byte-2 respectively. (iii) (g7,g6,g5,g4) of Compressed Byte-3 to (j7,j6,j5,j4) of Decompressed Byte-3 (iv) (g3,g2,g1,g0) of Compressed Byte-3 to (k7,k6,k5,k4) of Decompressed Byte-4 respectively. (v) (f7,f6) of Compressed Byte-2 to (h3,h2) of Decompressed Byte-1 respectively. (vi) (f5,f4) of Compressed Byte-2 to (i3,i2) of Decompressed Byte-2 respectively. (vii) (f3,f2) of Compressed Byte-2 to (j3,j2) of Decompressed Byte-3 respectively. (viii) (f1,f0) of Compressed Byte-2 to(k3,k2) of Decompressed Byte-4 respectively. Fig. 3 Compression process of ALGO-2 [6] Fig. 4 Decompression process of ALGO-2 [6] References: [1] H.Schwarz, D.Marpe, and T.Weigand, “Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE Trans. Circuits and Systems for Video Technology, vol 17, no.9, pp.1103-1120, Sep.2007. [2] G.Van der Auwera and M.Reisslein, “Implications of smooth streaming on statistical multiplexing of H.264/AVC and SVC video streams,” IEEE trans. Broadcasting, vol.55, no.3, pp.541-558, Sep.2009. [3] M.Wien, H.Schwarz, and T.Oelbaum, “Performance analysis of SVC,” IEEE Trans. Circuits and Systems for Video Technology, vol.17, no.9, pp.1194-1203, Sep.2007. [4] G.Vander der Auwera, P.T.David, and M.Reisslein, “Traffic characteristics of H.264/AVC variable bit rate video,” IEEE Communications Magazine, vol.46, no.11, pp.164-174, Nov.2008. [5] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, “Introduction to Algorithms,” First Edition 1990, MIT press and McGraw-Hill, Cambridge, MA, USA. [6] T.R. Rahman, M. Rahman, “ Compression algorithms for audio-video streaming” IEEE Conference Intelligent systems, modeling and simulation, pp. 187-192, 2010. [7] ITU-T and ISO/IEC JTC 1, “Generic coding of moving pictures and associated audio information-Part 1: Systems,” ITU-T Recommendation H.222.0 and ISO/IEC 13818-1(MPEG-2 Systems), Nov.1994. [8] G.Vander der Auwera, P.T.David, and M.Reisslein, “Traffic and quality characterization of single-layer video streams encoded with the H.264/MPEG-4 advanced video coding standard and scalable video coding extension,” Broadcasting, IEEE Transactions, vol.54, no.3, pp.698-718, Aug.2008.