D1.2: Understanding the RTP packetization (encapsulation) of SVC V 1.2 Contents 1. Real-Time Protocol (RTP) ...................................................................................... 2 RTP runs on top of UDP ............................................................................................ 2 RTP Example........................................................................................................... 2 RTP and QoS........................................................................................................... 3 RTP Streams ........................................................................................................... 3 RTP Header ............................................................................................................ 3 Real-Time Control Protocol (RTCP) ............................................................................. 3 RTCP Packets .......................................................................................................... 4 Synchronization of Streams ...................................................................................... 5 RTCP Bandwidth Scaling ........................................................................................... 5 2. RTP Payload Format for H.264 Video ...................................................................... 6 NAL unit type .......................................................................................................... 6 3. RTP payload format for SVC .................................................................................. 8 NAL unit header: 3 octets extended ........................................................................... 8 References .............................................................................................................. 10 1 1. Real-Time Protocol (RTP) RTP specifies a packet structure for packets carrying audio and video data: RFC 3550 (replaces RFC 1889). RTP packet provides o o o payload type identification packet sequence numbering timestamping RTP runs in the end systems. RTP packets are encapsulated in UDP segments Interoperability: If two Internet phone applications run RTP, then they may be able to work together RTP runs on top of UDP RTP libraries provide a transport-layer interface that extend UDP: • port numbers, IP addresses • error checking across segment • payload type identification • packet sequence numbering • time-stamping RTP Example This is an example from [2]. Consider sending 64 kbps PCM-encoded voice over RTP. Application collects the encoded data in chunks, e.g., every 20 msec = 160 bytes in a chunk. The audio chunk along with the RTP header form the RTP packet, which is encapsulated into a UDP segment. RTP header indicates type of audio encoding in each packet; senders can change encoding during a conference. RTP header also contains sequence numbers and timestamps. 2 RTP and QoS RTP does not provide any mechanism to ensure timely delivery of data or provide other quality of service guarantees. RTP encapsulation is only seen at the end systems -- it is not seen by intermediate routers. Routers providing the Internet's traditional best-effort service do not make any special effort to ensure that RTP packets arrive at the destination in a timely matter. In order to provide QoS to an application, the Internet most provide a mechanism, such as RSVP, for the application to reserve network resources. RTP Streams RTP allows each source (for example, a camera or a microphone) to be assigned its own independent RTP stream of packets. For example, for a videoconference between two participants, four RTP streams could be opened: two streams for transmitting the audio (one in each direction) and two streams for the video (again, one in each direction). However, some popular encoding techniques -- including MPEG1 and MPEG2 -- bundle the audio and video into a single stream during the encoding process. When the audio and video are bundled by the encoder, then only one RTP stream is generated in each direction. For a many-to-many multicast session, all of the senders and sources typically send their RTP streams into the same multicast tree with the same multicast address. RTP Header Payload Type (7 bits): Used to indicate the type of encoding that is currently being used. If a sender changes the encoding in the middle of a conference, the sender informs the receiver through this payload type field. • Payload type 31. H.261 • Payload type 33, MPEG2 video Sequence Number (16 bits): The sequence number increments by one for each RTP packet sent; may be used to detect packet loss and to restore packet sequence. Real-Time Control Protocol (RTCP) Works in conjunction with RTP. 3 Each participant in an RTP session periodically transmits RTCP control packets to all other participants. Each RTCP packet contains sender and/or receiver reports that report statistics useful to the application. Statistics include number of packets sent, number of packets lost, interarrival jitter, etc. This feedback of information to the application can be used to control performance and for diagnostic purposes. The sender may modify its transmissions based on the feedback. - For an RTP session there is typically a single multicast address; all RTP and RTCP packets belonging to the session use the multicast address. - RTP and RTCP packets are distinguished from each other through the use of distinct port numbers. - To limit traffic, each participant reduces his RTCP traffic as the number of conference participants increases. RTCP Packets Receiver report packets: • fraction of packets lost, last sequence number, average interarrival jitter. Sender report packets: • SSRC of the RTP stream, the current time, the number of packets sent, and the number of bytes sent. Source description packets: 4 • e-mail address of the sender, the sender's name, the SSRC of the associated RTP stream. Packets provide a mapping between the SSRC and the user/host name. Synchronization of Streams • RTCP can be used to synchronize different media streams within a RTP session. • Consider a videoconferencing application for which each sender generates one RTP stream for video and one for audio. • The timestamps in these RTP packets are tied to the video and audio sampling clocks, and are not tied to the wall-clock time (i.e., to real time). • Each RTCP sender-report packet contains, for the most recently generated packet in the associated RTP stream, the timestamp of the RTP packet and the wall-clock time for when the packet was created. Thus the RTCP sender-report packets associate the sampling clock to the real-time clock. • Receivers can use this association to synchronize the playout of audio and video. RTCP Bandwidth Scaling RTCP attempts to limit its traffic to 5% of the session bandwidth. For example, suppose there is one sender, sending video at a rate of 2 Mbps. Then RTCP attempts to limit its traffic to 100 Kbps. The protocol gives 75% of this rate, or 75 kbps, to the receivers; it gives the remaining 25% of the rate, or 25 kbps, to the sender. The 75 kbps devoted to the receivers is equally shared among the receivers. Thus, if there are R receivers, then each receiver gets to send RTCP traffic at a rate of 75/R kbps and the sender gets to send RTCP traffic at a rate of 25 kbps. A participant (a sender or receiver) determines the RTCP packet transmission period by dynamically calculating the the average RTCP packet size (across the entire session) and dividing the average RTCP packet size by its allocated rate. 5 2. RTP Payload Format for H.264 Video All contents of this section is refered from [3] Internally, the NAL uses NAL units. A NAL unit consists of a one- byte header and the payload byte string. The header indicates the type of the NAL unit, the (potential) presence of bit errors or syntax violations in the NAL unit payload, and information regarding the relative importance of the NAL unit for the decoding process. This RTP payload specifcation is designed to be unaware of the bit string in the NAL unit payload. Some concepts should be noted: Access unit: includes Primary Coded Picture Time: SEI + RTP Time Sequence SEI messages The picture timing SEI message enables carriage of multiple timestamps for the same coded picture, and therefore the 3:2 pulldown process is perfectly controlled. The picture timing SEI message mechanism is necessary because only one timestamp per coded frame can be conveyed in the RTP timestamp. The most impotant concept in this memo is NAL unit types NAL unit type: 5 last bits in NAL header 6 Table 1. Summary of NAL unit types and their payload structures Type Packet Type name Section -----------------------------------------------------------------------------------------0 undefned 1-23 NAL unit Single NAL unit packet per H.264 5.6 24 STAP-A Single-time aggregation packet 5.7. 25 STAP-B Single-time aggregation packet 5.7. 26 MTAP6 Multi-time aggregation packet 5.7.2 27 MTAP24 Multi-time aggregation packet 5.7.2 28 FU-A Fragmentation unit 29 FU-B Fragmentation unit 30-31 undefined 5.8 5.8 - Packetization Modes This memo specifes three cases of packetization modes: o Single NAL unit mode o Non-interleaved mode o Interleaved mode The single NAL unit mode is targeted for conversational systems that comply with ITU-T Recommendation H.241. The non-interleaved mode is targeted for conversational systems that may not comply with ITU-T Recommendation H.241. In the non-interleaved mode, NAL units are transmitted in NAL unit decoding order. The interleaved mode is targeted for systems that do not require very low end-to-end latency. The interleaved mode allows transmission of NAL units out of NAL unit decoding order. Fragmentation Units (FUs) -> special, particular This payload type allows fragmenting a NAL unit into several RTP packets. Doing so on the application layer instead of relying on lower layer fragmentation (e.g., by IP) has the following advantages: o The payload format is capable of transporting NAL units bigger than 64 kbytes over an IPv4 network that may be present in pre- recorded video, particularly in High Defnition formats (there is a limit of the number of slices per picture, which results in a limit of NAL units per picture, which may result in big NAL units). o The fragmentation mechanism allows fragmenting a single picture and applying generic forward error correction as described in section 2.5. a NAL unit MUST be reassembled in RTP sequence number order. 7 3. RTP payload format for SVC All contents of this section is refered from [4]. In the SVC case, the base layer is anticipated to conform to a non-scalable profile of H.264/AVC. The enhancement layers conform to the SVC specification. NAL unit header: 3 octets extended like AVC NAL unit header In H.264/AVC, the NAL unit types 20 and 21 (among others) were reserved for future extensions. SVC uses these two NAL unit types and indicates the presence of this 2nd extend octet P: priority D: discardable E: sign for the 3rd octet temporal_level (TL) is used to indicate temporal scalability layer. A layer consisting of NAL units that carry pictures with a smaller TL value has a lower frame rate. This octet adds more dependency information. dependency_id (DID) field can be used to indicate the inter-layer coding dependency hierarchy. At any temporal location, a picture of a lower DID value may be used for inter-layer prediction for coding of a picture with a higher DID 8 value. quality_level (QL) indicates the FGS layer hierarchy. At any temporal location and with identical dependency_id value, an FGS picture with quality_level value equal to QL uses the FGS picture or base quality picture (the non-FGS picture when QL-1 =0) with quality_level value equal to QL-1 for inter-layer prediction. When QL is larger than 0, the NAL unit contains an FGS slice or its part. (Referred from the section 2 (Network abstraction layer) of [4].) The other sections in this paper mention about some problems in transmitting like: 3 use cases with problems with firewall pinholes Payload specific signaling In its current form, the SVC RTP payload draft (Wenger and Wang, 2005) follows the other possible avenue, which is a payload specific solution. The draft suggests that each layer (or group of layers, the draft is not specific in this regard) be transported in its own RTP session, which is announced/negotiated as an independent SDP media description. From an SDP and higher protocol viewpoint, all these descriptions appear to be independent media streams. Their relationship is described in a single SDP attribute, which carries a binary description of the layering structure in BASE64 format. The content of this attribute is not accessible to non-SVC-aware mechanisms. NAL UNIT AGGREGATION AND FRAGMENTATION Once more, we believe that the difficulty of defining more packet types can most easily be overcome by disallowing more than one layer in one RTP session. Alternatively, it could be possible to require that aggregation only be performed with NAL units belonging to a single layer. Finally, it could be argued that any devices that wish to re-arrange RTP packets must necessarily be media aware and therefore can be required to look into the media data themselves. Hence, no support from a payload viewpoint is required. This is the reason why (Wenger and Wang, 2005) does not include new aggregation and fragmentation packets. 9 References [1] RFC 3550 : RTP [2] http://connekgroup.net/documents/ebook/computer_science/networking/top_down/PRESEN TATIONS/CHAPTER6A.PPT [3] RFC 3984: RTP Payload Format for H.264 Video [4] RTP payload format for H.264/SVC scalable video coding [draft] WENGER Stephan, WANG Ye-kui, HANNUKSELA Miska M. 10