Rohc draft contribution Zhigang Liu, Nokia September 12, 2000 This document contains a new version of texts on encoding. It is based on the section 4.5 and section 5.8 as in <draft-ietf-rohc-rtp01.txt>. Section 5.8 is now merged with section 4.5. Following are the major changes that have been done: Change the name of "VLE" and "OVLE" to window-based LSB encoding, and rewrite the two sections related to LSB encoding. Removed section 4.5.2 and 4.5.3 Added a section on scaled RTP TS encoding Added a section on offset IP-ID encoding Bormann (ed.) [Page 1] INTERNET-DRAFT 4.5. Robust Header Compression July 14, 2000 Encoding methods This chapter describes in general various encoding methods that can be used for different header fields. How the methods are applied to each field (e.g. values of associated parameters) will be specified in the packet format chapter. 4.5.1. Least Significant Bits (LSB) encoding Least Significant Bits (LSB) encoding can be used for a header field whose values are usually subject to small changes. In simplest terms, k least significant bits of the field value are sent by the compressor instead of the original field value, where k is a positive integer. After receiving the k LSBs, the decompressor derives the original value using a previously received value as reference (v_ref). The scheme is guaranteed to be correct if the compressor and the decompressor agree on an interpretation interval in which 1) the original value resides and 2) is the only value that has the exact same k LSBs as transmitted in the compressed packet. (It will be shown later that this statement can actually be relaxed) The interval can be described as a function f(v_ref, k). Without loosing generality, we choose the function as f(v_ref, k) = [v_ref – p, v_ref + (2^k – 1) - p], where p is an integer <-------------- interpretation interval --------------> |-------------+---------------------------------------| v_ref – p v_ref v_ref + (2^k-1) - p Note that function f has a good property: for any value k, k LSBs will uniquely identify a value in the interval. This is referred to as the uniqueness property of function f, which satisfies the condition 2) above. The parameter p is introduced so that the interpretation interval can be shifted with respect to v_ref. A good value of p can yield more efficient encoding for fields with certain characteristics. For example, if a field value observed by the compressor is always increasing, p should be set to 0, and thus gives an interval [v_ref, v_ref + 2^k – 1]. Because LSB encoding will be applied in rohc to header fields whose values normally increases except some uncommon situations (e.g. packet misordering, RTP TS for video, etc), only two values will be assigned to p: p1 = 0 or p2 = 2^(k-2) – 1. The latter gives an interpretation interval [v_ref – (2^(k-2) – 1), v_ref + 3*2^(k-2)] which can handle negative changes but still gives more Bormann (ed.) [Page 2] INTERNET-DRAFT Robust Header Compression July 14, 2000 space for the positive changes. See packet format section for more details. Following is the procedure of compression and decompression: 1) The compressor (decompressor) always uses as v_ref_c (v_ref_d) the last value that has been compressed (decompressed); 2) When compressing a value v, the compressor simply calculates the minimal (for efficiency) but sufficient (for correctness) value of k such that v falls into the interval (interval_c) as defined by function f with v_ref = v_ref_c. Let's represent this procedure as a function k = g(v, v_ref). Note that more than k LSBs may be sent to fit the packet format (see packet format section); 3) When receiving m LSBs, the decompressor will derive the interpretation interval (interval_d) as defined by f with k = m and v_ref = v_ref_d. Then it picks as the decompressed value the one in interval_d whose LSBs match the received ones. The scheme is complicated by two factors: packet loss between the compressor and decompressor, and transmission errors undetected by the lower layer. In the former case, the compressor and decompressor will lose the synchronization of v_ref, and thus the one of the interpretation interval. If v is still covered by the intersection(interval_c, interval_d), the decompression will be correct. Otherwise, incorrect decompression will happen. The next section will address this issue. As to the latter case, the corrupted LSB will give incorrectly decompressed value that will be used later as v_ref, which in turn will probably lead to damage propagation. This can be solved by the idea of secure reference, i.e. a reference value whose correctness can be verified by a CRC calculated over it. Consequently, the procedure 1) above is modified as following: the compressor always uses as v_ref_c the last value that has been compressed and sent with a CRC. The decompressor always uses as v_ref_d the last correctly (as proven by CRC) decompressed value. 4.5.2 Window-based LSB encoding The idea is very simple. Although the compressor cannot determine the exact value of v_ref_d that will be used by decompressor for a particular value v, it knows the 'candidate' set of values from which the decompressor may choose the v_ref_d. Obviously, such candidate set is a subset of all the values that have been sent by the compressor before v. The compressor then calculates the value of k such that no matter which v_ref_d the decompressor uses, the resulted interval_d covers v. Bormann (ed.) [Page 3] INTERNET-DRAFT Robust Header Compression July 14, 2000 Because of the order introduced by the fact that the compressor (decompressor) always uses the last sent (received) value with a CRC as the reference, the compressor actually maintains a sliding window (VSW) rather than a set. VSW is initially empty. Below are the operations performed on VSW by the compressor: 1) After sending a value v (either compressed or uncompressed) with a CRC, the compressor adds v to the head of the VSW; 2) For each value v being compressed, the compressor chooses k = max(g(v, v_min), g(v, v_max)), where v_main and v_max are the minimal and maximal value in VSW, and g is the function defined in previous section; 3) The sliding window is advanced when the compressor has sufficient confidence that values that are older (at the tail of VSW) than a certain value will be no longer used by the decompressor as v_ref_d. The confidence may be obtained by various means, e.g., an ACK from the decompressor if operating in R mode, in which case values older than the ACKed one can be removed from VSW. In the case of U/O mode, since there is a CRC to verify correct decompression, a VSW with certain depth will be used. The window depth is an optimization parameter determined during implementation. A special case is set it to 1. Note that the decompressor follows the same procedure as described in previous section, except that it MUST ACK each value received with a CRC (refer to compression/decompression logic section for details). 4.5.3 Scaled RTP Timestamp encoding The RTP Timestamp (RTP TS) will not increase by any arbitrary number from packet to packet. Instead, the increase is normally an integral multiple of some unit (TS_STRIDE). For example, in the case of audio, the sample rate is normally 8Khz and one voice frame covers 20 ms. Furthermore, each voice frame is usually carried in one RTP packet. Therefore, the RTP increment is always n * 160 (= 8000 * 0.02), where n is an integer. Note that silence periods has no impact on this as the sample clock at the source normally keeps running without changing either frame rate or boundary. For the case of video, there is a TS_STRIDE as well if we look at the frame level. The sample rate for most video codec is 90Khz. If the frame rate is fixed, say, to 30 frames/second, the RTP TS increment will always be n * 3000 (= 90000 / 30) between frames. Note that one video frame is normally divided into several RTP packets to achieve robustness against packet loss. Consequently, several RTP packets can carry the same RTP TS. However, this does not affect the usage of TS_STRIDE for RTP TS in packets belonging to different video frames. Bormann (ed.) [Page 4] INTERNET-DRAFT Robust Header Compression July 14, 2000 Therefore, the RTP TS can be downscaled by a factor of TS_STRIDE before compression. This will save s bits for each compressed RTP TS, where s = floor(log2(TS_STRIDE)). Following is the detailed algorithm: 1) Initialization. The compressor sends to the decompressor the value of TS_STRIDE (e.g. via in-band signalling, see packet format section) and the absolute value of one RTP TS (TS_0). The latter will be used by the decompressor to initialize TS_OFFSET = TS_0 mod TS_STRIDE. 2) After initialization, the compressor no longer compresses the original RTP TS values. Instead, it compresses the down-scaled values: TS_scaled = RTP TS / TS_STRIDE. The compression method could be either window based LSB encoding or the timer-based encoding described in the next section, or anything else. 3) When receiving the compressed information of TS_scaled, the decompressor first derives the value of the original TS_scaled. Then the original RTP TS can be calculated as RTP TS = TS_scaled * TS_STRIDE + TS_OFFSET. 4) Note that the wrap around of the 32-bit RTP TS will invalidate the current value of TS_OFFSET used in the equation above. For example, let's assume TS_STRIDE = 160 = 0xA0 and the current RTP TS = 0xFFFFFFF0, which means TS_OFFSET = 0x50 = 80. Then if the next RTP TS = 0x00000130 (i.e. the increment = 160 * 2 = 320), the new TS_OFFSET should be 0x00000130 mod 0xA0 = 0x90 = 144, instead of 80 (or 0xA0). Therefore, the decompressor must detect the wrap round of RTP TS (which is trivial) and updates the TS_OFFSET. In general, this method can be applied to many frame-based codecs. However, the value of TS_STRIDE might (though unlikely) change during a session. If that happens, the RTP TS has to be compressed unscaled until the re-initialization of the new TS_STRIDE and TS_OFFSET is done. 4.5.4 Timer-Based Compression of RTP Timestamp By the definition of RTP TS [RFC 1889], it (when generated at source) should closely follow a linear pattern as a function of the time of day clock, particularly for the case when speech is being carried in the RTP payload. The linear ratio is basically determined by the source sample rate (assuming fixed), but may be complicated by packetization (e.g. in the case of video) or some frame rearrangement (e.g. B-frame in some video codec). As the example shown in the previous section, with a fixed sample rate of 8KHz, 20 ms in time domain is equivalent to an increment of 1 in the scaled RTP TS domain, or 160 in the unscaled RTP TS domain. Bormann (ed.) [Page 5] INTERNET-DRAFT Robust Header Compression July 14, 2000 Consequently, the (scaled) RTP TS in headers coming into the decompressor also follow a linear pattern as a function of time, but less closely, due to the delay jitter between the source and the decompressor. In normal operation (no crashes or failures), the delay jitter is bounded, to meet the requirements of conversational realtime traffic. Hence, by using a local clock to measure packet arrival time, the decompressor can obtain an approximation of the (scaled) RTP TS in the header to be decompressed. The approximation is then refined with the k LSBs of the (scaled) RTP TS carried in the header. The required value of k to ensure correct decompression is a function of the jitter between the source and decompressor. The compressor can estimate the jitter (using a local clock) and determine k, or alternatively it can have a fixed k, and filter out the packets with excessive jitter. The advantages to this scheme: The size of the compressed RTP TS is constant and small. In particular, it does NOT depend on the length of the silence interval. This is in contrast to other RTP TS compression techniques, which require a variable number of bits dependent on the duration of the preceding silence interval. No synchronization is required between the two clocks: one local to the compressor and the other to the decompressor. Note that although this scheme can work in theory with both scaled and unscaled RTP TS, it is preferable in practice to be combined with scaled RTP TS encoding because of the less demanding requirement on the clock resolution (e.g. 20 ms instead of 1/8 ms). Therefore, the algorithm described below assumes that the scheme works on top of scaled RTP TS. (The case of unscaled RTP TS will be very similar, with only slightly changes on scale factors) Compressor: major task is to determine the value of k. 1) The compressor maintains a sliding window TSW = {(T_j, t_j) | for each header j sent previously with a CRC}, where T_j = the scaled RTP TS for header j, t_j = the arrival time of header j. The TSW has the same purpose as VSW (section 4.5.2). 2) When a new header n arrives with T_n as the scaled RTP TS, the compressor measures the arrival time t_n. Then it calculates Max_Jitter_BC = max {|(T_n – T_j) – (t_n – t_j) / TIME_STRIDE|}, where TIME_STRIDE is a period of real time (e.g. 20 ms) that is equivalent to one TS_STRIDE. Max_Jitter_BC is the maximum jitter before the compressor (in units of TS_STRIDE), using all the headers in the sliding window as reference. 3) K is then calculated as: k = ceiling(log2(2 * J + 1), where J = Max_Jitter_BC + Max_Jitter_CD + 2. Max_Jitter_CD is the upper bound of jitter expected on the communication channel between compressor and decompressor (CD- Bormann (ed.) [Page 6] INTERNET-DRAFT Robust Header Compression July 14, 2000 CC). It depends only on the characteristics of the CD-CC and is expected to be reasonably small in order to have good quality for real-time services. The factor of 2 is to account for the quantization error caused by the clock at the compressor and decompressor, which can be +/- 1. Note that the calculation of k follows the same compression algorithm as described in section 4.5.1, with p = 2^(k-1) – 1. 4) The pair (T_n, t_n) will be added to TSW if header n is sent with a CRC. The advance of TSW is same as described in section 4.5.2. Decompressor: 1) It always uses as reference header the last correctly (as verified by CRC) decompressed header. Maintain the pair (T_ref, t_ref), where T_ref = the scaled RTP TS in the reference header, t_ref = the arrival time of the reference header. 2) When receiving a compressed header n at time t_n, it calculates the approximation of the original scaled TS as: T_approx = T_ref + (t_n – t_ref) / TIME_STRIDE. 3) The approximation is then refined by the k LSBs carried in header n, following the same decompression algorithm as in section 4.5.1, with p = 2^(k-1) – 1. Note that the algorithm does not assume any special behavior of the input to the compressor (i.e. it tolerates reordering, or more generally, non-increasing RTP timestamp behavior observed prior to the compressor). Besides, the clock resolution is allowed to be worse than TIME_STRIDE, in which case the difference (i.e. actual resolution – TIME_STRIDE) can be considered as an additional jitter in the calculation of k. 4.5.5 Offset IP-ID encoding This section assumes that the Ipv4 stack at the source host assigns IP-ID to the value of a 2-byte counter which is always increased by one after each assignment. Therefore, the IP-ID field of a particular Ipv4 packet flow will be sequential (with increment of 1 from packet to packet) except when the sequence is disrupted by other flows. Consequently, the observation is that RTP SN increases by 1 for each packet and the IP-ID increases by at least the same amount. Thus, it is more efficient to compress the offset, i.e. (IP-ID – RTP SN), in stead of IP-ID itself. Bormann (ed.) [Page 7] INTERNET-DRAFT Robust Header Compression July 14, 2000 The following text describes how to compress/decompress the sequence of offsets using window-based LSB encoding/decoding, with p = p2 (see section 4.5.1). Compressor: Maintaining a sliding window of IP-ID-offsets, W = {offset_i = ID_i – SN_i | for each header i sent out with CRC}, ID_i and SN_i are the values of IP-ID and RTP SN in header i, respectively. When compressing packet n during FO state, calculates the offset_n = ID_n – SN_n, then compresses offset_n using window-based LSB encoding, with W as the context. Decompressor: Reference header = the last correctly (as verified by CRC) decompressed header. When receiving a compressed packet m, it calculates the offset_ref = ID_ref – SN_ref, where ID_ref and SN_ref are the values of IP-ID and RTP SN in the reference header, respectively. Then, it decompresses the IP-ID offset_m following the window-based LSB decoding, using the LSBs in the packet m and offset_ref as the decompression context. Finally, it regenerates IP-ID for packet m = RTP SN in packet m + decompressed offset_m. Note that some Ipv4 stacks use little endian for the IP-ID field, instead of big endian (the network byte order). In that case, the compressor can compress IP-ID field after swapping the bytes. Consequently, the decompressor will also swap the bytes after decompression to regenerate the original IP-ID. This trick works only if the compressor and the decompressor synchronize on the byte order of IP-ID field (see header format section). Bormann (ed.) [Page 8]