Telecommunications Industry Association TR-30.1/99-10-036 (TIA) Columbia, MD Oct 13, 1999 COMMITTEE CONTRIBUTION Technical Committee TR-30 Meetings SOURCE: Hughes Network Systems CONTACT: Jeff Heath Hughes Network Systems 10450 Pacific Center Court San Diego, CA 92121 Phone: (619) 452-4826 Fax: (619) 597-8979 E-mail: jheath@hns.com TITLE: Upgrading V.42bis to LZJH PROJECT: PN-xxxx DISTRIBUTION: Members of TR-30 and TR-30.1 and meeting attendees ABSTRACT This paper describes the implementation changes necessary to upgrade a soft modem from V.42bis to a similiar ITU Recommendation based upon the LZJH data compression algorithm. The paper describes the changes at a fairly low level and is intended to provide the reader with an appreciation of the relative ease of such an upgrade. Copyright Statement The contributor grants a free, irrevocable license to the Telecommunications Industry Association (TIA) to incorporate text contained in this contribution and any modifications thereof in the creation of a TIA standards publication; to copyright in TIA's name any TIA standards publication even though it may include portions of this contribution; and at TIA's sole discretion to permit others to reproduce in whole or in part the resulting TIA standards publication. Intellectual Property Statement The individual preparing this contribution knows of patents, the use of which may be essential to a standard resulting in whole or in part from this contribution. 1. 1. 2. 3. 4. 5. 6. 2. Algorithm Differences Dictionary structure Code words vs. single characters String Extension Dictionay Entry Reuse Dictionary Re-initialization Transparent Mode Dictionary Structure A 4096 entry dictionary has 4096 unique code words (0 - 4095) available to represent strings of characters. Both algorithms reserve code words 0, 1, and 2 for control purposes. 2.1. V.42bis The encoder and decoder dictionaries are identical. Code words 3 through 258 are reserved for the 256 combinations of an 8 bit byte and each has a DOWN index from which to create strings. Each of the remaining 3837 dictionary entries (code words 259 through 4095) has the following: DOWN index - index to the dictionary entry having the next character in a string. RIGHT index - index to the next dictionary entry having same previous string as this entry. UP index - index to the dictionary entry having the previous character in a string. CHAR - the character represented by this dictionary entry. 2.2. LZJH Unlike V.42bis, the encoder dictionary is different from the decoder dictionary. 2.2.1. Encoder Dictionary The encoder dictionary is similiar to the V.42bis encoder dictionary but has 3 parts as follows: Root Dictionary - 256 DOWN indexes for the 256 possible combinations of an 8 bit byte. Node Dictionary - Each of the 4093 entries (code words 3 through 4095) has the following: DOWN index - index to the dictionary entry having the next character(s) in a string. RIGHT index - index to the next dictionary entry having same previous string as this entry. LOCATION index - index into the Character Dictionary where the first character of the one or more characters (string extension characters) defined by this entry is located. COUNT - number of characters defined by this entry (the first of which is at LOCATION). Character Dictionary - an array of variable length that contains all input characters received since the dictionaries were initialized. For a 4096 dictionary this should be about 25,000 bytes. The LOCATION index in the Node Dictionary points to character(s) within this dictionary. 2.2.2. Decoder Dictionary Node Dictionary - Each of the 4093 entries (code words 3 through 4095) has the following: LOCATION index - index into the Character Dictionary where the last character of the string defined by this entry is located. DEPTH - the total number of characters in the string defined by this entry. 2/16/2016 2 Character Dictionary - an array, the same size of the encoder Character Dictionary, that contains all input characters received since the dictionaries were initialized. The LOCATION index in the Node Dictionary points to cahracters within this dictionary. 3. Code Words vs. Single Characters V.42bis always sends code words. With a 4096 dictionary the code word (after the reaching the maximum) size remains at 12 bits until the dictionary is re-initialized. Thus when a single character is output it requires 12 bits even though there are only 8 significant bits. LZJH sends code words, single characters, and string extension tokens. A bit is output prior to a code word or single character indicating which type follows. Thus code words require 13 bits (1 bit plus 12 bit code word) and single characters 9 bits (1 bit plus 8 bit character). The disadvantage of sending an extra bit for code words is offset by the advantage of sending 9 bits for single characters. An additional advantage is, since the decoder can distinguish between code words and single characters, 256 entries of the Node Dictionary are not reserved for the 256 combinations of an 8 bit character (hence the Root Dictionary). 4. String Extension 4.1. V.42bis Each time a repeating string of characters is encountered the encoder (and decoder) adds the next input character to the end of the string (creating a dictionary entry). Thus a 10 character string has to repeat 9 times before the encoder has created an entry (code word) representing the entire string. The first time the encoder chained the 2nd character to the 1st to create a 2 character string. Each dictionary entry defines a one character extension to the previous string. 4.2. LZJH LZJH does the same as V.42bis the first time a repeating string of characters is encountered (i.e. chain the 2nd character to the 1st to create a 2 character string). In fact both algorithms always add the next character to process to a single character which is not part of a string. However, the second time a repeating string of more than 3 characters is encountered by the LZJH encoder it immediately extends the string to its maximum length (up to maximum string parameter). Using the LOCATION index in the dictionary entry for the 2nd character of the string it compares the bytes following the first instance of the string to the next bytes to process following the second instance of the string (the first 2 characters of which have already been encoded). If the encoder finds that 1 or more characters in both instances of the string match, i.e. it is a 3 or more character string, it sends a string extension signal to the encoder to extend the string by the number of characters that match (subject to the maximum string parameter). Node Dictionary entries define one or many characters. The LOCATION index into the Character Dictionary and COUNT define the sequence of string extension characters defined by each entry in the Node Dictionary. 5. Dictionary Entry Reuse V.42bis reuses dictionary entries when the dictionary becomes full. For a 4096 dictionary, 3837 entries of which are available for string extensions, the dictionary becomes full after about 3900 bytes have been processed. From that moment until the dictionary is RESET (which happens rarely in most V.42bis implementations) dictionary entries are reused. Starting at the first available entry (number 259) the encoder and decoder take the first entry they can find that does not have a DOWN index (i.e. it is the last character in a chained string) and deletes it from the chain. The entry is then reused on the next input character to create a string extension. LZJH does not reuse dictionary entries. When either the Node Dictionary or Character Dictionary becomes full, the dictionaries are immediately re-initialized. Since LZJH can reference many characters 2/16/2016 3 in a dictionary entry, it can process many more characters than V.42bis before the dictionary fills. Hence the 25K Character Dictionary for a 4096 dictionary. With 4 to 1 compression, LZJH will process over 20K characters before the 4092 available entries are all used. The Character Dictionary should be large enough such that the Node Dictionary fills first so that the compression ratio is not restricted. 6. Dictionary Initialization (Re-initialization) 6.1. V.42bis The V.42bis dictionaries are initialized (with a 4096 dictionary) by: put a null index into the DOWN index of the 256 reserved entries and 3837 node entries of the dictionary. Alternately just set all 4096 DOWN indicies to illegal values. put a null index into the UP index of the 3837 node entries of the dictionary. set the next dictionary entry to be used variable to its initial value (i.e. 259). set the code word size to its initial value (i.e. 9). 6.2. LZJH The LZJH dictionaries are initialized (with a 4096 dictionary) by: put a null index into the DOWN index of the 256 entries in the encoders Root Dictionary. set the next dictionary entry to be used variable to its initial value (i.e. 3). set the code word size to its initial value (i.e. 6). set the next location index into the Character Dictionary to its initial value (i.e. 0). NOTE that neither the encoders or decoders Node Dictionary indicies need to be initialized. 7. Transparent Mode 7.1. V.42bis V.42bis transparent mode has 3 command characters one of which is an ESC character which is modified each time it appears in the input data. The ESC character precedes the other two command characters (RESET, ECM) and itself which makes moderate expansion of the data possible while in transparent mode. The RESET is used to re-iinitialize the dictionaries and the ECM to go back to compression mode. While in transparent mode the encoder dictionary is filled with data that does not compress and the decoder must actually encode the transparent data to keep its dictionary current with that of the encoder. 7.2. LZJH Although LZJH could have a transparent mode with command characters, including an ESC character, the proposed implementation is much simpler than that of V.42bis. Every N input characters (N is currently 200) it processes and compresses, LZJH checks to see if compression was successful. If so, it stays in compression mode and sends the compressed output. If not, it places an ETM into the output (same as V.42bis) followed by the N transparent characters and re-initializes its dictiionaries. The ETM indicates to the decoder to go into transparent mode, process exactly N characters transparently, then re-initialize the dictionary and return to compression mode. The encoder, with dictionary re-initialized, processes N more input characters sending the compressed output if they compress or an ETM followed by the N transparent characters if they don’t compress. The decoder, in compression mode, either sees compressed data or an ETM indicating a switch to transparent mode for N characters. 8. Examples The following pages describe the difference between the V.42bis and LZJH encoders and how the dictionary is used. Assumed is a 4096 dictionary which already stepped up to the maximum 12 bit code word. The examples show how each processes an 8 character string ‘ABCDEFGH” which is repeated 8 2/16/2016 4 times in the input data. For simplicity, intervening characters between each instance of the example string are ignored. This exemplifies how the V.42bis encoder builds its dictionary one character at a time and how it must see 7 instances of an 8 character string before it has built a code word that represents the entire string such that it can encode the string with one code word on the 8th instance. In contrast, LZJH builds a code word to represent the entire string the 2nd time it is seen and encodes the string with a single code word on the 3rd instance. Not only does LZJH only build 9 dictionary entries (compared to the 18 for V.42bis) but it uses only 169 bits to encode the 8 instances of the string (compared to 312 bits) which is about an 85% savings. Note that the savings are greater as the length of the string increases. LZJH can encode any sized string (up to the string length maximum of 255) the 2nd time it sees it and send a single code word on the 3rd instance. The V.42bis decoder builds the exact same dictionary as the encoder. Thus, on the 8th instance of the string, when the decoder receives code word 276, it traverses the tree backwards through the UP indexes until it gets to an entry that is a single character (i.e. less than 259). It then must copy the characters into the output in reverse order to re-create the string. In contrast, the LZJH decoder dictionary, just maintains an index into the Character Dictionary to the last character of a string and its total length. When it receives code word a it just uses the index and the total length to get to the beginning of a previous instance of the string within the Character Dictionary and does a copy to the output. Legend: Shown is the input string, each of the 8 instances in the case of V.42bis. Also shown is the output of the encoder and the number of bits. New nodes built during each instance of the string are gray shaded. V.42bis nodes have the character itself within the dictionary entry. LZJH encoder nodes have an index to the character(s) and the number referenced. For the 2 nd instance, when the LZJH encoder found the string ‘AB’ and output code word 3, it compared the characters in the Character Dictionary following the 1st instance of the ‘AB’ with the next characters to process which followed the 2nd instance of ‘AB’. Since 6 characters in both instances matched, it sent a signal to the decoder to extend the string referenced by code word 3 by 6 additional characters and built a dictionary entry (a) for 6 characters. LZJH decoder nodes have an index to the last character and the total in the string. 2/16/2016 5 V.42bis Encoder and Decoder input output ABCDEFGH reserved A node B B 259 C C 260 D ABCDEFGH D 261 E E 262 F F 263 ABCDEFGH A B B 259 C 266 C C 260 261 E 267 E E B B 259 C 266 D 269 C C 260 262 F 263 G 268 B D B 259 C 266 D 269 E 272 2/16/2016 6 C 260 265 G G 264 E F H 4 X 12 = 48 H 265 G D 261 E 262 F 263 G 264 E 267 F 270 G 268 H 271 C H H 266 262 264 H ABCDEFGH A 264 F ABCDEFGH A G G 259 261 263 265 D D bits 8 X 12 = 96 H 4 X 12 = 48 H 265 269 268 H D E F G D 261 E 262 F 263 G 264 E 267 F 270 G 268 H 271 H 273 H 3 X 12 = 36 H 265 ABCDEFGH A B B 259 C 266 D 269 E 272 F 274 C C 260 272 271 D E F A B B 259 C 266 D 269 E 272 F 274 G 275 2/16/2016 7 C 261 E 262 F 263 G 264 E 267 F 270 G 268 H 271 H 273 C 260 G D ABCDEFGH 2 X 12 = 24 H H 265 274 265 D E F 2 X 12 = 24 G D 261 E 262 F 263 G 264 E 267 F 270 G 268 H 271 H 273 H H 265 ABCDEFGH A B B 259 C 266 D 269 E 272 F 274 G 275 H 276 C C 260 275 H D E F A B B 259 C 266 D 269 E 272 F 274 G 275 H 276 2/16/2016 8 C 261 E 262 F 263 G 264 E 267 F 270 G 268 H 271 H 273 C 260 G D ABCDEFGH 2 X 12 = 24 H H 265 276 D E F 1 X 12 = 12 G D 261 E 262 F 263 G 264 E 267 F 270 G 268 H 271 H 273 H H 265 total = 312 LZJH Encoder input output ABCDEFGH root A node 1 1 B 3 1 2 ABCDEFGH C 4 D 5 1 3 1 4 E 6 F 7 1 5 1 6 G 8 0 1 2 3 4 5 6 B 3 6 b a 1 2 36 C 4 D 5 1 3 1 4 E 6 0 1 2 3 4 5 6 F 7 1 5 ABCDEFGH 1 6 B 1 1 3 6 b a 1 2 8 1 3 1 9 1 7 ABCDEFGH 7 8 9 D a b c 5 1 4 E 6 d e f 10 2 3 4 5 6 F 7 1 5 ABCDEFGH 0 H 11 a C 4 13 + 6 = 19 G ABCDEFGH A 9 1 7 7 ABCDEFGH 1 1 H ABCDEFGH character dictionary A bits 8 X 9 = 72 1 6 G 8 1 7 ABCDEFGH 7 8 9 a b c d 1 X 13 = 13 e H 9 ABCDEFGH f 10 11 12 13 14 15 16 17 18 19 20 5 X 13 = 65 2/16/2016 9 total = 169 LZJH Decoder root A node 2 1 input output ABCDEFGH ABCDEFGH B 3 2 2 C 4 2 3 D 5 2 4 E 6 F 7 2 5 2 6 0 1 2 3 4 5 6 B 3 8 10 a 9 2 2 ABCDEFGH C 4 2 3 D 5 2 4 E 6 0 1 2 3 4 5 6 F 7 2 5 ABCDEFGH 2/16/2016 10 2 7 7 36 2 1 8 H ABCDEFGH character dictionary A G 2 6 G 8 2 7 ABCDEFGH 7 8 9 a b c d e f 10 11 H 9