Telecommunications Industry Association (TIA) TR-30.1/99-12-0 Clearwater, Florida November 29, 1999 COMMITTEE CONTRIBUTION Technical Committee TR-30 Meetings SOURCE: Hughes Network Systems CONTACT: Jeff Heath Hughes Network Systems 10450 Pacific Center Court San Diego, CA 92121 Phone: (619) 452-4826 Fax: (619) 597-8979 E-mail: jheath@hns.com TITLE: Comments on the ADI Data Compression Paper (TR-30.1-99-11-060) PROJECT: PN-xxxx DISTRIBUTION: Members of TR-30 and TR-30.1 and meeting attendees ABSTRACT This paper calls into question several important issues regarding the ADI Data Compression contribution to the TR30.1 Clearwater meeting. Some of which are: The execution speed of LZ77 encoders which are notoriously very slow. Intellectual Property Rights. The algorithm described potentially infringes on several patents. Most notably, the algorithm described is a slight variation of the algorithm over which Stac successfully sued Microsoft. The comparison testing presented in the ADI contribution. Copyright Statement The contributor grants a free, irrevocable license to the Telecommunications Industry Association (TIA) to incorporate text contained in this contribution and any modifications thereof in the creation of a TIA standards publication; to copyright in TIA's name any TIA standards publication even though it may include portions of this contribution; and at TIA's sole discretion to permit others to reproduce in whole or in part the resulting TIA standards publication. Intellectual Property Statement The individual preparing this contribution knows of patents, the use of which may be essential to a standard resulting in whole or in part from this contribution. 1. LZ77 Execution Speed LZ77 algorithms are sliding window based such that a window of N bytes contains the last N bytes of input data. Each time a new byte is input, the oldest byte in the window is shifted out of the window. The window is typically about 2048 bytes although some algorithms, such as Microsoft Point to Point use as a window as large as 8192 bytes. As additional input characters are received, they are compared to the strings in the window. When a longest string is found, the encoder basically indicates to the decoder where the start of the string is in the window and its length. The problem is finding the matching string within the window. The process is very time consuming and potentially increases exponentially as the size of the window increases. Section 3 of the ADI contribution leaves the implementation of the encoder and string matching up to the developer to figure out and with good reason, there is no obvious way to do it fast enough for the ADI algorithm to be viable for MIPs limited processors. Despite the assertion in the contribution that a linear search is an option, it is out of the question as shown in Figure 1 below comparing the relative encoder times of LZS, LZJH, and V.42bis. LZS is a Stac LZ77 algorithm very similar to the ADI algorithm and was implemented by HNS using a search of the window while employing obvious short cuts. The figure was taken verbatim from the LZH contribution to the Dana Point meeting in Feb. 1999. The timing shown used a 2048 byte window, times for larger windows were exponentially higher. The trick is then to use tables, hash schemes, etc. to get the execution requirements of the encoder to less than 100% of available CPU. The algorithm, and its variants, has been around since 1977, some 22 years, and the big obstacle is still the slow encoder. I suspect that one or more LZ77 variants was submitted to the committee 10+ years ago when V.42bis was created and deemed too slow to be useable. The Advanced Development Group at HNS has been trying to replace LZJH as the algorithm of choice for Hughes for at least 18 months. They have been trying to speed up a LZ77 clone and, using a tree data structure proposed by C. Bell, have managed to get their LZ77 to where it is only 6 to 8 times slower than the LZJH encoder (the pertinent pages are attached and the complete text of the ADG paper is available). The catch is that tree structures tend to take more and more memory while offsetting the CPU savings by their maintenance It is not likely that modem developers attempting to incorporate the ADI algorithm into existing hardware (i.e. on controllers) will be able invent enough tricks to keep the ADI encoder from overwhelming the processor. Soft modem developers will potentially use considerably more CPU resources on the host (i.e. the PC) than is available to the modem. Especially since most good ways to speed up the encoder developed over the past 22 years are protected by IPR. There are many data compression algorithms that get considerably better compression than either LZJH, LZ77, or ADI. But, like LZ77 algorithms, they are too slow to be practical for a real time environment like a modem or satellite terminal with limited MIPs. The ADI contribution correctly notes that LZ77 decoders are trivial and very fast. Note that the LZJH decoder has those same attributes, unlike the V.42bis decoder. 2. LZ77 Intellectual Property Rights The ADI algorithm is basically Microsoft Point to Point Compression (MPPC) with different encoding for offset, length, and characters between 128 and 255. Stac Inc. sued Microsoft and won several years ago because MPPC is basically a Stac LZ77 variant (LZS) with different encoding for offset, length, and characters. The pertinent pages from RFC 2118 (MPPC) and ANSI X.3 241 are attached and the complete text is available. If the ADI algorithm were to become the new ITU-T data compression standard it would seem likely that Stac and Microsoft would claim and/or enforce IPR and royalties. In addition to the Stac IPR there are tens of additional patents on various ways to speed up the LZ77 encoder. Of the 100 or so patents that reference the LZ77 patent (# 4,464,650), on the IBM Patent Server, the majority involve methods to speed up the very slow LZ77 encoder. Most notable are the P. Katz (PKZIP) patent on sorted hash tables, Greene and Fiala for tree data structures, and several other patents related to hashing by Jung, Chambers, and Stac. ADI leaves it to the imagination of the modem developer to implement the LZ77 encoder. Logic would dictate that if they had a clever way of doing it, which potentially makes the ADI encoder only 4 or 5 time slower than the LZJH and V.42bis encoders, that they would have included the details in the document. Logic also dictates that, with all the IPR already held by those that have been somewhat successful in speeding up the LZ77 encoder, modem developers will have two choices: (1) implement an extremely slow encoder, (2) pay royalties for a moderately slow encoder. Neither choice seems a desirable option when comparad to the speed of the fast LZJH encoder. 800 700 600 500 LZW 400 LZS LZH 300 200 100 0 Stream file A Stream file B Stream file C Frame file A Frame file B Frame file C Figure 1: Average CPU Time 3. ADI and LZJH Comparison Testing The following are comparisons of the ADI algorithm and LZJH on some of the same files compared in the TR301-99-11-055 contribution by HNS. Note that the ADI comparisons were confusing to the extent that the LZJH history buffer was restricted considerably for no apparent reason. It was most notable on the 512 and 1024 tests where ADI apparently assumed the contradiction of a hardware platform which had very little memory (8K to 16K of RAM ) available for compression but whose processor was fast enough to handle the very slow execution speeds of the ADI encoder. This is a classic memory versus MIPs issue. In the ADI comparisons they chose to limit the memory of the LZJH history buffer but allowed for enough MIPs to execute the ADI algorithm. Most modern processors have more memory than MIPs. For the older hardware platforms (i.e. controllers) some have little of either. With LZJH the developer has the option of decreasing the history buffer size, affecting compression only on the very compressible data, while still getting better compression than V.42bis. With the ADI or other LZ77 algorithm the developer simply can’t implement the algorithm due to lack of MIPs. The comparison testing below was with an ADI 2048 window and an LZJH 2048 dictionary with neither restricted to MIPs or memory. 3.1. eBay Web Browsing Table 1: Web Browsing - EBAY File File Size eBay HTMLs ebay HTMLs Headers LZJH Compressed 228,320 Difference Percent 1,561,446 ADI compressed 252,502 24,182 11 1,626,700 MPPC 257,042 312,918 283,072 29,846 11 LZJH Compressed 137,890 Difference Percent 22,039 16 MPPC 314,614 Table 2: EBAY Computers - Hardware File File Size Ebay Computers Hardware Ebay Computers Hardware Frame mode 1,403,970 ADI compressed 159,929 1,403,970 500,327 467,118 33,209 7 ADI Compressed 638,297 LZJH compressed 521,137 Difference Percent 131,716 25 LZJH compressed 13,109 15,238 Difference Percent 1,669 1,339 13 9 3.2. Mail File Table 3: Mail File File File Size Mail File 1,444,029 3.3. C Source Files Table 4: C Source Files File File Size LZJH Source LZJH Source Headers 45,930 47,850 ADI compressed 14,778 16,577 Control Kernel X.25 Source X.25 Source Headers 3.4. 72,021 24,494 20,853 3,641 17 542,091 564,711 156,432 179,468 120,814 146,340 35,618 33,128 29 23 LZJH compressed 30,233 8,212 Difference Percent 1,027 413 3 5 LZJH compressed 4,144 28,692 Difference Percent 74 2,609 2 9 LZJH Compressed 208,629 Difference Percent 61,090 29 LZJH compressed 6,145 24,218 59,761 Difference Percent 25 1,610 5,591 0 7 9 Difference Percent 5,720 10 Executable and Object Files Table 5: Executable and Object Files File Size PC .EXE PC .OBJ 3.5. 238,894 69,707 ADI compressed 31,260 8,625 Word Documents Table 6: Word Documents File File Size TR30.1 Review Doc 22,528 98.816 3.6. ADI compressed 4,218 31,301 RTF and PDF File Table 7: RTF and PDF Files File File Size RTF 9,526,404 3.7. ADI compressed 269,719 Various PC Files Table 8: Various Files in PC directories Files File Size Deisl1.isu Readme.wri Rdlang32.dll 20,575 56,704 288,256 3.8. ADI compressed 6,170 25,828 65,352 Amazon.com Web Browsing Table 9: Web Browsing - Amazon File Size Amazon 383,634 ADI compressed 65,709 LZJH compressed 59,989 Amazon Headers 3.9. 400,020 80,902 73,483 7,419 10 LZJH compressed 3,064 Difference Percent 12,173 ADI compressed 3,475 411 13 34,538 9.392 8,404 988 12 97,030 9,764 9,152 612 7 41,986 11,361 10,404 957 9 51,100 9,336 8,414 922 11 51,953 10,520 9,006 253 3 20,901 5,693 5,440 253 5 30,027 4,385 4,186 199 5 25,408 3,830 3,382 448 13 Various HTML Files Table 10: Various HTML Files File Size Dick Brant Home Page Cdnow.com Home page Cdnow artist search Barnes & Nobel home page B & N author search #1 B & N author search #2 Mass music Home page Mass music Artist seach #1 Mass music Artist seach #2