History of Data Compression - Telecommunications Industry

advertisement
Telecommunications Industry Association
(TIA)
TR-30.1/99-12-0
Clearwater, Florida November 29, 1999
COMMITTEE CONTRIBUTION
Technical Committee TR-30 Meetings
SOURCE:
Hughes Network Systems
CONTACT:
Jeff Heath
Hughes Network Systems
10450 Pacific Center Court
San Diego, CA 92121
Phone:
(619) 452-4826
Fax:
(619) 597-8979
E-mail:
jheath@hns.com
TITLE:
Comments on the ADI Data Compression Paper (TR-30.1-99-11-060)
PROJECT:
PN-xxxx
DISTRIBUTION:
Members of TR-30 and TR-30.1 and meeting attendees
ABSTRACT
This paper calls into question several important issues regarding the ADI Data Compression
contribution to the TR30.1 Clearwater meeting. Some of which are:
 The execution speed of LZ77 encoders which are notoriously very slow.
 Intellectual Property Rights. The algorithm described potentially infringes on several patents.
Most notably, the algorithm described is a slight variation of the algorithm over which Stac
successfully sued Microsoft.
 The comparison testing presented in the ADI contribution.
Copyright Statement
The contributor grants a free, irrevocable license to the Telecommunications Industry Association
(TIA) to incorporate text contained in this contribution and any modifications thereof in the creation
of a TIA standards publication; to copyright in TIA's name any TIA standards publication even
though it may include portions of this contribution; and at TIA's sole discretion to permit others to
reproduce in whole or in part the resulting TIA standards publication.
Intellectual Property Statement
The individual preparing this contribution knows of patents, the use of which may be essential to a
standard resulting in whole or in part from this contribution.
1.
LZ77 Execution Speed
LZ77 algorithms are sliding window based such that a window of N bytes contains the last N bytes
of input data. Each time a new byte is input, the oldest byte in the window is shifted out of the
window. The window is typically about 2048 bytes although some algorithms, such as Microsoft
Point to Point use as a window as large as 8192 bytes. As additional input characters are
received, they are compared to the strings in the window. When a longest string is found, the
encoder basically indicates to the decoder where the start of the string is in the window and its
length.
The problem is finding the matching string within the window. The process is very time
consuming and potentially increases exponentially as the size of the window increases. Section 3
of the ADI contribution leaves the implementation of the encoder and string matching up to the
developer to figure out and with good reason, there is no obvious way to do it fast enough for the
ADI algorithm to be viable for MIPs limited processors. Despite the assertion in the contribution
that a linear search is an option, it is out of the question as shown in Figure 1 below comparing
the relative encoder times of LZS, LZJH, and V.42bis. LZS is a Stac LZ77 algorithm very similar
to the ADI algorithm and was implemented by HNS using a search of the window while employing
obvious short cuts. The figure was taken verbatim from the LZH contribution to the Dana Point
meeting in Feb. 1999. The timing shown used a 2048 byte window, times for larger windows were
exponentially higher.
The trick is then to use tables, hash schemes, etc. to get the execution requirements of the
encoder to less than 100% of available CPU. The algorithm, and its variants, has been around
since 1977, some 22 years, and the big obstacle is still the slow encoder. I suspect that one or
more LZ77 variants was submitted to the committee 10+ years ago when V.42bis was created
and deemed too slow to be useable.
The Advanced Development Group at HNS has been trying to replace LZJH as the algorithm of
choice for Hughes for at least 18 months. They have been trying to speed up a LZ77 clone and,
using a tree data structure proposed by C. Bell, have managed to get their LZ77 to where it is only
6 to 8 times slower than the LZJH encoder (the pertinent pages are attached and the complete
text of the ADG paper is available). The catch is that tree structures tend to take more and more
memory while offsetting the CPU savings by their maintenance
It is not likely that modem developers attempting to incorporate the ADI algorithm into existing
hardware (i.e. on controllers) will be able invent enough tricks to keep the ADI encoder from
overwhelming the processor. Soft modem developers will potentially use considerably more CPU
resources on the host (i.e. the PC) than is available to the modem. Especially since most good
ways to speed up the encoder developed over the past 22 years are protected by IPR.
There are many data compression algorithms that get considerably better compression than
either LZJH, LZ77, or ADI. But, like LZ77 algorithms, they are too slow to be practical for a real
time environment like a modem or satellite terminal with limited MIPs.
The ADI contribution correctly notes that LZ77 decoders are trivial and very fast. Note that the
LZJH decoder has those same attributes, unlike the V.42bis decoder.
2.
LZ77 Intellectual Property Rights
The ADI algorithm is basically Microsoft Point to Point Compression (MPPC) with different
encoding for offset, length, and characters between 128 and 255. Stac Inc. sued Microsoft and
won several years ago because MPPC is basically a Stac LZ77 variant (LZS) with different
encoding for offset, length, and characters. The pertinent pages from RFC 2118 (MPPC) and
ANSI X.3 241 are attached and the complete text is available.
If the ADI algorithm were to become the new ITU-T data compression standard it would seem
likely that Stac and Microsoft would claim and/or enforce IPR and royalties.
In addition to the Stac IPR there are tens of additional patents on various ways to speed up the
LZ77 encoder. Of the 100 or so patents that reference the LZ77 patent (# 4,464,650), on the IBM
Patent Server, the majority involve methods to speed up the very slow LZ77 encoder. Most
notable are the P. Katz (PKZIP) patent on sorted hash tables, Greene and Fiala for tree data
structures, and several other patents related to hashing by Jung, Chambers, and Stac.
ADI leaves it to the imagination of the modem developer to implement the LZ77 encoder. Logic
would dictate that if they had a clever way of doing it, which potentially makes the ADI encoder
only 4 or 5 time slower than the LZJH and V.42bis encoders, that they would have included the
details in the document. Logic also dictates that, with all the IPR already held by those that have
been somewhat successful in speeding up the LZ77 encoder, modem developers will have two
choices: (1) implement an extremely slow encoder, (2) pay royalties for a moderately slow
encoder. Neither choice seems a desirable option when comparad to the speed of the fast LZJH
encoder.
800
700
600
500
LZW
400
LZS
LZH
300
200
100
0
Stream
file A
Stream
file B
Stream
file C
Frame
file A
Frame
file B
Frame
file C
Figure 1: Average CPU Time
3.
ADI and LZJH Comparison Testing
The following are comparisons of the ADI algorithm and LZJH on some of the same files
compared in the TR301-99-11-055 contribution by HNS.
Note that the ADI comparisons were confusing to the extent that the LZJH history buffer was
restricted considerably for no apparent reason. It was most notable on the 512 and 1024 tests
where ADI apparently assumed the contradiction of a hardware platform which had very little
memory (8K to 16K of RAM ) available for compression but whose processor was fast enough to
handle the very slow execution speeds of the ADI encoder.
This is a classic memory versus MIPs issue. In the ADI comparisons they chose to limit the
memory of the LZJH history buffer but allowed for enough MIPs to execute the ADI algorithm.
Most modern processors have more memory than MIPs. For the older hardware platforms (i.e.
controllers) some have little of either. With LZJH the developer has the option of decreasing the
history buffer size, affecting compression only on the very compressible data, while still getting
better compression than V.42bis. With the ADI or other LZ77 algorithm the developer simply can’t
implement the algorithm due to lack of MIPs.
The comparison testing below was with an ADI 2048 window and an LZJH 2048 dictionary with
neither restricted to MIPs or memory.
3.1.
eBay Web Browsing
Table 1: Web Browsing - EBAY
File
File Size
eBay HTMLs
ebay HTMLs
Headers
LZJH
Compressed
228,320
Difference
Percent
1,561,446
ADI
compressed
252,502
24,182
11
1,626,700
MPPC
257,042
312,918
283,072
29,846
11
LZJH
Compressed
137,890
Difference
Percent
22,039
16
MPPC
314,614
Table 2: EBAY Computers - Hardware
File
File Size
Ebay
Computers
Hardware
Ebay
Computers
Hardware
Frame mode
1,403,970
ADI
compressed
159,929
1,403,970
500,327
467,118
33,209
7
ADI
Compressed
638,297
LZJH
compressed
521,137
Difference
Percent
131,716
25
LZJH
compressed
13,109
15,238
Difference
Percent
1,669
1,339
13
9
3.2.
Mail File
Table 3: Mail File
File
File Size
Mail File
1,444,029
3.3.
C Source Files
Table 4: C Source Files
File
File Size
LZJH Source
LZJH Source
Headers
45,930
47,850
ADI
compressed
14,778
16,577
Control
Kernel
X.25 Source
X.25 Source
Headers
3.4.
72,021
24,494
20,853
3,641
17
542,091
564,711
156,432
179,468
120,814
146,340
35,618
33,128
29
23
LZJH
compressed
30,233
8,212
Difference
Percent
1,027
413
3
5
LZJH
compressed
4,144
28,692
Difference
Percent
74
2,609
2
9
LZJH
Compressed
208,629
Difference
Percent
61,090
29
LZJH
compressed
6,145
24,218
59,761
Difference
Percent
25
1,610
5,591
0
7
9
Difference
Percent
5,720
10
Executable and Object Files
Table 5: Executable and Object Files
File Size
PC .EXE
PC .OBJ
3.5.
238,894
69,707
ADI
compressed
31,260
8,625
Word Documents
Table 6: Word Documents
File
File Size
TR30.1
Review Doc
22,528
98.816
3.6.
ADI
compressed
4,218
31,301
RTF and PDF File
Table 7: RTF and PDF Files
File
File Size
RTF
9,526,404
3.7.
ADI
compressed
269,719
Various PC Files
Table 8: Various Files in PC directories
Files
File Size
Deisl1.isu
Readme.wri
Rdlang32.dll
20,575
56,704
288,256
3.8.
ADI
compressed
6,170
25,828
65,352
Amazon.com Web Browsing
Table 9: Web Browsing - Amazon
File Size
Amazon
383,634
ADI
compressed
65,709
LZJH
compressed
59,989
Amazon
Headers
3.9.
400,020
80,902
73,483
7,419
10
LZJH
compressed
3,064
Difference
Percent
12,173
ADI
compressed
3,475
411
13
34,538
9.392
8,404
988
12
97,030
9,764
9,152
612
7
41,986
11,361
10,404
957
9
51,100
9,336
8,414
922
11
51,953
10,520
9,006
253
3
20,901
5,693
5,440
253
5
30,027
4,385
4,186
199
5
25,408
3,830
3,382
448
13
Various HTML Files
Table 10: Various HTML Files
File Size
Dick Brant
Home Page
Cdnow.com
Home page
Cdnow artist
search
Barnes &
Nobel home
page
B & N author
search #1
B & N author
search #2
Mass music
Home page
Mass music
Artist seach
#1
Mass music
Artist seach
#2
Download