Lempel-Ziv Compression

advertisement
The LZ family

LZ77





LZR
LZSS
LZB
LZH – used by zip and unzip
LZ78





LZW – Unix compress
LZC – Unix compress
LZT
LZMW
LZJLZFG
Overview of LZ family

To demonstrate:


simple alphabet containing only two
letters, a and b,
and create a sample stream of text
LZ family overview

Rule: Separate this stream of characters
into pieces of text so that the shortest
piece of data is the string of characters
that we have not seen so far.
Sender : The Compressor

Before compression, the pieces of
text from the breaking-down
process are indexed from 1 to n:
LZ

indices are used to number the pieces of data.



The empty string (start of text) has index 0.
The piece indexed by 1 is a. Thus a, together
with the initial string, must be numbered Oa.
String 2, aa, will be numbered 1a, because it
contains a, whose index is 1, and the new
character a.
LZ

the process of renaming pieces of
text starts to pay off.


Small integers replace what were once
long strings of characters.
can now throw away our old stream of
text and send the encoded information
to the receiver
Bit Representation of Coded
Information

Now, want to calculate num bits
needed



each chunk is an int and a letter
num bits depends on size of table
permitted in the dictionary
every character will occupy 8 bits
because it will be represented in US
ASCII format
Compression good?


in a long string of text, the number
of bits needed to transmit the coded
information is small compared to
the actual length of the text.
example: 12 bits to transmit the
code 2b instead of 24 bits (8 + 8 +
8) needed for the actual text aab.
Receiver: The Decompressor
(Implementation


receiver knows exactly where boundaries are, so no
problem in reconstructing the stream of text.
Preferable to decompress the file in one pass;
otherwise, we will encounter a problem with
temporary storage..
Lempel-Ziv applet

See

http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic23/#JavaApplet
Lempel Ziv Welsch (LZW)






previous methods worked only on
characters
LZW works by encoding strings
some strings are replaced by a single
codeword
for now assume codeword is fixed (12
bits)
for 8 bit characters, first 256 (or less)
entries in table are reserved for the
characters
rest of table (257-4096) represent strings
LZW compression






trick is that string-to-codeword mapping
is created dynamically by the encoder
also recreated dynamically by the decoder
need not pass the code table between the
two
is a lossless compression algorithm
degree of compression hard to predict
depends on data, but gets better as
codeword table contains more strings
LZW encoder
Initialize table with single character strings
STRING = first input character
WHILE not end of input stream
CHARACTER = next input character
IF STRING + CHARACTER is in the string table
STRING = STRING + CHARACTER
ELSE
Output the code for STRING
Add STRING + CHARACTER to the string table
STRING = CHARACTER
END WHILE
Output code for string
Demonstrations

Another animated LZ algorithm …

http://www.data-compression.com/lempelziv.html
LZW encoder example

compress the string BABAABAAA
LZW decoder
Lempel-Ziv compression


a lossless compression algorithm
All encodings have the same length


But may represent more than one
character
Uses a “dictionary” approach –
keeps track of characters and
character strings already
encountered
LZW decoder example

decompress the string
<66><65><256><257><65><26
0>
LZW Issues



compression better as the code
table grows
what happens when all 4096
locations in string table are used?
A number of options, but encoder
and decoder must agree to do the
same thing



do not add any more entries to table
(as is)
clear codeword table and start again
clear codeword table and start again
LZW advantages/disadvantages

advantages





simple, fast and good compression
can do compression in one pass
dynamic codeword table built for each
file
decompression recreates the codeword
table so it does not need to be passed
disadvantages


not the optimum compression ratio
actual compression hard to predict
Entropy methods




all previous methods are lossless
and entropy based
lossless methods are essential for
computer data (zip, gnuzip, etc.)
combination of run length
encoding/huffman is a standard tool
are often used as a subroutine by
other lossy methods (Jpeg, Mpeg)
Lempel-Ziv compression


a lossless compression algorithm
All encodings have the same length


But may represent more than one
character
Uses a “dictionary” approach –
keeps track of characters and
character strings already
encountered
Download