Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB

advertisement
Algorithms in the Real World
Lempel-Ziv
Burroughs-Wheeler
ACB
Page 1
Compression Outline
Introduction: Lossy vs. Lossless, Benchmarks, …
Information Theory: Entropy, etc.
Probability Coding: Huffman + Arithmetic Coding
Applications of Probability Coding: PPM + others
Lempel-Ziv Algorithms:
– LZ77, gzip,
– LZ78, compress (Not covered in class)
Other Lossless Algorithms: Burrows-Wheeler
Lossy algorithms for images: JPEG, MPEG, ...
Compressing graphs and meshes: BBK
Page 2
Lempel-Ziv Algorithms
LZ77 (Sliding Window)
Variants: LZSS (Lempel-Ziv-Storer-Szymanski)
Applications: gzip, Squeeze, LHA, PKZIP, ZOO
LZ78 (Dictionary Based)
Variants: LZW (Lempel-Ziv-Welch), LZC
Applications: compress, GIF, CCITT (modems),
ARC, PAK
Traditionally LZ77 was better but slower, but the
gzip version is almost as fast as any LZ78.
Page 3
LZ77: Sliding Window Lempel-Ziv
Cursor
a a c a a c a b c a b a b a c
Dictionary
Lookahead
(previously coded)
Buffer
Dictionary and buffer “windows” are fixed length
and slide with the cursor
Repeat:
Output (o, l, c) where
o = position (offset) of the longest match that
starts in the dictionary (relative to the cursor)
l = length of longest match
c = next char in buffer beyond longest match
Advance window by l + 1
Page 4
LZ77: Example
a a c a a c a b c a b a a a c
(_,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Dictionary (size = 6)
Buffer (size = 4)
Longest match
Next character
Page 5
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
For each message it looks it up in the dictionary and
inserts a copy at the end of the string
What if l > o? (only part of the message is in the
dictionary.)
E.g. dict = abcd, codeword = (2,9,e)
• Simply copy from left to right
for (i = 0; i < length; i++)
out[cursor+i] = out[cursor-offset+i]
• Out = abcdcdcdcdcdce
• First character is in the dictionary, and each time
a character is read, another is written, so never
run out of characters to read.
Page 6
LZ77 Optimizations used by gzip
LZSS: Output one of the following two formats
(0, position, length) or (1,char)
Uses the second format if length < 3.
a a c a a c a b c a b a a a c
(1,a)
a a c a a c a b c a b a a a c
(1,a)
a a c a a c a b c a b a a a c
(1,c)
a a c a a c a b c a b a a a c
(0,3,4)
Page 7
Optimizations used by gzip (cont.)
1. Huffman code the positions, lengths, and chars
2. Non greedy: possibly use shorter match so that
next match is better
3. Use a hash table to store the dictionary.
– Hash keys are all strings of length 3 in the
dictionary window.
– Find the longest match within the correct
hash bucket.
– Puts a limit on the length of the search within
a bucket.
– Within each bucket store in order of position
to make deleting easier when window moves.
Page 8
The Hash Table
…
7 8 9 101112131415161718192021 …
… a a c a a c a b c a b a a a c
…
…
a a c 19
c a b 15
a c a 11
a a c 10
c a b 12
c a a 9
a a c 7
a c a 8
Page 9
Theory behind LZ77
The Sliding Window Lempel-Ziv Algorithm is Asymptotically Optimal,
A. D. Wyner and J. Ziv, Proceedings of the IEEE, Vol. 82. No. 6,
June 1994.
Will compress long enough strings to the source entropy as the
window size goes to infinity.
Special case of proof: Assume infinite dictionary window,
characters generated one-at-a-time independently, look for
matches of length one only.
If i’th character in alphabet occurs with probability pi, expected
distance to nearest match is given by (like tossing coin until
heads appears)
E[oi ]  pi  (1  pi )(1  E[oi ])
Solving, we find that E[oi] = 1/pi, which is intuitive. If we can
encode oi using ≈ log oi bits, then the expected codeword
length is
1
  pi log
i
pi
Which is the entropy of the single character distribution.
Page 10
Logarithmic Length Encoding
How can we encode oi using ≈ log oi bits?
Gamma code uses two parts: First log oi  is encoded in unary using
log oi  bits. Then oi is encoded in binary notation using log oi 
bits. Total length of codeword: 2 log oi . Off by a factor of 2!
Improvement: Code log oi  in binary using log  log oi   bits
instead, and first code log  log oi   in unary using log  log oi  
bits. Total length of codeword: 2log  log oi   + log oi  .
Etc.: 2 log log  log oi    +log  log oi   + log oi  …
Page 11
Theory behind LZ77
General case: for any n, there is a window size w (typically
exponential in n) such that on average a match of size n is
found, and the number of bits needed to encode the position
and length of the match approaches the source entropy for
substrings of length n:
Hn 

X An
1
p ( X ) log
p( X )
Optimal in situation in which each character might
depend on previous n-1, even though n is unknown.
Uses logarithmic code for the position and match
length. (Note that typically match length is short
compared to position.)
Problem: “long enough” window is really really long.
Page 12
Comparison to Lempel-Ziv 78
Both LZ77 and LZ78 and their variants keep a
“dictionary” of recent strings that have been seen.
The differences are:
– How the dictionary is stored (LZ78 is a trie)
– How it is extended (LZ78 only extends an existing
entry by one character)
– How it is indexed (LZ78 indexes the nodes of the
trie)
– How elements are removed
Page 13
Lempel-Ziv Algorithms Summary
Adapts well to changes in the file (e.g. a tar file with
many file types within it).
Initial algorithms did not use probability coding and
performed poorly in terms of compression. More
modern versions (e.g. gzip) do use probability
coding as “second pass” and compress much better.
The algorithms are becoming outdated, but ideas are
used in many of the newer algorithms.
Page 14
Compression Outline
Introduction: Lossy vs. Lossless, Benchmarks, …
Information Theory: Entropy, etc.
Probability Coding: Huffman + Arithmetic Coding
Applications of Probability Coding: PPM + others
Lempel-Ziv Algorithms: LZ77, gzip, compress, …
Other Lossless Algorithms:
– Burrows-Wheeler
– ACB
Lossy algorithms for images: JPEG, MPEG, ...
Compressing graphs and meshes: BBK
Page 15
Burrows -Wheeler
Currently near best “balanced” algorithm for text
Breaks file into fixed-size blocks and encodes each
block separately.
For each block:
– First sort each character by its full context.
This is called the block sorting transform.
– Then use the move-to-front transform to
encode the sorted characters.
The ingenious observation is that the decoder only
needs the sorted characters and a pointer to the
first character of the original sequence.
Page 16
Burrows Wheeler: Example
Let’s encode: decode
Context “wraps” around. Last char is most significant.
In the output, characters with similar contexts are
near each other.
Context
ecode
coded
odede
dedec
edeco
decod
Char
d sort using
e context as
c key
o
d
e
All rotations of input
Context
dedec
coded
decod
odede
ecode
edeco
Output
o
e
e
c
d  start
d
Page 17
Burrows Wheeler Decoding
Key Idea: Can construct entire sorted table from sorted
column alone! First: sorting the output gives last column of
context:
Context Output
c
d
d
e
e
o
o
e
e
c
d
d
Page 18
Burrows Wheeler Decoding
Now sort pairs in last column of context and output column
to form last two columns of context:
Context Output
c
d
d
e
e
o
o
e
e
c
d
d
Context Output
ec
ed
od
de
de
co
o
e
e
c
d
d
Page 19
Burrows Wheeler Decoding
Repeat until entire table is complete. Pointer to first
character provides unique decoding.
Context
dedec
coded
decod
odede
ecode
edeco
Output
o
e
e
c
d
d
Message was d in first position, preceded in wrapped
fashion by ecode: decode.
Page 20
Burrows Wheeler Decoding
Optimization: Don’t really have to rebuild the whole context
table.
Context
dedec
coded1
decod2
odede1
ecode2
edeco
Output
o
e1
e2
c
d1 
d2
What character comes after
the first character, d1?
Just have to find d1 in last
column of context and see what
follows it: e1.
Observation: instances of same
character of output appear in
same order in last column of
context. (Proof is an exercise.)
Page 21
Burrows-Wheeler: Decoding
Context Output Rank
The “rank” is the position
of a character if it were
sorted using a stable
sort.
c
d
d
e
e
o
o
e
e
c
d 
d
6
4
5
1
2
3
Page 22
Burrows-Wheeler Decode
Function BW_Decode(In, Start, n)
S = MoveToFrontDecode(In,n)
R = Rank(S)
j = Start
for i=1 to n do
Out[i] = S[j]
j = R[j]
Rank gives position of each char in sorted order.
Page 23
Decode Example
S
o4
e2
e6
c3
d1
d5
Rank(S)
6
4
5
1
2
3

Out
d1 
e2
c3
o4
d5
e6
Page 24
Overview of Text Compression
PPM and Burrows-Wheeler both encode a single
character based on the immediately preceding
context.
LZ77 and LZ78 encode multiple characters based on
matches found in a block of preceding text
Can you mix these ideas, i.e., code multiple
characters based on immediately preceding
context?
– BZ does this, but they don’t give details on how
it works
– ACB also does this – close to BZ
Page 25
ACB (Associate Coder of Buyanovsky)
Keep dictionary sorted by context (the last character
is the most significant)
• Find longest match of current context in context
part of dictionary
• Find longest match of look-ahead buffer in contents
part of dictionary
• Code
• Distance between matches in the sorted order
• Length of contents match
• Next character that doesn’t match
• Shift look-ahead window, update context, contents
Has aspects of Burrows-Wheeler and LZ77
Page 26
ACB Example
Have seen “decode” so far. Sort (by context) all places
in dictionary (contents) that a match for text in lookahead buffer could occur.
Context Contents
decode
dec ode
d ecode
decod e
de code
decode
deco de
Suppose current position is
decode|odcoe
Best Matches:
“decode” in context
“odc” in contents (2 chars)
Output -4,2,c
Decoder can find best match in context, go from there.
Page 27
Download