Compression Algorithms

advertisement
Compression Algorithms
CSCI 2720
Spring 2005
Eileen Kraemer
When we last met …
We looked at string encoding
And noted that if we use same number of bits
per character in our alphabet, then
the number of bits required to encode a
character of the alphabet is
• log2(ceil(sizeof(alphabet))
And we don’t need to transmit or store the
mapping from encodings to characters
What if …
the string we encode doesn’t use all the
letters in the alphabet?
log2(ceil(sizeof(set_of_characters_used))
But then also need to store / transmit the
mapping from encodings to characters
… and is typically close to size of alphabet
And we also looked at: …
Huffman Encoding
Assumes encoding on a per-character basis
Observation: assigning shorter codes to frequently
used characters (which requires assigning longer
codes to rarely used characters) can result in overall
shorter encodings of strings
Problem:
• when decoding, need to know how many bits to read off
for each character.
Solution:
• Choose an encoding that ensures that no character
encoding is the prefix of any other character encoding.
An encoding tree has this property.
A Huffman Encoding Tree
21
0
1
9
12
0
E
1
5
0
7
1
3
A
0
1
2
3
4
T
R
N
21
0
1
5
0
7
1
3
A
000
9
T
001
E
R
010
N
011
E
1
1
12
0
A
0
1
2
3
4
T
R
N
Weighted path length
Weighted path = Len(code(A)) * f(A) +
A 000
T 001
R 010
N 011
E 1
Len(code(T)) * f(T) + Len(code(R) ) * f(R) +
Len(code(N)) * f(N) + Len(code(E)) * f(E)
= (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1)
= 9 + 6 + 9 + 12 + 9 = 45
Claim (proof in text) : no other encoding can result
in a shorter weighted path length
Taking a step back …
Why do we need compression?
rate of creation of image and video data
image data from digital camera
today 1k by 1.5 k is common = 1.5 mbytes
need 2k by 3k to equal 35mm slide = 6 mbytes
video at even low resolution of
512 by 512 and 3 bytes per pixel, 30 frames/second
Compression basics
video data rate
23.6 mbytes/second
2 hours of video = 169 gigabytes
mpeg-1 compresses
23.6 mbytesdown to 187 kbytes per second
169 gigabytes down to 1.3 gigabytes
compression is essential for both storage
and transmission of data
Compression basics
compression is very widely used
jpeg, giff for single images
mpeg1, 2, 3, 4 for video sequence
zip for computer data
mp3 for sound
based on two fundamental principles
spatial coherence and temporal coherence
similarity with spatial neighbour
similarity with temporal neighbour
Basics of compression
character = basic data unit in the input stream
represents byte, bit, etc.
strings = sequences of characters
encoding = compression
decoding = decompression
codeword = data elements used to represent
input characters or character strings
codetable = list of codewords
Codeword
encoding/compression takes
 characters/strings as input and use codetable
to decide on which codewords to produce
decoder/decompressor takes
codewords as input and uses same codetable
to decide on which characters/strings to
produce
Codetable
clearly both encoder and decoder must
pass the encoded data as a series of
codewords
also must pass the codetable
the codetable can be passed explicitly or
implicitly
that is we either
pass it across
agree on it beforehand (hard wired)
recreate it from the codewords (clever!)
Basic definitions
compression ratio =
size of original data / compressed data
basically higher compression ratio the better
lossless compression
output data is exactly same as input data
essential for encoding computer processed
data
lossy compression
output data not same as input data
acceptable for data that is only viewed or heard
Lossless versus lossy
human visual system less sensitive to high
frequency losses and to losses in color
lossy compression acceptable for visual
data
degree of loss is usually a parameter of
the compression algorithm
tradeoff - loss versus compression
higher compression => more loss
lower compression => less loss
Symmetric versus asymmetric
symmetric
encoding time == decoding time
essential for real-time applications (ie. video or
audio on demand)
asymmetric
encoding time >> decoding
 ok for write-once, read-many situations
Entropy encoding
compression that does not take into
account what is being compressed
normally is also lossless encoding
most common types of entropy encoding
run length encoding
Huffman encoding
modified Huffman (fax…)
Lempel Ziv
Source encoding
takes into account type of data (ie. visual)
normally is lossy but can also be lossless
most common types in use:
JPEG, GIF = single images
MPEG = sequence of images (video)
MP3 = sound sequence
often uses entropy encoding as a subroutine
Run length encoding
 one of simplest and earliest types of
compression
 take account of repeating data (called runs)
 runs are represented by a count along with the
original data
eg. AAAABB => 4A2B
 do you run length encode a single character?
 no, use a special prefix character to represent
start of runs
Run length encoding
runs are represented as
<prefix char><repeat count><run char>
prefix char itself becomes
<prefix char>1<prefix char>
want a prefix char that is not too common
an example early use is MacPaint file
format
run length encoding is lossless and has
fixed length codewords
MacPaint File Format
Run length encoding
works best for images with solid
background
good example of such an image is a
cartoon
does not work as well for natural images
does not work well for English text
however, is almost always a part of a
larger compression system
Huffman encoding
assume we know the frequency of each
character in the input stream
then encode each character as a variable
length bit string, with the length inversely
proportional to the character frequency
variable length codewords are used; early
example is Morse code
Huffman produced an algorithm for
assigning codewords optimally
Huffman encoding
input = probabilities of occurrence of each
input character (frequencies of
occurrence)
output is a binary tree
each leaf node is an input character
each branch is a zero or one bit
codeword for a leaf is the concatenation of bits
for the path from the root to the leaf
codeword is a variable length bit string
a very good compression ratio (optimal)?
Huffman encoding
Basic algorithm
Mark all characters as free tree nodes
While there is more than one free node
Take two nodes with lowest freq. of occurrence
Create a new tree node with these nodes as children
and with freq. equal to the sum of their freqs.
Remove the two children from the free node list.
Add the new parent to the free node list
Last remaining free node is the root of the
binary tree used for encoding/decoding
Huffman example
a series of colours in an 8 by 8 screen
colours are red, green, cyan, blue,
magenta, yellow, and black
sequence is
rkkkkkkk
kkkrrkkk
kkrrrrgg
kkbcccrr
gggmcbrr
bbbmybbr
gggggggr
grrrrgrr
Huffman example
Huffman example
Huffman example
Huffman example
Fixed versus variable length
codewords
run length codewords are fixed length
Huffman codewords are variable length
length inversely proportional to frequency
all variable length compression schemes
have the prefix property
one code can not be the prefix of another
binary tree structure guarantees that this is
the case (a leaf node is a leaf node!)
Huffman encoding
advantages
maximum compression ratio assuming correct
probabilities of occurrence
easy to implement and fast
disadvantages
need two passes for both encoder and decoder
one to create the frequency distribution
one to encode/decode the data
can avoid this by sending tree (takes time) or by
having unchanging frequencies
Modified Huffman encoding
if we know frequency of occurrences, then
Huffman works very well
consider case of a fax; mostly long white
spaces with short bursts of black
do the following
run length encode each string of bits on a line
Huffman encode these run length codewords
use a predefined frequency distribution
combination run length, then Huffman
Lempel Ziv Welsch (LZW)
 previous methods worked only on characters
 LZW works by encoding strings
 some strings are replaced by a single codeword
 for now assume codeword is fixed (12 bits)
 for 8 bit characters, first 256 (or less) entries in
table are reserved for the characters
 rest of table (257-4096) represent strings
LZW compression
trick is that strings to codeword mapping is
created dynamically by the encoder
also recreated dynamically by the decoder
need not pass the code table between the
two
is a lossless compression algorithm
degree of compression hard to predict
depends on data, but gets better as
codeword table contains more strings
LZW encoder
Demonstrations
A nice animated version of Lempel-Ziv
LZW encoder example
compress the string BABAABAAA
LZW decoder
Lempel-Ziv compression
a lossless compression algorithm
All encodings have the same length
But may represent more than one character
Uses a “dictionary” approach – keeps
track of characters and character strings
already encountered
LZW decoder example
 decompress the string
<66><65><256><257><65><260>
LZW Issues
compression better as the code table
grows
what happens when all 4096 locations in
string table are used?
A number of options, but encoder and
decoder must agree to do the same thing
do not add any more entries to table (as is)
clear codeword table and start again
clear codeword table and start again with larger
table/longer codewords (GIF format)
LZW advantages/disadvantages
advantages
simple, fast and good compression
can do compression in one pass
dynamic codeword table built for each file
decompression recreates the codeword table
so it does not need to be passed
disadvantages
not the optimum compression ratio
actual compression hard to predict
Entropy methods
all previous methods are lossless and
entropy based
lossless methods are essential for
computer data (zip, gnuzip, etc.)
combination of run length
encoding/huffman is a standard tool
are often used as a subroutine by other
lossy methods (Jpeg, Mpeg)
Lempel-Ziv compression
a lossless compression algorithm
All encodings have the same length
But may represent more than one character
Uses a “dictionary” approach – keeps
track of characters and character strings
already encountered
Download