Text Compression  S

advertisement
Text Compression


Input: String S
Output: String S


Shorter
S can be reconstructed from S
CompSci 100
21.1
Text Compression: Examples
“abcde” in the different formats
ASCII:
01100001011000100110001101100100…
Fixed:
000001010011100
Encodings
ASCII: 8 bits/character
Unicode: 16 bits/character
1
0
Symbol ASCII
a
01100001
Fixed
length
000
Var.
length
000
b
01100010
001
11
c
01100011
010
01
d
01100100
011
001
e
01100101
100
10
0
0
a
Var:
000110100110
1
1
b
0
1
c
d
0
0
e
1
0
0
0
a
CompSci 100
0
1
1
c
0
e
1
b
d
21.2
Huffman coding: go go gophers
ASCII(7bits) 3 bits Huffman
g 103
o 111
p 112
h 104
e 101
r 114
s 115
sp. 32

1100111
1101111
1110000
1101000
1100101
1110010
1110011
1000000
000
001
010
011
100
101
110
111
00
01
1100
1101
1110
1111
100
101
13
7
6
g
o
3
3
s
*
1
2
Encoding uses tree:



2
2
p
h
e
r
1
1
1
1
6
0 left/1 right
How many bits? 37!!
Savings? Worth it?
CompSci 100
4
3
4
2
2
g
o
3
3
3
p
h
e
r
1
1
1
1
s
*
1
2
21.3
Huffman Coding




D.A Huffman in early 1950’s
Before compressing data, analyze the input stream
Represent data using variable length codes
Variable length codes though Prefix codes





Each letter is assigned a codeword
Codeword is for a given letter is produced by traversing
the Huffman tree
Property: No codeword produced is the prefix of another
Letters appearing frequently have short codewords, while
those that appear rarely have longer ones
Huffman coding is an optimal per-character coding
method
CompSci 100
21.4
Building a Huffman tree

Begin with a forest of single-node trees (leaves)




Repeat until there is only one node left: root of tree



Each node/tree/leaf is weighted with character count
Node stores two values: character and count
There are n nodes in forest, n is size of alphabet?
Remove two minimally weighted trees from forest
Create new tree with minimal trees as children,
o New tree root's weight: sum of children (character ignored)
Does this process terminate? How do we get minimal trees?

Remove minimal trees, hmmm……
CompSci 100
21.5
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
5
5
4
4
3
3
3
3
2
2
2
2
2
1
1
I
E
N
S
M
A
B
O
T
G
D
L
R
U
P
F
CompSci 100
1
C
21.6
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
2
11
1
1
F
C
6
5
5
4
4
3
3
3
3
I
E
N
S
M
A
B
O
T
CompSci 100
2
2
2
2
2
2
1
G
D
L
R
U
P
21.7
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
2
11
3
1
1
1
2
C
F
P
U
6
5
5
4
4
I
E
N
S
M
CompSci 100
3
3
3
3
3
A
B
O
T
2
2
2
2
2
G
D
L
R
21.8
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
4
2
11
3
1
1
1
2
C
F
P
U
6
5
5
I
E
N
CompSci 100
4
4
4
S
M
3
2
2
L
R
2
3
3
3
3
A
B
O
T
2
2
G
D
21.9
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
4
2
11
3
1
1
1
2
C
F
P
U
6
5
5
I
E
N
CompSci 100
4
4
4
4
S
M
3
4
2
2
2
2
G
D
L
R
3
3
3
3
A
B
O
T
2
21.10
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
5
3
4
2
3
O
11
6
I
1
1
1
2
C
F
P
U
5
CompSci 100
5
5
E
N
4
4
4
4
S
M
4
2
2
2
2
G
D
L
R
3
3
3
A
B
T
3
21.11
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
5
3
6
2
3
O
11
6
4
3
T
1
1
1
2
C
F
P
U
6
I
CompSci 100
5
5
5
E
N
4
4
4
2
2
2
2
G
D
L
R
4
4
3
3
S
M
A
B
21.12
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
6
5
3
6
2
3
O
11
6
4
3
T
1
1
1
2
C
F
P
U
6
CompSci 100
6
I
5
5
5
E
N
4
4
4
2
2
2
2
G
D
L
R
4
4
S
M
3
3
A
B
21.13
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
6
8
5
3
6
2
3
O
11
8
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
6
CompSci 100
6
6
I
5
5
5
E
N
4
4
2
2
2
2
G
D
L
R
3
3
A
B
4
21.14
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
8
5
3
6
2
3
O
11
8
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
8
CompSci 100
6
6
6
I
5
6
8
5
5
E
N
4
2
2
2
2
G
D
L
R
3
3
A
B
21.15
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
10
5
8
5
6
E
3
2
3
O
11
10
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
8
CompSci 100
8
6
6
6
I
5
6
8
4
2
2
2
2
G
D
L
R
3
3
A
B
5
N
21.16
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
10
5
11
5
5
E
6
N
3
2
3
O
11
8
11
3
T
1
1
1
2
C
F
P
U
10
CompSci 100
8
8
6
6
8
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
6
I
21.17
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
12
10
11
8
6
8
6
I
5
5
5
E
N
3
2
3
O
12
6
11
3
T
1
1
1
2
C
F
P
U
11
CompSci 100
10
8
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
8
21.18
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
16
10
11
12
8
6
8
6
I
5
5
5
E
N
3
2
3
O
16
6
12
3
T
1
1
1
2
C
F
P
U
11
CompSci 100
11
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
10
21.19
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
21
16
10
11
12
8
6
8
6
I
5
5
5
E
N
3
2
3
O
21
6
16
3
T
1
1
1
2
C
F
P
U
12
CompSci 100
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
11
21.20
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
N
3
2
3
O
23
6
21
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
16
CompSci 100
21.21
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
N
3
2
3
O
37
6
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
23
CompSci 100
21.22
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
60
CompSci 100
21.23
Encoding
1.
Count occurrence of all occurring characters
O(
N
)
2.
Build priority queue
O(
A
)
3.
Build Huffman tree
O( A log A )
4.
Create Table of codes from tree
O( A log A )
Write Huffman tree and coded data to file
O(
5.
CompSci 100
N
)
21.24
Properties of Huffman coding

Want to minimize weighted path length L(T)of tree T
L(T ) 





i
iLeaf ( T )
i
wi is the weight or count of each codeword i
di is the leaf corresponding to codeword i
How do we calculate character (codeword)
frequencies?
Huffman coding creates pretty full bushy trees?


d w
When would it produce a “bad” tree?
How do we produce coded compressed data from
input efficiently?
CompSci 100
21.25
Writing code out to file

How do we go from characters to encodings?



Need way of writing bits out to file



Platform dependent?
Complicated to write bits and read in same ordering
See BitInputStream and BitOutputStream classes


Build Huffman tree
Root-to-leaf path generates encoding
Depend on each other, bit ordering preserved
How do we know bits come from compressed file?

Store a magic number
CompSci 100
21.26
Decoding a message
01100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
CompSci 100
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
21.27
Decoding a message
1100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
CompSci 100
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
21.28
Decoding a message
100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
CompSci 100
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
21.29
Decoding a message
00000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
CompSci 100
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
21.30
Decoding a message
0000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CompSci 100
21.31
Decoding a message
000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CompSci 100
21.32
Decoding a message
00100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CompSci 100
21.33
Decoding a message
0100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CompSci 100
21.34
Decoding a message
100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CompSci 100
21.35
Decoding a message
00001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CompSci 100
21.36
Decoding a message
0001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CompSci 100
21.37
Decoding a message
001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CompSci 100
21.38
Decoding a message
01001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CompSci 100
21.39
Decoding a message
1001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CompSci 100
21.40
Decoding a message
001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CompSci 100
21.41
Decoding a message
01101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CompSci 100
21.42
Decoding a message
1101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CompSci 100
21.43
Decoding a message
101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CompSci 100
21.44
Decoding a message
01
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CompSci 100
21.45
Decoding a message
1
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOOD
CompSci 100
21.46
Decoding a message
01100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOOD
CompSci 100
21.47
Decoding
1.
Read in tree data
O(
)
2.
Decode bit string with tree
O(
)
CompSci 100
21.48
Huffman coding: go go gophers
ASCII
g 103
o 111
p 112
h 104
e 101
r 114
s 115
sp. 32

1100111
1101111
1110000
1101000
1100101
1110010
1110011
1000000
000
001
010
011
100
101
110
111
??
??
g
o
p
h
e
r
s
*
3
3
1
1
1
1
1
2
2
2
3
p
h
e
r
s
*
1
1
1
1
1
2
choose two smallest weights




3 bits Huffman
combine nodes + weights
Repeat
Priority queue?

0 left/1 right
How many bits?
CompSci 100
2
2
Encoding uses tree:

6
4
p
h
e
r
1
1
1
1
g
o
3
3
21.49
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”

E.g. “ A SIMPLE”  “10101101001000101001110011100000”
CompSci 100
21.50
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”

E.g. “ A SIMPLE”  “10101101001000101001110011100000”
CompSci 100
21.51
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”

E.g. “ A SIMPLE”  “10101101001000101001110011100000”
CompSci 100
21.52
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”

E.g. “ A SIMPLE”  “10101101001000101001110011100000”
CompSci 100
21.53
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”

E.g. “ A SIMPLE”  “10101101001000101001110011100000”
CompSci 100
21.54
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”

E.g. “ A SIMPLE”  “10101101001000101001110011100000”
CompSci 100
21.55
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”

E.g. “ A SIMPLE”  “10101101001000101001110011100000”
CompSci 100
21.56
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”

E.g. “ A SIMPLE”  “10101101001000101001110011100000”
CompSci 100
21.57
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”

E.g. “ A SIMPLE”  “10101101001000101001110011100000”
CompSci 100
21.58
Other methods


Adaptive Huffman coding
Lempel-Ziv algorithms





Build the coding table on the fly while reading document
Coding table changes dynamically
Protocol between encoder and decoder so that everyone is
always using the right coding scheme
Works well in practice (compress, gzip, etc.)
More complicated methods


Burrows-Wheeler (bunzip2)
PPM statistical methods
CompSci 100
21.59
Download