Data Compression

advertisement
Data Compression



Compression is a high-profile application
 .zip, .mp3, .aac, .jpg, .gif, .gz, .mpg, …
 What property of MP3 was a significant factor in what
made P2P file sharing work?
Why do we care?
 Secondary storage capacity doubles every year
 Disk space fills up quickly on every computer system
 More data to compress than ever before
Data compression facilitated by priority queue
 All-time best assignment in a Compsci 100(e) course?
• Subject to debate, of course
CompSci 100e
9.1
More on Compression




What’s the difference between compression techniques?
 .mp3 files and .zip files?
 .gif and .jpg?
 Lossless and lossy
Is it possible to compress (lossless) every file? Why?
Lossy methods
 Good for pictures, video, and audio (JPEG, MPEG, etc.)
Lossless methods
 Run-length encoding, Huffman, LZW, …
11 3 5 3 2 6 2 6 5 3 5 3 5 3 10
CompSci 100e
9.2
Text Compression


Input: String S
Output: String S
 Shorter
 S can be reconstructed from S
CompSci 100e
9.3
Text Compression: Examples
“abcde” in the different formats
ASCII:
01100001011000100110001101100100…
Fixed:
000001010011100
1
Encodings
ASCII: 8 bits/character
Unicode: 16 bits/character
0
Symbol
ASCII
Fixed
length
Var.
length
a
01100001
000
000
b
01100010
001
11
c
01100011
010
01
d
01100100
011
001
e
01100101
100
10
0
0
a
Var:
000110100110
1
1
b
0
1
c
d
0
0
e
1
0
0
0
a
CompSci 100e
0
1
1
c
0
e
1
b
d
9.4
Huffman coding: go go gophers
ASCII
g 103
o 111
p 112
h 104
e 101
r 114
s 115
sp. 32

3 bits Huffman
1100111
1101111
1110000
1101000
1100101
1110010
1110011
1000000
000
001
010
011
100
101
110
111
00
01
1100
1101
1110
1111
101
101
13
7
6
g
o
3
3
Encoding uses tree:
 0 left/1 right
 How many bits? 37!!
 Savings? Worth it?
4
3
s
*
1
2
2
2
p
h
e
r
1
1
1
1
6
4
2
2
g
o
3
3
3
CompSci 100e
p
h
e
r
1
1
1
1
s
*
1
2
9.5
Huffman Coding





D.A Huffman in early 1950’s
Before compressing data, analyze the input stream
Represent data using variable length codes
Variable length codes though Prefix codes
 Each letter is assigned a codeword
 Codeword is for a given letter is produced by traversing
the Huffman tree
 Property: No codeword produced is the prefix of another
 Letters appearing frequently have short codewords,
while those that appear rarely have longer ones
Huffman coding is optimal per-character coding method
CompSci 100e
9.6
Building a Huffman tree

Begin with a forest of single-node trees (leaves)
 Each node/tree/leaf is weighted with character count
 Node stores two values: character and count
 There are n nodes in forest, n is size of alphabet?

Repeat until there is only one node left: root of tree
 Remove two minimally weighted trees from forest
 Create new tree with minimal trees as children,
• New tree root's weight: sum of children (character ignored)

Does this process terminate? How do we get minimal trees?
 Remove minimal trees, hummm……
CompSci 100e
9.7
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
5
5
4
4
3
3
3
3
2
2
2
2
2
1
1
1
I
E
N
S
M
A
B
O
T
G
D
L
R
U
P
F
C
CompSci 100e
9.8
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
2
11
1
1
F
C
6
5
5
4
4
3
3
3
3
I
E
N
S
M
A
B
O
T
CompSci 100e
2
2
2
2
2
2
1
G
D
L
R
U
P
9.9
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
2
11
3
1
1
1
2
C
F
P
U
6
5
5
4
4
I
E
N
S
M
CompSci 100e
3
3
3
3
3
A
B
O
T
2
2
2
2
2
G
D
L
R
9.10
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
4
2
11
3
1
1
1
2
C
F
P
U
6
5
5
I
E
N
CompSci 100e
4
4
4
S
M
3
2
2
L
R
2
3
3
3
3
A
B
O
T
2
2
G
D
9.11
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
4
2
11
3
1
1
1
2
C
F
P
U
6
5
5
I
E
N
CompSci 100e
4
4
4
4
S
M
3
4
2
2
2
2
G
D
L
R
3
3
3
3
A
B
O
T
2
9.12
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
5
3
4
2
3
O
11
6
1
1
1
2
C
F
P
U
5
I
CompSci 100e
5
5
E
N
4
4
4
4
S
M
4
2
2
2
2
G
D
L
R
3
3
3
A
B
T
3
9.13
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
5
3
6
2
3
O
11
6
4
3
T
1
1
1
2
C
F
P
U
6
I
CompSci 100e
5
5
5
E
N
4
4
4
2
2
2
2
G
D
L
R
4
4
3
3
S
M
A
B
9.14
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
6
5
3
6
2
3
O
11
6
4
3
T
1
1
1
2
C
F
P
U
6
6
I
CompSci 100e
5
5
5
E
N
4
4
4
2
2
2
2
G
D
L
R
4
4
S
M
3
3
A
B
9.15
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
6
8
5
3
6
2
3
O
11
8
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
6
6
6
I
CompSci 100e
5
5
5
E
N
4
4
2
2
2
2
G
D
L
R
3
3
A
B
4
9.16
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
8
5
3
6
2
3
O
11
8
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
8
6
6
6
I
CompSci 100e
5
6
8
5
5
E
N
4
2
2
2
2
G
D
L
R
3
3
A
B
9.17
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
10
5
8
5
6
E
3
2
3
O
11
10
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
8
8
6
6
6
I
CompSci 100e
5
6
8
4
2
2
2
2
G
D
L
R
3
3
A
B
5
N
9.18
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
10
5
11
5
5
E
6
N
3
2
3
O
11
8
11
3
T
1
1
1
2
C
F
P
U
10
8
8
6
6
8
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
6
I
CompSci 100e
9.19
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
12
10
11
8
6
8
6
I
5
5
5
E
N
3
2
3
O
12
6
11
3
T
1
1
1
2
C
F
P
U
11
CompSci 100e
10
8
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
8
9.20
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
16
10
11
12
8
6
8
6
I
5
5
5
E
N
3
2
3
O
16
6
12
3
T
1
1
1
2
C
F
P
U
11
CompSci 100e
11
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
10
9.21
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
21
16
10
11
12
8
6
8
6
I
5
5
5
E
N
3
2
3
O
21
6
16
3
T
1
1
1
2
C
F
P
U
12
CompSci 100e
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
11
9.22
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
N
3
2
3
O
23
6
21
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
16
CompSci 100e
9.23
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
N
3
2
3
O
37
6
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
23
CompSci 100e
9.24
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
60
CompSci 100e
9.25
Encoding
1.
Count occurrence of all occurring character
O(
)
2.
Build priority queue
O(
)
3.
Build Huffman tree
O(
)
4.
Create Table of codes from tree
O(
)
Write Huffman tree and coded data to file
O(
)
5.
CompSci 100e
9.26
Properties of Huffman coding

Want to minimize weighted path length L(T)of tree T
L(T ) 

d w
i
iLeaf ( T )
i
wi is the weight or count of each codeword i
 di is the leaf corresponding to codeword i
How do we calculate character (codeword) frequencies?
Huffman coding creates pretty full bushy trees?
 When would it produce a “bad” tree?
How do we produce coded compressed data from input
efficiently?




CompSci 100e
9.27
Writing code out to file

How do we go from characters to encodings?
 Build Huffman tree
 Root-to-leaf path generates encoding

Need way of writing bits out to file
 Platform dependent?
 Complicated to write bits and read in same ordering

See BitInputStream and BitOutputStream classes
 Depend on each other, bit ordering preserved

How do we know bits come from compressed file?
 Store a magic number
CompSci 100e
9.28
Decoding a message
01100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
CompSci 100e
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
9.29
Decoding a message
1100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
CompSci 100e
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
9.30
Decoding a message
100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
CompSci 100e
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
9.31
Decoding a message
00000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
CompSci 100e
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
9.32
Decoding a message
0000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CompSci 100e
9.33
Decoding a message
000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CompSci 100e
9.34
Decoding a message
00100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CompSci 100e
9.35
Decoding a message
0100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CompSci 100e
9.36
Decoding a message
100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CompSci 100e
9.37
Decoding a message
00001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CompSci 100e
9.38
Decoding a message
0001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CompSci 100e
9.39
Decoding a message
001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CompSci 100e
9.40
Decoding a message
01001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CompSci 100e
9.41
Decoding a message
1001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CompSci 100e
9.42
Decoding a message
001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CompSci 100e
9.43
Decoding a message
01101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CompSci 100e
9.44
Decoding a message
1101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CompSci 100e
9.45
Decoding a message
101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CompSci 100e
9.46
Decoding a message
01
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CompSci 100e
9.47
Decoding a message
1
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOOD
CompSci 100e
9.48
Decoding a message
01100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOOD
CompSci 100e
9.49
Decoding
1.
Read in tree data
O(
)
2.
Decode bit string with tree
O(
)
CompSci 100e
9.50
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CompSci 100e
9.51
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CompSci 100e
9.52
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CompSci 100e
9.53
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CompSci 100e
9.54
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CompSci 100e
9.55
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CompSci 100e
9.56
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CompSci 100e
9.57
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CompSci 100e
9.58
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CompSci 100e
9.59
Building a Huffman tree

Begin with a forest of single-node trees/tries (leaves)
 Each node/tree/leaf is weighted with character
count
 Node stores two values: character and count

Repeat until there is only one node left: root of tree
 Remove two minimally weighted trees from forest
 Create new tree/internal node with minimal trees
as children,
• Weight is sum of children’s weight (no char)
How does process terminate? Finding minimum?
 Remove minimal trees, hummm……

CompSci 100e
9.60
How do we create Huffman Tree/Trie?

Insert weighted values into priority queue
 What are initial weights? Why?


Remove minimal nodes, weight by sums, re-insert
 Total number of nodes?
PriorityQueue<TreeNode> pq = new PriorityQueue<TreeNode>();
for(int k=0; k < freq.length; k++){
pq.add(new TreeNode(k,freq[k],null,null));
}
while (pq.size() > 1){
TreeNode left = pq.remove();
TreeNode right = pq.remove();
pq.add(new TreeNode(0,left.weight+right.weight,
left,right));
}
TreeNode root = pq.remove();
CompSci 100e
9.61
Creating compressed file

Once we have new encodings, read every character
 Write encoding, not the character, to compressed
file
 Why does this save bits?
 What other information needed in compressed file?

How do we uncompress?
 How do we know foo.hf represents compressed
file?
 Is suffix sufficient? Alternatives?

Why is Huffman coding a two-pass method?
 Alternatives?
CompSci 100e
9.62
Uncompression with Huffman



We need the trie to uncompress
 000100100010011001101111
As we read a bit, what do we do?
 Go left on 0, go right on 1
 When do we stop? What to do?
How do we get the trie?
 How did we get it originally? Store 256 int/counts
• How do we read counts?
 How do we store a trie? 20 Questions relevance?
• Reading a trie? Leaf indicator? Node values?
CompSci 100e
9.63
Other Huffman Issues

What do we need to decode?
 How did we encode? How will we decode?
 What information needed for decoding?

Reading and writing bits: chunks and stopping
 Can you write 3 bits? Why not? Why?
 PSEUDO_EOF
 BitInputStream and BitOutputStream: API

What should happen when the file won’t compress?
 Silently compress bigger? Warn user? Alternatives?
CompSci 100e
9.64
Huffman Complexities

How do we measure? Size of input file, size of
alphabet
 Which is typically bigger?

Accumulating character counts: ______
 How can we do this in O(1) time, though not really
Building the heap/priority queue from counts ____
 Initializing heap guaranteed
Building Huffman tree ____
 Why?
Create table of encodings from tree ____
 Why?
Write tree and compressed file _____




CompSci 100e
9.65
Good Compsci 100(e) Assignment?






Array of character/chunk counts, or is this a map?
 Map character/chunk to count, why array?
Priority Queue for generating tree/trie
 Do we need a heap implementation? Why?
Tree traversals for code generation, uncompression
 One recursive, one not, why and which?
Deal with bits and chunks rather than ints and chars
 The good, the bad, the ugly
Create a working compression program
 How would we deploy it? Make it better?
Benchmark for analysis
 What’s a corpus?
CompSci 100e
9.66
Other methods



Adaptive Huffman coding
Lempel-Ziv algorithms
 Build the coding table on the fly while reading
document
 Coding table changes dynamically
 Protocol between encoder and decoder so that
everyone is always using the right coding scheme
 Works well in practice (compress, gzip, etc.)
More complicated methods
 Burrows-Wheeler (bunzip2)
 PPM statistical methods
CompSci 100e
9.67
Data Compression
Year Scheme
Bit/Cha 
r
1967 ASCII
7.00
1950 Huffman
4.70
1977 Lempel-Ziv (LZ77)
3.94

1984 Lempel-Ziv-Welch (LZW) – Unix 3.32
compress
Why is data
compression
important?
How well can
you compress
files losslessly?
1987 (LZH) used by zip and unzip
3.30

1987 Move-to-front
3.24

1987 gzip
2.71
1995 Burrows-Wheeler
2.29
1997 BOA (statistical data
compression)
1.99
CompSci 100e

Is there a limit?
How to compare?
How do you
measure how
much
information?
9.68
Download