Uploaded by prasanna venkatesan Sundaram

5 text compression

advertisement
Managing Gigabytes
Chapter 2
Ian H. Witten
Text compression
Text and index representation
Compression is prediction
Models
– Finite-context vs finite-state
– Static vs adaptive vs semi-static
Huffman coding
Arithmetic coding
PPM
LZ compression: LZ77, LZ78, LZW
Comparison of methods
2005
Managing Gigabytes
Chapter 2
2005
Ian H. Witten
Managing Gigabytes
Chapter 2
Ian H. Witten
768078 preceding characters, 49983 of them are t
Pr[next
Pr[next
Pr[next
Pr[next
is
is
is
is
t] = 49983/768078 = 6.5% [–log 0.065 = 3.94 bits]
e] = 9.4%
x] = 0.11%
Z] = 0???
Fixed-length coding
Sam I am I am a Sam
I am a Sam I am
letter
S
a
m
•
I
\n
code
Fixed length representation
000
001
010
011
100
101
6 different symbols
3 bits/symbol
log2 6 = 2.585
35 symbols in message
35 × 3 = 105 bits = 14 bytes
000 001 010 011 100 011 001 010 011 100 011 001 010 011 001 011 000 001 010 101
100 011 001 010 011 001 011 000 001 010 011 100 011 001 010
00000101 00111000 11001010 01110001 10010100 11001011 00000101
01011000 11001010 01100101 10000010 10011100 01100101 0*******
2005
Managing Gigabytes
Chapter 2
Ian H. Witten
Huffman coding
Sam I am I am a Sam
I am a Sam I am
letter
frequency
code
size
total
S
a
m
•
I
\n
3
9
7
11
4
1
1110
01
10
00
110
1111
4
2
2
2
3
4
12
18
14
22
12
4
total 35
Variable length
representation
82 bits = 11 bytes
11/14 = compression to 78%
82 bits
1110 01 10 00 110 00 01 10 00 110 00 01 10 00 01 00 1110 01 10 1111
110 00 01 10 00 01 00 1110 01 10 00 110 00 01 10
11100110 00110000 11000110 00011000 01001110 01101111 11000011
00001001 11001100 01100001 10******
Huffman code tree
Sam I am I am a Sam
I am a Sam I am
2005
letter
S
a
m
•
I
\n
frequency
3
9
7
11
4
1
letter
S
a
m
•
I
\n
code
1110
01
10
00
110
1111
•
11
a
9
m
7
I
4
S
3
\n
1
4
8
20
15
35
0
1
Rules:
1. Join the two smallest nodes at each stage
2. Label left-going lines 0, right ones 1
Managing Gigabytes
Chapter 2
2005
Ian H. Witten
Managing Gigabytes
Chapter 2
2005
Ian H. Witten
Managing Gigabytes
Chapter 2
Ian H. Witten
Ziv-Lempel coding: LZ77
Repeat occurrences of character sequences?
— replace them with a pointer back to the first occurrence
to•be•or•not•to•be,•that•is•…
Pointer is <position,length>
to•be•or•not•<1,5>,•that•is•…
In both are <256, pointer fits in 2 bytes
2005
Managing Gigabytes
Chapter 2
Ian H. Witten
LZ77 example
to be just or
to be not,
must or must I be not?
byte 16
byte 1
t o
b e
j u s t
o r\n t o
b e
n o t ,\n m u s t
o r
m u s t
I
b e
n o t ?
t o
b e
j u s t
o r\n 1 6
n o t ,\n m 8 6 26 5 I17 7 ?
LZ77 example
Sam I am I am a Sam
I am a Sam I am
byte 16
byte 1
2005
S a m
I
S a m\n I
a m
a m
a m
S a m
5
2 8 a
I
I
a
a m
a
S a m
I
1 3\n1010 4
Managing Gigabytes
Chapter 2
Ian H. Witten
LZ77 example
aaaaaaaaaaaaaaaaarrrrrrrr
rrrrrggggggggggghhhhh
byte 16
byte 1
a a a a a a a a a a a a a a a a
a r r r r r r r r r r r r r g g
g g g g g g g g g h h h h h
a 116 r1812 g3110 h42 4
Making LZ77 practical
Each element is <go-back, length, new char>
Sliding window of W chars
(Alternative: 1-bit flag to indicate whether item is pointer or
character)
2005
Managing Gigabytes
Chapter 2
2005
Ian H. Witten
Managing Gigabytes
Chapter 2
Ian H. Witten
Char
Pointer
Char
LZ78
2005
Managing Gigabytes
Chapter 2
Ian H. Witten
LZW
2005
Managing Gigabytes
Chapter 2
2005
Ian H. Witten
Managing Gigabytes
Chapter 2
2005
Ian H. Witten
Managing Gigabytes
Chapter 2
2005
Ian H. Witten
Managing Gigabytes
Chapter 2
2005
Ian H. Witten
Download