Managing Gigabytes Chapter 2 Ian H. Witten Text compression Text and index representation Compression is prediction Models – Finite-context vs finite-state – Static vs adaptive vs semi-static Huffman coding Arithmetic coding PPM LZ compression: LZ77, LZ78, LZW Comparison of methods 2005 Managing Gigabytes Chapter 2 2005 Ian H. Witten Managing Gigabytes Chapter 2 Ian H. Witten 768078 preceding characters, 49983 of them are t Pr[next Pr[next Pr[next Pr[next is is is is t] = 49983/768078 = 6.5% [–log 0.065 = 3.94 bits] e] = 9.4% x] = 0.11% Z] = 0??? Fixed-length coding Sam I am I am a Sam I am a Sam I am letter S a m • I \n code Fixed length representation 000 001 010 011 100 101 6 different symbols 3 bits/symbol log2 6 = 2.585 35 symbols in message 35 × 3 = 105 bits = 14 bytes 000 001 010 011 100 011 001 010 011 100 011 001 010 011 001 011 000 001 010 101 100 011 001 010 011 001 011 000 001 010 011 100 011 001 010 00000101 00111000 11001010 01110001 10010100 11001011 00000101 01011000 11001010 01100101 10000010 10011100 01100101 0******* 2005 Managing Gigabytes Chapter 2 Ian H. Witten Huffman coding Sam I am I am a Sam I am a Sam I am letter frequency code size total S a m • I \n 3 9 7 11 4 1 1110 01 10 00 110 1111 4 2 2 2 3 4 12 18 14 22 12 4 total 35 Variable length representation 82 bits = 11 bytes 11/14 = compression to 78% 82 bits 1110 01 10 00 110 00 01 10 00 110 00 01 10 00 01 00 1110 01 10 1111 110 00 01 10 00 01 00 1110 01 10 00 110 00 01 10 11100110 00110000 11000110 00011000 01001110 01101111 11000011 00001001 11001100 01100001 10****** Huffman code tree Sam I am I am a Sam I am a Sam I am 2005 letter S a m • I \n frequency 3 9 7 11 4 1 letter S a m • I \n code 1110 01 10 00 110 1111 • 11 a 9 m 7 I 4 S 3 \n 1 4 8 20 15 35 0 1 Rules: 1. Join the two smallest nodes at each stage 2. Label left-going lines 0, right ones 1 Managing Gigabytes Chapter 2 2005 Ian H. Witten Managing Gigabytes Chapter 2 2005 Ian H. Witten Managing Gigabytes Chapter 2 Ian H. Witten Ziv-Lempel coding: LZ77 Repeat occurrences of character sequences? — replace them with a pointer back to the first occurrence to•be•or•not•to•be,•that•is•… Pointer is <position,length> to•be•or•not•<1,5>,•that•is•… In both are <256, pointer fits in 2 bytes 2005 Managing Gigabytes Chapter 2 Ian H. Witten LZ77 example to be just or to be not, must or must I be not? byte 16 byte 1 t o b e j u s t o r\n t o b e n o t ,\n m u s t o r m u s t I b e n o t ? t o b e j u s t o r\n 1 6 n o t ,\n m 8 6 26 5 I17 7 ? LZ77 example Sam I am I am a Sam I am a Sam I am byte 16 byte 1 2005 S a m I S a m\n I a m a m a m S a m 5 2 8 a I I a a m a S a m I 1 3\n1010 4 Managing Gigabytes Chapter 2 Ian H. Witten LZ77 example aaaaaaaaaaaaaaaaarrrrrrrr rrrrrggggggggggghhhhh byte 16 byte 1 a a a a a a a a a a a a a a a a a r r r r r r r r r r r r r g g g g g g g g g g g h h h h h a 116 r1812 g3110 h42 4 Making LZ77 practical Each element is <go-back, length, new char> Sliding window of W chars (Alternative: 1-bit flag to indicate whether item is pointer or character) 2005 Managing Gigabytes Chapter 2 2005 Ian H. Witten Managing Gigabytes Chapter 2 Ian H. Witten Char Pointer Char LZ78 2005 Managing Gigabytes Chapter 2 Ian H. Witten LZW 2005 Managing Gigabytes Chapter 2 2005 Ian H. Witten Managing Gigabytes Chapter 2 2005 Ian H. Witten Managing Gigabytes Chapter 2 2005 Ian H. Witten Managing Gigabytes Chapter 2 2005 Ian H. Witten