Data Compression Compression is a high-profile application .zip, .mp3, .aac, .jpg, .gif, .gz, .mpg, … What property of MP3 was a significant factor in what made P2P file sharing work? Why do we care? Secondary storage capacity doubles every year Disk space fills up quickly on every computer system More data to compress than ever before Data compression facilitated by priority queue All-time best assignment in a Compsci 100(e) course? • Subject to debate, of course CompSci 100e 9.1 More on Compression What’s the difference between compression techniques? .mp3 files and .zip files? .gif and .jpg? Lossless and lossy Is it possible to compress (lossless) every file? Why? Lossy methods Good for pictures, video, and audio (JPEG, MPEG, etc.) Lossless methods Run-length encoding, Huffman, LZW, … 11 3 5 3 2 6 2 6 5 3 5 3 5 3 10 CompSci 100e 9.2 Text Compression Input: String S Output: String S Shorter S can be reconstructed from S CompSci 100e 9.3 Text Compression: Examples “abcde” in the different formats ASCII: 01100001011000100110001101100100… Fixed: 000001010011100 1 Encodings ASCII: 8 bits/character Unicode: 16 bits/character 0 Symbol ASCII Fixed length Var. length a 01100001 000 000 b 01100010 001 11 c 01100011 010 01 d 01100100 011 001 e 01100101 100 10 0 0 a Var: 000110100110 1 1 b 0 1 c d 0 0 e 1 0 0 0 a CompSci 100e 0 1 1 c 0 e 1 b d 9.4 Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104 e 101 r 114 s 115 sp. 32 3 bits Huffman 1100111 1101111 1110000 1101000 1100101 1110010 1110011 1000000 000 001 010 011 100 101 110 111 00 01 1100 1101 1110 1111 101 101 13 7 6 g o 3 3 Encoding uses tree: 0 left/1 right How many bits? 37!! Savings? Worth it? 4 3 s * 1 2 2 2 p h e r 1 1 1 1 6 4 2 2 g o 3 3 3 CompSci 100e p h e r 1 1 1 1 s * 1 2 9.5 Huffman Coding D.A Huffman in early 1950’s Before compressing data, analyze the input stream Represent data using variable length codes Variable length codes though Prefix codes Each letter is assigned a codeword Codeword is for a given letter is produced by traversing the Huffman tree Property: No codeword produced is the prefix of another Letters appearing frequently have short codewords, while those that appear rarely have longer ones Huffman coding is optimal per-character coding method CompSci 100e 9.6 Building a Huffman tree Begin with a forest of single-node trees (leaves) Each node/tree/leaf is weighted with character count Node stores two values: character and count There are n nodes in forest, n is size of alphabet? Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Create new tree with minimal trees as children, • New tree root's weight: sum of children (character ignored) Does this process terminate? How do we get minimal trees? Remove minimal trees, hummm…… CompSci 100e 9.7 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 1 1 1 I E N S M A B O T G D L R U P F C CompSci 100e 9.8 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 11 1 1 F C 6 5 5 4 4 3 3 3 3 I E N S M A B O T CompSci 100e 2 2 2 2 2 2 1 G D L R U P 9.9 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 11 3 1 1 1 2 C F P U 6 5 5 4 4 I E N S M CompSci 100e 3 3 3 3 3 A B O T 2 2 2 2 2 G D L R 9.10 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 11 3 1 1 1 2 C F P U 6 5 5 I E N CompSci 100e 4 4 4 S M 3 2 2 L R 2 3 3 3 3 A B O T 2 2 G D 9.11 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 11 3 1 1 1 2 C F P U 6 5 5 I E N CompSci 100e 4 4 4 4 S M 3 4 2 2 2 2 G D L R 3 3 3 3 A B O T 2 9.12 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 3 4 2 3 O 11 6 1 1 1 2 C F P U 5 I CompSci 100e 5 5 E N 4 4 4 4 S M 4 2 2 2 2 G D L R 3 3 3 A B T 3 9.13 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 3 6 2 3 O 11 6 4 3 T 1 1 1 2 C F P U 6 I CompSci 100e 5 5 5 E N 4 4 4 2 2 2 2 G D L R 4 4 3 3 S M A B 9.14 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 5 3 6 2 3 O 11 6 4 3 T 1 1 1 2 C F P U 6 6 I CompSci 100e 5 5 5 E N 4 4 4 2 2 2 2 G D L R 4 4 S M 3 3 A B 9.15 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 8 5 3 6 2 3 O 11 8 4 4 S M 4 3 T 1 1 1 2 C F P U 6 6 6 I CompSci 100e 5 5 5 E N 4 4 2 2 2 2 G D L R 3 3 A B 4 9.16 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 8 5 3 6 2 3 O 11 8 4 4 S M 4 3 T 1 1 1 2 C F P U 8 6 6 6 I CompSci 100e 5 6 8 5 5 E N 4 2 2 2 2 G D L R 3 3 A B 9.17 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 5 8 5 6 E 3 2 3 O 11 10 4 4 S M 4 3 T 1 1 1 2 C F P U 8 8 6 6 6 I CompSci 100e 5 6 8 4 2 2 2 2 G D L R 3 3 A B 5 N 9.18 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 5 11 5 5 E 6 N 3 2 3 O 11 8 11 3 T 1 1 1 2 C F P U 10 8 8 6 6 8 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 6 I CompSci 100e 9.19 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 12 10 11 8 6 8 6 I 5 5 5 E N 3 2 3 O 12 6 11 3 T 1 1 1 2 C F P U 11 CompSci 100e 10 8 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 8 9.20 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 16 10 11 12 8 6 8 6 I 5 5 5 E N 3 2 3 O 16 6 12 3 T 1 1 1 2 C F P U 11 CompSci 100e 11 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 10 9.21 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 21 16 10 11 12 8 6 8 6 I 5 5 5 E N 3 2 3 O 21 6 16 3 T 1 1 1 2 C F P U 12 CompSci 100e 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 11 9.22 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 23 6 21 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 16 CompSci 100e 9.23 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 37 6 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 23 CompSci 100e 9.24 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 60 CompSci 100e 9.25 Encoding 1. Count occurrence of all occurring character O( ) 2. Build priority queue O( ) 3. Build Huffman tree O( ) 4. Create Table of codes from tree O( ) Write Huffman tree and coded data to file O( ) 5. CompSci 100e 9.26 Properties of Huffman coding Want to minimize weighted path length L(T)of tree T L(T ) d w i iLeaf ( T ) i wi is the weight or count of each codeword i di is the leaf corresponding to codeword i How do we calculate character (codeword) frequencies? Huffman coding creates pretty full bushy trees? When would it produce a “bad” tree? How do we produce coded compressed data from input efficiently? CompSci 100e 9.27 Writing code out to file How do we go from characters to encodings? Build Huffman tree Root-to-leaf path generates encoding Need way of writing bits out to file Platform dependent? Complicated to write bits and read in same ordering See BitInputStream and BitOutputStream classes Depend on each other, bit ordering preserved How do we know bits come from compressed file? Store a magic number CompSci 100e 9.28 Decoding a message 01100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U CompSci 100e 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 9.29 Decoding a message 1100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U CompSci 100e 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 9.30 Decoding a message 100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U CompSci 100e 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 9.31 Decoding a message 00000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U CompSci 100e 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 9.32 Decoding a message 0000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B G CompSci 100e 9.33 Decoding a message 000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B G CompSci 100e 9.34 Decoding a message 00100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B G CompSci 100e 9.35 Decoding a message 0100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B G CompSci 100e 9.36 Decoding a message 100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B G CompSci 100e 9.37 Decoding a message 00001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GO CompSci 100e 9.38 Decoding a message 0001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GO CompSci 100e 9.39 Decoding a message 001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GO CompSci 100e 9.40 Decoding a message 01001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GO CompSci 100e 9.41 Decoding a message 1001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GO CompSci 100e 9.42 Decoding a message 001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GOO CompSci 100e 9.43 Decoding a message 01101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GOO CompSci 100e 9.44 Decoding a message 1101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GOO CompSci 100e 9.45 Decoding a message 101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GOO CompSci 100e 9.46 Decoding a message 01 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GOO CompSci 100e 9.47 Decoding a message 1 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B GOOD CompSci 100e 9.48 Decoding a message 01100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B GOOD CompSci 100e 9.49 Decoding 1. Read in tree data O( ) 2. Decode bit string with tree O( ) CompSci 100e 9.50 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100e 9.51 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100e 9.52 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100e 9.53 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100e 9.54 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100e 9.55 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100e 9.56 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100e 9.57 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100e 9.58 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100e 9.59 Building a Huffman tree Begin with a forest of single-node trees/tries (leaves) Each node/tree/leaf is weighted with character count Node stores two values: character and count Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Create new tree/internal node with minimal trees as children, • Weight is sum of children’s weight (no char) How does process terminate? Finding minimum? Remove minimal trees, hummm…… CompSci 100e 9.60 How do we create Huffman Tree/Trie? Insert weighted values into priority queue What are initial weights? Why? Remove minimal nodes, weight by sums, re-insert Total number of nodes? PriorityQueue<TreeNode> pq = new PriorityQueue<TreeNode>(); for(int k=0; k < freq.length; k++){ pq.add(new TreeNode(k,freq[k],null,null)); } while (pq.size() > 1){ TreeNode left = pq.remove(); TreeNode right = pq.remove(); pq.add(new TreeNode(0,left.weight+right.weight, left,right)); } TreeNode root = pq.remove(); CompSci 100e 9.61 Creating compressed file Once we have new encodings, read every character Write encoding, not the character, to compressed file Why does this save bits? What other information needed in compressed file? How do we uncompress? How do we know foo.hf represents compressed file? Is suffix sufficient? Alternatives? Why is Huffman coding a two-pass method? Alternatives? CompSci 100e 9.62 Uncompression with Huffman We need the trie to uncompress 000100100010011001101111 As we read a bit, what do we do? Go left on 0, go right on 1 When do we stop? What to do? How do we get the trie? How did we get it originally? Store 256 int/counts • How do we read counts? How do we store a trie? 20 Questions relevance? • Reading a trie? Leaf indicator? Node values? CompSci 100e 9.63 Other Huffman Issues What do we need to decode? How did we encode? How will we decode? What information needed for decoding? Reading and writing bits: chunks and stopping Can you write 3 bits? Why not? Why? PSEUDO_EOF BitInputStream and BitOutputStream: API What should happen when the file won’t compress? Silently compress bigger? Warn user? Alternatives? CompSci 100e 9.64 Huffman Complexities How do we measure? Size of input file, size of alphabet Which is typically bigger? Accumulating character counts: ______ How can we do this in O(1) time, though not really Building the heap/priority queue from counts ____ Initializing heap guaranteed Building Huffman tree ____ Why? Create table of encodings from tree ____ Why? Write tree and compressed file _____ CompSci 100e 9.65 Good Compsci 100(e) Assignment? Array of character/chunk counts, or is this a map? Map character/chunk to count, why array? Priority Queue for generating tree/trie Do we need a heap implementation? Why? Tree traversals for code generation, uncompression One recursive, one not, why and which? Deal with bits and chunks rather than ints and chars The good, the bad, the ugly Create a working compression program How would we deploy it? Make it better? Benchmark for analysis What’s a corpus? CompSci 100e 9.66 Other methods Adaptive Huffman coding Lempel-Ziv algorithms Build the coding table on the fly while reading document Coding table changes dynamically Protocol between encoder and decoder so that everyone is always using the right coding scheme Works well in practice (compress, gzip, etc.) More complicated methods Burrows-Wheeler (bunzip2) PPM statistical methods CompSci 100e 9.67 Data Compression Year Scheme Bit/Cha r 1967 ASCII 7.00 1950 Huffman 4.70 1977 Lempel-Ziv (LZ77) 3.94 1984 Lempel-Ziv-Welch (LZW) – Unix 3.32 compress Why is data compression important? How well can you compress files losslessly? 1987 (LZH) used by zip and unzip 3.30 1987 Move-to-front 3.24 1987 gzip 2.71 1995 Burrows-Wheeler 2.29 1997 BOA (statistical data compression) 1.99 CompSci 100e Is there a limit? How to compare? How do you measure how much information? 9.68