Data Compression and Huffman Trees (HW 4)

Data Compression and Huffman
(HW 4)
Data Structures Fall 2008
Modified by Eugene Weinstein
Representing Text
• Way of representing characters as bits
– Characters are ‘a’, ‘b’, ‘1’ , ‘%’, ‘@’, ‘\n’, ‘\t’…
• Each character is represented by a unique
7 bit code. There are 128 possible
• To encode a long text, we encode it
character by character.
Inefficiency of ASCII
• Realization: In many natural files, we are
much more likely to see the letter ‘e’ than
the character ‘&’, yet they are both
encoded using 7 bits!
• Solution: Use variable length encoding!
The encoding for ‘e’ should be shorter
than the encoding for ‘&’.
Variable Length Coding
• Assume we know the distribution of characters
(‘e’ appears 1000 times, ‘&’ appears 1 time)
• Each character will be encoded using a number
of bits that is inversely proportional to its
frequency (made precise later).
• Need a ‘prefix free’ encoding: if ‘e’ = 001
than we cannot assign ‘&’ to be 0011. Since
encoding is variable length, need to know when
to stop.
Encoding Trees
• Think of encoding as an (unbalanced) tree.
• Data is in leaf nodes only (prefix free).
• ‘e’ = 0, ‘a’ = 10, ‘b’ = 11
• How to decode ‘01110’?
Cost of a Tree
• For each character ci let fi be its frequency
in the file.
• Given an encoding tree T, let di be the
depth of ci in the tree (number of bits
needed to encode the character).
• The length of the file after encoding it with
the coding scheme defined by T will be
C(T)= Σdi fi
Creating an Optimal T
• Problem: Find tree T with C(T) minimal.
• Solution (Huffman 1952):
– Create a tree for each character. The weight of the
tree W(T) is the frequency of the character.
– Repeat n-1 times (n = number of chars)
• Select trees T’, T’’ with lowest weights. Merge them together
to form T.
• Set W(T)= W(T’) + W(T’’)
• Implement Using Min-Heap.
• What is running time?
Optimality Intuition
• Need to show that Huffman’s algorithm
indeed results in a tree T with optimal
C(T)= Σci fi.
• The two least weight letters should be on
bottom as siblings (otherwise improve cost
by swapping).
• Intuitively when we combine trees we can
think of this as a new letter with combined
• Implement:
– public class HuffmanTree
• Has traversal/code printing method
– public class HuffmanNode (Comparable)
• Contains letter, integer frequency
• Has accessor (getter) methods
– public class BinaryHeap (given in class)
• Read a file ‘huff.txt’ which includes letters and
– A 20 E 24 G 3 H 4 I 17 L 6 N 5 O 10 S 8 V 1 W 2
• Create a Huffman Tree, algorithm: book 389-395
• Print “legend”: the code of each character
Tips and Implementation Notes
• HuffmanNode should be Comparable to work
with BinaryHeap
– How to implement compareTo method?
• Implement toString method in BinaryHeap
– Print heap after every rearrangement
• Understand binary heap operations:
– insert
– deleteMin