CS100-Bits and Huffman Encoding.pptx

advertisement
Bits and Huffman Encoding
Please get a piece of paper and a pen
and put your name and netid on it.
Make sure you can turn in it after class
without losing your notes.
How is data stored in a computer?
• Several different ways are possible
• In an ASCII-encoded text file (hithero known
as a “text file”), each character consists of 8bit chunks
• What’s a bit?
Bits Mean What You Want Them To Mean
01001101 01001001 01001011 01000101
M
K
I
E
But why 8-bit chunks? Couldn’t we interpret the same string as 4 bit chunks?
0100 1101 0100 1001 0100 1011 0100 0101
4
13
4
9
4
11
4
5
With 8 bit chucks there are 28 (256) possible characters. With 4 bit chunks there are
24 (16) possible characters.
What if we want to save memory?
• Could just use fewer bits per chunk, but that
limits the number of characters we can
represent
• But my document contains a lot more ‘a’s than
‘%’s. What if the code for a was “01” and the
code for % was “10000101”
A Problem
Mike decides he wants to compress his vast collection of Harry Potter
fanfic (all stored a textfiles). He notices that the character ‘H’
occurs much more frequently in the textfiles than does the
character ‘%’, but they both use 8 bits. So he changes the encoding
so that ‘H’ is represented by ‘01’ (2 bits) and ‘%’ is represented by
‘1010101010101010’ (16 bits). Assuming all other characters stay
at 8 bits (a bit unlikely…but just pretend)…how much more frequent
do ‘H’s need to be than ‘%’s for there to be a space savings?
1.
2.
3.
4.
There must be more than 3Hs for every %s
There must be more than 3Hs for every 2%s
There must be more than 4Hs for every 3%s
As long as there more Hs than %s you get a savings
• We can get savings if we know what
characters occur frequently
• So imagine you look at some set of data with
character frequencies. How much can you
save?
Consider the string:
• AACCCAAABAADAE
Letter
Frequency
A
8
C
3
B
1
D
1
E
1
What’s the best
encoding?
Why will this encoding not work?
Letter
Frequency
Encoding
A
8
0
C
3
1
B
1
10
D
1
01
E
1
11
What is the encoding for
AAE?
What is the encoding for
ADC?
Consider the string:
• AACCCAAABAADAE
Letter
Frequency
Encoding
A
8
?
C
3
?
B
1
?
D
1
?
E
1
?
The rule: one
character’s encoding
cannot be the prefix
of another. So if
A=01, no other
encoding can begin
with 01
Work with your row to find the most efficient possible encoding for each character. I was
able to encode the entire string in 25 bits. See if you can do it that efficiently – but be
careful with prefixes. Write your encoding, plus your computation of its total length, on
your handin sheet.
My Encoding
• AACCCAAABAADAE
Letter
Frequency
Encoding
A
8
0
C
3
11
B
1
100
D
1
1010
E
1
1011
The rule: one
character’s encoding
cannot be the prefix
of another. So if
A=01, no other
encoding can begin
with 01
There a several possible encodings that can give you 25. I’m pretty sure there’s no more
efficient encoding, at least as far as compressing individual characters.
Given a set of frequencies, how
should you select the character
encoding for maximum efficiency?
Huffman Encoding
The basic idea
• We will make a “huffman tree”. By repeatedly
combining nodes with small frequencies into
“meta nodes” with larger frequency
• A letter’s position in the tree will tell us what
its encoded form is
Generate a huffman tree and
encodings for this circumstance
Letter2
Frequency
Encoding
A
12
?
C
9
?
B
8
?
D
8
?
E
5
?
F
5
?
Follow the tutorial
linked off the
resources section in
Sakai
Write the huffman tree and the resultant encoding on your handin
sheet.
The Header
• The Magic Number
• Counts
• Or a tree for extra credit
The Psuedo-EOF character
• A made up character to write to your file
Please turn your sheet in at the back
of the room (try for a neat pile)
Download