From bits to bytes to ints

advertisement
From bits to bytes to ints

At some level everything is stored as either a zero or a one
 A bit is a binary digit a byte is a binary term (8 bits)
 We should be grateful we can deal with Strings rather than
sequences of 0's and 1's.
 We should be grateful we can deal with an int rather than the
32 bits that make an int

Int values are stored as two's complement numbers with 32 bits,
for 64 bits use the type long, a char is 16 bits
 Standard in Java, different in C/C++
 Facilitates addition/subtraction for int values
 We don't need to worry about this, except to note:
• Integer.MAX_VALUE + 1 = Integer.MIN_VALUE
• Math.abs(Integer.MIN_VALUE) != Infinity
CPS 100
12.1
How are data stored?

To facilitate Huffman coding we need to read/write one bit
 Why do we need to read one bit?
 Why do we need to write one bit?
 When do we read 8 bits at a time? Read 32 bits at a time?

We can't actually write one bit-at-a-time. We can't really
write one char at a time either.
 Output and input are buffered,minimize memory
accesses and disk accesses
 Why do we care about this when we talk about data
structures and algorithms?
• Where does data come from?
CPS 100
12.2
How do we buffer char output?

Done for us as part of InputStream and Reader classes
 InputStreams are for reading bytes
 Readers are for reading char values
 Why do we have both and how do they interact?
Reader r = new InputStreamReader(System.in);
 Do we need to flush our buffers?

In the past Java IO has been notoriously slow
 Do we care about I? About O?
 This is changing, and the java.nio classes help
• Map a file to a region in memory in one operation
CPS 100
12.3
Buffer bit output

To buffer bit output we need to store bits in a buffer
 When the buffer is full, we write it.
 The buffer might overflow, e.g., in process of writing 10
bits to 32-bit capacity buffer that has 29 bits in it
 How do we access bits, add to buffer, etc.?

We need to use bit operations
 Mask bits -- access individual bits
 Shift bits – to the left or to the right
 Bitwise and/or/negate bits
CPS 100
12.4
Representing pixels

A pixel typically stores RGB and alpha/transparency values
 Each RGB is a value in the range 0 to 255
 The alpha value is also in range 0 to 255
Pixel red = new Pixel(255,0,0,0);
Pixel white = new Pixel(255,255,255,0);

Typically store these values as int values, a picture is simply an
array of int values
void process(int pixel){
int blue = pixel & 0xff;
int green = (pixel >> 8) & 0xff;
int red = (pixel>> 16) & 0xff;
}
CPS 100
12.5
Bit masks and shifts


void process(int pixel){
int blue = pixel & 0xff;
int green = (pixel >> 8) & 0xff;
int red
= (pixel >> 16) & 0xff;
}
Hexadecimal number: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
 Note that f is 15, in binary this is 1111, one less than
10000
 The hex number 0xff is an 8 bit number, all ones
The bitwise & operator creates an 8 bit value, 0—255 (why)
 1&1 == 1, otherwise we get 0, similar to logical and
 Similarly we have |, bitwise or
CPS 100
12.6
Bit operations revisited

How do we write out all of the bits of a number
/**
* writes the bit representation of a int
* to standard out
*/
void bits(int val)
{
CPS 100
12.7
Text Compression


Input: String S
Output: String S
 Shorter
 S can be reconstructed from S
CPS 100
12.8
Text Compression: Examples
“abcde” in the different formats
ASCII:
01100001011000100110001101100100…
Fixed:
000001010011100
1
Encodings
ASCII: 8 bits/character
Unicode: 16 bits/character
0
Symbol
ASCII
Fixed
length
Var.
length
a
01100001
000
000
b
01100010
001
11
c
01100011
010
01
d
01100100
011
001
e
01100101
100
10
0
0
a
Var:
000110100110
1
1
b
0
1
c
d
0
0
e
1
0
0
0
CPS 100
0
a
1
1
d
c
0
e
1
b
12.9
Huffman coding: go go gophers
ASCII
g 103
o 111
p 112
h 104
e 101
r 114
s 115
sp. 32

3 bits Huffman
1100111
1101111
1110000
1101000
1100101
1110010
1110011
1000000
000
001
010
011
100
101
110
111
00
01
1100
1101
1110
1111
101
101
13
7
6
g
o
3
3
Encoding uses tree:
 0 left/1 right
 How many bits? 37!!
 Savings? Worth it?
4
3
s
*
1
2
2
2
p
h
e
r
1
1
1
1
6
4
2
2
g
o
3
3
3
CPS 100
p
h
e
r
1
1
1
1
s
*
1
2
12.10
Huffman Coding





D.A Huffman in early 1950’s
Before compressing data, analyze the input stream
Represent data using variable length codes
Variable length codes though Prefix codes
 Each letter is assigned a codeword
 Codeword is for a given letter is produced by traversing
the Huffman tree
 Property: No codeword produced is the prefix of another
 Letters appearing frequently have short codewords,
while those that appear rarely have longer ones
Huffman coding is optimal per-character coding method
CPS 100
12.11
Building a Huffman tree

Begin with a forest of single-node trees (leaves)
 Each node/tree/leaf is weighted with character count
 Node stores two values: character and count
 There are n nodes in forest, n is size of alphabet?

Repeat until there is only one node left: root of tree
 Remove two minimally weighted trees from forest
 Create new tree with minimal trees as children,
• New tree root's weight: sum of children (character ignored)

Does this process terminate? How do we get minimal trees?
 Remove minimal trees, hummm……
CPS 100
12.12
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
5
5
4
4
3
3
3
3
2
2
2
2
2
1
1
I
E
N
S
M
A
B
O
T
G
D
L
R
U
P
F
CPS 100
1
12.13
C
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
2
11
1
1
F
C
6
5
5
4
4
3
3
3
3
I
E
N
S
M
A
B
O
T
CPS 100
2
2
2
2
2
2
1
G
D
L
R
U
P
12.14
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
2
11
3
1
1
1
2
C
F
P
U
6
5
5
4
4
I
E
N
S
M
CPS 100
3
3
3
3
3
A
B
O
T
2
2
2
2
2
G
D
L
R
12.15
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
4
2
11
3
1
1
1
2
C
F
P
U
6
5
5
I
E
N
CPS 100
4
4
4
S
M
3
2
2
L
R
2
3
3
3
3
A
B
O
T
2
2
G
D
12.16
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
4
2
11
3
1
1
1
2
C
F
P
U
6
5
5
I
E
N
CPS 100
4
4
4
4
S
M
3
4
2
2
2
2
G
D
L
R
3
3
3
3
A
B
O
T
2
12.17
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
5
3
4
2
3
O
11
6
1
1
1
2
C
F
P
U
CPS 100
I
5
5
5
E
N
4
4
4
4
S
M
4
2
2
2
2
G
D
L
R
3
3
3
A
B
T
3
12.18
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
5
3
6
2
3
O
11
6
4
3
T
1
1
1
2
C
F
P
U
CPS 100
6
I
5
5
5
E
N
4
4
4
2
2
2
2
G
D
L
R
4
4
3
3
S
M
A
B
12.19
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
6
5
3
6
2
3
O
11
6
4
3
T
1
1
1
2
C
F
P
U
CPS 100
6
6
I
5
5
5
E
N
4
4
4
2
2
2
2
G
D
L
R
4
4
S
M
3
3
A
B
12.20
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
6
8
5
3
6
2
3
O
11
8
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
CPS 100
6
6
6
I
5
5
5
E
N
4
4
2
2
2
2
G
D
L
R
4
3
3
A
B
12.21
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
8
5
3
6
2
3
O
11
8
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
CPS 100
8
6
6
6
I
5
6
8
5
5
E
N
4
2
2
2
2
G
D
L
R
3
3
A
B
12.22
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
10
5
8
5
6
E
3
2
3
O
11
10
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
CPS 100
8
8
6
6
6
I
5
6
8
5
N
4
2
2
2
2
G
D
L
R
3
3
A
B
12.23
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
10
5
11
5
5
E
6
N
3
2
3
O
11
8
11
3
T
1
1
1
2
C
F
P
U
10
CPS 100
8
8
6
6
I
6
8
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.24
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
12
10
11
8
6
8
6
I
5
5
5
E
N
3
2
3
O
12
6
11
3
T
1
1
1
2
C
F
P
U
11
CPS 100
10
8
8
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.25
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
16
10
11
12
8
6
8
6
I
5
5
5
E
N
3
2
3
O
16
6
12
3
T
1
1
1
2
C
F
P
U
11
CPS 100
11
10
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.26
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
21
16
10
11
12
8
6
8
6
I
5
5
5
E
N
3
2
3
O
21
6
16
3
T
1
1
1
2
C
F
P
U
12
CPS 100
11
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.27
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
N
3
2
3
O
23
6
21
3
T
1
1
1
2
C
F
P
U
16
CPS 100
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.28
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
N
3
2
3
O
37
6
23
3
T
1
1
1
2
C
F
P
U
CPS 100
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.29
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
N
3
2
3
O
60
6
3
T
1
1
1
2
C
F
P
U
CPS 100
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.30
Huffman Complexities

How do we measure? Size of input file, size of alphabet
 Which is typically bigger?

Accumulating character counts: ______
 How can we do this in O(1) time, though not really
Building the heap/priority queue from counts ____
 Initializing heap guaranteed
Building Huffman tree ____
 Why?
Create table of encodings from tree ____
 Why?
Write tree and compressed file _____




CPS 100
12.31
Properties of Huffman coding

Want to minimize weighted path length L(T)of tree T
L(T ) 

d w
i
iLeaf ( T )
i
wi is the weight or count of each codeword i
 di is the leaf corresponding to codeword i
How do we calculate character (codeword) frequencies?
Huffman coding creates pretty full bushy trees?
 When would it produce a “bad” tree?
How do we produce coded compressed data from input
efficiently?




CPS 100
12.32
Writing code out to file

How do we go from characters to encodings?
 Build Huffman tree
 Root-to-leaf path generates encoding

Need way of writing bits out to file
 Platform dependent?
 Complicated to write bits and read in same ordering

See BitInputStream and BitOutputStream classes
 Depend on each other, bit ordering preserved

How do we know bits come from compressed file?
 Store a magic number
CPS 100
12.33
Decoding a message
01100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.34
Decoding a message
1100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.35
Decoding a message
100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.36
Decoding a message
00000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.37
Decoding a message
0000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
G
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.38
Decoding a message
000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
G
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.39
Decoding a message
00100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
G
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.40
Decoding a message
0100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
G
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.41
Decoding a message
100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
G
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.42
Decoding a message
00001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
GO
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.43
Decoding a message
0001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
GO
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.44
Decoding a message
001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
GO
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.45
Decoding a message
01001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
GO
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.46
Decoding a message
1001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
GO
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.47
Decoding a message
001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
GOO
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.48
Decoding a message
01101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
GOO
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.49
Decoding a message
1101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
GOO
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.50
Decoding a message
101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
GOO
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.51
Decoding a message
01
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
4
4
S
M
3
T
1
1
1
2
C
F
P
U
GOO
4
4
2
2
2
2
G
D
L
R
3
3
A
B
12.52
Decoding a message
1
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
GOOD
3
3
A
B
12.53
Decoding a message
01100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
GOOD
3
3
A
B
12.54
Decoding
1.
Read in tree data
O(
)
2.
Decode bit string with tree
O(
)
CPS 100
12.55
Huffman coding: go go gophers
ASCII
g 103
o 111
p 112
h 104
e 101
r 114
s 115
sp. 32


3 bits Huffman
1100111
1101111
1110000
1101000
1100101
1110010
1110011
1000000
000
001
010
011
100
101
110
111
??
??
choose two smallest weights
 combine nodes + weights
 Repeat
 Priority queue?
Encoding uses tree:
 0 left/1 right
 How many bits?
CPS 100
g
o
p
h
e
r
s
*
3
3
1
1
1
1
1
2
2
2
3
p
h
e
r
s
*
1
1
1
1
1
2
6
4
2
2
p
h
e
r
1
1
1
1
g
o
3
3
12.56
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CPS 100
12.57
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CPS 100
12.58
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CPS 100
12.59
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CPS 100
12.60
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CPS 100
12.61
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CPS 100
12.62
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CPS 100
12.63
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CPS 100
12.64
Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” 
“10101101001000101001110011100000”
CPS 100
12.65
Other methods



Adaptive Huffman coding
Lempel-Ziv algorithms
 Build the coding table on the fly while reading
document
 Coding table changes dynamically
 Protocol between encoder and decoder so that everyone
is always using the right coding scheme
 Works well in practice (compress, gzip, etc.)
More complicated methods
 Burrows-Wheeler (bunzip2)
 PPM statistical methods
CPS 100
12.66
Download