Compression Codings

advertisement
Redundancy in Binary Coding
A redundant binary representation (RBR) is a system of numeration that uses more bits than
is required to represent a single binary digit, thus most numbers have several representations.
RBR makes the bitwise logical operation slower when compared to a non-redundant
representation, however its Arithmetic operations are faster when large bit widths are used.
The value represented by an RBR digit can be found using a translation table that indicates
the mathematical value of each possible pair of bits, that is, for every place RBR uses 2-bits.
The weighting in RBR starts at 1 for the rightmost position. Each subsequent position goes up
by a factor of 2. Most integers have several possible representations and negative values are
allowed.
𝑘
An integer value can be converted back from RBR using the formula ∑𝑛−1
𝑘=0 𝑑𝑘 2 where 𝑛 is
the number of digits and 𝑑𝑘 is the interpreted value of the 𝑘-th digit, where 𝑘 starts at 0 at the
rightmost position:
An example of a translation table is below. Using the table, number 1 can be represented in
many ways such as "01·01·01·11", "01·01·10·11", "01·01·11·00", "11·00·00·00". The
additive inverse of the integer represented can be found by flipping all the bits.
Huffman codes to binary data
The symbols below were collected from a text page. A count was done of all the alphabet
characters and the symbols occurring with the highest probability were ‘a, e, o, t.’
Symbol
e
t
a
o
Frequency Probability
in text
of
occurrence
209
0.29
181
0.25
163
0.23
159
0.23
Huffman Coding
=log2(712/209)=1.8
=log2(712/181)=2.0
=log2(712/163)=2.1
=log2(712/159)=2.2
length
2
2
2
3
Code created
by Huffman
algorithm
00
10
01
11
Huffman Probability Diagram
Step 1
Step 2
e
0.29
0.29
0
t
0.25
0.25
1
a
0.23
0
o
0.23
1
0.46
Step 3
0.54
0
1.0
1
0.46
The codeword is received by arranging the binary digits in reverse order so that the
Codeword= {00,10,01,11}. The code lengths vary in size.
Strings of binary coding is used to show the code. You then compress the string of ‘1’ and
‘0’. Let’s take the symbol ‘a.’ The string of binary coding is converted into a char value that
can be written to a file. This binary tree is constructed from its leaves and the moves towards
the roots. Huffman coding is a greedy algorithm as the smallest numbers are combined
together as can be seen from the probability diagram above. In the first step the probabilities
of ‘a’ and ‘o’ is combined as these two are the least probable. In step two the probabilities of
‘e’ and ‘t’ are combined. In the third step the probabilities add up to 1. Let ci represent the ith symbol in a data string. Li represents the length (in bits) of the code string. The code string
length is the sum of li*ci. If the length of the most frequently occurring symbol is large the
compression is likely to fail. This form of failure can occur if the wrong statistics are used to
do the compression. With this sort of inefficiency the code string can have more bits then the
original data.
Adaptive Huffman Coding
The distribution of characters may not be consistent in a text. For example there could be a
lot of ‘a’ characters at the beginning of a file and later on there could be less. Throughout the
file the Huffman tree maybe adjusted in this scenario. As can be seen in the example above in
static Huffman Coding the character ‘o’ is low down on the tree because of its low frequency
of occurrence compared to the other characters. If the letter ‘o’ appeared with high frequency
at the beginning of the text it would be inserted at the highest leaf before it is decoded. The
‘o’ would then get pushed down by more high frequency characters. The file could be split
into various sections and there could be different tree’s for each section. This is an example
of Adaptive Huffman Coding which is also known as Dynamic Huffman Encoding. The
difference between static and dynamic Huffman Coding is that in dynamic coding the
frequency of symbol occurrence is actively responding to the changing character count. In
static Huffman coding the frequency of occurrence is known in advance. The Huffman tree
has to be sent at the beginning of the compressed file. This ensures that the decomposer can
decode it. Adaptive Huffman Coding can be used on a universal level. It is much more
effective then static Huffman Coding as it constantly evolving. To decode Adaptive Huffman
Coding a Huffman tree is built like the one required for encoding the symbols. Starting from
the root of the tree, the first step is to decode the first symbol. This has a frequency count of
1. Then the next symbol is decoded and this is repeated until all the symbols are decoded. So
in the example above the symbol ‘e’ would be decoded as 00. When encoding a file as the
number of symbols encoded increases, the compression ratio rises before it gets to a certain
point. It then levels of and will not rise beyond this point.
http://www.codepedia.com/1/Art_Huffman_p2
Huffman coding is very efficient. Data can be compressed up to approximately 90%. This
sort of compression is used in many different applications. As discussed in later chapters it is
used in JPEG files.
Unequal letter costs
It is assumed that in the Huffman code the cost of the symbol ‘a’ is 01 and the cost of the
symbol ‘e’ is 00. It does not matter how many 1’s or 0’s there are in the length of 8 digits as
both digits have an equal cost to transmit.
An example of Huffman coding with different costs to different symbols is Morse coding. A
‘dash line’ will take longer to send then a ‘dot’ therefore sending a ‘dash line’ is less
efficient in terms of cost of transmission then sending a dot. The way to learn Morse Coding
is to listen to a slow recording of it. You will be listening to a combination of dashes and
dots. When listening to the coding a dash is represented as a longer beep and a dot is
represented by a shorter beep. After every single letter there is a short pause and after every
word there is a long pause. Therefore to master how this coding works a lot of practice in
listening to the code is required. Morse coding is the most efficient way to send information
by using radio. Morse coding can be sent out using sound or light. This sometimes the case
when ships are at sea. Distress signals are transmitted using Morse coding. Morse coding was
great in its time but now there are more efficient coding with encryption when required.
Morse coding is slow when compared with modern communication. The ‘space’ is a symbol
that is very frequently used therefore using Huffman coding this would have the lowest bits.
This maybe one or two bits. A symbol such as ‘@’ is not used very frequently therefore it
could have 12 bits and this would be efficient. In our sample the optimal occurs when the
number of bits used for symbols ‘a, e, o, t’ is proportionate to the logarithms probability of
occurrence.
Random-access memory and Read only memory
Random-access memory and Read only memory is a form of computer data storage. These
are integrated circuits that store data. The stored data can be accessed in any order. RAM is
linked with volatile types of memory. In this case the information can be lost when the power
has been switched off. Read only memory can never be modified. It is therefore used for
storing data which will not require modification. It is used in computers to store look up
tables to evaluate logical and mathematical functions.
In some situations it is possible to get more out of random access memory and read only
memory by compressing the code and storing it on read only memory. This is then
decompressed when required. This is a compression that is Huffman coding.
American Standard Code for Information Interchange
Programming languages can use ASCII coding (ASCII stands for American Standard Code
for Information Interchange) for symbols. The most common code chart used is the American
Standard Code for Information Interchange (ASCII) code. The encoding in the ASCII is
based on the orderings from the English alphabet. The preferred name by the Internet
Assigned Numbers Authority is US-ASCII.
The ASCII was initially set-up based on tele-printer encoding systems that defines 128
characters. Some of these characters are nowadays obsolete. It allows digital devices to
communicate between each other as well as to process, store and communicate information
input by humans such as written language.
The ASCII consists of control codes and graphic codes. The control codes are characters that
perform an action rather than materialize in the data set itself. Graphic characters are those
which are presented in a form such that they can be read by humans.
The code has been designed in such a way that the control codes and graphic codes are
grouped together. From the chart it can be seen that the first 33 positions and position 128 are
the control characters. The ‘space’ character is positioned as the 33rd character to make the
sorting algorithm easy [1]. The 34th to the 127th positions are classified as the graphic
characters. Uppercase and lowercase letters were not assigned as a single character. The
lower case letters were placed after the uppercase letters. This caused the characters to be
slightly different in terms of the bit pattern by a single bit. To reduce the uppercase letters to
an easily usable set, special and numeric codes have been included before the letters. Thus
both the uppercase and its corresponding lowercase letter are placed in the same row.
Note that the ‘space’ character is considered as both graphic and control character, whereas,
the ‘delete’ character is strictly considered to be a control character.
Seven-bit codes were initially agreed upon as they were sufficient enough, however since the
perforated tape was capable of recording eight bits in one position, a ‘zero’ was added at the
start of the code [2]. For example when letter A is read from the chart, it is denoted as
1000001 as a seven bit code, but as an eight bit code it is denoted as 01000001.
From the ASCII chart, in terms of 8-bit codes, it can be seen that the first column of control
characters start with the digits 0000 and the second column of control characters start with
the digits 0001. In column 3, the ‘space’, special characters and some mathematical operators
(which are usually above the numbers on a computer keyboard) start with digits 0010. The
numbers 0-9 are in column 4 followed by some additional mathematical symbols start with
0011. Column 4-7 consists of uppercase and lowercase letters which have been aligned to the
same row respectively, with special characters inserted in-between these letters. All the
characters in column 4, column 5, column 6 and column 7 start with 0100, 0101, 0110 and
0111 respectively. Each character, when converted to binary code is of the same length i.e.
made up of 8 bits.
For example, from the code chart above, we can see that both the upper case and lower case
letters vary slightly by the 3rd digit.
The letters ‘ENT’ from the word ENTROPY are represented as 01000101 01001110
01010100.
The letters ‘ent’ from the word entropy are represented as 01100101 01101110 01110100.
Symbol
ASCII
coding
a
e
o
t
97
101
111
116
Binary Coding according to the
American Heritage Dictionary
of the English Language
01100001
01100101
01101111
01110100
Code
3-bit Binary
Coding
0
1
2
3
000
001
010
011
Take the example of a string “at ate to” would be represented in numerical coding as 97 116
97 116 101 116 111.
Written in bits it would be difficult for humans to read.
This written in bits as 01100001 01110100 01100001 01110100 01100101 01110100
01101111 (without the spaces)
There are 4 symbols in this coding. 3 bits can be used to encode the different symbols.
The string “at ate to” is encoded as 0 3 0 3 1 3 2
As bits this is 000 011 000 011 001 011 010
When using 3-bits per symbol there are 21 bits instead of 56 bits. This example shows how
ASCII coding is used to increase efficiency.
Formatted output converts binary information ASCII characters. This is then written to the
output file. Formatted input/output is portable. The process involves moving formatted data
files to different computers as long as they have the ASCII character set. Formatted files are
easily read by humans and can be typed or edited using the text editor. Formatted
input/output is more expensive then unformatted input/output as information needs to be
converted to internal binary data and ASCII text.
The diagram below shows the cumulative distribution of code values when generated by
different coding methods. It shows how Arithmetic coding is the optimal coding.
The guessing game
In a English text file there are redundancy. There are ASCII characters that have unequal
frequencies. For example the letter ‘u’ is more likely to followed after the letter ‘q’ than after
the letter ‘e.’ In a sentence a word maybe predicted after another word.
The redundancy of English could be compressed. Then guessing game will illustrate how this
works. The way this game works is that an English speaker predicts the next character after a
letter in the English alphabet. Consider the example when there are 26 upper case letters and
a ‘-.’ The English speaker has to guess the 36 characters in order of arrangement. After
making the first guess, the speaker is told whether or not his guess is right or wrong. If he is
wrong the process is repeated until he is right. Then he/she will try to guess the next letter
and so on.
The following results were achieved when a human was asked to play the guessing game.
The number of guesses are listed below the character
THE-WEATHER–IS- COLD–
1 1 1 1 15 2 4
2 1 1 1 1 1 1 1 17 3 3 1
1
From this example you can see that sometimes the characters are guessed straight away but
other times the guesser finds it more difficult to guess the letter. The maximum guesses for a
letter is 26 from the upper case alphabet. The total number of symbols has not been reduced.
1 is the most frequently occurring integer therefore compression of the characters with
number of guesses equal to 1 should be easier then the compression of the characters with
number of guesses 17. From this analysis we move to arithmetic coding which is rather
similar. The human guesser in arithmetic coding is replaced by a probabilistic model. The
source provides the symbol and the probabilistic model provides a list of possible symbols
with probabilities for the next symbol. The sum of the possibilities of each symbol must add
up to 1. Research that is ongoing looks at ways to increase the speed of this coding without
reducing compression efficiency.
Arithmetic Coding
Arithmetic coding gives a codeword to every single data set. The probability of occurrence is
represented by sub intervals. This sub intervals are half open units [0,1). Subintervals are
distinguished from other subintervals by the number of bits. The shorter the code is, the
larger the sub interval will be. Arithmetic codes are nearly always more efficient then prefix
codes.
Example
Refer to the figure below: If there is a file with two letter. The first letter is a1 with
probability 2/3 or b1 with probability 1/3. The second letter is a2 with probability 1/2 or b2
with probability 1/2. The size of the string is probability b1*probability a2=1/3*1/2=1/6
We start with the subinterval [0,1). This is subdivided into the orange region a1 with
probability 1/2. This is the subinterval in the region [0,1/2), [1/2,1). If the input b1 is chosen
the red region is selected on the subinterval [1/2,1). The region a2 is selected with probability
1/2. The region a2 lies on the subinterval [1/2,3/4), [3/4,1). The output is the shortest binary
fraction that lies between [1/2,3/4). In Arithmetic coding the longer the message is, the more
bits that are needed in the output number. The output number can be decoded so that a
symbol is formed. The encoding process involves reducing the range of possible numbers
with every symbol in the file. The decoding process is the inverse process as the range is
increased in proportion to the probability of each symbol as it is extracted. The decoding
process is such that the final output is given. At each stage the sub-range is entered. The
decoding process will represent the range.
0
initial current interval
1/2
1
1/2
1/2
1/2
3/4
1
key
a1
b1
a2
b2
The problem with this type of coding is that the shrinking of current intervals needs the use of
accurate arithmetic. There is no output until the whole file has been read. In this coding the
decoding process is the same as the encoding process.
Converting fractions to binary code
If there is an output fraction such as (7653888/16777216), it is difficult to send this in
computers therefore the fraction would be converted to binary coding for the use of
computers. The way in which to this is to first convert the fraction into its simplest form.
7653888/16777216=14949/2^15=14949/32768
First we find the largest power of 2 that is less than or equal to 14949. The first value of n is
therefore 13 as 2^13=8192. 8192<14949. The remainder is 6757. So the next value of n is the
largest power of 2 less than or equal to 6757. This process is repeated and the results are
shown in the table below.
n
2^n
Test
Is n included?
Remainder
13
8192
14949>8192
Yes
14949-8192=6757
12
4096
6757>4096
Yes
6757-4096-2661
11
2048
2661>2048
Yes
2661-2048=613
10
1024
613<1024
No
9
512
613>512
Yes
613-512=101
8
256
101<256
No
7
128
101<128
No
6
64
101>64
Yes
101-64=37
5
32
37>32
Yes
37-32=5
4
16
5<16
No
3
8
5<8
No
2
4
4<5
Yes
1
1
1
1
Yes
To represent 14949 in binary coding from the table above if the answer to the question: Is n
included is yes this is represented as a ‘1’. If the answer to the question: Is n included is no
then this is represented as a ‘0’.
14949=2^13+2^12+2^11+2^9+2^6+2^5+2^2+1=1110100110011 in binary.
Infinite binary fraction
From my example below you will see that the binary representation of the fraction 1/10 is
infinite.
Step1
Start by converting the fraction to decimal and multiplying by 2. 0.1*2=0.2. Therefore the
first binary digit on the right of the point is a 0.
0.1=.0... (base 2)
Step 2
Next the whole number is ignored from the previous answer. In this case it is 0. Then
0.2*2=0.4. Therefore the second binary digit on the right of the point is a 0.
0.1=.00... (base 2)
Step 3
Next the whole number is ignored from the previous answer. In this case it is 0. Then
0.4*2=0.8. Therefore the second binary digit on the right of the point is a 0.
0.1=.000... (base 2)
Step 4
Next the whole number is ignored from the previous answer. In this case it is 0. Then
0.8*2=1.6. Therefore the second binary digit on the right of the point is a 1.
0.1=.0001... (base 2)
Step 5
Next the whole number is ignored from the previous answer. In this case it is 1. Then
0.6*2=1.2. Therefore the second binary digit on the right of the point is a 1.
0.1=.00011... (base 2)
If you repeat this process again you will see that in the next step 0.2*2=0.4 which is the same
as step 2.
Therefore the final binary coding would be:
0.1= .0001100110011...(base 2)
Comparison of Huffman coding and Arithmetic coding
In both Huffman and Arithmetic coding the probabilities and alphabet is known in advance.
They are both lossless data coding methods. To learn more about lossless data compression
refer to the compression section. Huffman coding uses the pre-processing of tree coding and
there is one codeword assigned to each symbol. In Arithmetic coding one codeword is
assigned to the whole data. A disadvantage of Huffman coding is that probabilities are
required but in practice this many not always be possible. Huffman codes have many defects
for practical purposes. Huffman coding uses a discrete number of symbols to compress
however Arithmetic coding does not. There is slight inaccuracy with Arithmetic coding as it
produces probabilities based on intervals. When the probability of occurrence of a symbol is
close to 1 Arithmetic coding can give a more efficient compression then other forms of
coding. Arithmetic coding can be seen as more optimal coding then Huffman coding.
However the defects of Arithmetic coding is its slowness and complexity. It can be speeded
up with a little loss in compression efficiency. It is not straight forward to implement. It does
not produce prefix codes. Arithmetic coding usually comes up with better results than the
more popular Huffman coding it is used rarely. Initially this was because of requirements on
the performance of computers. But later on legal requirements became more important.
Companies such as IBM and Mitsubishi have patents to use this form of coding. It has only
been in the last 10 years that arithmetic coding has been developed to replace Huffman
coding.
This section of the project has looked at Huffman Coding and Arithmetic Coding in detail.
After doing an analysis the conclusion is that the Huffman Code is more efficient at
compression for a variety of reason then the Arithmetic Code. However with advanced
technology and the pace of change and innovation this may not be the case in the future.
[1] http://en.wikipedia.org/wiki/Sorting_algorithm
[2] http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters
Download