Compression Codings

Redundancy in Binary Coding A redundant binary representation (RBR) is a system of numeration that uses more bits than is required to represent a single binary digit, thus most numbers have several representations. RBR makes the bitwise logical operation slower when compared to a non-redundant representation, however its Arithmetic operations are faster when large bit widths are used. The value represented by an RBR digit can be found using a translation table that indicates the mathematical value of each possible pair of bits, that is, for every place RBR uses 2-bits. The weighting in RBR starts at 1 for the rightmost position. Each subsequent position goes up by a factor of 2. Most integers have several possible representations and negative values are allowed. 𝑘 An integer value can be converted back from RBR using the formula ∑𝑛−1 𝑘=0 𝑑𝑘 2 where 𝑛 is the number of digits and 𝑑𝑘 is the interpreted value of the 𝑘-th digit, where 𝑘 starts at 0 at the rightmost position: An example of a translation table is below. Using the table, number 1 can be represented in many ways such as "01·01·01·11", "01·01·10·11", "01·01·11·00", "11·00·00·00". The additive inverse of the integer represented can be found by flipping all the bits. Huffman codes to binary data The symbols below were collected from a text page. A count was done of all the alphabet characters and the symbols occurring with the highest probability were ‘a, e, o, t.’ Symbol e t a o Frequency Probability in text of occurrence 209 0.29 181 0.25 163 0.23 159 0.23 Huffman Coding =log2(712/209)=1.8 =log2(712/181)=2.0 =log2(712/163)=2.1 =log2(712/159)=2.2 length 2 2 2 3 Code created by Huffman algorithm 00 10 01 11 Huffman Probability Diagram Step 1 Step 2 e 0.29 0.29 0 t 0.25 0.25 1 a 0.23 0 o 0.23 1 0.46 Step 3 0.54 0 1.0 1 0.46 The codeword is received by arranging the binary digits in reverse order so that the Codeword= {00,10,01,11}. The code lengths vary in size. Strings of binary coding is used to show the code. You then compress the string of ‘1’ and ‘0’. Let’s take the symbol ‘a.’ The string of binary coding is converted into a char value that can be written to a file. This binary tree is constructed from its leaves and the moves towards the roots. Huffman coding is a greedy algorithm as the smallest numbers are combined together as can be seen from the probability diagram above. In the first step the probabilities of ‘a’ and ‘o’ is combined as these two are the least probable. In step two the probabilities of ‘e’ and ‘t’ are combined. In the third step the probabilities add up to 1. Let ci represent the ith symbol in a data string. Li represents the length (in bits) of the code string. The code string length is the sum of li*ci. If the length of the most frequently occurring symbol is large the compression is likely to fail. This form of failure can occur if the wrong statistics are used to do the compression. With this sort of inefficiency the code string can have more bits then the original data. Adaptive Huffman Coding The distribution of characters may not be consistent in a text. For example there could be a lot of ‘a’ characters at the beginning of a file and later on there could be less. Throughout the file the Huffman tree maybe adjusted in this scenario. As can be seen in the example above in static Huffman Coding the character ‘o’ is low down on the tree because of its low frequency of occurrence compared to the other characters. If the letter ‘o’ appeared with high frequency at the beginning of the text it would be inserted at the highest leaf before it is decoded. The ‘o’ would then get pushed down by more high frequency characters. The file could be split into various sections and there could be different tree’s for each section. This is an example of Adaptive Huffman Coding which is also known as Dynamic Huffman Encoding. The difference between static and dynamic Huffman Coding is that in dynamic coding the frequency of symbol occurrence is actively responding to the changing character count. In static Huffman coding the frequency of occurrence is known in advance. The Huffman tree has to be sent at the beginning of the compressed file. This ensures that the decomposer can decode it. Adaptive Huffman Coding can be used on a universal level. It is much more effective then static Huffman Coding as it constantly evolving. To decode Adaptive Huffman Coding a Huffman tree is built like the one required for encoding the symbols. Starting from the root of the tree, the first step is to decode the first symbol. This has a frequency count of 1. Then the next symbol is decoded and this is repeated until all the symbols are decoded. So in the example above the symbol ‘e’ would be decoded as 00. When encoding a file as the number of symbols encoded increases, the compression ratio rises before it gets to a certain point. It then levels of and will not rise beyond this point. http://www.codepedia.com/1/Art_Huffman_p2 Huffman coding is very efficient. Data can be compressed up to approximately 90%. This sort of compression is used in many different applications. As discussed in later chapters it is used in JPEG files. Unequal letter costs It is assumed that in the Huffman code the cost of the symbol ‘a’ is 01 and the cost of the symbol ‘e’ is 00. It does not matter how many 1’s or 0’s there are in the length of 8 digits as both digits have an equal cost to transmit. An example of Huffman coding with different costs to different symbols is Morse coding. A ‘dash line’ will take longer to send then a ‘dot’ therefore sending a ‘dash line’ is less efficient in terms of cost of transmission then sending a dot. The way to learn Morse Coding is to listen to a slow recording of it. You will be listening to a combination of dashes and dots. When listening to the coding a dash is represented as a longer beep and a dot is represented by a shorter beep. After every single letter there is a short pause and after every word there is a long pause. Therefore to master how this coding works a lot of practice in listening to the code is required. Morse coding is the most efficient way to send information by using radio. Morse coding can be sent out using sound or light. This sometimes the case when ships are at sea. Distress signals are transmitted using Morse coding. Morse coding was great in its time but now there are more efficient coding with encryption when required. Morse coding is slow when compared with modern communication. The ‘space’ is a symbol that is very frequently used therefore using Huffman coding this would have the lowest bits. This maybe one or two bits. A symbol such as ‘@’ is not used very frequently therefore it could have 12 bits and this would be efficient. In our sample the optimal occurs when the number of bits used for symbols ‘a, e, o, t’ is proportionate to the logarithms probability of occurrence. Random-access memory and Read only memory Random-access memory and Read only memory is a form of computer data storage. These are integrated circuits that store data. The stored data can be accessed in any order. RAM is linked with volatile types of memory. In this case the information can be lost when the power has been switched off. Read only memory can never be modified. It is therefore used for storing data which will not require modification. It is used in computers to store look up tables to evaluate logical and mathematical functions. In some situations it is possible to get more out of random access memory and read only memory by compressing the code and storing it on read only memory. This is then decompressed when required. This is a compression that is Huffman coding. American Standard Code for Information Interchange Programming languages can use ASCII coding (ASCII stands for American Standard Code for Information Interchange) for symbols. The most common code chart used is the American Standard Code for Information Interchange (ASCII) code. The encoding in the ASCII is based on the orderings from the English alphabet. The preferred name by the Internet Assigned Numbers Authority is US-ASCII. The ASCII was initially set-up based on tele-printer encoding systems that defines 128 characters. Some of these characters are nowadays obsolete. It allows digital devices to communicate between each other as well as to process, store and communicate information input by humans such as written language. The ASCII consists of control codes and graphic codes. The control codes are characters that perform an action rather than materialize in the data set itself. Graphic characters are those which are presented in a form such that they can be read by humans. The code has been designed in such a way that the control codes and graphic codes are grouped together. From the chart it can be seen that the first 33 positions and position 128 are the control characters. The ‘space’ character is positioned as the 33rd character to make the sorting algorithm easy [1]. The 34th to the 127th positions are classified as the graphic characters. Uppercase and lowercase letters were not assigned as a single character. The lower case letters were placed after the uppercase letters. This caused the characters to be slightly different in terms of the bit pattern by a single bit. To reduce the uppercase letters to an easily usable set, special and numeric codes have been included before the letters. Thus both the uppercase and its corresponding lowercase letter are placed in the same row. Note that the ‘space’ character is considered as both graphic and control character, whereas, the ‘delete’ character is strictly considered to be a control character. Seven-bit codes were initially agreed upon as they were sufficient enough, however since the perforated tape was capable of recording eight bits in one position, a ‘zero’ was added at the start of the code [2]. For example when letter A is read from the chart, it is denoted as 1000001 as a seven bit code, but as an eight bit code it is denoted as 01000001. From the ASCII chart, in terms of 8-bit codes, it can be seen that the first column of control characters start with the digits 0000 and the second column of control characters start with the digits 0001. In column 3, the ‘space’, special characters and some mathematical operators (which are usually above the numbers on a computer keyboard) start with digits 0010. The numbers 0-9 are in column 4 followed by some additional mathematical symbols start with 0011. Column 4-7 consists of uppercase and lowercase letters which have been aligned to the same row respectively, with special characters inserted in-between these letters. All the characters in column 4, column 5, column 6 and column 7 start with 0100, 0101, 0110 and 0111 respectively. Each character, when converted to binary code is of the same length i.e. made up of 8 bits. For example, from the code chart above, we can see that both the upper case and lower case letters vary slightly by the 3rd digit. The letters ‘ENT’ from the word ENTROPY are represented as 01000101 01001110 01010100. The letters ‘ent’ from the word entropy are represented as 01100101 01101110 01110100. Symbol ASCII coding a e o t 97 101 111 116 Binary Coding according to the American Heritage Dictionary of the English Language 01100001 01100101 01101111 01110100 Code 3-bit Binary Coding 0 1 2 3 000 001 010 011 Take the example of a string “at ate to” would be represented in numerical coding as 97 116 97 116 101 116 111. Written in bits it would be difficult for humans to read. This written in bits as 01100001 01110100 01100001 01110100 01100101 01110100 01101111 (without the spaces) There are 4 symbols in this coding. 3 bits can be used to encode the different symbols. The string “at ate to” is encoded as 0 3 0 3 1 3 2 As bits this is 000 011 000 011 001 011 010 When using 3-bits per symbol there are 21 bits instead of 56 bits. This example shows how ASCII coding is used to increase efficiency. Formatted output converts binary information ASCII characters. This is then written to the output file. Formatted input/output is portable. The process involves moving formatted data files to different computers as long as they have the ASCII character set. Formatted files are easily read by humans and can be typed or edited using the text editor. Formatted input/output is more expensive then unformatted input/output as information needs to be converted to internal binary data and ASCII text. The diagram below shows the cumulative distribution of code values when generated by different coding methods. It shows how Arithmetic coding is the optimal coding. The guessing game In a English text file there are redundancy. There are ASCII characters that have unequal frequencies. For example the letter ‘u’ is more likely to followed after the letter ‘q’ than after the letter ‘e.’ In a sentence a word maybe predicted after another word. The redundancy of English could be compressed. Then guessing game will illustrate how this works. The way this game works is that an English speaker predicts the next character after a letter in the English alphabet. Consider the example when there are 26 upper case letters and a ‘-.’ The English speaker has to guess the 36 characters in order of arrangement. After making the first guess, the speaker is told whether or not his guess is right or wrong. If he is wrong the process is repeated until he is right. Then he/she will try to guess the next letter and so on. The following results were achieved when a human was asked to play the guessing game. The number of guesses are listed below the character THE-WEATHER–IS- COLD– 1 1 1 1 15 2 4 2 1 1 1 1 1 1 1 17 3 3 1 1 From this example you can see that sometimes the characters are guessed straight away but other times the guesser finds it more difficult to guess the letter. The maximum guesses for a letter is 26 from the upper case alphabet. The total number of symbols has not been reduced. 1 is the most frequently occurring integer therefore compression of the characters with number of guesses equal to 1 should be easier then the compression of the characters with number of guesses 17. From this analysis we move to arithmetic coding which is rather similar. The human guesser in arithmetic coding is replaced by a probabilistic model. The source provides the symbol and the probabilistic model provides a list of possible symbols with probabilities for the next symbol. The sum of the possibilities of each symbol must add up to 1. Research that is ongoing looks at ways to increase the speed of this coding without reducing compression efficiency. Arithmetic Coding Arithmetic coding gives a codeword to every single data set. The probability of occurrence is represented by sub intervals. This sub intervals are half open units [0,1). Subintervals are distinguished from other subintervals by the number of bits. The shorter the code is, the larger the sub interval will be. Arithmetic codes are nearly always more efficient then prefix codes. Example Refer to the figure below: If there is a file with two letter. The first letter is a1 with probability 2/3 or b1 with probability 1/3. The second letter is a2 with probability 1/2 or b2 with probability 1/2. The size of the string is probability b1*probability a2=1/3*1/2=1/6 We start with the subinterval [0,1). This is subdivided into the orange region a1 with probability 1/2. This is the subinterval in the region [0,1/2), [1/2,1). If the input b1 is chosen the red region is selected on the subinterval [1/2,1). The region a2 is selected with probability 1/2. The region a2 lies on the subinterval [1/2,3/4), [3/4,1). The output is the shortest binary fraction that lies between [1/2,3/4). In Arithmetic coding the longer the message is, the more bits that are needed in the output number. The output number can be decoded so that a symbol is formed. The encoding process involves reducing the range of possible numbers with every symbol in the file. The decoding process is the inverse process as the range is increased in proportion to the probability of each symbol as it is extracted. The decoding process is such that the final output is given. At each stage the sub-range is entered. The decoding process will represent the range. 0 initial current interval 1/2 1 1/2 1/2 1/2 3/4 1 key a1 b1 a2 b2 The problem with this type of coding is that the shrinking of current intervals needs the use of accurate arithmetic. There is no output until the whole file has been read. In this coding the decoding process is the same as the encoding process. Converting fractions to binary code If there is an output fraction such as (7653888/16777216), it is difficult to send this in computers therefore the fraction would be converted to binary coding for the use of computers. The way in which to this is to first convert the fraction into its simplest form. 7653888/16777216=14949/2^15=14949/32768 First we find the largest power of 2 that is less than or equal to 14949. The first value of n is therefore 13 as 2^13=8192. 8192<14949. The remainder is 6757. So the next value of n is the largest power of 2 less than or equal to 6757. This process is repeated and the results are shown in the table below. n 2^n Test Is n included? Remainder 13 8192 14949>8192 Yes 14949-8192=6757 12 4096 6757>4096 Yes 6757-4096-2661 11 2048 2661>2048 Yes 2661-2048=613 10 1024 613<1024 No 9 512 613>512 Yes 613-512=101 8 256 101<256 No 7 128 101<128 No 6 64 101>64 Yes 101-64=37 5 32 37>32 Yes 37-32=5 4 16 5<16 No 3 8 5<8 No 2 4 4<5 Yes 1 1 1 1 Yes To represent 14949 in binary coding from the table above if the answer to the question: Is n included is yes this is represented as a ‘1’. If the answer to the question: Is n included is no then this is represented as a ‘0’. 14949=2^13+2^12+2^11+2^9+2^6+2^5+2^2+1=1110100110011 in binary. Infinite binary fraction From my example below you will see that the binary representation of the fraction 1/10 is infinite. Step1 Start by converting the fraction to decimal and multiplying by 2. 0.1*2=0.2. Therefore the first binary digit on the right of the point is a 0. 0.1=.0... (base 2) Step 2 Next the whole number is ignored from the previous answer. In this case it is 0. Then 0.2*2=0.4. Therefore the second binary digit on the right of the point is a 0. 0.1=.00... (base 2) Step 3 Next the whole number is ignored from the previous answer. In this case it is 0. Then 0.4*2=0.8. Therefore the second binary digit on the right of the point is a 0. 0.1=.000... (base 2) Step 4 Next the whole number is ignored from the previous answer. In this case it is 0. Then 0.8*2=1.6. Therefore the second binary digit on the right of the point is a 1. 0.1=.0001... (base 2) Step 5 Next the whole number is ignored from the previous answer. In this case it is 1. Then 0.6*2=1.2. Therefore the second binary digit on the right of the point is a 1. 0.1=.00011... (base 2) If you repeat this process again you will see that in the next step 0.2*2=0.4 which is the same as step 2. Therefore the final binary coding would be: 0.1= .0001100110011...(base 2) Comparison of Huffman coding and Arithmetic coding In both Huffman and Arithmetic coding the probabilities and alphabet is known in advance. They are both lossless data coding methods. To learn more about lossless data compression refer to the compression section. Huffman coding uses the pre-processing of tree coding and there is one codeword assigned to each symbol. In Arithmetic coding one codeword is assigned to the whole data. A disadvantage of Huffman coding is that probabilities are required but in practice this many not always be possible. Huffman codes have many defects for practical purposes. Huffman coding uses a discrete number of symbols to compress however Arithmetic coding does not. There is slight inaccuracy with Arithmetic coding as it produces probabilities based on intervals. When the probability of occurrence of a symbol is close to 1 Arithmetic coding can give a more efficient compression then other forms of coding. Arithmetic coding can be seen as more optimal coding then Huffman coding. However the defects of Arithmetic coding is its slowness and complexity. It can be speeded up with a little loss in compression efficiency. It is not straight forward to implement. It does not produce prefix codes. Arithmetic coding usually comes up with better results than the more popular Huffman coding it is used rarely. Initially this was because of requirements on the performance of computers. But later on legal requirements became more important. Companies such as IBM and Mitsubishi have patents to use this form of coding. It has only been in the last 10 years that arithmetic coding has been developed to replace Huffman coding. This section of the project has looked at Huffman Coding and Arithmetic Coding in detail. After doing an analysis the conclusion is that the Huffman Code is more efficient at compression for a variety of reason then the Arithmetic Code. However with advanced technology and the pace of change and innovation this may not be the case in the future. [1] http://en.wikipedia.org/wiki/Sorting_algorithm [2] http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters

Compression Codings

Related documents

Products

Support

Compression Codings

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib