Binary coding Binary coding is when each string of digits has one of two possible digits that is either “1” or “0”. Eight digits of “1” and “0” can be arranged differently to correspond to a different symbol. With eight digits there would be 256 symbols depending on how the digits are arranged. Eight digits for example 01110110 can be used to represent the lowercase letter b. In computing and telecommunication, binary code is used for any of a variety of methods of programming data, such as a series of bytes. Binary coding can be fixed width or variable width. In a fixed-width binary code, each letter, digit, or other character has a string of binary numbers of the same length. Variable width binary coding has different characters which have different widths. One way in which binary coding is implemented in practice is the way in which lasers are used to send binary strings. Using a laser to send decimal number 5 in binary string is 101. ‘1’ represents the laser being on and ‘0’ represents the laser being off. So the laser is switched on then off then on. Each number is transformed into binary and this is sent down a optic fibre with a laser. It is then converted back into the number. The speed at which modern day computers can switch a laser beam on and off is extremely fast. This increases efficiency as lots of numbers sent very quickly. Suppose ππ is the binary length of character π. Then the average length < π > = ∑ππ=π ππ ππ . For example, 1 A 0 B 10 C 11 D 110 1 1 2 1 ππ΅ = 4 1 ππΆ = 8 1 ππ· = 8 ππ΄ = 1 1 The average length < π > = 1 ∗ 2 + 2 ∗ 4 + 2 ∗ 8 + 3 ∗ 8 = 1.625 Before coding: the length of string of characters = πΏ After coding: the length after binary coding = πΏπ π πΏπ = ∑ππ=1 ππ ππ = πΏ ∑ππ=1 πΏπ ππ = πΏ ∑ππ=1 ππ ππ = πΏ < π > ∴ πΏπ = πΏ < π > Before coding: the entropy before coding = S After coding: the entropy after coding is Sπ πΌ = S. πΏ = Sπ . πΏπ = Sπ . πΏ. < π > ∴ S = Sπ <l> S ≤ πππ2 π ο Sπ ≤ 1 ∴ S ≤ < π > The length of the coding cannot be less than the entropy according to Shannon’s theorem i.e. < π > ≥ S. The above formulas are all derived from previously mentioned formulae in the introduction. It is vital that the coding is uniquely decode-able i.e. invertible. One example is Instantaneous coding. For example, we have: π1 → 0 π2 → 01 π3 → 11 The recipient receives the message 0111111..... When decoding this message, it could be either (i) 0.11.11.11..... or (ii) 01.11.11..... If one was to take even # 1 as in case (i) above, then we would get π1 . π3 . π3 . π3 ..... and if one was to take odd # 1 as in case (ii) above, then we would get π2 . π3 . π3 . π3 ..... Depending on the method of decoding chosen, 2 different messages are received. Therefore, one cannot begin to decode until one has the complete sequence. Prefix coding Prefix code is the idea that no code word is the beginning of another code word. Examples are fixed width codes. Binary code is an example of a fixed width prefix code. These fixed codes are the easiest and most obvious but not an important consideration. If you are looking for speed and efficiency then you would look into variable width prefix codes. ππ is not the beginning of ππ for any π, π i.e. ππ ≠ ππ π where π is some other string. However, ππ is a prefix of ππ if ππ = ππ π. For example, π1 = 0, π2 = 01, πππ π3 = 11, then π1 is a prefix of π2 . If π1 = 0, π2 = 10, πππ π3 = 11, we can see that this is not a prefix coding. Obvious Prefix coding Instantaneous Not so obvious Prefix (binary) codes can be constructed using a tree diagram as below. 000 001 010 00 011 100 01 101 110 10 0 111 “length 3” “length 2” 11 “length 1” 1 “length 0” ∅ The table below summarizes the coding picked without any evaluation and the optimal coding which has been carefully picked. The normal coding was picked randomly. This was done by starting at the bottom of the tree diagram. 0 has been chosen to represent π1, this means that the whole branch of 0 cannot be used any further. Going along the branch 1, 10 has been chosen to represent π2 , thus this branch can also no longer be used. This method has been followed and thus 110 chosen to represent π3 and 111 chosen to reporesent π4 . This means that there is nothing available to choose from for π5 and therefore this coding is inefficient. The branches used can be seen by the bold lines in the tree diagram above. 000 001 010 00 011 100 01 101 110 10 0 11 1 ∅ 111 “length 3” “length 2” “length 1” “length 0” The optimal code is obtained by selecting carefully from the tree diagram. Continue the same procedure from the other coding to obtain the optimal code. This time starting from the row of “length 2”, choose 00 to represent π1, 01 to represent π2 , 10 to represent π3 , 110 to represent π4 , and 111 to represent π5 . Since a code has been derived for all 5 letters, this is assumed to be the optimal code that works. The branches used can be seen by the dashed lines in the tree diagram above. Letter Coding π1 π2 π3 π4 π5 0 10 110 111 Length 1 2 3 3 Optimal coding 00 01 10 110 111 Length 2 2 2 3 3 Country calling codes When making a telephone call to another country there is an international dialling prefix. For example “00” or “011” is dialled before the country calling code. So if somebody wanted to make a call to a mobile phone in the UK from abroad, they would dial 00447XX XXX XXXX (followed by 9 digits) i.e. 00 being the international code, 44 being the country code, and 7 being the second digit of any mobile number in the UK. Huffman coding The Huffman coding method is named after D.A Huffman. He developed the coding in the 1950s. Huffman coding is a compression that transforms characters into variable length bit strings. The characters that occur most frequently have the shortest bit string. The characters that do not occur so frequently have longer bit strings. This code will be explained in detail later on in the project. Huffman coding is used to obtain prefix codes. Prefix codes are known as Huffman codes even if it was not formed from a Huffman code. http://en.wikipedia.org/wiki/Prefix_code Each symbol has a variable length depending on the number of time the symbol occurs. • Each codeword has a certain number of bits. • A prefix code • A variable-length code • Codewords are created by expanding a Huffman Tree Huffman coding is the most illustrative for our project, even though there are many other coding methods. Arithmetic coding Arithmetic coding is an entropy coding. This sort of coding was invented by Elias, Rissanen and Pasco and was subsequently made practical by Witten et al. The coding does not use a discrete number of bits to compress. However it is slow speed. Each symbol is assigned an interval. The coding can provide compression that is close to optimal. The coding is optimal for long coding. So far we have only been concerned with coding, not worried about if words or symbols can come often. In the main body of the project will pay more attention to the fact that some symbols come up more often and this will motivate the concept of entropy which will be explained in detail.