Concept of Information 1 3 The Concept of Information We begin by defining information as a simple count of binary “letters” and, through successive generalizations, we will find expressions for information that apply to physical systems. Information in Binary Messages The simplest measure of information is to count the letters in a message written with a two letter alphabet like plus and minus (+,), zero and one (0,1), or dot and dash (,). These are said to be binary digits and each digit carries one binary bit of information. For example, the message 11010 has five binary letters and therefore 5 bits of information. 1. Find the information in the binary message .[ans. 8 bits] Computer programs use number bases like binary (base 2) and octal (base 8). The value for the binary number 1101 is 1 23 1 2 2 0 21 1 20 8 4 0 1 13 (in decimal). Likewise, 107 in octal is 1 82 0 81 7 80 64 0 7 73 (in decimal). Other number systems work similarly. 2. Find the information in the following binary number (base 2): 1011. Evaluate this number in the decimal system. [ans. 4 bits, 11] Computers store information in binary codes based on each relay being either on or off. Information in Other Alphabets When there are more letters in an alphabet the message will be shorter because each letter contains more information. Suppose, for example, that you write letters with a four-letter alphabet A,T,C,G. You can “code” the letters in binary digits and then count these. See how many different combinations of two binary digits you can construct. The result is four: Combination Letter’s Name ++ + + A T C G Since each letter contains two bits of information, the sequence TAACGT has 12 bits of information. Concept of Information 2 3. The genetic code of DNA is written in four molecules we label A,T,C, and G. How much information is in the genetic message AATCTTGGG? [ans. 18 bits] 4. We saw that a 4-letter alphabet had 2 bits of information per letter. Find the information in the following number written in number base 4: 132. Evaluate this number in the decimal system. [ans. 6 bits, 30] 5. Show that 3 bits of information can specify 8 letters or digits. (Make a list.) 6. How many bits are in the octal number 6071423015? [ans. 30 bits] exercise One bit codes 2 letters, two bits code 4 letters, and three bits code 8 letters. Find the pattern that will give you the number of letters coded by four bits (without having to write all the combinations). An Equation We see that one bit codes 2 letters (say, dot and dash), two bits code 4 letters (say, A,T,C,G), and three bits code 8 letters. The following table gives the pattern: Information per Symbol, I/G Number of Letters in Alphabet, n 1 bit 2 bits 3 bits 2 = 21 4 = 22 8 = 23 I/G n = 2I/G Here I is the total information in a message with G symbols written in an “alphabet” of n letters. We see that the relation between these is n 2I G or, taking the natural logarithm of both sides, I k G ln n (1) with k 1 ln 2 (2) 7. Use Eq.(1) to find the information per symbol in a 4 letter alphabet. 8. Use Eq.(1) to find the information per symbol in a 20 letter alphabet. Notice that the result is not an integer. [ans. 4.32 bits] Concept of Information 3 Application to Biology Strands of the genetic material DNA can be considered to be messages made of four molecular “letters” called nucleotides: A (adanine), T (thymine), C (cytosine), and G (guanine). These messages code the instructions for living cells to make enzymes, the active chemicals that moderate all chemical syntheses, digestion, muscular activity, etc. The enzymes (and other polypeptides) are themselves “messages” made of a 20 letter alphabetthe amino acids. 9. DNA molecules are comprised of how many different kinds of nucleotides? Enzymes are comprised of how many different kinds of amino acids? G C T A C G A T Figure 1 A molecular representation of a strand of DNA is shown on the left. The same strand is depicted more schematically on the right where it is shown attached to a complementary strand by Hydrogen bonds (dotted lines). Notice that the nucleotides are always paired AT and CG. Figure 2 An enzyme is a polymer consisting of 20 amino acid units. Concept of Information 4 We address the question of how many nucleotides are required to specify an amino acid. The information in the DNA must be equal to or greater than the informtion in an amino acid. We write Information in DNA Information in amino acid k Gnucleotides ln 4 k Gamino acid ln 20 Gnucleotides 2.16 Gamino acid Since there cannot be fractional nucleotides, three nucleotides are the minimum number needed to code one amino acid. The genetic code is a listing of which three amino acids correspond to particular amino acids. For instance, GGC codes glycine, GTT codes valine, and TCT codes serine. Each such triplet is called a codon. 10. Demonstrate that three nucleotides is the minimum number needed to code one amino acid. 11. The DNA of an artificial life form consists of only 2 kinds of nucleotides and the enzymes consist of only 12 kinds of amino acids. Calculate the minimum number of nucleotides needed to code one amino acid? [ans. 4] 12. A simple bacterium, E. Coli, is estimated to have about 6 X 106 DNA nucleotides in its chromosome. Assume that English consists of 28 equally probable characters including space and period and assume further that her are about 2100 English characters per textbook page. Estimate the number of text pages that are required to specify the same amount of information “written” on bacterial DNA. [ans. ~1200 pages] Summary The central result of this chapter was to establish a basic equation for the information content of a message, I k G ln n , where G is the number of characters in the message, n is the number of equally probable letters in the “alphabet,” and k 1/ ln 2 was chosen to give information in bits (this is an arbitrary unit). In the next chapter, we will generalize this relation to the case where the letters are not equally likely.