3 The Concept of Information

advertisement
Concept of Information 1
3 The Concept of Information
We begin by defining information as a simple count of binary “letters” and,
through successive generalizations, we will find expressions for information that
apply to physical systems.
Information in Binary Messages
The simplest measure of information is to count the letters in a message written
with a two letter alphabet like plus and minus (+,), zero and one (0,1), or dot and dash
(,). These are said to be binary digits and each digit carries one binary bit of
information. For example, the message 11010 has five binary letters and therefore 5 bits
of information.
1.
Find the information in the binary message        .[ans. 8 bits]
Computer programs use number bases like binary (base 2) and octal (base 8). The
value for the binary number 1101 is 1  23  1  2 2  0  21  1  20  8  4  0  1  13 (in
decimal). Likewise, 107 in octal is 1  82  0  81  7  80  64  0  7  73 (in decimal).
Other number systems work similarly.
2.
Find the information in the following binary number (base 2): 1011. Evaluate
this number in the decimal system. [ans. 4 bits, 11]
Computers store information in binary codes based on each relay being either on or off.
Information in Other Alphabets
When there are more letters in an alphabet the message will be shorter because
each letter contains more information. Suppose, for example, that you write letters with a
four-letter alphabet A,T,C,G. You can “code” the letters in binary digits and then count
these. See how many different combinations of two binary digits you can construct. The
result is four:
Combination
Letter’s Name
++
+
+

A
T
C
G
Since each letter contains two bits of information, the sequence TAACGT has 12 bits of
information.
Concept of Information 2
3.
The genetic code of DNA is written in four molecules we label A,T,C, and G.
How much information is in the genetic message AATCTTGGG? [ans. 18 bits]
4.
We saw that a 4-letter alphabet had 2 bits of information per letter. Find the
information in the following number written in number base 4: 132. Evaluate
this number in the decimal system. [ans. 6 bits, 30]
5.
Show that 3 bits of information can specify 8 letters or digits. (Make a list.)
6.
How many bits are in the octal number 6071423015? [ans. 30 bits]
exercise
One bit codes 2 letters, two bits code 4 letters, and three bits code 8 letters. Find
the pattern that will give you the number of letters coded by four bits (without
having to write all the combinations).
An Equation
We see that one bit codes 2 letters (say, dot and dash), two bits code 4 letters (say,
A,T,C,G), and three bits code 8 letters. The following table gives the pattern:
Information per
Symbol, I/G
Number of Letters in
Alphabet, n
1 bit
2 bits
3 bits
2 = 21
4 = 22
8 = 23
I/G
n = 2I/G
Here I is the total information in a message with G symbols written in an “alphabet” of n
letters. We see that the relation between these is
n  2I G
or, taking the natural logarithm of both sides,
I  k G ln n
(1)
with
k
1
ln 2
(2)
7.
Use Eq.(1) to find the information per symbol in a 4 letter alphabet.
8.
Use Eq.(1) to find the information per symbol in a 20 letter alphabet. Notice that
the result is not an integer. [ans. 4.32 bits]
Concept of Information 3
Application to Biology
Strands of the genetic material DNA can be considered to be messages made of
four molecular “letters” called nucleotides: A (adanine), T (thymine), C (cytosine), and
G (guanine). These messages code the instructions for living cells to make enzymes, the
active chemicals that moderate all chemical syntheses, digestion, muscular activity, etc.
The enzymes (and other polypeptides) are themselves “messages” made of a 20
letter alphabetthe amino acids.
9.
DNA molecules are comprised of how many different kinds of nucleotides?
Enzymes are comprised of how many different kinds of amino acids?
G
C
T
A
C
G
A
T
Figure 1
A molecular representation of a strand of DNA is shown on the left.
The same strand is depicted more schematically on the right where it is shown
attached to a complementary strand by Hydrogen bonds (dotted lines). Notice that
the nucleotides are always paired AT and CG.
Figure 2
An enzyme is a polymer consisting of 20 amino acid units.
Concept of Information 4
We address the question of how many nucleotides are required to specify an amino acid.
The information in the DNA must be equal to or greater than the informtion in an amino
acid. We write
Information in DNA  Information in amino acid
k Gnucleotides ln 4  k Gamino acid ln 20
Gnucleotides
 2.16
Gamino acid
Since there cannot be fractional nucleotides, three nucleotides are the minimum number
needed to code one amino acid. The genetic code is a listing of which three amino acids
correspond to particular amino acids. For instance, GGC codes glycine, GTT codes
valine, and TCT codes serine. Each such triplet is called a codon.
10.
Demonstrate that three nucleotides is the minimum number needed to code one
amino acid.
11.
The DNA of an artificial life form consists of only 2 kinds of nucleotides and the
enzymes consist of only 12 kinds of amino acids. Calculate the minimum number
of nucleotides needed to code one amino acid? [ans. 4]
12.
A simple bacterium, E. Coli, is estimated to have about 6 X 106 DNA nucleotides
in its chromosome. Assume that English consists of 28 equally probable
characters including space and period and assume further that her are about 2100
English characters per textbook page. Estimate the number of text pages that are
required to specify the same amount of information “written” on bacterial DNA.
[ans. ~1200 pages]
Summary
The central result of this chapter was to establish a basic equation for the
information content of a message,
I  k G ln n ,
where G is the number of characters in the message, n is the number of equally probable
letters in the “alphabet,” and k  1/ ln 2 was chosen to give information in bits (this is an
arbitrary unit).
In the next chapter, we will generalize this relation to the case where the letters
are not equally likely.
Download