Information Theory

```DA-IICT, Gandhinagar
Lecture Notes
Subject: Probability, Statistics and Information Theory (SC222)
Date: 07/04/2021
Reg. No.: 201901438
Name: Yash Mandaviya
Lecture No.: 21
1
The Information Theory
The Information Theory is the study of the communication of information.
The fundamental work of the field The Information Theory was done by Harry Nyquist(an
electrical engineer), Ralph Hartley(an electrical reasercher) and Claude Shannon(mathematician).ClaudeShannon is known as The Father of the information theory because of the paper
he had published in 1948 called ”A Mathematical theory of the information”.
The Information Theory is divided in the parts like (1) Source Coding and (2) Channel
Coding.Source Coding comes in play while passing the messages, it is more of a related
to compressing the data.Source coding helps storing the data in the least possible space.
Whereas the channel coding is quite opposite , as if it adds more bits to the message to
make the communication error free.
1.1
Information
While studying the Information Theory, question comes in mind that what is actually information in the mathematical sense, and answer is:
”Information is the amount of uncertainty”
Well, still it’s not that much clear,isn’t it?, here are some examples:
Example 1: Suppose that Ankit and Hiten are playing some chess games and carrom
games, Ankit wins the chess games by 63 percent win rate, and Hiten wins the chess games
by 37 percent win rate, Ankit wins the carrom games by the win rate of 6 percent and
Hiten wins the carrom games by the win rate of 94 percent.
win rate for Ankit
win rate for Hiten
chess
63
37
carrom
6
94
If we go through the table and look at the chess coloumn , we can’t actually predicte the
good outcome weather Ankit wins or Hiten, but if we observe the carrom coloumn we are
almost certain that Hiten will win the game of the carrom.so we can say that the results of
the chess coloumn are more uncertain that of the carrom.
Example 2: The results published by the CBSE board in 2020 shows that 88.78 percent students passed the exams, and the report published by the IIT Delhi in 2020 shows
that only 28.64 percent students passed the IIT enterance exam.
pass percentage
fail percentage
IIT-JEE(2020)
28.64
71.36
CBSE board(2020)
88.78
11.22
We can conclude from the above table that, the information of the pass-fail percentage
of the IIT-JEE(2020) is less certain than the information of the CBSE board(2020).
The observation of the 2 tables above tell us that if there is more uncertanity in the
information then, that information contains more data.(the amount of the data in bits are
calculated in the entropy part.)
1.2
Entropy
Entropy is very important term , we are using in the information theory to measure the
information.As the information is the amount of uncertanity , more uncertanity in the information brings more information to store ,so the entropy is the amount of the uncertanity
in the information.
Here is the formula to calculate the entropy for a discreate random variable X,
X
H(X) = −
p(i) log2 p(i) bits
i∈X
The unit of the entropy is ’bits’ if the we were to take the base 2 in the formula, but if
we take the base ’e’ then, the unit of the entropy is ’nats’.
Example 1:Import all the data from the Example 1 of the Information part, we have
the following for chess game:
win rate for Ankit: 63 percent
win rate for Hiten: 37 percent
the entropy of the chess game is as follows:
H(chess) = −0.63 log2 0.63 − 0.37 log2 0.37 = 0.95 bits
we have the following for the carrom game:
win rate for Ankit: 6 percent
win rate for Hiten:94 percent
the entropy of the carrom game is as follows:
H(carrom) = −0.06 log2 0.06 − 0.94 log2 0.94 = 0.32 bits
2
Example 2: Suppose that we have two coins, one is unbiased and other is biased with
the following results:
Tail percenatge
Biased Coin(A)
70
30
Unbiased Coin(B)
50
50
H(A) = −0.7 log2 0.7 − 0.3 log2 0.3 = 0.88 bits
H(B) = −0.5 log2 0.5 − 0.5 log2 0.5 = 1 bits
Some Observation: Suppose that there is a random experiment , and it has ’n’ different
outcomes with the same probability of ’1/n’ then after the experiment is done we were to
get the information of log2 n bits
From the understanding of the examples we have analyzed we have the following:- Before the random event is occured, There is some fixed(certain) amount of the uncertanity and we have not gained any type of information
- After the random event has occured , There is some fixed(certain) amount of information
gained and no uncertainty.
Example 3: Suppose that you toss a unbiased coin for several times untill you get 2
tails one after another. X denotes the number of tosses required to get 2 consecutive tails.
Calculate the Entropy for X.Follow the table.
X
2
3
4
5
..
i
p(X)
(1/4)
(1/8)
(1/16)
(1/32)
..
(1/2i )
Event
TT
HTT
HHTT
HHHTT
..
(H..H)TT
∞
X
1
1
= 1.5 bits
H(X) =
log2
i
i
2
1/2
i=2
Quick Reference to sum of some infinite series:
∞
X
ri =
1
, |r| &lt; 1
1−r
ri =
r
, |r| &lt; 1
1−r
i=0
∞
X
i=1
3
∞
X
i ∗ ri =
i=1
r
, |r| &lt; 1
(1 − r)2
Example 4: Suppose that a certain random event has two outcomes 0 and 1, let’s take
a random variable X , which ∈ {0, 1} with following probabilities,
P (X = 0) = 1 − P, P (X = 1) = 1 − P (X = 0) = P, 0 ≤ p ≤ 1.
H(X) = −P log P − (1 − P ) log (1 − P )
If we want to maximize H(x) over P,we have to take
dH(x)
= 0
dx
and we will get x=1/2, means H(1/2) is maximum :
1.2.1
JOINT ENTROPY:
H(X, Y ) = −
XX
p(x, y) log2 p(x, y)
x∈X y∈Y
1.2.2
CONDITIONAL ANTROPY:
XX
H(Y | X) = −
p(x, y) log2 p(y | x)
x∈X y∈Y
=
P
x∈X
p(x) H(Y | x)
Example 1: joint probabilities for the random variables X and Y is given in the following
table:
X
Y
1
2
3
p(X) =⇒
1
2
3
p(Y)⇓
1/12
1/6
1/12
1/3
1/4
1/8
1/8
1/2
1/12
0
1/12
1/6
5/12
7/24
7/24
total = 1
4
1. H(X) = 0.56 bits
2. H(Y ) = 1.56 bits
3. H(Y | X) = 0.4 bits
4. H(X | Y ) = 1.4 bits
5. H(X, Y ) = 4 bits
1.2.3
MUTUAL INFORMATION
Mutual information of the two random variables A and B is denominated as I(A;B) or
I(B;A),
if we look up in the example of the conditional entropy , we can very wel see that H(Y) H(X) = H(Y | X) − H(X | Y )
I(A; B) = H(B) − H(B | A) = H(A) − H(A | B)
from the example 1 of the conditonal entropy , I(X;Y) = 1 bit
1.2.4
CHAIN RULE FOR ENTROPY
H(A, B) = H(A) + H(B | A)
= H(B) + H(A | B)
Given Formulas shows that, the total amount of the information which is carried by A,B
is the same as total information carried by A and total information carried by B given X ,
and vice-versa.
From this,we can derive yet another formula for I(A;B) which is,
I(A; B) = H(A) + H(B) − H(A, B)
If we are dealing with more than two random variables , say A1 , A2 , A3 , ...., andAn , then,
H(A1 , A2 , A3 , ...., An ) =
n
X
H(Aj | A1 , A2 , A3 , ...., Aj−1 )
j=1
And the Mutual Information for more than
two random variables:
n
X
I(A1 , A2 , A3 , ...., An ; B) =
I(Aj ; B | A1 , A2 , A3 , ...., Aj−1 )
j=1
5
1.3
SOURCE-CODING:
Source code C for a random variable X is a mapping from X to {0, 1}∗ , i.e., all possible
bit strings of 0 and 1. Here {0, 1}∗ = { ε, 0, 1, 00, 01, 10, 11, ....}
Average Length of the code C(L(C)) is given by:
X
L(C) =
pi li
i
Example 1: Suppose that there is a Random varible called X with the following attributes:
X
1
2
3
L(C1 ) =
p(x)
1/3
1/2
1/6
C(x)
10
011
1101
l(x)
2
3
4
1
1
1
∗2+ ∗3+ ∗4
3
2
6
= 2.8333 bits
Example 2: Suppose A has the following properties:
A
1
2
3
4
L(C) =
p(a)
1/8
1/2
1/4
1/8
C(a)
001
01
0
1111
l(a)
3
2
1
4
1
1
1
1
∗3+ ∗2+ ∗1+ ∗4
8
2
4
8
= 9/8 bits
Example 3: find the average length for Y ,which has the following prop:
Y
1
2
3
4
p(y)
1/2
1/4
1/8
1/8
6
C(y)
0
10
110
111
l(y)
1
2
3
3
L(C3 ) =
1
1
1
1
∗1+ ∗2+ ∗3+ ∗3
2
4
8
8
= 1.75 bits
H(Y1 ) =
1
1
1
1
log2 2 + log2 4 + log2 8 + log 8
2
4
8
8
= 1.75 bits
here, l(y) = −log2 p(y). So, L(C) = H(Y ).
2
CONCLUSION:
Now , Question is can we use the code which is in the example 2?