Introduction - Team

Project 8: The concept of Entropy and its applications to information
and coding
Table of Contents
1.Abstract ....................................................................................................................2
2.Introduction ..............................................................................................................3
2.1.Historical Remarks:................................................................................................3
2.2.Topic High-level Overview.....................................................................................4
2.2.1.Information theory...............................................................................................4
2.2.2.Coding ...............................................................................................................7
2.2.3.Entropy ..............................................................................................................7
1. Abstract
2. Introduction
Much of current technology is concerned with the transmission and storage of information.
Most of the devices that we use daily either transmit/receive information (telephones, radios,
TV sets), store it (MP3 players, CDs, DVDs, memory sticks) or do both (computers, mobile
phones). This technology has been made possible by great advances in physics and
engineering, but also by the formulation of a mathematical theory of communication by Claude
Shannon in 1948.
The concept of entropy is central to the mathematical description of information. Given a
source that generates a sequence of symbols (e.g. a text, a numerical code, or the sequence of
nucleotides in the human genome), its entropy is a quantitative measure of how much
information it produces per unit time, and of how much memory is needed to store that
information. In most cases it is possible to encode the sequence of symbols in such a way as
to reduce the amount of memory needed, a process known as ``data compression''. The
entropy of the sequence tells us how much it can be compressed without distortion.
In this project we will define the notion of entropy and discuss some of its properties. The
theorems relevant to its applications to coding and data compression will be presented,
motivated with examples and proved. Students will solve a few problems in which the entropy
of sequences of symbols is estimated and the sequences coded and compressed.
The purpose of this section is to give an introduction into the project of “The concept of
Entropy and its applications to information and coding” and will be mainly based on the paper
published by Claude Shannon in the Bell Systems Technical Journal in 1948. This should
give the reader the necessary grounding in the subject and prepare for more advanced
discussions on the matter.
2.1. Historical Remarks:
To start with, the term entropy was coined in 1865 by the German physicist Rudolf Clausius, from the
Greek words en - "in" and trope - "a turning", in analogy with energy. The entropy in the context of
thermodynamics can provide a measure of the amount of energy in a physical system that cannot be
used to do work.
The work of Shannon was inspired by some of the developments in electrical communications at that
time (e.g. Pulse-code or Pulse-position modulation). Abstracting away from the means we use for
communication, its fundamental problem is that of reproducing at one point either exactly or
approximately a message selected at another point. E.g.:
Diagram 1
Thus, the natural question to ask was how to make this information transfer more efficient.
Among the efficiency considerations that motivated Shannon were:
~ What is the most effective way to encode a message given some information on its
statistical structure? Can we make use this structure to make some savings in the length of
the resulting code?
~ What is the effect of noise in the channel on the reproduction of the original message at the
recipient end of communication? Is it possible to minimise such impact?
It's also necessary to say that the Theory of Mathematical Communication came as the first
systematics framework of its kind.
2.2. Topic High-level Overview
The 3 key important aspects of this topic are:
Information theory
2.2.1. Information theory
Information is measured in different units like characters, in the case of describing the length of an
email message, digits in the length of a telephone number, etc. The convention in information theory
is to measure information in bits. A bit is a contraction of the term binary digit which is either a zero or
a one. For example there are 8 possible configurations of 3 bits, i.e. 000, 001, 010, 011, 100, 101,
110 and 111.
Information can be described in a number of forms such as:
A message received and understood
Knowledge acquired through study or experience or instruction
Formal accusation of a crime
Data: a collection of facts from which conclusions may be drawn; statistical data
An important observation made by Shannon with respect to the message being transmitted is that
what the message means semantically is irrelevant to the engineering task of communication. What is
essential though is that the message selected is one from the set of possible messages that the
communication system should be able to transmit. If this set has a finite number of elements, say N,
then this number or some function of this number can be used to determine the information given
when a particular message from the set is selected. It was first R.V.L Hartley who pointed out that the
most natural choice would be a logarithmic function.
In order to illustrate this quite abstract proposition, let’s consider the following example:
Given an alphabet of K = {a, b}, a message is a string made up of symbols from K. Having received
part of the message,
We naturally, by just observing the pattern of symbols in the partly received text, expect the next
symbol to be an A as well. If it happens to be so, then the information gained by us receiving the next
symbol is very little, since we anticipated the outcome beforehand. Conversely, if the next symbol is a
B, then we gain a lot of information by receiving it as it is very unexpected.
This example highlights the relation between our expectation, or probability in the mathematical
sense, and the amount of information, as well as demonstrates that this relation is ‘inversely
proportional’ (not used in the strict mathematical sense here).
Table 1
Trying now to formalise the above example, in order to draw more mathematically precise conclusions
from it, we introduce the following definitions.
Let a source of information output each symbol 𝑆𝑖 from some alphabet K (|𝐾| = 𝑖) with a probability 𝑃𝑖 .
We define the information gained upon reading 𝑆𝑖 as 𝐼(𝑆𝑖 ).
𝐾 = {𝐴, 𝐵} such that 𝑆1 = 𝐴 and 𝑆2 = 𝐵. We also know that the probability distribution of symbols is
𝑃1 = 1 and 𝑃2 = 0 i.e. we expect the message only to contain symbol 𝐴 and no occurrences of symbol
From our discussion above, we require that the information gained from reading another symbol 𝐴
from the message is zero, as we are sure that’s the symbol we are going to read. Hence, 𝑃1 = 1 =>
𝐼(𝑆𝑖 ) = 𝑈. Using the same argument, the information gained from encountering symbol B somewhere
in the message should necessarily tend to infinity. Thus, 𝑃𝑖 = 0 => 𝐼(𝑆𝑖 ) = ∞. Consolidating these
conclusions in the form of a diagram, the function relating probability and information gain should look
qualitatively similar to the one below:
I(si) = f(Pi)
Diagram 2
Also we will assume for simplicity that symbols in the message occur independently, i.e. we will not
allow any conditional probabilities at this stage (this more advanced topic will be discussed in detail in
later sections). In this case, we can expect with reason that the information gained by reading two
symbols, say 𝑆𝑖 and 𝑆𝑗 , will be the same as the information gained from reading symbol 𝑆𝑖 plus that of
𝑆𝑗 . This is a very intuitively anticipated result, as each time we receive an extra symbol from the
message, we add to our already existent knowledge received previously. Hence, information is
incremental in nature and this statement can be formally written as
𝐼(𝑆𝑖 𝑆𝑗 ) = 𝐼(𝑆𝑖 ) + 𝐼(𝑆𝑗 )
Now, in the light of the above conclusions and assumptions we have made, we can try to guess what
𝐼(𝑆𝑖 ) = 𝑓(𝑃𝑖 ) is. The simplest example that might spring to one’s mind when seeing the somewhat
‘inversely proportional’ link between the two would be the function of the form
𝐼(𝑆𝑖 ) = 𝑓(𝑃𝑖 ) =
However, it’s very easy to see that this is not defined for 𝑃𝑖 = 0 and 𝐼(𝑆𝑖 ) = 0, and it would be
unreasonable to exclude such basic cases from our theory. Also,
𝐼(𝑆𝑖 𝑆𝑗 ) =
1 1
≠ + = 𝐼(𝑆𝑖 ) + 𝐼(𝑆𝑗 )
𝑃𝑖 ∗ 𝑃𝑗 𝑃𝑖 𝑃𝑗
Remark: We have interpreted the probability of 𝑆𝑖 and 𝑆𝑗 appearing one after another in the message
as𝑃𝑖 ∗ 𝑃𝑗 , based on our assumption that these are independent events. Otherwise, by Bayes’ theorem
𝑃𝑖∗𝑗 = 𝑃𝑖 ∗ 𝑃𝑗|𝑖 , where 𝑃𝑗|𝑖 is the conditional probability of 𝑆𝑗 given 𝑆𝑖 . (The property in question would
still apply in this case as well)
Next we try out Hartley’s suggestion to use the logarithmic function, in order to quantise information
and see that logically it does make a lot of sense. Suppose, 𝐼(𝑆𝑖 ) = 𝑓(𝑃𝑖 ) = log(𝑃𝑖 ). Then, using the
properties of the logarithmic function,
𝐼(𝑆𝑖 𝑆𝑗 ) = log(𝑃𝑖 𝑃𝑗 ) = log(𝑃𝑖 ) + log(𝑃𝑗 ) = 𝐼(𝑆𝐼 ) + 𝐼(𝑆𝑗 )
Which matches property (*). It’s also qualitatively equivalent to the function we depicted in Diagram 1
earlier, apart from the fact that 𝑃𝑖 and 𝐼(𝑆𝑖 ) should be ‘inversely’ related, with 𝑆𝑖 < 𝑆𝑗 → 𝐼(𝑆𝑖 ) > 𝐼(𝑆𝑗 ),
while the logarithmic function is clearly increasing:
Diagram 3
In order to make our 𝑓(𝑃𝑖 ) non-increasing / decreasing, we simple reflect it about the x-axis by
introducing the minus sign, and 𝑓(𝑃𝑖 ) becomes:
(𝑆𝑖 ) = 𝑓(𝑃𝑖 ) = −log(𝑃𝑖 )
There is some ambiguity with information theory as there are many different types of logarithms. A
particular one has to be chosen to see how many units of information are received. 𝑙𝑜𝑔2 is called a bit,
𝑙𝑜𝑔10 is called a Hartley, etc.
Another example is tossing a coin. The 2 possibilities when tossing a coin are heads(𝐻) and tails(𝑇).
In a fair tossing experiment, we know the probabilities will be i.e. 𝑃(𝐻) = and 𝑃(𝑇) = . So if we
toss the coin once and get heads, we would receive the following information:
𝐼(𝐻) = −𝑙𝑜𝑔2 = 1
Remark: We have interpreted the probability of a heads occurring when a coin is tossed once. hence
we can see that one toss gives us 1 bit of information.
The information in the whole message can be calculated using the formula 𝐼(𝐿) = − ∑𝑘𝑖=1 𝑛𝑖 𝑙𝑜𝑔𝑃𝑖
where 𝑘 is the number of characters in the alphabet, 𝐿 is the length of string of characters, and 𝑛𝑖 is
the frequency of character𝑖. 𝑛𝑖 depends on L i.e. longer L is, larger 𝑛𝑖 is. Information density is the
information per character =
2.2.2. Coding
The importance of coding and its relation to information theory arises from the need to transmit
information over some medium. And it is very unlikely that the source of information would produce
the message in exactly the form that is suitable for the medium to transmit. That is it's not generally
the case that the source feeding the channel and the channel itself have the same alphabets. Hence,
the message is first encoded into a signal by a transmitter and then the inverse operation is done by
the receiver, in order to transform the received data into the form accepted by the recipient. Thus,
when coding, it is vital to ensure code invertibility (reversibility).
Coding theory also helps design efficient and reliable data transmission methods so that a suitable
coding with the shortest possible length can be created, with data redundancies omitted and errors
The 3 classes of coding are:
1. Source coding – also known as data compression
2. Channel coding – also known as forward error correction
3. Joint source and channel coding
A simple example of coding could be ASCII, which is a character-encoding scheme based on the
ordering of the English alphabet. ASCII codes represent text in computers, communications
equipment, and other devices that use text:
Diagram 4
As can be seen from the code chart above, each ANSII character is encoded as a 7-bit integer (thus,
there are 2^7 = 128 characters in the set). E.g. character 'A' would correspond to 01000001 and 'a' –
to 01100001.
However, ANSII codes is just an illustrative example of a character-encoding scheme used in most
modern digital computers. This scheme does not have any notion of probability distribution or other
statistical information incorporated in it, while we are going to take a slightly different perspective on
At this stage, it is feasible to introduce the term entropy again, but this time in the context of
information theory. Coding will be dealt with in more detail in section XXX
It is a measure of the uncertainty associated with a random variable and quantifies the information
contained in a message.
2.2.3. Entropy
The entropy of a random variable is a measure of randomness and can be shown to be a good
measure of the information contained in a variable. It also tells us the information density of the
particular variable, i.e. the average information per character. Equivalently, entropy quantifies the
average information content one is missing, when one does not know the value of the random
variable. Types of variables could be messages, a story book, the size of a file that needs to be sent
via email etc. The more random it is, the higher the entropy.
Very random  HOT  high entropy
Not random  FROZEN  low entropy
The entropy is the limit when 𝐿 → ∞ for
. Entropy is denoted S and calculated using the formula
S= lim
𝐿→∞ 𝐿
= lim (− ∑
log 𝑃𝑖
= 𝑃𝑖
In this context, we will be referring to the Shannon entropy which quantifies the information contained
in a message, usually in units such as bits.
S= − ∑𝑘𝑖=1 𝑃𝑖 𝑙𝑜𝑔𝑃𝑖
Where 𝑃𝑖 is the average information per character and 𝑙𝑜𝑔𝑃𝑖 is the information of character “𝑖”.