Project 8: The concept of Entropy and its applications to information and coding Table of Contents 1.Abstract ....................................................................................................................2 2.Introduction ..............................................................................................................3 2.1.Historical Remarks:................................................................................................3 2.2.Topic High-level Overview.....................................................................................4 2.2.1.Information theory...............................................................................................4 2.2.2.Coding ...............................................................................................................7 2.2.3.Entropy ..............................................................................................................7 1. Abstract TO BE DONE AT THE END – BRIEF DESCRIPTION OF WHAT HAS BEEN DONE BY US IN THE PROJECT. 2. Introduction Much of current technology is concerned with the transmission and storage of information. Most of the devices that we use daily either transmit/receive information (telephones, radios, TV sets), store it (MP3 players, CDs, DVDs, memory sticks) or do both (computers, mobile phones). This technology has been made possible by great advances in physics and engineering, but also by the formulation of a mathematical theory of communication by Claude Shannon in 1948. The concept of entropy is central to the mathematical description of information. Given a source that generates a sequence of symbols (e.g. a text, a numerical code, or the sequence of nucleotides in the human genome), its entropy is a quantitative measure of how much information it produces per unit time, and of how much memory is needed to store that information. In most cases it is possible to encode the sequence of symbols in such a way as to reduce the amount of memory needed, a process known as ``data compression''. The entropy of the sequence tells us how much it can be compressed without distortion. In this project we will define the notion of entropy and discuss some of its properties. The theorems relevant to its applications to coding and data compression will be presented, motivated with examples and proved. Students will solve a few problems in which the entropy of sequences of symbols is estimated and the sequences coded and compressed. The purpose of this section is to give an introduction into the project of “The concept of Entropy and its applications to information and coding” and will be mainly based on the paper published by Claude Shannon in the Bell Systems Technical Journal in 1948. This should give the reader the necessary grounding in the subject and prepare for more advanced discussions on the matter. 2.1. Historical Remarks: To start with, the term entropy was coined in 1865 by the German physicist Rudolf Clausius, from the Greek words en - "in" and trope - "a turning", in analogy with energy. The entropy in the context of thermodynamics can provide a measure of the amount of energy in a physical system that cannot be used to do work. The work of Shannon was inspired by some of the developments in electrical communications at that time (e.g. Pulse-code or Pulse-position modulation). Abstracting away from the means we use for communication, its fundamental problem is that of reproducing at one point either exactly or approximately a message selected at another point. E.g.: Diagram 1 Thus, the natural question to ask was how to make this information transfer more efficient. Among the efficiency considerations that motivated Shannon were: ~ What is the most effective way to encode a message given some information on its statistical structure? Can we make use this structure to make some savings in the length of the resulting code? ~ What is the effect of noise in the channel on the reproduction of the original message at the recipient end of communication? Is it possible to minimise such impact? It's also necessary to say that the Theory of Mathematical Communication came as the first systematics framework of its kind. 2.2. Topic High-level Overview The 3 key important aspects of this topic are: Information theory Coding Entropy 2.2.1. Information theory Information is measured in different units like characters, in the case of describing the length of an email message, digits in the length of a telephone number, etc. The convention in information theory is to measure information in bits. A bit is a contraction of the term binary digit which is either a zero or a one. For example there are 8 possible configurations of 3 bits, i.e. 000, 001, 010, 011, 100, 101, 110 and 111. Information can be described in a number of forms such as: A message received and understood Knowledge acquired through study or experience or instruction Formal accusation of a crime Data: a collection of facts from which conclusions may be drawn; statistical data An important observation made by Shannon with respect to the message being transmitted is that what the message means semantically is irrelevant to the engineering task of communication. What is essential though is that the message selected is one from the set of possible messages that the communication system should be able to transmit. If this set has a finite number of elements, say N, then this number or some function of this number can be used to determine the information given when a particular message from the set is selected. It was first R.V.L Hartley who pointed out that the most natural choice would be a logarithmic function. In order to illustrate this quite abstract proposition, let’s consider the following example: Given an alphabet of K = {a, b}, a message is a string made up of symbols from K. Having received part of the message, A A A A A We naturally, by just observing the pattern of symbols in the partly received text, expect the next symbol to be an A as well. If it happens to be so, then the information gained by us receiving the next symbol is very little, since we anticipated the outcome beforehand. Conversely, if the next symbol is a B, then we gain a lot of information by receiving it as it is very unexpected. This example highlights the relation between our expectation, or probability in the mathematical sense, and the amount of information, as well as demonstrates that this relation is ‘inversely proportional’ (not used in the strict mathematical sense here). Information Probability High Low Low High Table 1 Trying now to formalise the above example, in order to draw more mathematically precise conclusions from it, we introduce the following definitions. Let a source of information output each symbol 𝑆𝑖 from some alphabet K (|𝐾| = 𝑖) with a probability 𝑃𝑖 . We define the information gained upon reading 𝑆𝑖 as 𝐼(𝑆𝑖 ). Example: 𝐾 = {𝐴, 𝐵} such that 𝑆1 = 𝐴 and 𝑆2 = 𝐵. We also know that the probability distribution of symbols is 𝑃1 = 1 and 𝑃2 = 0 i.e. we expect the message only to contain symbol 𝐴 and no occurrences of symbol 𝐵. E.g.: 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴.. From our discussion above, we require that the information gained from reading another symbol 𝐴 from the message is zero, as we are sure that’s the symbol we are going to read. Hence, 𝑃1 = 1 => 𝐼(𝑆𝑖 ) = 𝑈. Using the same argument, the information gained from encountering symbol B somewhere in the message should necessarily tend to infinity. Thus, 𝑃𝑖 = 0 => 𝐼(𝑆𝑖 ) = ∞. Consolidating these conclusions in the form of a diagram, the function relating probability and information gain should look qualitatively similar to the one below: I(si) I(si) = f(Pi) Pi 1 0 Diagram 2 Also we will assume for simplicity that symbols in the message occur independently, i.e. we will not allow any conditional probabilities at this stage (this more advanced topic will be discussed in detail in later sections). In this case, we can expect with reason that the information gained by reading two symbols, say 𝑆𝑖 and 𝑆𝑗 , will be the same as the information gained from reading symbol 𝑆𝑖 plus that of 𝑆𝑗 . This is a very intuitively anticipated result, as each time we receive an extra symbol from the message, we add to our already existent knowledge received previously. Hence, information is incremental in nature and this statement can be formally written as 𝐼(𝑆𝑖 𝑆𝑗 ) = 𝐼(𝑆𝑖 ) + 𝐼(𝑆𝑗 ) * Now, in the light of the above conclusions and assumptions we have made, we can try to guess what 𝐼(𝑆𝑖 ) = 𝑓(𝑃𝑖 ) is. The simplest example that might spring to one’s mind when seeing the somewhat ‘inversely proportional’ link between the two would be the function of the form 𝐼(𝑆𝑖 ) = 𝑓(𝑃𝑖 ) = 1 𝑃𝑖 However, it’s very easy to see that this is not defined for 𝑃𝑖 = 0 and 𝐼(𝑆𝑖 ) = 0, and it would be unreasonable to exclude such basic cases from our theory. Also, 𝐼(𝑆𝑖 𝑆𝑗 ) = 1 1 1 ≠ + = 𝐼(𝑆𝑖 ) + 𝐼(𝑆𝑗 ) 𝑃𝑖 ∗ 𝑃𝑗 𝑃𝑖 𝑃𝑗 Remark: We have interpreted the probability of 𝑆𝑖 and 𝑆𝑗 appearing one after another in the message as𝑃𝑖 ∗ 𝑃𝑗 , based on our assumption that these are independent events. Otherwise, by Bayes’ theorem 𝑃𝑖∗𝑗 = 𝑃𝑖 ∗ 𝑃𝑗|𝑖 , where 𝑃𝑗|𝑖 is the conditional probability of 𝑆𝑗 given 𝑆𝑖 . (The property in question would still apply in this case as well) Next we try out Hartley’s suggestion to use the logarithmic function, in order to quantise information and see that logically it does make a lot of sense. Suppose, 𝐼(𝑆𝑖 ) = 𝑓(𝑃𝑖 ) = log(𝑃𝑖 ). Then, using the properties of the logarithmic function, 𝐼(𝑆𝑖 𝑆𝑗 ) = log(𝑃𝑖 𝑃𝑗 ) = log(𝑃𝑖 ) + log(𝑃𝑗 ) = 𝐼(𝑆𝐼 ) + 𝐼(𝑆𝑗 ) Which matches property (*). It’s also qualitatively equivalent to the function we depicted in Diagram 1 earlier, apart from the fact that 𝑃𝑖 and 𝐼(𝑆𝑖 ) should be ‘inversely’ related, with 𝑆𝑖 < 𝑆𝑗 → 𝐼(𝑆𝑖 ) > 𝐼(𝑆𝑗 ), while the logarithmic function is clearly increasing: Diagram 3 In order to make our 𝑓(𝑃𝑖 ) non-increasing / decreasing, we simple reflect it about the x-axis by introducing the minus sign, and 𝑓(𝑃𝑖 ) becomes: (𝑆𝑖 ) = 𝑓(𝑃𝑖 ) = −log(𝑃𝑖 ) There is some ambiguity with information theory as there are many different types of logarithms. A particular one has to be chosen to see how many units of information are received. 𝑙𝑜𝑔2 is called a bit, 𝑙𝑜𝑔10 is called a Hartley, etc. Example: Another example is tossing a coin. The 2 possibilities when tossing a coin are heads(𝐻) and tails(𝑇). 1 1 1 In a fair tossing experiment, we know the probabilities will be i.e. 𝑃(𝐻) = and 𝑃(𝑇) = . So if we 2 2 2 toss the coin once and get heads, we would receive the following information: 1 𝐼(𝐻) = −𝑙𝑜𝑔2 = 1 2 Remark: We have interpreted the probability of a heads occurring when a coin is tossed once. hence we can see that one toss gives us 1 bit of information. The information in the whole message can be calculated using the formula 𝐼(𝐿) = − ∑𝑘𝑖=1 𝑛𝑖 𝑙𝑜𝑔𝑃𝑖 where 𝑘 is the number of characters in the alphabet, 𝐿 is the length of string of characters, and 𝑛𝑖 is the frequency of character𝑖. 𝑛𝑖 depends on L i.e. longer L is, larger 𝑛𝑖 is. Information density is the 𝐼(𝐿) information per character = . 𝐿 2.2.2. Coding The importance of coding and its relation to information theory arises from the need to transmit information over some medium. And it is very unlikely that the source of information would produce the message in exactly the form that is suitable for the medium to transmit. That is it's not generally the case that the source feeding the channel and the channel itself have the same alphabets. Hence, the message is first encoded into a signal by a transmitter and then the inverse operation is done by the receiver, in order to transform the received data into the form accepted by the recipient. Thus, when coding, it is vital to ensure code invertibility (reversibility). Coding theory also helps design efficient and reliable data transmission methods so that a suitable coding with the shortest possible length can be created, with data redundancies omitted and errors corrected. The 3 classes of coding are: 1. Source coding – also known as data compression 2. Channel coding – also known as forward error correction 3. Joint source and channel coding A simple example of coding could be ASCII, which is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text: Diagram 4 As can be seen from the code chart above, each ANSII character is encoded as a 7-bit integer (thus, there are 2^7 = 128 characters in the set). E.g. character 'A' would correspond to 01000001 and 'a' – to 01100001. However, ANSII codes is just an illustrative example of a character-encoding scheme used in most modern digital computers. This scheme does not have any notion of probability distribution or other statistical information incorporated in it, while we are going to take a slightly different perspective on code. At this stage, it is feasible to introduce the term entropy again, but this time in the context of information theory. Coding will be dealt with in more detail in section XXX It is a measure of the uncertainty associated with a random variable and quantifies the information contained in a message. 2.2.3. Entropy The entropy of a random variable is a measure of randomness and can be shown to be a good measure of the information contained in a variable. It also tells us the information density of the particular variable, i.e. the average information per character. Equivalently, entropy quantifies the average information content one is missing, when one does not know the value of the random variable. Types of variables could be messages, a story book, the size of a file that needs to be sent via email etc. The more random it is, the higher the entropy. Very random HOT high entropy Not random FROZEN low entropy The entropy is the limit when 𝐿 → ∞ for below: 𝐼(𝐿) 𝐿 . Entropy is denoted S and calculated using the formula S= lim 𝐼(𝐿) 𝐿→∞ 𝐿 𝑘 𝑛𝑖 = lim (− ∑ 𝑛→∞ lim 𝐿→∞ 𝑖=1 𝑛𝑖 𝐿 𝐿 log 𝑃𝑖 = 𝑃𝑖 In this context, we will be referring to the Shannon entropy which quantifies the information contained in a message, usually in units such as bits. S= − ∑𝑘𝑖=1 𝑃𝑖 𝑙𝑜𝑔𝑃𝑖 Where 𝑃𝑖 is the average information per character and 𝑙𝑜𝑔𝑃𝑖 is the information of character “𝑖”.