Communication Systems (EEEg4172) Chapter II Information Theory and Coding Instructor: Fisiha Abayneh Acknowledgement: Materials prepared by Welelaw Y. & Habib M. in topics related to this chapter are used. 1 Chapter II Information Theory and Coding 2.1 Information Theory basics 2.2 Source Coding 2.3 Channel Coding 2 2.1 Information Theory Basics Outline: Introduction Measure of Information Discrete Memoryless Channels Mutual Information Channel Capacity 3 Introduction Information theory is a field of study which deals with quantification, storage, and communication of information. It is originally proposed by Claude E. Shannon in his famous paper entitled “A Mathematical Theory of Communication”. In general, information theory provides: a quantitative measure of information contained in message signals. a way to determine the capacity of a communication system to transfer information from source to destination. 4 Measure of Information Information Source: An information source is an object that produces an event, the outcome of which is selected at random according to a specific probability distribution. A practical information source in a communication system is a device that produces messages and it can be either analog or discrete. A discrete information source is a source that has only a finite set of symbols as possible outputs. The set of source symbols is called the source alphabet and the elements of the set are called symbols or letters. 5 Measure of Information Cont’d…… Information Source: Information sources can be classified as: Sources with memory, or Memoryless sources. A source with memory is one for which a current symbol depends on the previous symbols. A memoryless source is one for which each symbol produced is independent of the previous symbols. A discrete memoryless source (DMS) can be characterized by the set of it’s symbols, the probability assignment to the symbols, and the rate by which the source generates symbols. 6 Measure of Information Cont’d…… Information Content of a DMS: The amount of information contained in an event is closely related to its uncertainty. Messages containing knowledge of high probability of occurrence convey relatively little information. If an event is certain, it conveys zero information. Thus, a mathematical measure of information should be a function of the probability of the outcome and should satisfy the following axioms: 1. Information should be proportional to the uncertainty of an outcome. 2. Information contained in independent outcomes should add. 7 Measure of Information Cont’d…… Information Content of a Symbol: Consider a DMS, denoted by X, with alphabets {x1, x2, …, xm}. The information content of a symbol xi, denoted by I(xi), is defined by: 1 b P ( xi ) I ( xi ) log logb P ( xi ) where P( xi ) is the probability of occurenceof symbol x i Units of I(xi) depends on the base b, can be defined as follows: If b = 2, the unit is bits/symbol (aka, Shannon) If b = 10, the unit is in decimal digit (or hartleys) If b = e, the unit is in “nats/symbol) 8 Measure of Information Cont’d…… Information Content of a Symbol: Note that I(xi) satisfies the following properties. i. I ( xi ) 0 for P ( xi ) 1 ii. I ( xi ) 0 iii. I ( xi ) I ( x j ) if P( xi ) P( x j ) iv. I ( xi , x j ) I ( xi ) I ( x j ) if xi and x j are independent 9 Measure of Information Cont’d…… Average Information or Entropy: In a practical communication system, we usually transmit long sequences of symbols from an information source. Thus, we are more interested in the average information that a source produces than the information content of a single symbol. The mean value of I(xi) over the alphabet of source X with m different symbols is given by: m H ( X ) E[ I ( xi )] P( xi ) I ( xi ) i 1 m P( xi ) log2 P ( xi ) bits/symbol i 1 10 Measure of Information Cont’d…… Average Information or Entropy: The quantity H(X) is called the entropy of the source X. It a measure of the average information content per source symbol. The source entropy H(X) can be considered the average amount of uncertainty within source X that is resolved by use of the alphabet. The source entropy H(X) satisfies the following relations: 0 H ( X ) log2 m where m is the size of the alphabet of source X . The lower bound corresponds to no uncertainty, which occurs when one symbol has probability p(xi)=1 while p(xj)= 0 for j≠i, so X emits xi at all times. The upper bound corresponds to the maximum uncertainty which occurs when p(xj)= 1/m for all i (i.e: when all symbols have equal probability to be generated by X. 11 Measure of Information Cont’d…… Average Information or Entropy: 12 Discrete Memoryless Channels Channel Representation: A communication channel is the path or medium through which the symbols flow from a source to a receiver. A discrete memoryless channel (DMC) is a statistical model with an input X and output Y as shown in the figure below. 13 Discrete Memoryless Channels Cont’d….. Channel Representation: The channel is discrete when the alphabets of X and Y are both finite. It is memoryless when the current output depends on only the current input and not on any of the previous inputs. In the DMC shown above, the input X consists of input symbols x1, x2, …, xm and the output Y consists of output symbols y1, y2, …, yn. Each possible input-to-output path is indicated along with a conditional probability P(yj/xi) where P(yj/xi) is called channel transition probability. 14 Discrete Memoryless Channels Cont’d….. Channel Matrix: A channel is completely specified by the complete set of transition probabilities. The channel matrix , denoted by [P(Y/X)], is given by: Since each input to the channel results in some output, each row of the channel matrix must sum to unity, i.e: 15 Discrete Memoryless Channels Cont’d….. Special Channels: 1. Lossless Channel A channel described by a channel matrix with only one non-zero element in each column is called a lossless channel. An example of a lossless channel is shown in the figure below. 16 Discrete Memoryless Channels Cont’d….. Special Channels: 2. Deterministic Channel A channel described by a channel matrix with only one non-zero unity element in each row is called a deterministic channel. An example of a deterministic channel is shown in the figure below. 17 Discrete Memoryless Channels Cont’d….. Special Channels: 3. Noiseless Channel A channel is called noiseless if it is both lossless and deterministic. A noiseless channel is shown in the figure below. The channel matrix has only one element in each in each row and each column and this element is unity. 18 Discrete Memoryless Channels Cont’d….. Special Channels: 4. Binary Symmetric Channel The binary symmetric channel (BSC) is defined by the channel matrix and channel diagram given below. 19 Mutual Information Conditional and Joint Entropies: Using the input probabilities P(xi), output probabilities P(yj), transition probabilities P(yj/xi), and joint probabilities P(xi, yj), we can define the following various entropy functions for a channel with m inputs and n outputs. 20 Mutual Information Conditional and Joint Entropies: The above entropies can be interpreted as the average uncertainties of the inputs and outputs. Two useful relationships among the above various entropies are: Mutual Information: The mutual information I(X;Y) of a channel is defined by: 21 Mutual Information Cont’d….. Mutual Information: 22 Channel Capacity 23 Channel Capacity Cont’d…. 24 Channel Capacity Cont’d…. 25 Additive White Gaussian Noise(AWGN) Channel In a continuous channel an information source produces a continuous signal x(t). The set of possible signals is considered as an ensemble of waveforms generated by some ergodic random process. Differential Entropy: The average amount of information per sample value of x(t) is measured by The entropy H(X) defined by equation above is known as the differential entropy of X. 26 Additive White Gaussian Noise(AWGN) Channel The average mutual information in a continuous channel is defined (by analogy with the discrete case) as: 27 Additive White Gaussian Noise(AWGN) Channel In an additive white Gaussian noise (AWGN ) channel, the channel output Y is given by: Y= X + n where X is the channel input and n is an additive band-limited white Gaussian noise with zero mean and variance σ2. The capacity Cs of an AWGN channel is given by: where S/N is the signal-to-noise ratio at the channel output. 28 Additive White Gaussian Noise(AWGN) Channel If the channel bandwidth B Hz is fixed, then the output y(t) is also a band-limited signal completely characterized by its periodic sample values taken at the Nyquist rate 2B samples/s. Then the capacity C (b/s) of the AWGN channel is given by: This equation is known as the Shannon-Hartley law. 29 Examples on Information Theory and Coding Example-1: A DMS X has four symbols x1 , x2 , x3 , x4 with probabilities P ( x1 ) 0.4, P ( x2 ) 0.3, P ( x3 ) 0.2 and P ( x4 ) 0.1 a. Calculate H ( X ) b. Find the amount of information contained in the messages x1 x2 x3 x4 and x4 x3 x3 x2 30 Examples on Information Theory and Coding Cont’d… Solution: 4 a. H ( X ) P( xi ) log2 [ P ( xi )] i 1 0.4 log2 0.3 log2 0.2 log2 0.1log2 0.4 0.3 0.2 0.1 1.85 bits/symbol b. P ( x1 x2 x3 x4 ) (0.4)(0.3)(0.2)(0.1) 0.0096 I ( x1 x2 x3 x4 ) log2 0.0096 6.7 bits/symbol Similarly, 0.0024 8.7 bits/symbol P ( x1 x2 x3 x4 ) (0.1)(0.2) 2 (0.3) 0.0012 I ( x1 x2 x3 x4 ) log2 0.0012 9.7 bits/symbol 31 Examples on Information Theory and Coding Cont’d…. Example-2: A high-resolution black-and-white TV picture consists of about 2*106 picture elements and 16 different brightness levels. Pictures are repeated at the rate of 32 per second. All picture elements are assumed to be independent and all levels have equal likelihood of occurrence. Calculate the average rate of information conveyed by this TV picture source. Solution: 1 1 H ( X ) log2 16 4 bits/element i 1 16 16 r 2 *106 * 32 64 *106 elements/s R rH ( X ) 64 *106 * 4 256*106 bits/s 256Mbps 32 Examples on Information Theory and Coding Cont’d…. Example-3: Consider a binary symmetric channel shown below. a. Find the channel matrix of the channel b. Find P( y1 ) and P( y2 ) when P( x1 ) P( x2 ) 0.5 c. Find the joint probabilities P( x1 , y2 ) and P( x2 , y1 ) when P( x1 ) P( x2 ) 0.5 33 Examples on Information Theory and Coding Cont’d…. Solution: a. The channel matrix is given by : P ( y1 / x1 ) P ( y2 / x1 ) 0.9 0.1 P(Y / X ) P ( y / x ) P ( y / x ) 0 . 2 0 . 8 1 2 2 2 b. P (Y ) P ( X ) P (Y / X ) 0.5 0.55 0 .9 0 . 1 0.5 0 . 2 0 . 8 0.45 P ( y1 ) P ( y 2 ) P ( y1 ) 0.55 and P ( y 2 ) 0.45 34 Examples on Information Theory and Coding Cont’d…. Solution: c. P ( X , Y ) P ( X ) d P (Y / X ) 0.5 0 0.9 0.1 0.2 0.8 0 0 . 5 0.45 0.05 P ( x1 , y1 ) P ( x1 , y2 ) 0 . 1 0 . 4 P ( x , y ) P ( x , y ) 2 1 2 2 P ( x1 , y2 ) 0.05 and P ( x2 , y1 ) 0.1 35 Examples on Information Theory and Coding Cont’d…. Example-4: An information source can be modeled as a bandlimited process with a bandwidth of 6kHz. This process is sampled at a rate higher than the Nyquist rate to provide a guard band of 2kHz. It is observed that the resulting samples take values in the set {-4, -3, -1, 2, 4, 7} with probabilities 0.2, 0.1, 0.15, 0.05, 0.3 and 0.2 respectively. What is the entropy of the discrete-time source in bits/sample? What is the entropy in bits/sec? 36 Examples on Information Theory and Coding Cont’d…. Solution: The entropy of the source is 6 H ( X ) P ( xi ) log2 P ( xi ) 2.4087 bits/sample i 1 The sampling rate is f s 2kHz 2 * (6kHz) 14kHz This means that 14000samples are taken per each second. The entropy of the source in bits per second is given by : H ( X ) 2.4087 (bis/sample) *14000 (samples/sec) 33721.8bits/sec 33.7218Kbps 37 Chapter II Information Theory and Coding 2.1 Information Theory basics 2.2 Source Coding 2.3 Channel Coding 38 Source Coding Outline: Code length and code efficiency Source coding theorem Classification of codes Kraft inequality Entropy coding 39 Source Coding A conversion of the output of a DMS into a sequence of binary symbols (binary code word) is called source coding. The device that performs this conversion is called the source decoder. An objective of source coding is to minimize the average bit rate required for representation of the source by reducing the redundancy of the information source. 40 Code Length and Code Efficiency Let X be a DMS with finite entropy H(X) and an alphabet {x1, x2, …..xm} with corresponding probabilities of occurrence P(xi) (i=1, 2, …., m). An encoder assigns a binary code (sequence of bits) of length ni bits for each symbol xi in the alphabet of the DMS. A binary code that represent a symbol xi is known as a code word. The length of a code word (code length) is the number of binary digits in the code word (or simply ni). The average code word length L, per source symbol is given by: m L P ( xi )ni i 1 41 Code Length and Code Efficiency The parameter L represents the average number of bits per source symbol used in the source coding process. The code efficiency η is defined as: Lmin , where Lmin is the minimum possible value of L L When η approaches unity, the code is said to be efficient. The code redundancy γ is defined as: 1 42 Source Coding Theorem The source coding theorem states that for a DMS X with entropy H(X), the average code word length L per symbol is bounded as: L H (X ) And further, L can be made as close to H(X) as desired, by employing efficient coding schemes. Thus, with Lmin=H(X), the code efficiency can be written as: H (X ) L 43 Classifications of Codes There are several types of codes. Let’s consider the table given below to describe different types of codes. 1. Fixed-Length Codes: A fixed-length code is one whose code word length is fixed. Code 1 and Code 2 of the above table are fixed-length codes with length 2. 44 Classifications of Codes 2. Variable-Length Codes: A variable-length code is one whose code word length is not fixed. All codes of the above table except codes 1 and 2 are variable-length codes. 3. Distinct Codes: A code is distinct if each code word is distinguishable from the other code words. All codes of the above table except code 1 are distinct codes. 45 Classifications of Codes 4. Prefix-Free Codes: A code is said to be a prefix-free code when the code words are distinct and no code word is a prefix for another code word. Codes 2, 4 and 6 in the above table are prefix free codes. 5. Uniquely Decodable Codes: A distinct code is uniquely decodable if the original source sequence can be reconstructed perfectly from the encoded binary sequences. Note that code 3 of Table above is not a uniquely decodable code. For example, the binary sequence 1001 may correspond to the source sequences X2X3X2 or X2X1XIX2 46 Classifications of Codes 5. Uniquely Decodable Codes continued… A sufficient condition to ensure that a code is uniquely decodable is that no code word is a prefix of another. Thus, the prefix-free codes 2, 4, and 6 are uniquely decodable codes. Note that the prefix-free condition is not a necessary condition for codes to be uniquely decodable. For example, code 5 of Table above does not satisfy the prefix-free condition, and yet it is uniquely decodable since the bit 0 indicates the beginning of each code word of the code. i.e: when the decoder finds 0, it knows a new word is starting. But it has to wait until the next 0 comes to identify which word is received. 47 Classifications of Codes 6. Instantaneous Codes: A uniquely decodable code is called an instantaneous code if the end of any code word is recognizable without examining subsequent code symbols. The instantaneous codes have the property previously mentioned that no code word is a prefix of another code word For this reason, prefix-free codes are sometimes called instantaneous codes. 7. Optimal Codes: A code is said to be optimal if it is instantaneous and has minimum average length L for a given source with a given probability assignment for the source symbols 48 Kraft Inequality Let X be a DMS with alphabet {Xi} (i = 1,2, ... ,m). Assume that the length of the assigned binary code word corresponding to Xi is ni. A necessary and sufficient condition for the existence of an instantaneous binary code is: which is known as the Kraft inequality. It is also referred as Kraft-McMillan inequality. Note that the Kraft inequality assures us of the existence of an instantaneously decodable code with code word lengths that satisfy the inequality. 49 Entropy Coding The design of a variable-length code such that its average code word length approaches the entropy of the DMS is often referred to as entropy coding. In this section we present two examples of entropy coding. There are two commonly known types of entropy coding; which are presented in this section. These are: i. ii. Shannon-Fano coding Huffman coding 50 Entropy Coding 1. Shannon-Fano Coding: An efficient code can be obtained by the following simple procedure, known as Shannon-Fano algorithm: 1. 2. 3. 4. List the source symbols in order of decreasing probability. Partition the set into two sets that are as close to equi-probable as possible, and assign 0 to the symbols in the upper set and 1 to the symbols in the lower set. Continue this process, each time partitioning the sets with as nearly equal probabilities as possible, and assigning 0s to the symbols of upper and 1s to the lower sets, until further partitioning is not possible. Collect 0s and 1s in the order from 1st step to the last step, to form a code word for each symbol. Note that in Shannon-Fano encoding the ambiguity may arise in the choice of approximately equi-probable sets. 51 Entropy Coding 1. Shannon-Fano Coding continued… An example of Shannon-Fano encoding is shown in Table below. 52 Entropy Coding 2. Huffman Coding: Huffman encoding employs an algorithm that results in an optimum code. Thus, it is the code that has the highest efficiency. The Huffman encoding procedure is as follows: 1. 2. 3. 4. 5. 6. List the source symbols in order of decreasing probability. Combine the probabilities of the two symbols that have lowest probabilities, and re order the resultant probabilities. This action is know as reduction. Repeat the above step until only two ordered probabilities are left. Start encoding with the last reduction, which contains only two ordered probabilities. Assign 0 for the 1st and 1 for the 2nd probability. Now, go back one step and assign 0 and 1 (as a 2nd digit) for the two probabilities that were combined in the last reduction process, and keep the digit in the previous step for the probability that is not combined. Repeat this process until the first column. The resulting digits will be code words for the symbols. 53 Entropy Coding 2. Huffman Coding continued… An example of Huffman encoding is shown in Table below. 54 Chapter II Information Theory and Coding 2.1 Information Theory basics 2.2 Source Coding 2.3 Channel Coding 55 Channel Coding Outline: Introduction Channel coding theorem Channel coding types Linear block codes Convolutional codes 56 Channel Coding Channel encoder introduces systematic redundancy into data stream by adding bits to the message signal to facilitate the detection and correction of bit errors in the original binary message at the receiver. The channel decoder at the receiver exploits the redundancy to decide which message bits are actually transmitted. The combined objective of the channel encoder and decoder is to minimize the effect of channel noise. 57 Channel Coding Figure below shows a basic block diagram representation of channel coding. Binary message sequence Channel encoder DMC Coded sequence Channel decoder Decoded binary sequence Noise 58 Channel Coding Theorem The channel coding theorem for a DMC is stated in two parts as follows: Let a DMS X has an entropy H(X) bits/symbol and produce symbols ever Ts seconds, and Let a DMC has a capacity Cs bits/ symbol and be used every Tc seconds: H(X) C ≤ s , Ts Tc there exists a coding scheme for which the source output can be transmitted over the channel with arbitrarily small probability of error . H(X) C 2. Conversely, if > s, Ts Tc it is not possible to transmit information over the channel with an arbitrarily small probability of error. 1. If Note that the channel coding theorem only asserts the existence of codes but it does not tell us how to construct these codes 59 Channel Coding Types Channel Coding schemes can be simply classified in to two:1. 2. Block code Convolutional code Commonly, block codes are used in channel encoding process. A brief introduction of these codes will be provided. 60 Block Codes A block code is a code in which all of its words have the same length. In block codes, the binary message or data sequence of the source is divided into sequential blocks of each k bits long. The channel encoder maps the k-bit blocks to other n-bit blocks. The channel encoder performs a mapping T: U→V Where U is a set of binary data words of length k and V is a set of binary code words of length n with n>k. 61 Block Codes The n-bit blocks are produced by the channel encoder (so, they are aka channel symbols), and n>k. The resultant block code is called an (n, k) block code. The k-bit blocks form 2k distinct message sequences referred to as k-tuples. The n-bit blocks can form as many as 2n distinct sequences referred to as ntuples. The set of all n-tuples defined over K ={0,1} is denoted by Kn. Each of the 2k data words is mapped to a unique code word. The ratio k/n is called the code rate. 62 Block Codes Binary Field: The set K= {0, 1} is known as a binary field. The binary field has two operations:- addition and multiplication, such that the results of all operations are in K. The rules of addition and multiplication are as follows: Addition: 0+0=0, 1+1=0, 0+1=1+0=1 Multiplication: 0*0=0, 1*1=1, 0*1=1*0=0 63 Linear Block Codes Linearity: Let a = (a1,a2, ... ,an), and b = (b1,b2, ... ,bn) be two code words in a code C. The sum of a and b, denoted by a+b, is defined by:(a1+b1, a2 + b2, •• ,an + bn). A code C is called linear if the sum of two code words is also a code word in C. A linear code C must contain the zero code word 0 = (0,0, ... ,0), since a+a =0. 64 Linear Block Codes Hamming Weight and Distance: Let c be a code word of length n. The Hamming weight of c, denoted by w(c), is the number of 1's in c. Let a and b be code words of length n. The Hamming distance between a and b, denoted by d(a,b), is the number of positions in which a and b differ. Thus, the Hamming weight of a code word c is the Hamming distance between c and 0, that is:W(c)= d(c,0) Similarly, the Hamming distance can be written in terms of Hamming weight as:d(a, b)=w(a + b) 65 Linear Block Codes There are many types of Linear block codes, such as: Hamming codes Repetition codes Reed-Solomon codes Reed-Muller codes Polynomial codes… etc. Reading Assignment: Minimum distance Error detection and correction capabilities Generator matrix Parity check matrix Syndrome decoding Cyclic codes 66 Convolutional Codes In this type of encoding, every code word of the channel symbols is made to be the weighted sum of various input message symbols. The operation is similar to a convolution used in LTI systems to find the output when the input and impulse response are known. In general, the resulting code word will be a convolution of input bit against previously known values of convolutional encoder. Convolutional encoders offer greater simplicity of implementation, but not more protection, than an equivalent block code. They are used in GSM mobile, voiceband modems, satellite and military communications. 67 References: Simon Haykin, Communication Systems, 4th edition. (chapter-9). B.P. Lathi, Modern Digital and Analog Communication Systems, 3rd edition. (Chapter-15) Henk C.A van Tilborg, Coding Theory, April 03, 1993. Additional Reading: https://en.wikipedia.org/wiki/Information_theory https://en.wikipedia.org/wiki/Coding_theory https://en.wikipedia.org/wiki/Noisy-channel_coding_theorem 68