MATH 1020: Mathematics For Non-science Chapter 4.1: Informatics Instructor: Prof. Ken Tsang Room E409-R9 Email: kentsang@uic.edu.hk 1 Informatics-the science of information – What’s information – Correcting errors in transmitted messages – Genetic code and information – Data compression – Cryptography 2 Data, messages and information “Message” is a generic term, something to convey the information. Data are records of messages. Data/messages may or may not be comprehensible to us. Information is the knowledge helpful to us to understand the world around us. Information is derived/obtained from data/message. 3 Messages • Messages can be a smell female insects can attract males with pheromones 4 Messages • Messages can be a sound ring, noise, applause musical, speech 5 Messages • Messages can be visual color, shape, design 6 Writings • More than 80,000 characters are used to code the Chinese language 7 Writings • Ancient Egyptians used hieroglyphs to code sounds and words 8 Writings • Japanese language is also using the 96 Hiragana character coding syllables 9 Writings • Modern numbers are coded with 10 digits created by Indians and transmitted to Europeans through the Arabs 10 Writings • George BOOLE (18151864) used only two characters to code logical values 0 1 11 Writings • John von NEUMANN (1903-1957) developed the concept of programming using also binary system to code all possible information 0 1 12 Binary Codes: the basis of data CD, MP3, and DVD players, digital TV, cell phones, the Internet, GPS system, etc. all represent data as strings of 0’s and 1’s rather than digits 0-9 and letters A-Z Whenever information needs to be digitally transmitted from one location to another, or stored in electronic device, a binary code is used 13 Binary Codes A binary code is a system for encoding data made up of 0’s and 1’s. Examples – UPC (universal product code, dark = 1, light = 0) – Morse code (dash = 1, dot = 0) – Braille (raised bump = 1, flat surface = 0) – Yi-jing易经 (Yin=0, yang=1) 14 The Challenges in an information age Mathematical Challenges in the Digital Revolution How to detect and correct errors in data transmission How to electronically send and store information economically/efficiently How to ensure security of transmitted data 15 Claude Shannon (1916-2001) “Father of Information Theory” Claude Shannon’s 1948 paper “A Mathematical Theory of Communication” is the paper that made the digital world we live in possible. 16 A Brief Introduction: Information Theory Information theory is a branch of science that deals with the analysis of a communications system We will study digital communications – using a file as the channel NOISE Source of Message Encoder Channel Decoder Destination of Message Very precise definition of information as a message made up of symbols from some finite set of alphabets. It applies to any form of messages. A human communication system : ASCII Dictionary defines code as “a system of symbols for communication.” Information is “Communication between an encoder and decoder using agreed-upon symbols.” 18 Information content Information content of a message is a quantifiable amount & always positive. The information content of a message is inversely related to the probability that the message will be received from the set of all possible messages. The message with the lowest probability of being received contains the highest information content. 19 Information content -1 The information content I of a message is defined as the negative of base-2 logarithm of its probability p: I=log2(1/p) If p = 1, then I = 0. A message you are certain to occur has no information content. The logarithmic measure is justified by the desire for information to be additive. We want the algebra of our measures to follow the Rules of Probability. When independent packets of information arrive, we would like to say that the total information received is the sum of the individual pieces. 20 1 I ( X j ) log log Pj P( X j ) Suppose we have an event X, where Xi represents a particular outcome of the event Consider flipping a fair coin, there are two equi-probable outcomes: – say X0 = heads, P0 = 1/2, X1 = tails, P1 = 1/2 The amount of information for any single result is 1 bit, unit of information. In other words, the number of bits required to communicate the result of the event is 1 bit When outcomes are equally likely, there is a lot of information in the result The higher the likelihood of a particular outcome, the less information that outcome conveys However, if the coin is biased such that it lands with heads up 99% of the time, there is not much information conveyed when we flip the coin and it lands on heads Information content -2 Suppose we speak a funny language with only 4 alphabets, and we play the game of questions & answer to determine which alphabet I have chosen. You can ask me questions and I will only answer `yes' or `no.' What is the minimum number of such questions that you must ask in order to guarantee finding the answer? “Is it A?”, “Is it B?" ...or is there some more intelligent way to solve this problem? 23 Information content -3The answer to a Yes/No (binary) question having equal probabilities conveys one bit worth of information. In the above example with 4 (equally probable) alphabets, the minimum number of questions to discover the answer is equal to 2, even though there are 4 possibilities. In English with 26 alphabets, the minimum number of questions to discover the answer is equal to I=log2(26)=4.7 24 Information Complexity of Coded Messages Let’s think written numbers: – k digits → 10k possible messages How about written English? – k letters → 26k possible messages – k words → Dk possible messages, where D is English dictionary size ∴ Length ~ content ~ log(complexity) 25 Information Entropy (熵) 1 H ( p( X )) p( x) log 2 p( x) Ex log 2 p( x) x • The expected content (length in bits) of a binary message with symbols {x} – other common descriptions: “code complexity”, “uncertainty”, “expected surprise”, the minimum number of bits required to store data, etc. 26 Entropy is the measurement of the average uncertainty of information – A highly predictable sequence contains little actual information Example: 11011011011011011011011011 (what’s next?) – A completely unpredictable sequence of n bits contains n bits of information Example: 01000001110110011010010000 (what’s next?) X – random variable with a discrete set of possible outcomes – (X0, X1, X2, … Xn-1) where n is the total number of possibilities n 1 n 1 1 Entropy Pj log Pj Pj log Pj j 0 j 0 Entropy is greatest when the probabilities of the outcomes are equal Let’s consider our fair coin experiment again The entropy H = ½ log 2 + ½ log 2 = 1 Since each outcome has information content of 1, the average of 2 outcomes is (1+1)/2 = 1 Consider a biased coin, P(H) = 0.98, P(T) = 0.02 H = 0.98 * log 1/0.98 + 0.02 * log 1/0.02 = = 0.98 * 0.029 + 0.02 * 5.643 = 0.0285 + 0.1129 = 0.1414 Entropy of a coin flip Entropy H(X) of a coin flip, measured in bits; graphed versus the fairness of the coin Pr(X=1). The entropy reaches the maximum value when the coin has equal probability ½ of landing with “heads” or “tails” side up (i.e. the coin is fair). The case of fair coin is the most uncertain situation to predict the outcome of the next toss. Hence, the result of each toss of the coin delivers a full 1 bit of information. 29 SETI Search for extraterrestrial intelligence SETI project uses information theory to search for intelligent life on other planets. Currently, radio signals from deep space are monitored for signs of transmissions from civilizations on other worlds. If those signals are generated by intelligent beings, they must not be random in nature and contain discernible pattern in them with significant information content. 30 SETI Search for extraterrestrial intelligence 31 DNA: Genetic Code • Nature uses 4 molecules to code the genetic heredity 32 Information Theory & genetics DNA Message Information Source: Signal Received Signal Message Receiver Child Transmitter Parents Noise Source Destination: Mutation, evolution A typical communication system 33 Informatics – What’s information – Correcting errors in transmitted messages – Genetic code and information – Data compression – Cryptography 34 Transmission Problems What are some problems that can occur when data is transmitted from one place to another? The two main problems are – transmission errors: the message sent is not the same as the message received – security: someone other than the intended recipient receives the message 35 Transmission Error Example Suppose you were looking at a newspaper ad for a job, and you see the sentence “must have bive years experience” We detect the error since we know that “bive” is not a word Can we correct the error? Why is “five” a more likely correction than “three”? Why is “five” a more likely correction than “nine”? 36 Another Example Suppose NASA is directing one of the Mars rovers by telling it which crater to investigate There are 16 possible signals that NASA could send, and each signal represents a different command NASA uses a 4-digit binary code to represent this information 0000 0001 0010 0100 0101 0110 1000 1001 1010 1100 1101 1110 0011 0111 1011 1111 37 Lost in Transmission The problem with this method is that if there is a single digit error, there is no way that the rover could detect or correct the error If the message sent was “0100” but the rover receives “1100”, the rover will never know a mistake has occurred This kind of error – called “noise” – occurs all the time Errors are also introduced into data stored over a long period of time on CD or magnetic tape as the hardware deteriorates. 38 Coding theory Coding theory, the study of codes, including error detecting and error correcting codes, has been studied extensively for the past forty years. Messages, in the form of bit strings, are encoded by translating them into longer bit strings, called codewords. As long as not too many errors were introduced in transmission, we can recover the codeword from the bit string received when this codeword was sent. 39 BASIC IDEA The details of techniques used to protect information against noise in practice are sometimes rather complicated, but basic principles are easily understood. The key idea is that in order to protect a message against a noise, we should encode the message by adding some redundant information to the message. In such a case, even if the message is corrupted by a noise, there will be enough redundancy in the encoded message to recover, or to decode the message completely. 40 Adding Redundancy to our Messages To decrease the effects of noise, we add redundancy to our messages. First method: repeat the digits multiple times. Thus, the computer is programmed to take any five-digit message received and decode the result by majority rule. 41 Majority Rule So, if we sent 00000, and the computer receives any of the following, it will still be decoded as 0. 00000 11000 Notice that for the 10000 10100 computer to decode 01000 10010 incorrectly, at least 00010 10001 three errors must be 00001 etc. made. 42 Independent Errors Using the five-time repeats, and assuming the errors happen independently, it is less likely that three errors will occur than two or fewer will occur. This is called the maximum likelihood decoding. 43 More on Redundancy Another way to try to avoid errors is to send the same message twice This would allow the rover to detect the error, but not correct it (since it has no way of knowing if the error occurs in the first copy of the message or the second) 44 Why don’t we use this? Repetition codes have the advantage of simplicity, both for encoding and decoding But, they are too inefficient! In a five-fold repetition code, 80% of all transmitted information is redundant. Can we do better? Yes! 45 Parity-Check Sums Sums of digits whose parities determine the check digits. Even Parity – Even integers are said to have even parity. Odd Parity – Odd integers are said to have odd parity. Decoding The process of translating received data into code words. Example: Say the parity-check sums detects an error. The encoded message is compared to each of the possible correct messages. This process of decoding works by comparing the distance between two strings of equal length and determining the number of positions in which the strings differ. The one that differs in the fewest positions is chosen to replace the message in error. In other words, the computer is programmed to automatically correct the error or choose the “closest” permissible answer. 46 Error Correction Over the past 40 years, mathematicians and engineers have developed sophisticated schemes to build redundancy into binary strings to correct errors in transmission! One example can be illustrated with Venn diagrams! 47 Computing the Check Digits The original message is four digits long We will call these digits I, II, III, and IV We will add three new digits, V, VI, and VII Draw three intersecting circles as shown here Digits V, VI, and VII should be chosen so that each circle contains an even number of ones Venn Diagrams V I VI III IV II VII 48 A Hamming (7,4) code A Hamming code of (n,k) means the message of k digits long is encoded into the code word of n digits. The 16 possible messages: 0000 1010 0011 0001 1100 1110 0010 1001 1101 0100 0110 1011 1000 0101 0111 1111 49 Binary Linear Codes The error correcting scheme we just saw is a special case of a Hamming code. These codes were first proposed in 1948 by Richard Hamming (1915-1998), a mathematician working at Bell Laboratories. Hamming was frustrated with losing a week’s worth of work due to an error that a computer could detect, but not correct. 50 Appending Digits to the Message The message we want to send is “0100” Digit V should be 1 so that the first circle has two ones Digit VI should be 0 so that the second circle has zero ones (zero is even!) Digit VII should be 1 so that the last circle has two ones Our message is now 0100101 1 0 0 0 0 1 1 51 52 53 54 55 Encoding those messages Message codeword 0000 0000000 0110 0110010 0001 0001011 0101 0101110 0010 0010111 0011 0011100 0100 0100101 1110 1110100 1000 1000110 1101 1101000 1010 1010001 1011 1011010 1100 1100011 0111 0111001 1001 1001101 1111 1111111 56 57 Detecting and Correcting Errors Now watch what happens when there is a single digit error We transmit the message 0100101 and the rover receives 0101101 The rover can tell that the second and third circles have odd numbers of ones, but the first circle is correct So the error must be in the digit that is in the second and third circles, but not the first: that’s digit IV Since we know digit IV is wrong, there is only one way to fix it: change it from 1 to 0 1 0 0 0 1 1 1 58 59 Try It! Encode the message 1110 using this method You have received the message 0011101. Find and correct the error in this message. 60 Extending This Idea This method only allows us to encode 4 bits (16 possible) messages, which isn’t even enough to represent the alphabet! However, if we use more digits, we won’t be able to use the circle method to detect and correct errors We’ll have to come up with a different method that allows for more digits 61 Parity Check Sums The circle method is a specific example of a “parity check sum” The “parity” of a number is 1 is the number is odd and 0 if the number is even For example, digit V is 0 if I + II + III is even, and 1 if I + II + III is odd 62 Conventional Notation Instead of using Roman numerals, we’ll use a1 to represent the first digit of the message, a2 to represent the second digit, and so on We’ll use c1 to represent the first check digit, c2 to represent the second, etc. 63 Old Rules in the New Notation Using this notation, our rules for our check digits become – c1 = 0 if a1 + a2 + a3 is even – c1 = 1 if a1 + a2 + a3 is odd – c2 = 0 if a1 + a3 + a4 is even – c2 = 1 if a1 + a3 + a4 is odd c1 a1 – c3 = 0 if a2 + a3 + a4 is even a3 a 4 a2 – c3 = 1 if a2 + a3 + a4 is odd c3 c2 64 An Alternative System If we want to have a system that has enough code words for the entire alphabet, we need to have 5 message digits: a1, a2, a3, a4, a5 We will also need more check digits to help us decode our message: c1, c2, c3, c4 65 Rules for the New System We can’t use the circles to determine the check digits for our new system, so we use the parity notation from before c1 c2 is the parity of a1 + a2 + a3 + a4 is the parity of a2 + a3 + a4 + a5 c3 is the parity of a1 + a2 + a4 + a5 c4 is the parity of a1 + a2 + a3 + a5 66 Making the Code Using 5 digits in our message gives us 32 possible messages, we’ll use the first 26 to represent letters of the alphabet On the next slide you’ll see the code itself, each letter together with the 9 digit code representing it 67 The Code Letter Code Letter Code A 000000000 N 011010101 B 000010111 O 011101100 C 000101110 P 011111011 D 000111001 Q 100001011 E 001001101 R 100011100 F 001011010 S 100100101 G 001100011 T 100110010 H 001110100 U 101000110 I 010001111 V 101010001 J 010011000 W 101101000 K 010100001 X 101111111 L 010110110 Y 110000100 M 011000010 Z 110010011 68 Using the Code Now that we have our code, using it is simple When we receive a message, we simply look it up on the table But what happens when the message we receive isn’t on the list? Then we know an error has occurred, but how do we fix it? We can’t use the circle method anymore 69 Beyond Circles Using this new system, how do we decode messages? Simply compare the (incorrect) message with the list of possible correct messages and pick the “closest” one What should “closest” mean? The distance between the two messages is the number of digits in which they differ 70 The Distance Between Messages What is the distance between 1100101 and 1010101? – The messages differ in the 2nd and 3rd digits, so the distance is 2 What is the distance between 1110010 and 0001100? – The messages differ in all but the 7th digit, so the distance is 6 71 Hamming Distance Def: The Hamming distance between two vectors of a vector space is the number of components in which they differ, denoted d(u,v). 72 Hamming Distance Ex. 1: The Hamming distance between v=[1011010] u=[0111100] d(u, v) = 4 Notice: d(u,v) = d(v,u) 73 74 Hamming weight of a Vector Def: The Hamming weight of a vector is the number of nonzero components of the vector, denoted wt(u). 75 Hamming weight of a code Def: The Hamming weight of a linear code is the minimum weight of any nonzero vector in the code. 76 Hamming Weight The Hamming weight of v=[1011010] u=[0111100] w=[0100101] are: wt(v) = 4 wt(u) = 4 wt(w) = 3 77 Nearest-Neighbor Decoding The nearest neighbor decoding method decodes a received message as the code word that agrees with the message in the most positions 78 Trying it Out Suppose that, using our alphabet code, we receive the message 010100011 We can check and see that this message is not on our list How far away is it from the messages on our list? 79 Distances From 010100011 Code Distance Code Distance 000000000 4 011010101 5 000010111 4 011101100 5 000101110 4 011111011 3 000111001 4 100001011 4 001001101 6 100011100 8 001011010 6 100100101 4 001100011 2 100110010 4 001110100 6 101000110 6 010001111 3 101010001 6 010011000 5 101101000 6 010100001 1 101111111 6 010110110 3 110000100 5 011000010 3 110010011 3 80 Fixing the Error Since 010100001 was closest to the message that we received, we know that this is the most likely actual transmission We can look this corrected message up in our table and see that the transmitted message was (probably) “K” This might still be incorrect, but other errors can be corrected using context clues or check digits 81 Distances From 1010 110 The distances between message “1010 110” and all possible code words: v 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 code word 0000 000 0001 011 0010 111 0100 101 1000 110 1100 011 1010 001 1001 101 distance 4 5 2 5 1 4 3 4 v 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 code word 0110 010 0101 110 0011 100 1110 100 1101 000 1011 010 0111 001 1111 111 distance 3 4 3 2 5 2 6 3 82 Informatics – What’s information – Correcting errors in transmitted messages – Genetic code and information – Data compression – Cryptography 83 84 The genetic code The genome基因組 is the instruction manual for life, an information system that specifies the biological body. In its simplest form, it consists of a linear sequence of four extremely small molecules, called nucleotides. These nucleotides make up the “steps” of the spiral-staircase structure of the DNA and are the letters of the genetic code. 85 DNA from an Information Theory Perspective The “alphabet” for DNA is {A,C,G,T}. Each DNA strand is a sequence of symbols from this alphabet. These sequences are replicated and translated in processes reminiscent of Shannon’s communication model. There is redundancy in the genetic code that enhances its error tolerance. 86 What Information Theory Contributes to Genetic Biology A useful model for how genetic information is stored and transmitted in the cell A theoretical justification for the observed redundancy of the genetic code The ideas that Information Theory (Claude Shannon) introduced have profound and farreaching implications for Origin of Life Research and Evolutionary Theory. 87 A DNA double helix The main role of DNA (Deoxyribonucleic acid 脱 氧核糖核酸) molecules is the long-term storage of the genetic instructions used in the development and functioning of all known living organisms. 88 The Genetic Code The four bases (nucleotides) found in DNA are adenine (abbreviated A), cytosine (C), guanine (G) and thymine (T). These four bases can be regarded as the symbols used in the Genetic Code. DNA code has four alphabets. They are A - C - G - T. Three alphabets come together to form a letter called a “triplet” or “Codon”. The Codon GGG is an instruction to make Glycine, an amino acids commonly found in proteins. 89 Processes of Life in the Cell Information stored in the DNA in the nucleus is copied over to RNA (ribonucleic acid) strands, which acts as a messenger to govern the chemical processes in the cell. Duplication and Division In the course of cell division, the DNA strands in the nucleus (chromosomes) are duplicated by splitting the double-helix strand up and replacing the open bonds with the corresponding amino acids Process must be sufficiently accurate, but also capable of occasional minor mistakes to allow for evolution. Transmitting • Prior to the cell fission the DNA molecule is unzipped 92 Escherichia coli genome • >gb|U00096|U00096 Escherichia coli 大腸桿菌 • K-12 MG1655 complete genome基因組 • AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTG TGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG AACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATT GACTTAGGTCACTAAATACTTTAACCAATATAGGCAT AGCGCACAGACAGATAAAAATTACAGAGTACACAAC ATCCATGAAACGCATTAGCACCACCATTACCACCACC ATCACCATTACCACAGGTAACGGTGCGGGCTGACGCG TACAGGAAACACAGAAAAAAGCCCGCACCTGACAGT GCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAAC AACCATGCGAGTGTTGAA 93 Hierarchies of symbols English letter (26) word (1-28 letters) sentence book computer bit (2) byte (8 bits) line program genetics nucleotide核苷酸(4) codon (3 nucleotides) gene genome 94 The Central Dogma of Molecular Biology Replication Transcription Translation RNA DNA Protein Reverse Transcription Ribonucleic acid 核糖核酸 A long string of DNA like ACGGGTCTTTAAGATG——- that DNA pattern might hold the instructions to build a big mouth or blue eyes. 95 Information Theory & genetics DNA Message Information Source: Signal Received Signal Message Receiver Child Transmitter Parents Noise Source Destination: Mutation, evolution A typical communication system 96 Redundancy in DNA code • Two other codons, GGC and GGA (besides GGG), also code for Glycine. Each amino acid can be represented by three different codons, not just one. This is “redundancy”. • DNA has a clever mechanism for reducing copying errors, namely that each amino acid has three Codons that code for it, not just one. It means that some copying errors (GGC instead of GGG) will not cause a problem. • This is why DNA has survived over 3 billion years intact. 97 The very existence of information changes the materialistic worldview. Materialistic philosophy has no explanation for the existence of information. Norbert Weiner, the MIT Mathematician and father of Cybernetics, said, “Information is information, neither matter nor energy. No materialism that denies this can survive the present day.” Information is an entity carried by, but beyond matter and energy. The observable universe is consisted of not only matter and energy but also information. Is life a vehicle to transport some kind of information through space and time? 98 Informatics – What’s information – Correcting errors in transmitted messages – Genetic code and information – Data compression – Cryptography 99 Data compression Purpose: Increase the efficiency in storing/transmitting data. How: achieved by removing data redundancy while preserving information content. Data with low entropy permit a larger compression ratio than data with high entropy. Types: Lossless compression and lossy compression. Entropy effectively limits the strongest lossless compression possible. 100 Data compression Data compression is important to storage systems because it allows more bytes to be packed into a given storage medium than when the data is uncompressed. Some storage devices (notably tape) compress data automatically as it is written, resulting in less tape consumption and significantly faster backup operations. Compression also reduces file transfer time, saving time and communications bandwidth. 102 Compression There are two main categories – Lossless – Lossy Compression ratio: 103 Compression factor A good metric for compression is the compression factor (or compression ratio) given by: If we have a 100KB file that we compress to 40KB, we have a compression factor of: 104 Inefficiency of ASCII Realization: In many natural (English) files, we are much more likely to see the letter ‘e’ than the character ‘&’, yet they are both encoded using 7 bits! Solution: Use variable length encoding! The encoding for ‘e’ should be shorter than the encoding for ‘&’. 105 Relative Frequency of Letters in English Text 106 ASCII (cont.) Here are the ASCII bit strings for the capital letters in our alphabet: Letter ASCII Letter ASCII A 0100 0001 N 0100 1110 B 0100 0010 O 0100 1111 C 0100 0011 P 0101 0000 D 0100 0100 Q 0101 0001 E 0100 0101 R 0101 0010 F 0100 0110 S 0101 0011 G 0100 0111 T 0101 0100 H 0100 1000 U 0101 0101 I 0100 1001 V 0101 0110 J 0100 1010 W 0101 0111 K 0100 1011 X 0101 1000 L 0100 1100 Y 0101 1001 M 0100 1101 Z 0101 1010 107 Example: Morse code Morse code is a method of transmitting textual information as a series of on-off tones, lights, or clicks that can be directly understood by a skilled listener or observer without special equipment. Each character is a sequence of dots and dashes, with the shorter sequences assigned to the more frequently used letters in English – the letter 'E' represented by a single dot, and the letter 'T' by a single dash. Invented in the early 1840s. it was extensively used in the 1890s for early radio communication before it was possible to transmit voice. 108 A U.S. Navy seaman sends Morse code signals in 2005. Vibroplex semiautomatic key. The paddle, when pressed to the right by the thumb, generates a series of dits. When pressed to the left by the knuckle of the index finger, the paddle generates a dah. 109 International Morse Code 110 Variable Length Coding Assume we know the distribution of characters (‘e’ appears 1000 times, ‘&’ appears 1 time) Each character will be encoded using a number of bits that is inversely proportional to its frequency (made precise later). Need a ‘prefix free’ encoding: if ‘e’ = 001 than we cannot assign ‘&’ to be 0011. Since encoding is variable length, need to know when to stop. 111 Encoding Trees Think of encoding as an (unbalanced) tree. Data is in leaf nodes only (prefix free). 1 0 e 0 1 a b ‘e’ = 0, ‘a’ = 10, ‘b’ = 11 How to decode ‘01110’? 112 Cost of a Tree For each character ci let fi be its frequency in the file. Given an encoding tree T, let di be the depth of ci in the tree (number of bits needed to encode the character). The length of the file after encoding it with the coding scheme defined by T will be C(T)= Σdi fi 113 Example Huffman encoding A=0 B = 100 C = 1010 D = 1011 R = 11 ABRACADABRA = 01001101010010110100110 This is eleven letters in 23 bits A fixed-width encoding would require 3 bits for 5 different letters, or 33 bits for 11 letters Notice that the encoded bit string can be decoded! 114 Why it works In this example, A was the most common letter In ABRACADABRA: – 5 As – 2 Rs – 2 Bs – 1C – 1D code for A is 1 bit long code for R is 2 bits long code for B is 3 bits long code for C is 4 bits long code for D is 4 bits long 115 Creating a Huffman encoding For each encoding unit (letter, in this example), associate a frequency (number of times it occurs) – Use a percentage or a probability Create a binary tree whose children are the encoding units with the smallest frequencies – The frequency of the root is the sum of the frequencies of the leaves Repeat this procedure until all the encoding units are in the binary tree 116 Example, step I Assume that relative frequencies are: A: 40 B: 20 C: 10 D: 10 R: 20 (I chose simpler numbers than the real frequencies) Smallest numbers are 10 and 10 (C and D), so connect those – – – – – 117 Example, step II and D have already been used, and the new node above them (call it C+D) has value 20 The smallest values are B, C+D, and R, all of which have value 20 C – Connect any two of these; it doesn’t matter which two 118 Example, step III The smallest values is R, while A and B+C+D all have value 40 Connect R to either of the others root leave 119 Example, step IV Connect the final two nodes 120 Example, step V Assign 0 to left branches, 1 to right branches Each encoding is a path from the root A=0 B = 100 C = 1010 D = 1011 R = 11 Each path terminates at a leaf Do you see why encoded strings are decodable? 121 Unique prefix property A=0 B = 100 C = 1010 D = 1011 R = 11 No bit string is a prefix of any other bit string For example, if we added E=01, then A (0) would be a prefix of E Similarly, if we added F=10, then it would be a prefix of three other encodings (B=100, C=1010, and D=1011) The unique prefix property holds because, in a binary tree, a leaf is not on a path to any other node 122 Practical considerations It is not practical to create a Huffman encoding for a single short string, such as ABRACADABRA – To decode it, you would need the code table – If you include the code table in the entire message, the whole thing is bigger than just the ASCII message Huffman encoding is practical if: – The encoded string is large relative to the code table, OR – We agree on the code table beforehand For example, it’s easy to find a table of letter frequencies for English (or any other alphabet-based language) 123 Data compression Huffman encoding is a simple example of data compression: representing data in fewer bits than it would otherwise need A more sophisticated method is GIF (Graphics Interchange Format) compression, for .gif files Another is JPEG (Joint Photographic Experts Group), for .jpg files – Unlike the others, JPEG is lossy—it loses information – Generally OK for photographs (if you don’t compress them too much) because decompression adds “fake” data very similar to the original 124 JPEG Compression Photographic images incorporate a great deal of information. However, much of that information can be lost without objectionable deterioration in image quality. With this in mind, JPEG allows user-selectable image quality, but even at the “best” quality levels, JPEG makes an image file smaller owing to its multiple-step compression algorithm. It’s important to remember that JPEG is lossy, even at the highest quality setting. It should be used only when the loss can be tolerated. 125 2. Run Length Encoding (RLE) RLE: When data contain strings of repeated symbols (such as bits or characters), the strings can be replaced by a special marker, followed by the repeated symbol, followed by the number of occurrences. In general, the number of occurrences (length) is shown by a two digit number. If the special marker itself occurs in the data, it is duplicated (as in character stuffing). RLE can be used in audio (silence is a run of 0s) and video (run of a picture element having the same brightness and color). 126 An Example of Run-Length Encoding 127 2. Run Length Encoding (RLE) Example – # is chosen as the special marker. – Two-digit number is chosen for the repetition count. – Consider the following string of decimal digits 15000000000045678111111111111118 Using RLE algorithm, the above digital string would be encoded as: 15#01045678#1148 – The compression ration would be (1 – (16/32)) * 100% = 50% 128 129 One of the essential aspects of communication systems is that the codes, the encoders and decoders have layers. Information is encoded from the top layer down, and it is decoded from the bottom layer up. 130