Informatics I101 February 25, 2003 John C. Paolillo, Instructor Electronic Text • ASCII — American Standard Code for Information Interchange • EBCDIC (IBM Mainframes, not standard) • Extended ASCII (8-bit, not standard) – DOS Extended ASCII – Windows Extended ASCII – Macintosh Extended ASCII • UNICODE (16-bit, standard-in-progress) ASCII 01000001 means Alphabet letter "A" is displayed as Screen Representation AAA A 0 The ASCII Code 1 2 3 4 5 6 7 0 NUL DLE blank 0 @ P ` p 1 SOH DC1 ! 1 A Q a q 2 STX DC2 " 2 B R b r 3 ETX DC3 # 3 C S c s 4 EOT DC4 $ 4 D T d t 5 ENQ NAK % 5 E U e u 6 ACK SYN & 6 F V f v 7 BEL ETB ' 7 G W g w 8 BS CAN ( 8 H X h x 9 HT EM ) 9 I Y i y A LF SUB * : J Z j z B VT ESC + ; K [ k { C FF FS ` < L \ l | D CR GS - = M ] m } E SO RS . > N ^ n ~ F SI US / ? O ~ o DEL An Example Text T h i s i s a n e x a m p 84 104 105 115 32 105 115 32 97 110 32 101 120 97 109 112 108 101 Note that each ASCII character corresponds to a number, including spaces, carriage returns, etc. Everything must be represented somehow, otherwise the computer couldn’t do anything with it. l e Representation in Memory 01101010 01101001 01101000 01100111 01100110 01100101 01100100 01100011 01100010 01100001 01100000 32 101 108 112 109 97 120 101 32 110 97 _ e l p m a x e _ n a Features of ASCII • 7 bit fixed-length code – all codes have same number of bits • Sorting: A precedes B, B precedes C, etc. • Caps + 32 = Lower case (A + space = a) • Word divisions, etc. must be parsed ASCII is very widespread and almost universally supported. Variable-Length Codes • Some symbols (e.g. letters) have shorter codes than others – E.g. Morse code: e = dot, j = dot-dash-dash-dash – Use frequency of symbols to assign code lentgths • Why? Space efficiency – compression tools such as gzip and zip use variablelength codes (based on words) Requirements Starting and ending points of symbols must be clear (simplistic) example: four symbols must be encoded: 0 10 110 1110 All symbols end with a zero Any zero ends a symbol Any one continues a symbol Average number of bits per symbol = 2 Example • 12 symbols – digits 0-9 – decimal point and space (end of number) 0 0 0 1 0 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 2 3 4 5 6 7 8 9 _ . 1 0 1 2 3 4 5 6 7 8 9 _ . 00 010 0110 01110 011110 011111 10 110 1110 11110 111110 111111 Efficient Coding Huffman coding (gzip) 1. count the number of times each symbol occurs 2. start with the two least frequent symbol a) combine them using a tree b) put 0 on one branch, 1 on the other c) combine counts and treat as a single symbol 3. continue combining in the same way until every symbol is assigned a place in the tree 4. read the codes from the top of the tree down to each symbol Information Theory • Mathematical theory of communication – How many bits in an efficient variable-length encoding? – How much information is in a chunk of data? – How can the capacity of an information medium be measured? • Probabilistic model of information – “Noisy channel” model – less frequent ≈ more surprising ≈ more informative • Measures information using the notion entropy Noisy Channel Source Destination 1 1 0 0 We measure the probability of each possible path (correct reception and errors) Entropy • Entropy of a symbol is calculated from its probability of occurrence Number of bits required hs = log2 ps Average entropy: H(p) = – sum( pi log pi ) • Related to variance • Measured in bits (log2) Base 2 Logarithms 2log2x = x ; e.g. log22 = 1, log24 = 2, log28 = 3, etc. Often we round up to the nearest power of two (= min number of bits) Unicode • Administered by the Unicode Consortium • Assigns unique code to every written symbol (21 bits: 2,097,152 codes) – UTF-32: four-byte fixed-length code – UTF-16: two to four-byte variable-length code – UTF-8: one to 4-byte variable length code • ASCII Block (one byte) + basic multilingual plane (2-3 bytes) + supplementary (4 bytes)