Course Notes 1.2 Fall 01/02 94.201 Page 1 of 122 Notes on Information Encoding The values of the state variables in the components of a computer system are fixed-length strings of bits. Each bit is a binary-valued digit. Since the state variable values are binary-valued, it is necessary to use a language based on binary values to describe all information relevant to a program. Using binary values to represent application information is referred to as a binary encoding of the information. Both the data and the algorithm (instructions) associated with a program must be encoded. This section will deal with data encoding, and the discussion of instruction encoding will be delayed until after exploring processor architecture and simple programming concepts. This discussion of information encoding is based on Figure 1. The figure shows two sets of information: application information, and fixed-length, binary valued strings. The application information has meaning to people and applications. Unfortunately, a computer can only represent information using binary values. Programmers must use a representation mapping to encode application information as binary values, and then write application programs to work with the information in its binary form. application information representation mapping fixed-length binary-valued strings Figure 1: Representing Application Information as Binary Strings To construct an application, it is necessary to understand how information is encoded. For some types of information, the computer hardware may provide some built-in support based on encodings that have become defacto standards (i.e. encodings that have been proven over time to be appropriate for data types frequently used in applications). Most computers have built-in support for integers (both signed and unsigned), and provide instructions that manipulate binary strings under integer encoding assumptions. Built-in support for data types simplifies the task of building applications, however, there are many data types for which the hardware does not have built-in support. In these cases, the programmer must select (or invent) an appropriate encoding and then use general bit manipulation instructions to work with the encoded values. Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 2 of 122 This section will discuss the standard encodings used for: unsigned integers (binary number system encoding), signed integers (2’s complement encoding), and characters (ASCII encoding). In addition, the signed magnitude encoding of signed integers will be discussed as one possible (but rarely used) alternate to 2’s complement encoding. Binary Values as Strings of 0’s and 1’s Each bit in a computer system state variable may have the value 0 or 1. If a state variable is viewed as a string of N bits, then there are 2N different values for the string (there are 2 possible values for each digit, and N digits). If you are uncomfortable with accepting this mathematical conclusion, convince yourself by considering several cases for increasing values of N. You will eventually conclude that adding a bit to the string doubles the number of possible values (half of the values have the new bit = 0, and half of the values have the new bit = 1). If the number of possible values for N – 1 bits is 2N-1, then doubling the number by increasing to N bits gives: 2 x 2N-1 = 2N. By itself, a string of 0’s and 1’s has no meaning – it is just a pattern of symbols. Inside a computer system, a string of 0’s and 1’s can have many possible interpretations. Fixed-Length Limitation: There is a significant limitation in using fixed-length binary strings to represent information (as shown in Figure 1). The number of bits used in the string limits the number of possible unique patterns of 0’s and 1’s, and therefore, limits the number application values that can be represented by the strings. The fixed-length limitation often constrains encodings to a finite subset, or range, of the application information. Number Systems Applications often deal with numbers, so it is useful to consider number systems. The decimal number system is used extensively in everyday life, but it is just one example of a number system. In general, a number system consists of a base and an interpretation rule. The base determines the set of symbols (the digits) that can appear in numbers, and the value of each symbol. The interpretation rule specifies how to interpret a string of digits as a numeric value. The rule usually specifies how to convert the string to a base 10 (decimal) number, since these are the numbers people use most frequently to represent numeric concepts. Counting Number Example: The number system we use to count things is also referred to as the base 10 system. In this system, there are 10 digits (0,1,2,3,4,5,6,7,8,9) and each symbol has an associated value. For example, the digit 0 represents the value zero. Some might find this overly trivial, but do not confuse the syntactic symbol “0” with the Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 3 of 122 counting concept of zero – the concept is the meaning (the semantics), while the symbol is a language construct (the syntax) used to represent the concept in a written communication. Different number systems might use a different symbol to represent the concept zero. A string of base 10 digits can be used to represent a counting number, and the interpretation of the string is based on the positions of the digits in the string. The right-most (least significant) digit is in position 0, and in a string of I digits, the left-most (most significant) digit is in position I–1. The string represents a summation of terms in which the coefficient of each term corresponds to the value of the digit in the string, and the coefficient is multiplied by a power of 10. The exponent of each term is determined by the position of the digit in the string. The 4-digit string D3D2D1D0, represents the counting number obtained from the sum: D3 x 103 + D2 x 102 + D1 x 101 + D0 x 100 For example, the string of base 10 digits: 2458 represents the number obtained from the summation: 2 x 103 + 4 x 102 + 5 x 101 + 8 x 100 = two thousand, four hundred and fifty eight The example may seem trivial, since it involves base 10 representation – perhaps the examples for base 2 and base 16 will be more instructive. General Base-N Number System The base 10 number system discussed above can be generalized to an arbitrary base N. All that is required is a mapping from the base symbols to counting values, and a generalization of the interpretation rule. In the generalized interpretation rule, the ith term of the summation is: (counting value of ith digit) x (base 10 value of the base) i where i starts at 0 for the right-most digit Notation: Where there is possible confusion about the base that pertains to a string of digits, the string will be followed by a subscript indicating the appropriate base. For example: 245810 indicates base 10 representation. Binary Number System The binary number system is an instance of the general base N system in which N = 2. In this number system there are only two symbols: 0 and 1. The symbols have their usual decimal interpretations (i.e. zero and one, respectively). Example: 100112 applying the base 10 interpretation rule gives: 1 x 24 + 0 x 23 + 0 x 22 + 1 x 21 + 1 x 20 = 16 + 0 + 0 + 2 + 1 = 19 Therefore, 100112 = 1910 Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 4 of 122 Hexadecimal Number System The hexadecimal number system is an instance of the general base N system where N = 16. In this number system the symbols are 0 through 9 (with the usual counting value interpretations), and A through F with the following counting value interpretations: A = 10 B = 11 C = 12 D = 13 E = 14 F = 15 Example: A0D316 applying the interpretation rule gives: A x 163 + 0 x 162 + D x 161 + 3 x 160 substituting counting values gives 10 x 163 + 0 x 162 + 13 x 161 + 3 x 160 = 10 x 4096 + 0 + 13 x 16 + 3 = 41,17110 Using Hexadecimal as a Shorthand for Binary As mentioned previously, computer system state variables are binary valued. Machinelevel programming requires references to state variable values, and this leads to the need to refer to binary values in discussions. The length of binary numbers makes them awkward to say and write, and error prone to use. It might be tempting to use decimal numbers as a shorthand representation for binary numbers. In the above example, the 5digit binary value 100112 corresponds to the 2-digit value 1910, and it would be much easier for us to talk about the number “nineteen base ten” than the number “ten thousand and eleven base two”). Unfortunately, the conversion between binary number system representation and decimal number system representation is not intuitively obvious, and requires calculations. As a result, decimal numbers are awkward to use for this purpose. Fortunately, the conversion between binary number representation and hexadecimal number representation is intuitively simple and does not require calculation. Therefore, it will be convenient to use hexadecimal numbers as a shorthand representation for binary numbers. Consider the following 12-digit binary value: b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 This is interpreted as: b11 x 211 + b10 x 210 + b9 x 29 + b8 x 28 + b7 x 27 + b6 x 26 + b5 x 25 + b4 x 24 + b3 x 23 + b2 x 22 + b1 x 21 + b0 x 20 Which can be rewritten as: (b11 x 23 + b10 x 22 + b9 x 21 + b8 x 20) x 28 + (b7 x 23 + b6 x 22 + b5 x 21 + b4 x 20) x 24 + (b3 x 23 + b2 x 22 + b1 x 21 + b0 x 20) x 20 4 But 2 = 16, and therefore the expression can be rewritten as: (b11 x 23 + b10 x 22 + b9 x 21 + b8 x 20) x 162 + (b7 x 23 + b6 x 22 + b5 x 21 + b4 x 20) x 161 + (b3 x 23 + b2 x 22 + b1 x 21 + b0 x 20) x 160 Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 5 of 122 And this is the first three terms of the interpretation rule for base 16 values. Therefore, if there is an easy way to convert between 4-bit binary values and hexadecimal digits, then there is an easy way to convert between strings of binary digits and strings of hexadecimal digits. It turns out that the conversion is simple: Bin 0000 0001 0010 0011 Hex 0 1 2 3 Bin 0100 0101 0110 0111 Hex 4 5 6 7 Bin 1000 1001 1010 1011 Hex 8 9 A B Bin 1100 1101 1110 1111 Hex C D E F The conversion from binary to hex can be accomplished by starting at the right end of the binary string and replacing successive groups of 4 binary digits with their corresponding hex digits. For example: 1001 1111 0011 01102 = 9F3616 9 F 3 6 The conversion from hex to binary can be accomplished by replacing each hex digit with its corresponding 4-bit binary value. For example: 80C516 = 1000 0000 1100 01012 Hexadecimal values are used frequently in programs, and the 16 subscript is awkward in simple text editors. To solve this problem, hexadecimal values are often denoted by extending the digits by the character 'H' (or 'h'). For example: 80C516 might be written as 80C5H (or 80C5h) Binary Encoding of Unsigned Integers The set of unsigned integers are the counting numbers: {0, 1, 2, 3, 4, 5, … } The unsigned integers can also be visualized on a number line: 0 1 2 3 4 5 Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 6 of 122 Computer applications often involve unsigned integer values, and an encoding of the values is needed. (Recall from Figure 1 that a binary encoding is an interpretation mapping between application information and fixed-length binary strings.) Most computers have built-in support for unsigned integers under the binary number system encoding, i.e. binary strings are interpreted as unsigned integers using the base 2 number system interpretation rule described earlier. The fixed-length limitation (recall previous discussion) constrains the range of unsigned integers that can be represented. If N-bit binary strings are used, then there are 2N unique binary values that can be used in the encoding. It might be tempting to conclude that 2N is the largest unsigned integer value that could be encoded, however this would not account for the encoding of the value 0. Since one of the binary values (i.e. the binary string consisting of all 0’s) is used to represent 0, then there are 2N – 1 values left to encode values greater than 0. Therefore, under the binary number system interpretation, N-bit strings can encode unsigned integer values in the range: 0 . . 2N–1. In terms of the number line visualization, the fixed-length limitation truncates the number line to a finite range: 0 1 2 3 ... 2N – 1 Representing only a finite range of unsigned integer values creates the potential for problems when performing arithmetic operations on the binary encoded values. For example, suppose that adding two values together results in an answer that is outside of the finite range? ( E.G. add 1 to 2N–1.) This problem is referred to as overflow and is a major concern when performing computer arithmetic. The conversion between binary values and unsigned integer values is reasonably straightforward. The conversion follows the binary number system interpretation rule described previously. There are several possible algorithms for converting unsigned values to binary representations. The algorithm shown below generates the bits one at a time, beginning with the least significant bit and ending with the most significant bit. In the algorithm assume that div is an integer division operation (i.e. it ignores any remainder, for example: 5 div 2 = 2), and mod is a modulus operation that results in only the remainder of a division (for example 5 mod 2 = 1). An algorithm for converting an unsigned integer to an N-bit representation (follows previous convention of labeling the least significant bit as b0): value = value to be converted for ( i = 0; i < N; i++ ) { bi = value mod 2 ; value = value div 2; } Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 7 of 122 In the algorithm, each iteration through the loop generates the value of one bit and reduces the value remaining to be converted. Try a few values to convince yourself that the algorithm works! Binary Encoding of Signed Integers The set of signed integers is: { … , -3, -2, -1, 0, 1, 2, 3, … } The signed integers can also be visualized on a number line: –3 –2 –1 0 1 2 3 Notes: The unsigned integers are a subset of the signed integers. Numbers to the left of zero are negative, while numbers to the right are positive. 0 is neither positive nor negative. Now consider the problem of encoding signed integers as binary values. The binary number system interpretation will not (by itself) be sufficient, since it does not account for the possibility of negative numbers. Two encoding schemes will be presented; however, it is the 2’s complement method that is most frequently used in typical computer systems (and by most high-level language compilers). Method 1: Signed Magnitude Encoding A simple solution to encoding signed integer values in a fixed length sequence of bits would be to use one bit to encode the sign, and the remaining bits to encode the unsigned magnitude using the binary number system interpretation (as discussed previously). To illustrate the signed magnitude approach, consider 8-bit values, and assume that the most significant bit (b7) is used to encode the sign. Let b7 = 1 denote negative values, and b7 = 0 denote positive values. Using this scheme, the following values would result: Signed Integer Value (decimal) 1 0 -1 127 -20 8-Bit Signed Magnitude Encoding 0 0000001 0 0000000 1 0000001 0 1111111 1 0010100 The primary advantage of the scheme is that the encoded binary values can be interpreted easily as signed values. There are, unfortunately, several disadvantages, including: Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 8 of 122 Two representations for zero. Under the scheme, there are two encodings for zero: 0 0000000 positive zero (b7 = 0) 1 0000000 negative zero (b7 = 1) The negative encoding for zero is not ideal, since zero is not usually thought of as being negative (zero is not really positive either – positive numbers are greater than zero). Furthermore, having two encodings for the same signed integer value means that the use of one of the encoded values is redundant (i.e. it could be used to encode some other value). Method 2: 2’s Complement Encoding The method used in most computers to represent signed integers is the 2’s complement encoding. The method relies on the ability to perform the 2’s complement operation on binary values. The operation first requires forming the complement of each bit, and then adding 1 to the result. The complement of a bit is obtained by inverting its value (i.e. the complement of 0 is 1, and the complement of 1 is 0). Binary addition is carried out in a bitwise fashion, with the following rules: 0 0 1 1 +0 +1 +0 +1 0 1 1 10 0 carry 1 For the last case, the carry is propagated to the next most significant bit. For example, consider the 2’s complement operation on the 8-bit value 01101011: 01101011 10010100 + 1 10010101 (each bit has been complemented) (add 1) 2’s complement of 01101011 The 2's complement operation by itself does not define the mapping of binary strings to signed integers. Since the unsigned integers are a subset of the signed integers, it would be convenient to use the same binary values to represents the same integer values under both interpretations. Therefore, zero and any positive integers should be represented by the same binary values under both unsigned and signed integer interpretations. What remains is the definition of the mapping of binary values to negative integers. Under the 2’s complement interpretation, the binary value representing the negation of a value can be obtained by applying the 2’s complement operation to the original value. For example, the 8-bit representation of –1 can be obtained by applying the 2’s complement operation to the 8-bit representation of +1: +1 = 00000001 + 11111110 1 11111111 (each bit has been complemented) (add 1) 2’s complement representation of –1 Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 9 of 122 In the 2’s complement encoding, it turns out that the representations for all non-negative numbers (i.e. zero and all positive numbers) have the most significant bit = 0, while the representations of all negative numbers have the most significant bit = 1. It might be tempting to think that this is the same as the sign bit used in the signed magnitude encoding described earlier, however, this is not the case! In 2’s complement encoding, all bits including the most significant bit are relevant to the magnitude of the value being represented. The fixed-length limitation imposes a range on the signed integer values that can be represented using 2's complement encoding. For N-bit binary strings, half of the 2N binary values have the most significant bit = 1, and therefore represent negative integers: 1 x 2N = 2N–1 binary values represent negative integers 2 Therefore, the smallest negative integer that can be represented (i.e. the negative integer with the largest magnitude) is –2N–1, and there are 2N–1 binary values left to represent non-negative integers. Since one binary value represents zero, then the largest positive integer that can be represented is 2N–1 – 1. In terms of the number line visualization, the fixed-length limitation truncates the number line to a finite range: –2N–1 –2 –1 0 1 2 2N–1–1 For non-negative integers, the conversion between binary values and signed integers follows the binary number system interpretation (as for unsigned integers). The conversion is not as obvious for negative integers. To form the binary value that represents a negative integer, first form the unsigned representation of the magnitude (as done for unsigned integers), and then perform the 2’s complement operation to obtain the representation of the negation of the value. This process was illustrated above to obtain the representation of –1. To obtain the negative integer represented by a binary value, first perform the 2’s complement operation to obtain the representation of the magnitude, and then calculate the magnitude (as done for non-negative values) – but, remember that the result is a negative value! For example, consider the integer value represented by the 8-bit binary value 11111111. Since the most significant bit = 1, the integer value is negative. To find the magnitude, the 2's complement of the value must be taken: 11111111 00000000 (each bit has been complemented) + 1 (add 1) 00000001 2’s complement of 11111111 But the 2's complement result is easily identified as the magnitude = 1, and therefore, the original binary value represents the signed integer –1. Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 10 of 122 It is interesting to compare the two examples given above. The first example obtained the encoding for –1 by taking the 2's complement of +1 (i.e. by negating the representation for 1). The second example negated the resulting representation of –1, and arrived back at the original representation for +1 (as introduced in the first example). The two examples show the consistency of the negation operation: negation( negation( +1 ) ) = +1 Consider several more interesting cases: Recall that one of the problems with the signed magnitude encoding was that it resulted in two representations of the value zero. Consider the 2's complement of the 8-bit value 00000000: 0000000 11111111 (each bit has been complemented) + 1 (add 1) 1 00000000 The addition of 1 to the complemented value has propagated a carry through all the bits to create a 9-bit value! The use of fixed-length binary strings creates problems here, since the 9-bit result cannot be represented by an 8-bit value. As a rule when performing arithmetic on fixed-length binary strings, say of length N, the result is truncated to keep only the N least significant bits of the answer. (The carry out of the most significant bit is sometimes relevant and will be considered further in the discussion of overflow.) Truncating the 9-bit answer obtained above to keep the least significant 8 bits gives: 00000000, which is the representation of 0. Therefore, the negation of 0 is 0 (which is mathematically correct), and there is only one representation for the value zero. Now consider the largest positive value that can be represented by an 8-bit value under 2's complement encoding: From the previous discussion of the signed integer range, the largest integer is 28–1 – 1 = 12710 = 011111112. Negating this value gives: 01111111 10000000 + 1 10000001 = –12710 Subtracting 1 from this value gives: 10000001 – 1 10000000 = –12810 (i.e. the value –28–1) Now what should happen if we negate this value? Should we get the representation of +128? (We should be surprised if we do, since +127 is the largest value that can be represented!) Try it and see what happens! Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 11 of 122 Binary Encoding of Characters Applications often deal with information in the form of textual characters (e.g. output to the display or a printer). Since the state variables of the computer system only hold binary values, the characters must also have binary encodings. The 7-bit ASCII encoding scheme is widely used in practice. (ASCII = American Standard Coding for Information Interchange) The encoding is shown in the attached ASCII table. In the encoding, each character is represented by a 7-bit binary value. This encoding is often extended to form an 8-bit value by extending the encoding with b7 = 0. The table lists hexadecimal values instead of binary values (recall: hexadecimal notation is used as shorthand for binary notation). Several points are worth noting: The encoding for the character '0' (character zero) is 30H (not 00H). A blank space has a representation in the encoding (20H). Upper and lower case letters have different encodings (e.g. 'A' is encoded as 41H, while 'a' is encoded as 61H). The table shows the encodings for displayable characters, but there are several nondisplayable encodings worth noting: carriage return 0DH line feed 0AH These codes are often used to control the cursor position on the display, and the print head on a printer. Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada Course Notes 1.2 Fall 01/02 94.201 Page 12 of 122 ASCII Table (for displayable 7-bit ASCII character set) hex char hex char hex char 20 21 22 23 blank space ! " # 40 41 42 43 @ A B C 60 61 62 63 ` a b c 24 25 26 27 $ % & ' 44 45 46 47 D E F G 64 65 66 67 d e f g 28 29 2A 2B ( ) * + 48 49 4A 4B H I J K 68 69 6A 6B h i j k 2C 2D 2E 2F , . / 4C 4D 4E 4F L M N 0 6C 6D 6E 6F l m n o 30 31 32 33 0 1 2 3 50 51 52 53 P Q R S 70 71 72 73 p q r s 34 35 36 37 4 5 6 7 54 55 56 57 T U V W 74 75 76 77 t u v w 38 39 3A 3B 8 9 : ; 58 59 5A 5B X Y Z [ 78 79 7A 7B x y z { 3C 3D 3E 3F < = > ? 5C 5D 5E 5F \ ] ^ _ 7C 7D 7E 7F | } ~ non-displayable Copyright Trevor W. Pearce, September 12, 2000 For use in the 94.201 course only – not for distribution outside of the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada