2 I speak Spanish to God, Italian to women, French to men, and German to my horse. Instructions: Language of the Computer 2.1 Introduction 76 2.2 Operations of the Computer Hardware 77 2.3 Operands of the Computer Hardware 80 2.4 Signed and Unsigned Numbers 86 2.5 Representing Instructions in the Charles V, King of France 1337–1380 03-Ch02-P374750.indd 74 Computer 93 2.6 Logical Operations 100 2.7 Instructions for Making Decisions 104 7/3/09 12:18:21 PM 2.8 Supporting Procedures in Computer Hardware 113 2.9 Communicating with People 122 2.10 ARM Addressing for 32-Bit Immediates and More Complex Addressing Modes 127 2.11 Parallelism and Instructions: Synchronization 133 2.12 Translating and Starting a Program 135 2.13 A C Sort Example to Put It All Together 143 2.14 Arrays versus Pointers 152 2.15 Advanced Material: Compiling C and Interpreting Java 156 2.16 Real Stuff: MIPS Instructions 156 2.17 Real Stuff: x86 Instructions 161 2.18 Fallacies and Pitfalls 170 2.19 Concluding Remarks 171 2.20 Historical Perspective and Further Reading 174 2.21 Exercises 174 The Five Classic Components of a Computer 03-Ch02-P374750.indd 75 7/3/09 12:18:22 PM 76 Chapter 2 2.1 instruction set The vocabulary of commands understood by a given architecture. Instructions: Language of the Computer Introduction To command a computer’s hardware, you must speak its language. The words of a computer’s language are called instructions, and its vocabulary is called an instruction set. In this chapter, you will see the instruction set of a real computer, both in the form written by people and in the form read by the computer. We introduce instructions in a top-down fashion. Starting from a notation that looks like a restricted programming language, we refine it step-by-step until you see the real language of a real computer. Chapter 3 continues our downward descent, unveiling the hardware for arithmetic and the representation of floating-point numbers. You might think that the languages of computers would be as diverse as those of people, but in reality computer languages are quite similar, more like regional dialects than like independent languages. Hence, once you learn one, it is easy to pick up others. This similarity occurs because all computers are constructed from hardware technologies based on similar underlying principles and because there are a few basic operations that all computers must provide. Moreover, computer designers have a common goal: to find a language that makes it easy to build the hardware and the compiler while maximizing performance and minimizing cost and power. This goal is time honored; the following quote was written before you could buy a computer, and it is as true today as it was in 1947: It is easy to see by formal-logical methods that there exist certain [instruction sets] that are in abstract adequate to control and cause the execution of any sequence of operations . . . . The really decisive considerations from the present point of view, in selecting an [instruction set], are more of a practical nature: simplicity of the equipment demanded by the [instruction set], and the clarity of its application to the actually important problems together with the speed of its handling of those problems. Burks, Goldstine, and von Neumann, 1947 The “simplicity of the equipment” is as valuable a consideration for today’s computers as it was for those of the 1950s. The goal of this chapter is to teach an instruction set that follows this advice, showing both how it is represented in hardware and the relationship between high-level programming languages and this more primitive one. Our examples are in the C programming language; Section 2.15 on the CD shows how these would change for an object-oriented language like Java. 03-Ch02-P374750.indd 76 7/3/09 12:18:22 PM 2.2 By learning how to represent instructions, you will also discover the secret of computing: the stored-program concept. Moreover, you will exercise your “foreign language” skills by writing programs in the language of the computer and running them on the simulator that comes with this book. You will also see the impact of programming languages and compiler optimization on performance. We conclude with a look at the historical evolution of instruction sets and an overview of other computer dialects. The chosen instruction set comes is ARM, which is the most popular 32-bit instruction set in the world, as 4 billion were shipped in 2008. Later, we will take a quick look at two other popular instruction sets. MIPS is quite similar to ARM, with making up in elegance what it lacks in popularity. The other example, the Intel x86, is inside almost all of the 300 million PCs made in 2008. We reveal the ARM instruction set a piece at a time, giving the rationale along with the computer structures. This top-down, step-by-step tutorial weaves the components with their explanations, making the computer’s language more palatable. Figure 2.1 gives a sneak preview of the instruction set covered in this chapter. 2.2 77 Operations of the Computer Hardware Operations of the Computer Hardware Every computer must be able to perform arithmetic. The ARM assembly language notation stored-program concept The idea that instructions and data of many types can be stored in memory as numbers, leading to the stored-program computer. There must certainly be instructions for performing the fundamental arithmetic operations. Burks, Goldstine, and von Neumann, 1947 ADD a, b, c instructs a computer to add the two variables b and c and to put their sum in a. This notation is rigid in that each ARM arithmetic instruction performs only one operation and must always have exactly three variables. For example, suppose we want to place the sum of four variables b, c, d, and e into variable a. (In this section we are being deliberately vague about what a “variable” is; in the next section we’ll explain in detail.) The following sequence of instructions adds the four variables: ADD a, b, c ADD a, a, d ADD a, a, e ; The sum of b and c is placed in a. ; The sum of b, c, and d is now in a. ; The sum of b, c, d, and e is now in a. Thus, it takes three instructions to sum the four variables. The words to the right of the sharp symbol (;) on each line above are comments for the human reader; the computer ignores them. Note that unlike other programming languages, each line of this language can contain at most one instruction. Another difference from C is that comments always terminate at the end of a line. 03-Ch02-P374750.indd 77 7/3/09 12:18:22 PM 78 Chapter 2 Instructions: Language of the Computer ARM operands Name Example Comments 16 registers r0, r1, r2, . . . , r11, r12, sp, lr, pc Fast locations for data. In ARM, data must be in registers to perform arithmetic, register 230 memory words Memory[0], Memory[4], . . . , Memory[4294967292] Accessed only by data transfer instructions. ARM uses byte addresses, so sequential word addresses differ by 4. Memory holds data structures, arrays, and spilled registers. ARM assembly language Category Arithmetic Data transfer Logical Conditional Branch Unconditional Branch Instruction Example add subtract load register store register load register halfword load register halfword signed ADD r1,r2,r3 SUB r1,r2,r3 LDR r1, [r2,#20] STR r1, [r2,#20] LDRH r1, [r2,#20] LDRHS r1, [r2,#20] r1 = r2 – r3 r1 = r2 + r3 r1 = Memory[r2 + 20] Memory[r2 + 20] = r1 r1 = Memory[r2 + 20] r1 = Memory[r2 + 20] store register halfword load register byte load register byte signed store register byte swap STRH r1, [r2,#20] LDRB r1, [r2,#20] LDRBS r1, [r2,#20] STRB r1, [r2,#20] SWP r1, [r2,#20] Memory[r2 + 20] = r1 r1 = Memory[r2 + 20] r1 = Memory[r2 + 20] Memory[r2 + 20] = r1 r1 = Memory[r2 + 20], Memory[r2 + 20] = r1 Halfword register to memory Byte from memory to register Byte from memory to register Byte from register to memory Atomic swap register and memory mov and MOV AND r1, r2 r1, r2, r3 r1 = r2 r1 = r2 & r3 Copy value into register Three reg. operands; bit-by-bit AND or ORR r1, r2, r3 r1 = r2 | r3 Three reg. operands; bit-by-bit OR not logical shift left (optional operation) logical shift right (optional operation) compare MVN LSL r1, r2 r1, r2, #10 r1 = ~ r2 r1 = r2 << 10 Two reg. operands; bit-by-bit NOT Shift left by constant LSR r1, r2, #10 r1 = r2 >> 10 Shift right by constant cond. flag = r1 − r2 Compare for conditional branch if (r1 == r2) go to PC + 8 + 100 Conditional Test; PC-relative go to PC + 8 + 10000 Branch For procedure call CMP r1, r2 branch on EQ, NE, LT, LE, GT, BEQ 25 GE, LO, LS, HI, HS, VS, VC, MI, PL branch (always) B 2500 branch and link BL 2500 Meaning r14 = PC + 4; go to PC + 8 + 10000 Comments 3 register operands 3 register operands Word from memory to register Word from memory to register Halfword memory to register Halfword memory to register FIGURE 2.1 ARM assembly language revealed in this chapter. This information is also found in Column 1 of the ARM Reference Data Card at the front of this book. The natural number of operands for an operation like addition is three: the two numbers being added together and a place to put the sum. Requiring every instruction to have exactly three operands, no more and no less, conforms to the philosophy of keeping the hardware simple: hardware for a variable number of operands is more complicated than hardware for a fixed number. This situation illustrates the first of four underlying principles of hardware design: 03-Ch02-P374750.indd 78 7/3/09 12:18:22 PM 2.2 79 Operations of the Computer Hardware Design Principle 1: Simplicity favors regularity. We can now show, in the two examples that follow, the relationship of programs written in higher-level programming languages to programs in this more primitive notation. Compiling Two C Assignment Statements into ARM This segment of a C program contains the five variables a, b, c, d, and e. Since Java evolved from C, this example and the next few work for either high-level programming language: EXAMPLE a = b + c; d = a – e; The translation from C to ARM assembly language instructions is performed by the compiler. Show the ARM code produced by a compiler. An ARM instruction operates on two source operands and places the result in one destination operand. Hence, the two simple statements above compile directly into these two ARM assembly language instructions: ANSWER ADD a, b, c SUB d, a, e Compiling a Complex C Assignment into ARM A somewhat complex statement contains the five variables f, g, h, i, and j: f = (g + h) – (i + j); EXAMPLE What might a C compiler produce? The compiler must break this statement into several assembly instructions, since only one operation is performed per ARM instruction. The first ARM instruction calculates the sum of g and h. We must place the result somewhere, so the compiler creates a temporary variable, called t0: ANSWER ADD t0,g,h ; temporary variable t0 contains g + h 03-Ch02-P374750.indd 79 7/3/09 12:18:22 PM 80 Chapter 2 Instructions: Language of the Computer Although the next operation is subtract, we need to calculate the sum of i and j before we can subtract. Thus, the second instruction places the sum of i and j in another temporary variable created by the compiler, called t1: ADD t1,i,j ; temporary variable t1 contains i + j Finally, the subtract instruction subtracts the second sum from the first and places the difference in the variable f, completing the compiled code: SUB f,t0,t1 ; f gets t0 – t1, which is (g + h) – (i + j) Check Yourself For a given function, which programming language likely takes the most lines of code? Put the three representations below in order. 1. Java 2. C 3. ARM assembly language Elaboration: To increase portability, Java was originally envisioned as relying on a software interpreter. The instruction set of this interpreter is called Java bytecodes (see Section 2.15 on the CD), which is quite different from the ARM instruction set. To get performance close to the equivalent C program, Java systems today typically compile Java bytecodes into the native instruction sets like ARM. Because this compilation is normally done much later than for C programs, such Java compilers are often called Just In Time (JIT) compilers. Section 2.12 shows how JITs are used later than C compilers in the start-up process, and Section 2.13 shows the performance consequences of compiling versus interpreting Java programs. 2.3 word The natural unit of access in a computer, usually a group of 32 bits; corresponds to the size of a register in the ARM architecture. 03-Ch02-P374750.indd 80 Operands of the Computer Hardware Unlike programs in high-level languages, the operands of arithmetic instructions are restricted; they must be from a limited number of special locations built directly in hardware called registers. Registers are primitives used in hardware design that are also visible to the programmer when the computer is completed, so you can think of registers as the bricks of computer construction. The size of a register in the ARM architecture is 32 bits; groups of 32 bits occur so frequently that they are given the name word in the ARM architecture. One major difference between the variables of a programming language and registers is the limited number of registers, typically 16 to 32 on current computers. Section 2.20 on the CD for the history of the number of registers.) Thus, (See continuing in our top-down, stepwise evolution of the symbolic representation of the ARM language, in this section we have added the restriction that the three 7/3/09 12:18:23 PM 2.3 81 Operands of the Computer Hardware operands of ARM arithmetic instructions must each be chosen from one of the 16 32-bit registers. The reason for the limit of 16 registers may be found in the second of our four underlying design principles of hardware technology: Design Principle 2: Smaller is faster. A very large number of registers may increase the clock cycle time simply because it takes electronic signals longer when they must travel farther. Guidelines such as “smaller is faster” are not absolutes; 15 registers may not be faster than 16. Yet, the truth behind such observations causes computer designers to take them seriously. In this case, the designer must balance the craving of programs for more registers with the designer’s desire to keep the clock cycle fast. Another reason for not using more than 16 is the number of bits it would take in the instruction format, as Section 2.5 demonstrates “Energy is a major concern today, so a third reason for using fewer registers is to conserve energy.” Chapter 4 shows the central role that registers play in hardware construction; as we shall see in this chapter, effective use of registers is critical to program performance. We use the convention r0, r1, . . . , r15 to refer to registers 0, 1, . . . , r15. Compiling a C Assignment Using Registers It is the compiler’s job to associate program variables with registers. Take, for instance, the assignment statement from our earlier example: EXAMPLE f = (g + h) – (i + j); The variables f, g, h, i, and j are assigned to the registers r0, r1, r2, r3, and r4, respectively. What is the compiled ARM code? The compiled program is very similar to the prior example, except we replace the variables with the register names mentioned above plus two temporary registers, r5 and r6, which correspond to the temporary variables above: ANSWER ADD r5,r0,r1 ; register r5 contains g + h ADD r6,r2,r3 ; register r6 contains i + j SUB r4,r5,r6 ; r4 gets r5 – r6, which is (g + h)–(i + j) Memory Operands Programming languages have simple variables that contain single data elements, as in these examples, but they also have more complex data structures—arrays and structures. These complex data structures can contain many more data elements than there are registers in a computer. How can a computer represent and access such large structures? 03-Ch02-P374750.indd 81 7/3/09 12:18:23 PM 82 Chapter 2 Instructions: Language of the Computer Processor 3 100 2 10 1 101 0 1 Address Data Memory FIGURE 2.2 Memory addresses and contents of memory at those locations. If these elements were words, these addresses would be incorrect, since ARM actually uses byte addressing, with each word representing four bytes. Figure 2.3 shows the memory addressing for sequential word addresses. data transfer instruction A command that moves data between memory and registers. address A value used to delineate the location of a specific data element within a memory array. Recall the five components of a computer introduced in Chapter 1 and repeated on page 75. The processor can keep only a small amount of data in registers, but computer memory contains billions of data elements. Hence, data structures (arrays and structures) are kept in memory. As explained above, arithmetic operations occur only on registers in ARM instructions; thus, ARM must include instructions that transfer data between memory and registers. Such instructions are called data transfer instructions. To access a word in memory, the instruction must supply the memory address. Memory is just a large, single-dimensional array, with the address acting as the index to that array, starting at 0. For example, in Figure 2.2, the address of the third data element is 2, and the value of Memory[2] is 10. The data transfer instruction that copies data from memory to a register is traditionally called load. The format of the load instruction is the name of the operation followed by the register to be loaded, then a constant and register used to access memory. The sum of the constant portion of the instruction and the contents of the second register forms the memory address. The actual ARM name for this instruction is LDR, standing for load word into register. Compiling an Assignment When an Operand Is in Memory EXAMPLE Let’s assume that A is an array of 100 words and that the compiler has associated the variables g and h with the registers r1 and r2 and uses r5 as a temporary register as before. Let’s also assume that the starting address, or base address, of the array is in r3. Compile this C assignment statement: g = h + A[8]; 03-Ch02-P374750.indd 82 7/3/09 12:18:23 PM 2.3 Although there is a single operation in this assignment statement, one of the operands is in memory, so we must first transfer A[8] to a register. The address of this array element is the sum of the base of the array A, found in register r3, plus the number to select element 8. The data should be placed in a temporary register for use in the next instruction. Based on Figure 2.2, the first compiled instruction is LDR 83 Operands of the Computer Hardware ANSWER r5,[r3,#8] ; Temporary reg r5 gets A[8] (On the next page we’ll make a slight adjustment to this instruction, but we’ll use this simplified version for now.) The following instruction can operate on the value in r5 (which equals A[8]) since it is in a register. The instruction must add h (contained in r2) to A[8] (r5) and put the sum in the register corresponding to g (associated with r1): ADD r1,r2,r5 ; g = h + A[8] The constant in a data transfer instruction (8) is called the offset, and the register added to form the address (r3) is called the base register. In addition to associating variables with registers, the compiler allocates data structures like arrays and structures to locations in memory. The compiler can then place the proper starting address into the data transfer instructions. Since 8-bit bytes are useful in many programs, most architectures address individual bytes. Therefore, the address of a word matches the address of one of the 4 bytes within the word, and addresses of sequential words differ by 4. For example, Figure 2.3 shows the actual ARM addresses for the words in Figure 2.2; the byte address of the third word is 8. Processor 12 100 8 10 4 101 0 1 Byte Address Data Hardware/ Software Interface Memory FIGURE 2.3 Actual ARM memory addresses and contents of memory for those words. The changed addresses are highlighted to contrast with Figure 2.2. Since ARM addresses each byte, word addresses are multiples of 4: there are 4 bytes in a word. 03-Ch02-P374750.indd 83 7/3/09 12:18:23 PM 84 Chapter 2 alignment restriction A requirement that data be aligned in memory on natural boundaries. Instructions: Language of the Computer In ARM, words must start at addresses that are multiples of 4. This requirement is called an alignment restriction, and many architectures have it. (Chapter 4 suggests why alignment leads to faster data transfers.) Computers divide into those that use the address of the leftmost or “big end” byte as the word address versus those that use the rightmost or “little end” byte. ARM is in the little-endian camp. Byte addressing also affects the array index. To get the proper byte address in the code above, the offset to be added to the base register r3 must be 4 × 8, or 32, so that the load address will select A[8] and not A[8/4]. (See the related pitfall on page 171 of Section 2.18.) The instruction complementary to load is traditionally called store; it copies data from a register to memory. The format of a store is similar to that of a load: the name of the operation, followed by the register to be stored, then offset to select the array element, and finally the base register. Once again, the ARM address is specified in part by a constant and in part by the contents of a register. The actual ARM name is STR, standing for store word from a register. Compiling Using Load and Store EXAMPLE Assume variable h is associated with register r2 and the base address of the array A is in r3. What is the ARM assembly code for the C assignment statement below? A[12] = h + A[8]; ANSWER Although there is a single operation in the C statement, now two of the operands are in memory, so we need even more ARM instructions. The first two instructions are the same as the prior example, except this time we use the proper offset for byte addressing in the load word instruction to select A[8], and the ADD instruction places the sum in r5: LDR ADD r5,[r3,#32] r5,r2,r5 ; Temporary reg r5 gets A[8] ; Temporary reg r5 gets h + A[8] The final instruction stores the sum into A[12], using 48 (4 × 12) as the offset and register r3 as the base register. STR 03-Ch02-P374750.indd 84 r5,[r3,#48] ; Stores h + A[8] back into A[12] 7/3/09 12:18:23 PM 2.3 85 Operands of the Computer Hardware Load word and store word are the instructions that copy words between memory and registers in the ARM architecture. Other brands of computers use other instructions along with load and store to transfer data. An architecture with such alternatives is the Intel x86, described in Section 2.17. Many programs have more variables than computers have registers. Consequently, the compiler tries to keep the most frequently used variables in registers and places the rest in memory, using loads and stores to move variables between registers and memory. The process of putting less commonly used variables (or those needed later) into memory is called spilling registers. The hardware principle relating size and speed suggests that memory must be slower than registers, since there are fewer registers. This is indeed the case; data accesses are faster if data is in registers instead of memory. Moreover, data is more useful when in a register. An ARM arithmetic instruction can read two registers, operate on them, and write the result. An ARM data transfer instruction only reads one operand or writes one operand, without operating on it. Thus, registers take less time to access and have higher throughput than memory, making data in registers both faster to access and simpler to use. Accessing registers also uses less energy than accessing memory. To achieve highest performance and conserve energy, compilers must use registers efficiently. Hardware/ Software Interface Constant or Immediate Operands Many times a program will use a constant in an operation—for example, incrementing an index to point to the next element of an array. In fact, more than half of the ARM arithmetic instructions have a constant as an operand when running the SPEC2006 benchmarks. Using only the instructions we have seen so far, we would have to load a constant from memory to use one. (The constants would have been placed in memory when the program was loaded.) For example, to add the constant 4 to register r3, we could use the code LDR r5, [r1,#AddrConstant4] ; r5 = constant 4 ADD r3,r3,r5 ; r3 = r3 + r5 (r5 == 4) assuming that r1 + AddrConstant4 is the memory address of the constant 4. 03-Ch02-P374750.indd 85 7/3/09 12:18:23 PM 86 Chapter 2 Instructions: Language of the Computer An alternative that avoids the load instruction is to offer the option having one operand of the arithmetic instructions be a constant, called an immediate operand. To add 4 to register r3, we just write ADD r3,r3,#4 ; r3 = r3 + 4 The sharp or hash symbol (#) means the following number is a constant. Immediate operands illustrate the third hardware design principle, first mentioned in the Fallacies and Pitfalls of Chapter 1: Design Principle 3: Make the common case fast. Constant operands occur frequently, and by including constants inside arithmetic instructions, operations are much faster and use less energy than if constants were loaded from memory. Check Yourself Given the importance of registers, what is the rate of increase in the number of registers in a chip over time? 1. Very fast: They increase as fast as Moore’s law, which predicts doubling the number of transistors on a chip every 18 months. 2. Very slow: Since programs are usually distributed in the language of the computer, there is inertia in instruction set architecture, and so the number of registers increases only as fast as new instruction sets become viable. Elaboration: The ARM offset plus base register addressing is an excellent match to structures as well as arrays, since the register can point to the beginning of the structure and the offset can select the desired element. We’ll see such an example in Section 2.13. The register in the data transfer instructions was originally invented to hold an index of an array with the offset used for the starting address of an array. Thus, the base register is also called the index register. Today’s memories are much larger and the software model of data allocation is more sophisticated, so the base address of the array is normally passed in a register since it won’t fit in the offset, as we shall see. 2.4 Signed and Unsigned Numbers First, let’s quickly review how a computer represents numbers. Humans are taught to think in base 10, but numbers may be represented in any base. For example, 123 base 10 = 1111011 base 2. Numbers are kept in computer hardware as a series of high and low electronic signals, and so they are considered base 2 numbers. (Just as base 10 numbers are called decimal numbers, base 2 numbers are called binary numbers.) 03-Ch02-P374750.indd 86 7/3/09 12:18:23 PM 2.4 87 Signed and Unsigned Numbers A single digit of a binary number is thus the “atom” of computing, since all information is composed of binary digits or bits. This fundamental building block can be one of two values, which can be thought of as several alternatives: high or low, on or off, true or false, or 1 or 0. Generalizing the point, in any number base, the value of ith digit d is d × Basei binary digit Also called binary bit. One of the two numbers in base 2, 0 or 1, that are the components of information. where i starts at 0 and increases from right to left. This leads to an obvious way to number the bits in the word: simply use the power of the base for that bit. We subscript decimal numbers with ten and binary numbers with two. For example, 1011two represents (1 × 23) + (0 × 22) + (1 × 21) + (1 × 20)ten = (1 × 8) + (0 × 4) + (1 × 2) + (1 × 1)ten = 8 + 0 + 2 + 1ten = 11ten We number the bits 0, 1, 2, 3, . . . from right to left in a word. The drawing below shows the numbering of bits within a ARM word and the placement of the number 1011two: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 7 6 5 4 3 2 1 0 0 0 0 0 0 1 0 1 1 (32 bits wide) Since words are drawn vertically as well as horizontally, leftmost and rightmost may be unclear. Hence, the phrase least significant bit is used to refer to the rightmost bit (bit 0 above) and most significant bit to the leftmost bit (bit 31). The ARM word is 32 bits long, so we can represent 232 different 32-bit patterns. It is natural to let these combinations represent the numbers from 0 to 232 − 1 (4,294,967,295ten): 0000 0000 0000 ... 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ... 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 03-Ch02-P374750.indd 87 least significant bit The rightmost bit in an ARM word. most significant bit The leftmost bit in an ARM word. 0000 0000two = 0ten 0000 0001two = 1ten 0000 0010two = 2ten 1111 1101two = 4,294,967,293ten 1111 1110two = 4,294,967,294ten 1111 1111two = 4,294,967,295ten 7/3/09 12:18:23 PM 88 Chapter 2 Instructions: Language of the Computer That is, 32-bit binary numbers can be represented in terms of the bit value times a power of 2 (here xi means the ith bit of x): (x31 × 231) + (x30 × 230) + (x29 × 229) + . . . + (x1 × 21) + (x0 × 20) Keep in mind that the binary bit patterns above are simply representatives of numbers. Numbers really have an infinite number of digits, with almost all being 0 except for a few of the rightmost digits. We just don’t normally show leading 0s. Hardware can be designed to add, subtract, multiply, and divide these binary bit patterns. If the number that is the proper result of such operations cannot be represented by these rightmost hardware bits, overflow is said to have occurred. It’s up to the programming language, the operating system, and the program to determine what to do if overflow occurs. Computer programs calculate both positive and negative numbers, so we need a representation that distinguishes the positive from the negative. The most obvious solution is to add a separate sign, which conveniently can be represented in a single bit; the name for this representation is sign and magnitude. Alas, sign and magnitude representation has several shortcomings. First, it’s not obvious where to put the sign bit. To the right? To the left? Early computers tried both. Second, adders for sign and magnitude may need an extra step to set the sign because we can’t know in advance what the proper sign will be. Finally, a separate sign bit means that sign and magnitude has both a positive and a negative zero, which can lead to problems for inattentive programmers. As a result of these shortcomings, sign and magnitude representation was soon abandoned. In the search for a more attractive alternative, the question arose as to what would be the result for unsigned numbers if we tried to subtract a large number from a small one. The answer is that it would try to borrow from a string of leading 0s, so the result would have a string of leading 1s. Given that there was no obvious better alternative, the final solution was to pick the representation that made the hardware simple: leading 0s mean positive, and leading 1s mean negative. This convention for representing signed binary numbers is called two’s complement representation: 0000 0000 0000 0000 0000 0000 0000 0000two = 0ten 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten 0000 0000 0000 0000 0000 0000 0000 0010two = 2ten ... ... 0111 1111 1111 1111 1111 1111 1111 1101two 0111 1111 1111 1111 1111 1111 1111 1110two 0111 1111 1111 1111 1111 1111 1111 1111two 1000 0000 0000 0000 0000 0000 0000 0000two 1000 0000 0000 0000 0000 0000 0000 0001two 1000 0000 0000 0000 0000 0000 0000 0010two ... ... 03-Ch02-P374750.indd 88 = = = = = = 2,147,483,645ten 2,147,483,646ten 2,147,483,647ten –2,147,483,648ten –2,147,483,647ten –2,147,483,646ten 7/3/09 12:18:23 PM 2.4 89 Signed and Unsigned Numbers 1111 1111 1111 1111 1111 1111 1111 1101two = –3ten 1111 1111 1111 1111 1111 1111 1111 1110two = –2ten 1111 1111 1111 1111 1111 1111 1111 1111two = –1ten The positive half of the numbers, from 0 to 2,147,483,647ten (231 − 1), use the same representation as before. The following bit pattern (1000 . . . 0000two) represents the most negative number −2,147,483,648ten (−231). It is followed by a declining set of negative numbers: −2,147,483,647ten (1000 . . . 0001two) down to −1ten (1111 . . . 1111two). Two’s complement does have one negative number, −2,147,483,648ten, that has no corresponding positive number. Such imbalance was also a worry to the inattentive programmer, but sign and magnitude had problems for both the programmer and the hardware designer. Consequently, every computer today uses two’s complement binary representations for signed numbers. Two’s complement representation has the advantage that all negative numbers have a 1 in the most significant bit. Consequently, hardware needs to test only this bit to see if a number is positive or negative (with the number 0 considered positive). This bit is often called the sign bit. By recognizing the role of the sign bit, we can represent positive and negative 32-bit numbers in terms of the bit value times a power of 2: (x31 × −231) + (x30 × 230) + (x29 × 229) + . . . + (x1 × 21) + (x 0 × 20) The sign bit is multiplied by −231, and the rest of the bits are then multiplied by positive versions of their respective base values. Binary to Decimal Conversion What is the decimal value of this 32-bit two’s complement number? EXAMPLE 1111 1111 1111 1111 1111 1111 1111 1100two Substituting the number’s bit values into the formula above: (1 × −231) + (1 × 230) + (1 × 229) + . . . + (1 × 22) + (0 × 21) + (0 × 20) + 230 + 229 + . . . + 22 + 0 + 0 = −231 = −2,147,483,648ten + 2,147,483,644ten = − 4ten ANSWER We’ll see a shortcut to simplify conversion from negative to positive soon. Just as an operation on unsigned numbers can overflow the capacity of hardware to represent the result, so can an operation on two’s complement numbers. 03-Ch02-P374750.indd 89 7/3/09 12:18:23 PM 90 Chapter 2 Instructions: Language of the Computer Overflow occurs when the leftmost retained bit of the binary bit pattern is not the same as the infinite number of digits to the left (the sign bit is incorrect): a 0 on the left of the bit pattern when the number is negative or a 1 when the number is positive. Hardware/ Software Interface Unlike the numbers discussed above, memory addresses naturally start at 0 and continue to the largest address. Put another way, negative addresses make no sense. Thus, programs want to deal sometimes with numbers that can be positive or negative and sometimes with numbers that can be only positive. Some programming languages reflect this distinction. C, for example, names the former integers (declared as int in the program) and the latter unsigned integers (unsigned int). Some C style guides even recommend declaring the former as signed int to keep the distinction clear. Let’s examine two useful shortcuts when working with two’s complement numbers. The first shortcut is a quick way to negate a two’s complement binary number. Simply invert every 0 to 1 and every 1 to 0, then add one to the result. This shortcut is based on the observation that the sum of a number and its inverted representation must be 111 . . . 111two, which represents −1. Since x + x– = −1, therefore x + x– + 1 = 0 or x– + 1 = −x. Negation Shortcut EXAMPLE ANSWER Negate 2ten, and then check the result by negating −2ten. 2ten = 0000 0000 0000 0000 0000 0000 0000 0010two Negating this number by inverting the bits and adding one, 03-Ch02-P374750.indd 90 + 1111 1111 1111 1111 1111 1111 1111 1101two 1two = = 1111 1111 1111 1111 1111 1111 1111 1110two –2ten 7/3/09 12:18:23 PM 2.4 91 Signed and Unsigned Numbers Going the other direction, 1111 1111 1111 1111 1111 1111 1111 1110two is first inverted and then incremented: + 0000 0000 0000 0000 0000 0000 0000 0001two 1two = = 0000 0000 0000 0000 0000 0000 0000 0010two 2ten Our next shortcut tells us how to convert a binary number represented in n bits to a number represented with more than n bits. For example, the immediate field in the load, store, branch, ADD, and set on less than instructions contains a two’s complement 16-bit number, representing −32,768ten (−215) to 32,767ten (215 − 1). To add the immediate field to a 32-bit register, the computer must convert that 16-bit number to its 32-bit equivalent. The shortcut is to take the most significant bit from the smaller quantity—the sign bit—and replicate it to fill the new bits of the larger quantity. The old bits are simply copied into the right portion of the new word. This shortcut is commonly called sign extension. Sign Extension Shortcut Convert 16-bit binary versions of 2ten and −2ten to 32-bit binary numbers. The 16-bit binary version of the number 2 is EXAMPLE ANSWER 0000 0000 0000 0010two = 2ten It is converted to a 32-bit number by making 16 copies of the value in the most significant bit (0) and placing that in the left-hand half of the word. The right half gets the old value: 0000 0000 0000 0000 0000 0000 0000 0010two = 2ten Let’s negate the 16-bit version of 2 using the earlier shortcut. Thus, 0000 0000 0000 0010two 03-Ch02-P374750.indd 91 7/3/09 12:18:23 PM 92 Chapter 2 Instructions: Language of the Computer becomes 1111 1111 1111 1101two + 1two = 1111 1111 1111 1110two Creating a 32-bit version of the negative number means copying the sign bit 16 times and placing it on the left: 1111 1111 1111 1111 1111 1111 1111 1110two = –2ten This trick works because positive two’s complement numbers really have an infinite number of 0s on the left and negative two’s complement numbers have an infinite number of 1s. The binary bit pattern representing a number hides leading bits to fit the width of the hardware; sign extension simply restores some of them. Summary The main point of this section is that we need to represent both positive and negative integers within a computer word, and although there are pros and cons to any option, the overwhelming choice since 1965 has been two’s complement. Check What is the decimal value of this 64-bit two’s complement number? Yourself 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1000two 1) –4ten 2) –8ten 3) –16ten 4) 18,446,744,073,709,551,609ten Elaboration: Two’s complement gets its name from the rule that the unsigned sum of an n-bit number and its negative is 2n; hence, the complement or negation of a two’s complement number x is 2n – x. 03-Ch02-P374750.indd 92 7/3/09 12:18:23 PM 2.5 A third alternative representation to two’s complement and sign and magnitude is called one’s complement. The negative of a one’s complement is found by inverting each bit, from 0 to 1 and from 1 to 0, which helps explain its name since the complement of x is 2n – x – 1. It was also an attempt to be a better solution than sign and magnitude, and several early scientific computers did use the notation. This representation is similar to two’s complement except that it also has two 0s: 00 . . . 00two is positive 0 and 11 . . . 11two is negative 0. The most negative number, 10 . . . 000two, represents –2,147,483,647ten, and so the positives and negatives are balanced. One’s complement adders did need an extra step to subtract a number, and hence two’s complement dominates today. A final notation, which we will look at when we discuss floating point in Chapter 3 is to represent the most negative value by 00 . . . 000two and the most positive value by 11 . . . 11two, with 0 typically having the value 10 . . . 00two. This is called a biased notation, since it biases the number such that the number plus the bias has a nonnegative representation. Elaboration: For signed decimal numbers, we used “–” to represent negative because there are no limits to the size of a decimal number. Given a fixed word size, binary and hexadecimal (see Figure 2.4) bit strings can encode the sign; hence we do not normally use “+” or “–” with binary or hexadecimal notation. 2.5 93 Representing Instructions in the Computer Representing Instructions in the Computer We are now ready to explain the difference between the way humans instruct computers and the way computers see instructions. Instructions are kept in the computer as a series of high and low electronic signals and may be represented as numbers. In fact, each piece of an instruction can be considered as an individual number, and placing these numbers side by side forms the instruction. one’s complement A notation that represents the most negative value by 10 . . . 000two and the most positive value by 01 . . . 11two, leaving an equal number of negatives and positives but ending up with two zeros, one positive (00 . . . 00two) and one negative (11 . . . 11two). The term is also used to mean the inversion of every bit in a pattern: 0 to 1 and 1 to 0. biased notation A notation that represents the most negative value by 00 . . . 000two and the most positive value by 11 . . . 11two, with 0 typically having the value 10 . . . 00two, thereby biasing the number such that the number plus the bias has a nonnegative representation. Translating ARM Assembly Instructions into Machine Instructions Let’s do the next step in the refinement of the ARM language as an example. We’ll show the real ARM language version of the instruction represented symbolically as EXAMPLE ADD r5,r1,r2 first as a combination of decimal numbers and then of binary numbers. 03-Ch02-P374750.indd 93 7/3/09 12:18:23 PM 94 Chapter 2 Instructions: Language of the Computer The decimal representation is ANSWER 14 0 0 4 0 1 5 2 Each of these segments of an instruction is called a field. The fourth field (containing 4 in this case) tells the ARM computer that this instruction performs addition. The sixth field gives the number of the register that is the first source operand of the addition operation (1=r1), and the last field gives the other source operand for the addition (2 = r2). The seventh field contains the number of the register that is to receive the sum (5 = r5). Thus, this instruction adds (field 4 = 4) register r1 (field 6 = 1) to register r2 (field 8 = 2), and places the sum in register r5 (field 7 = 5); we’ll reveal the purpose of the remaining four fields later. This instruction can also be represented as fields of binary numbers as opposed to decimal: instruction format A form of representation of an instruction composed of fields of binary numbers. machine language Binary representation used for communication within a computer system. hexadecimal Numbers in base 16. 1110 00 0 0100 0 0001 0101 000000000010 4 bits 2 bits 1 bit 4 bits 1 bit 4 bits 4 bits 12 bits This layout of the instruction is called the instruction format. As you can see from counting the number of bits, this ARM instruction takes exactly 32 bits—the same size as a data word. In keeping with our design principle that simplicity favors regularity, all ARM instructions are 32 bits long. To distinguish it from assembly language, we call the numeric version of instructions machine language and a sequence of such instructions machine code. It would appear that you would now be reading and writing long, tedious strings of binary numbers. We avoid that tedium by using a higher base than binary that converts easily into binary. Since almost all computer data sizes are multiples of 4, hexadecimal (base 16) numbers are popular. Since base 16 is a power of 2, we can trivially convert by replacing each group of four binary digits by a single hexadecimal digit, and vice versa. Figure 2.4 converts between hexadecimal and binary. Hexadecimal Binary Hexadecimal Binary Hexadecimal Binary Hexadecimal Binary 0hex 0000two 4hex 0100two 8hex 1000two chex 1100two 1hex 0001two 5hex 0101two 9hex 1001two dhex 1101two 2hex 0010two 6hex 0110two ahex 1010two ehex 1110two 3hex 0011two 7hex 0111two bhex 1011two fhex 1111two FIGURE 2.4 The hexadecimal-binary conversion table. Just replace one hexadecimal digit by the corresponding four binary digits, and vice versa. If the length of the binary number is not a multiple of 4, go from right to left. 03-Ch02-P374750.indd 94 7/3/09 12:18:23 PM 2.5 95 Representing Instructions in the Computer Because we frequently deal with different number bases, to avoid confusion we will subscript decimal numbers with ten, binary numbers with two, and hexadecimal numbers with hex. (If there is no subscript, the default is base 10.) By the way, C and Java use the notation 0xnnnn for hexadecimal numbers. Binary to Hexadecimal and Back Convert the following hexadecimal and binary numbers into the other base: EXAMPLE eca8 6420hex 0001 0011 0101 0111 1001 1011 1101 1111 two Using Figure 2.4, the answer is just a table lookup one way: ANSWER eca8 6420hex 1110 1100 1010 1000 0110 0100 0010 0000two And then the other direction: 0001 0011 0101 0111 1001 1011 1101 1111two 1357 9bdfhex ARM Fields ARM fields are given names to make them easier to discuss: Cond F I Opcode S Rn Rd Operand2 4 bits 2 bits 1 bit 4 bits 1 bit 4 bits 4 bits 12 bits Here is the meaning of each name of the fields in ARM instructions: ■ Opcode: Basic operation of the instruction, traditionally called the opcode. opcode The field that ■ Rd: The register destination operand. It gets the result of the operation. denotes the operation and format of an instruction. ■ Rn: The first register source operand. 03-Ch02-P374750.indd 95 7/3/09 12:18:23 PM 96 Chapter 2 Instructions: Language of the Computer ■ Operand2: The second source operand. ■ I: Immediate. If I is 0, the second source operand is a register. If I is 1, the second source operand is a 12-bit immediate. (Section 2.10 goes into details on ARM immediates, but for now we’ll just assume its just a plain constant.) ■ S: Set Condition Code. Described in Section 2.7, this field is related to conditional branch instructions. ■ Cond: Condition. Described in Section 2.7, this field is related to conditional branch instructions. ■ F: Instruction Format. This field allows ARM to different instruction formats when needed. Let’s look at the add word instruction from page 83 that has a constant operand: ADD r3,r3,#4 ; r3 = r3 + 4 As you might expect, the constant 4 is placed in the Operand2 field and the I field is set to 1. We make a small change the binary version from before. 14 0 1 4 0 3 3 4 The Opcode field is still 4 so that this instruction performs addition. The Rn and Rd fields gives the numbers of the register that is the first source operand (3) and the register to receive the sum (3). Thus, this instruction adds 4 to r3 and place the sum in r3. Now let’s try load word instruction from page 82: LDR r5,[r3, #32] ; Temporary reg r5 gets A[8] Loads and stores use a different instruction format from above, with just 6 fields: Cond F Opcode Rn Rd Offset12 4 bits 2 bits 6 bits 4 bits 4 bits 12 bits To tell ARM that the format is different, the F field now as 1, meaning that this is a data transfer instruction format. The opcode field has 24, showing that this instruction does load word. The rest of the fields are straightword: Rn field has 3 for the base register, the Offset12 field has 32 as the offset to add to the base register, and the Rd field has 5 for the destination register, which receives the result of the load: 03-Ch02-P374750.indd 96 14 1 24 3 5 32 4 bits 2 bits 6 bits 4 bits 4 bits 12 bits 7/3/09 12:18:23 PM 2.5 97 Representing Instructions in the Computer Let’s call the first option the data processing (DP) instruction format and the second the data transfer (DT) instruction format. Although multiple formats complicate the hardware, we can reduce the complexity by keeping the formats similar. For example, the first two fields and the last three fields of the two formats are the same size and four of them have the same names; the length of the Opcode field in DT format is equal to the sum of the lengths of three fields of the DP format. Figure 2.5 shows the numbers used in each field for the ARM instructions covered here. Instruction Format Cond F I op S Rn Rd Operand2 reg ADD DP 14 0 0 4ten 0 reg reg SUB (subtract) DP 14 0 0 2sten 0 reg reg reg ADD (immediate) DP 14 0 1 4ten 0 reg reg constant LDR (load word) DT 14 1 n.a. 24ten n.a. reg reg address STR (store word) DT 14 1 n.a. 25ten n.a. reg reg address FIGURE 2.5 ARM instruction encoding. In the table above, “reg” means a register number between 0 and 15, “constant” means a 12-bit constant, “address” means a 12-bit address. “n.a.” (not applicable) means this field does not appear in this format, and Op stands for opcode. Translating ARM Assembly Language into Machine Language We can now take an example all the way from what the programmer writes to what the computer executes. If r3 has the base of the array A and r2 corresponds to h, the assignment statement EXAMPLE A[30] = h + A[30]; is compiled into LDR ADD STR r5,[r3,#120] r5,r2,r5 r5,[r3,#120] ; Temporary reg r5 gets A[30] ; Temporary reg r5 gets h + A[30] ; Stores h + A[30] back into A[30] What is the ARM machine language code for these three instructions? 03-Ch02-P374750.indd 97 7/3/09 12:18:23 PM 98 Chapter 2 ANSWER Instructions: Language of the Computer For convenience, let’s first represent the machine language instructions using decimal numbers. From Figure 2.5, we can determine the three machine language instructions: opcode Cond F Rn I 14 1 14 0 14 1 opcode S 24 0 4 Rd 0 25 Offset12 Operand2 3 5 120 2 5 5 3 5 120 The LDR instruction is identified by 24 (see Figure 2.5) in the third field (opcode). The base register 3 is specified in the fourth field (Rn), and the destination register 5 is specified in the sixth field (Rd). The offset to select A[30] (120 = 30 × 4) is found in the final field (offset12). The ADD instruction that follows is specified with 4 in the fourth field (opcode). The three register operands (2, 5, and 5) are found in the sixth, seventh, and eighth fields. The STR instruction is identified with 25 in the third field. The rest of this final instruction is identical to the LDR instruction. Since 120ten = 0000 1111 0000two, the binary equivalent to the decimal form is: opcode Cond F Rn I 1110 1 1110 0 1110 1 opcode S 11000 0 100 11001 Rd 0 Off Operand2 0011 0101 0000 1111 0000 0010 0101 0000 0000 0101 0011 0101 0000 1111 0000 Note the similarity of the binary representations of the first and last instructions. The only difference is in the last bit of the opcode. Figure 2.6 summarizes the portions of ARM machine language described in this section. As we shall see in Chapter 4, the similarity of the binary representations of related instructions simplifies hardware design. These similarities are another example of regularity in the ARM architecture. 03-Ch02-P374750.indd 98 7/3/09 12:18:23 PM 2.5 99 Representing Instructions in the Computer ARM machine language Name Format Example Comments ADD DP 14 0 0 4 0 2 1 3 SUB DP 14 0 0 2 0 2 1 3 SUB r1,r2,r3 LDR DT 14 1 24 2 1 100 LDR r1,100(r2) STR DT 14 1 25 2 1 100 Field size 4 bits 2 bits 1 bit 4 bits 1 bit 4 bits 4 bits 12 bits I Opcode S Rn Rd Operand2 Rn Rd Offset12 DP format DP Cond F DT format DT Cond F Opcode ADD r1,r2,r3 STR r1,100(r2) All ARM instructions are 32 bits long Arithmetic instruction format Data transfer format FIGURE 2.6 ARM architecture revealed through Section 2.5. The two ARM instruction formats so far are DP and DT. The last 16 bits have the same sized fields: both contain an Rn field, giving one of the sources; and an Rd field, specifying the destination register. BIG Today’s computers are built on two key principles: The Picture 1. Instructions are represented as numbers. 2. Programs are stored in memory to be read or written, just like numbers. These principles lead to the stored-program concept; its invention let the computing genie out of its bottle. Figure 2.7 shows the power of the concept; specifically, memory can contain the source code for an editor program, the corresponding compiled machine code, the text that the compiled program is using, and even the compiler that generated the machine code. One consequence of instructions as numbers is that programs are often shipped as files of binary numbers. The commercial implication is that computers can inherit ready-made software provided they are compatible with an existing instruction set. Such “binary compatibility” often leads industry to align around a small number of instruction set architectures. What ARM instruction does this represent? Cond F I opcode S Rn Rd Operand2 14 0 0 4 0 0 1 2 Check Yourself Choose from one of the four options below. 1. ADD r0, r1, r2 2. ADD r1, r0, r2 3. ADD r2, r1, r0 4. SUB r2, r0, r1 03-Ch02-P374750.indd 99 7/3/09 12:18:23 PM 100 Chapter 2 Instructions: Language of the Computer Memory Accounting program (machine code) Editor program (machine code) Processor C compiler (machine code) Payroll data Book text Source code in C for editor program FIGURE 2.7 The stored-program concept. Stored programs allow a computer that performs accounting to become, in the blink of an eye, a computer that helps an author write a book. The switch happens simply by loading memory with programs and data and then telling the computer to begin executing at a given location in memory. Treating instructions in the same way as data greatly simplifies both the memory hardware and the software of computer systems. Specifically, the memory technology needed for data can also be used for programs, and programs like compilers, for instance, can translate code written in a notation far more convenient for humans into code that the computer can understand. “Contrariwise,” continued Tweedledee, “if it was so, it might be; and if it were so, it would be; but as it isn’t, it ain’t. That’s logic.” Lewis Carroll, Alice’s Adventures in Wonderland, 1865 03-Ch02-P374750.indd 100 2.6 Logical Operations Although the first computers operated on full words, it soon became clear that it was useful to operate on fields of bits within a word or even on individual bits. Examining characters within a word, each of which is stored as 8 bits, is one example of such an operation (see Section 2.9). It follows that operations were added to programming languages and instruction set architectures to simplify, among other things, the packing and unpacking of bits into words. These instructions are called logical operations. Figure 2.8 shows logical operations in C, Java, and ARM. 7/3/09 12:18:24 PM 2.6 Logical Operations Logical operations C operators Java operators Bit-by-bit AND & & Bit-by-bit OR | | ORR Bit-by-bit NOT ~ ~ MVN Shift left << << LSL Shift right >> >>> LSR 101 ARM instructions AND FIGURE 2.8 C and Java logical operators and their corresponding ARM instructions. ARM implements NOT using a NOR with one operand being zero. The first class of such operations Another useful operation that isolates fields is AND. (We capitalize the word to avoid confusion between the operation and the English conjunction.) AND is a bit-by-bit operation that leaves a 1 in the result only if both bits of the operands are 1. For example, if register r2 contains 0000 0000 0000 0000 0000 1101 1100 0000two AND A logical bit-by-bit operation with two operands that calculates a 1 only if there is a 1 in both operands. and register r1 contains 0000 0000 0000 0000 0011 1100 0000 0000two then, after executing the ARM instruction AND r5,r1,r2 ; reg r5 = reg r1 & reg r2 the value of register r5 would be 0000 0000 0000 0000 0000 1100 0000 0000two As you can see, AND can apply a bit pattern to a set of bits to force 0s where there is a 0 in the bit pattern. Such a bit pattern in conjunction with AND is traditionally called a mask, since the mask “conceals” some bits. To place a value into one of these seas of 0s, there is the dual to AND, called OR. It is a bit-by-bit operation that places a 1 in the result if either operand bit is a 1. To elaborate, if the registers r1 and r2 are unchanged from the preceding example, the result of the ARM instruction ORR r5,r1,r2 OR A logical bit-by-bit operation with two operands that calculates a 1 if there is a 1 in either operand. ; reg r5 = reg r1 | reg r2 is this value in register r5: 0000 0000 0000 0000 0011 1101 1100 0000two 03-Ch02-P374750.indd 101 7/3/09 12:18:24 PM 102 Chapter 2 NOT A logical bit-by-bit The third logical operation is a contrarian. NOT takes one operand and places a 1 in the result if one operand bit is a 0, and vice versa. If the register r1 is unchanged from the preceding example, the result of the ARM instruction Move Not (MVN): operation with one operand that inverts the bits; that is, it replaces every 1 with a 0, and every 0 with a 1. Instructions: Language of the Computer MVN r5,r1 ; reg r5 = ~reg r1 is this value in register r5: 1111 1111 1111 1111 1100 0011 1111 1111two Although not a logical instruction, a useful instruction related to MVN that we’ve not listed so far simply copies one register to another without changing it. The Move instruction (MOV) does just what you’d think: MOV r6,r5 ; reg r6 = reg r5 Register r6 now has the value of the contents of r5. Another class of such operations is called shifts. They move all the bits in a word to the left or right, filling the emptied bits with 0s. For example, if register r0 contained 0000 0000 0000 0000 0000 0000 0000 1001two = 9ten and the instruction to shift left by 4 was executed, the new value would be: 0000 0000 0000 0000 0000 0000 1001 0000two= 144ten Shift left logical provides a bonus benefit. Shifting left by i bits gives the same result as multiplying by 2i, just as shifting a decimal number by i digits is equivalent to multiplying by 10i. For example, the above LSL shifts by 4, which gives the same result as multiplying by 24 or 16. The first bit pattern above represents 9, and 9 × 16 = 144, the value of the second bit pattern. The dual of a shift left is a shift right. The actual name of the two ARM shift operations are called logical shift left (LSL) and logical shift right (LSR). Although ARM has shift operations, they are not separate instructions. How can this be? Unlike any other microprocessor instruction set, ARM offers the ability to shift the second operand as part of any data processing instruction! For example, this variation of the add instruction adds register r1 to register r2 shifted left by 2 bits and puts the result in register r5: ADD r5,r1,r2, LSL #2 ; r5 = r1 + (r2 << 2) In case you were wondering, ARM hardware is designed so that the adds with shifts are no slower than regular adds. 03-Ch02-P374750.indd 102 7/3/09 12:18:24 PM 2.6 Logical Operations 103 If you just wanted to shift register r5 right by 4 bits and place the result in r6, you could do that with a move instruction: MOV r6,r5, LSR #4 ; r6 = r5 >> 4 Although usually programmers want to shift by a constant, ARM allows shifting by the value found in a register. The following instruction shifts register r5 right by the amount in register r3 and places the result in r6. MOV r6,r5, LSR r3 ; r6 = r5 >> r3 Figure 2.8.5 shows that these shift operations are encoded in the 12 bit Operand2 field of the Data Processing instruction format. If the shift field is 0, the operation is logical shift left, and if it is 1 then the operation is logical shift right. Here are the machine language versions of the three instructions above with shifts operations: Shift_imm Cond F I opcode S Rn Rs 14 0 14 0 14 0 0 Shift Rm Shift Rm Rd 4 0 2 5 0 13 0 0 6 0 13 0 0 6 0 2 4 3 0 0 0 5 1 0 5 1 1 5 Note that if these new fields are 0, then there operand is not shifted, so the encodings shown in the examples in prior sections work properly. Figure 2.8 above shows the relationship between the C and Java operators and the ARM instructions. 11 8 7 shift_imm Rs 0 6 5 4 3 0 Shift 0 Rm Shift 1 Rm FIGURE 2.8.5 Encoding of shift operations inside Operand2 field of Data Processing instruction format. Elaboration: The full ARM instruction set also includes exclusive or (EOR), which sets the bit to 1 when two corresponding bits differ, and to 0 when they are the same. It also has Bit Clear (BIC), which sets to 0 any bit that is a 1 in the second operand. C allows bit fields or fields to be defined within words, both allowing objects to be packed within a word and to match an externally enforced interface such as an I/O device. All fields must fit within a single word. Fields are unsigned integers that can be as short as 1 bit. C compilers insert and extract fields using logical operations in ARM: AND,ORR, LSL, and LSR. 03-Ch02-P374750.indd 103 7/3/09 12:18:24 PM 104 Chapter 2 Instructions: Language of the Computer Elaboration: ARM has more shift operations than LSL and LSR. Arithmetic shift right (ASR) replicates the sign bit during a shift; we’ll see what they were hoping to do with ASR (but failed) in the fallacy in Section 3.8 of the next chapter. Instead of discarding the bits on a right shift, Rotate Right (ROR) brings them back into the vacated upper bits. As the name suggests, the bit pattern can be thought of as a ring that rotates in a register but are never lost. Check Which operations can isolate a field in a word? Yourself 1. AND 2. A shift left followed by a shift right The utility of an automatic computer lies in the possibility of using a given sequence of instructions repeatedly, the number of times it is iterated being dependent upon the results of the computation. ...This choice can be made to depend upon the sign of a number (zero being reckoned as plus for machine purposes). Consequently, we introduce an [instruction] (the conditional transfer [instruction]) which will, depending on the sign of a given number, cause the proper one of two routines to be executed. 2.7 Instructions for Making Decisions What distinguishes a computer from a simple calculator is its ability to make decisions. Based on the input data and the values created during computation, different instructions execute. Decision making is commonly represented in programming languages using the if statement, sometimes combined with go to statements and labels. ARM assembly language includes many decision-making instructions, similar to an if statement with a go to. Let’s start with an instruction that compares two values followed by an instruction that branches to L1 if the registers are equal CMP register1, register2 BEQ L1 This pair of instructions means go to the statement labelled L1 if the value in register1 equals the value in register2. The mnemonic CMP stands for compare and BEQ stands for branch if equal. Another example is the instruction pair CMP register1, register2 BEQ L1 Burks, Goldstine, and von Neumann, 1947 03-Ch02-P374750.indd 104 7/3/09 12:18:24 PM 2.7 i=j 105 Instructions for Making Decisions i≠ j i = = j? Else: f=g+h f=g–h Exit: FIGURE 2.9 Illustration of the options in the if statement above. The left box corresponds to the then part of the if statement, and the right box corresponds to the else part. conditional branch This pair goes to the statement labelled L1 if the value in register1 does not equal the value in register2. The mnemonic BNE stands for branch if not equal. BEQ and BNE are traditionally called conditional branches. An instruction that requires the comparison of two values and that allows for a subsequent transfer of control to a new address in the program based on the outcome of the comparison. Compiling if-then-else into Conditional Branches In the following code segment, f, g, h, i, and j are variables. If the five variables f through j correspond to the five registers r0 through r4, what is the compiled ARM code for this C if statement? EXAMPLE if (i == j) f = g + h; else f = g – h; Figure 2.9 is a flowchart of what the ARM code should do. The first expression compares for equality, so it would seem that we would want the branch if registers are equal instruction (BEQ). In general, the code will be more efficient if we test for the opposite condition to branch over the code that performs the subsequent then part of the if (the label Else is defined below) and so we use the branch if registers are not equal instruction (BNE): CMP r3,r4 BNE, Else ANSWER ; go to Else if i ≠ j The next assignment statement performs a single operation, and if all the operands are allocated to registers, it is just one instruction: ADD r0,r1,r2 03-Ch02-P374750.indd 105 ; f = g + h (skipped if i ≠ j) 7/3/09 12:18:24 PM 106 Chapter 2 Instructions: Language of the Computer We now need to go to the end of the if statement. This example introduces another kind of branch, often called an unconditional branch. This instruction says that the processor always follows the branch (the label Exit is defined below). B Exit ; go to Exit The assignment statement in the else portion of the if statement can again be compiled into a single instruction. We just need to append the label Else to this instruction. We also show the label Exit that is after this instruction, showing the end of the if-then-else compiled code: Else:SUB r0,r1,r2 Exit: ; f = g – h (skipped if i = j) Notice that the assembler relieves the compiler and the assembly language programmer from the tedium of calculating addresses for branches, just as it does for calculating data addresses for loads and stores (see Section 2.12). Hardware/ Software Interface Compilers frequently create branches and labels where they do not appear in the programming language. Avoiding the burden of writing explicit labels and branches is one benefit of writing in high-level programming languages and is a reason coding is faster at that level. Loops Decisions are important both for choosing between two alternatives—found in if statements—and for iterating a computation—found in loops. The same assembly instructions are the building blocks for both cases. Compiling a while Loop in C EXAMPLE Here is a traditional loop in C: while (save[i] == k) i += 1; Assume that i and k correspond to registers r3 and r5 and the base of the array save is in r6. What is the ARM assembly code corresponding to this C segment? 03-Ch02-P374750.indd 106 7/3/09 12:18:24 PM 2.7 The first step is to load save[i] into a temporary register. Before we can load save[i] into a temporary register, we need to have its address. Before we can add i to the base of array save to form the address, we must multiply the index i by 4 due to the byte addressing problem. Fortunately, we can use the logical shift left operation, since shifting left by 2 bits multiplies by 22 or 4 (see page 101 in the prior section). We need to add the label Loop to it so that we can branch back to that instruction at the end of the loop: Loop: ADD r12,r6, r3, LSL # 2 107 Instructions for Making Decisions ANSWER ; r12 = address of save[i] Now we can use that address to load save[i] into a temporary register: LDR r0,[r12,#0] ; Temp reg r0 = save[i] The next instruction pair performs the loop test, exiting if save[i] ≠ k: CMP BNE r0,r5 Exit ; go to Exit if save[i] ≠ k The next instruction adds 1 to i: ADD r3,r3,#1 ; i = i + 1 The end of the loop branches back to the while test at the top of the loop. We just add the Exit label after it, and we’re done: B Loop ; go to Loop Exit: (See the exercises for an optimization of this sequence.) Such sequences of instructions that end in a branch are so fundamental to compiling that they are given their own buzzword: a basic block is a sequence of instructions without branches, except possibly at the end, and without branch targets or branch labels, except possibly at the beginning. One of the first early phases of compilation is breaking the program into basic blocks. Hardware/ Software Interface basic block A sequence The test for equality or inequality is probably the most popular test, but sometimes it is useful to see if a variable is less than another variable. For example, a for loop may want to test to see if the index variable is less than 0. Thus, ARM has a large set of conditional branches. For example, LT, LE, GT, and GE branch if the 03-Ch02-P374750.indd 107 of instructions without branches (except possibly at the end) and without branch targets or branch labels (except possibly at the beginning). 7/3/09 12:18:24 PM 108 Chapter 2 Instructions: Language of the Computer result of the compare is less than, less than or equal, greater than, or greater than or equal, respectively. Conditional branches are actually based on condition flags or condition codes that can be set by the CMP instruction. Branches then test the condition flags. Condition flags form a special purpose register who values can be tested by a conditional branch instruction at any time after the compare instruction, not just by the following instruction. Moreover, the condition flags can be set by many other instructions than compare; in such a case, the result of the operation is compared to 0. The S bit in Figure 2.5 is used to set the condition codes as part of data processing instructions. The programmer specifies setting the condition flags by appending an S to the instruction name. Thus, SUBS sets the condition flags with the result of the subtraction, while SUB leaves the condition codes unchanged. Hardware/ Software Interface Comparison instructions must deal with the dichotomy between signed and unsigned numbers. Sometimes a bit pattern with a 1 in the most significant bit represents a negative number and, of course, is less than any positive number, which must have a 0 in the most significant bit. With unsigned integers, on the other hand, a 1 in the most significant bit represents a number that is larger than any that begins with a 0. (We’ll soon take advantage of this dual meaning of the most significant bit to reduce the cost of the array bounds checking.) ARM offers more versions conditional branch to handle these alternatives: Less than unsigned is call LO for lower; less than or equal unsigned is called LS for lower or same; Great than unsigned is called HI for higher; Greater or equal unsigned is called HS for higher or same. Signed versus Unsigned Comparison EXAMPLE Suppose register r0 has the binary number 1111 1111 1111 1111 1111 1111 1111 1111two and that register r1 has the binary number 0000 0000 0000 0000 0000 0000 0000 0001two 03-Ch02-P374750.indd 108 7/3/09 12:18:24 PM 2.7 109 Instructions for Making Decisions and the following instruction is executed. CMP r0, r1 Which conditional branch is taken? BLO BLT L1 ; unsigned branch L2 ; signed branch The value in register r0 represents −1ten if it is an integer and 4,294,967,295ten if it is an unsigned integer. The value in register r1 represents 1ten in either case. The branch on lower unsigned instruction (BLO) is not taken to L1, since 4,294,967,295ten > 1ten. However, the branch on less than instruction (BLT) is taken to L2, since −1ten < 1ten. ANSWER Treating signed numbers as if they were unsigned gives us a low cost way of checking if 0 ≤ x < y, which matches the index out-of-bounds check for arrays. The key is that negative integers in two’s complement notation look like large numbers in unsigned notation; that is, the most significant bit is a sign bit in the former notation but a large part of the number in the latter. Thus, an unsigned comparison of x < y also checks if x is negative as well as if x is less than y. Bounds Check Shortcut Use this shortcut to reduce an index-out-of-bounds check: branch to IndexOutOfBounds if r1 ≥ r2 or if r1 is negative. The checking code just uses BHS to do both checks: EXAMPLE ANSWER CMP r1,r2 BHS IndexOutOfBounds ;if r1>=r2 or r1<0, go to Error Case/Switch Statement Most programming languages have a case or switch statement that allows the programmer to select one of many alternatives depending on a single value. The simplest way to implement switch is via a sequence of conditional tests, turning the switch statement into a chain of if-then-else statements. 03-Ch02-P374750.indd 109 7/3/09 12:18:24 PM 110 Chapter 2 jump address table Also called jump table. A table of addresses of alternative instruction sequences. program counter (PC) The register containing the address of the instruction in the program being executed. Hardware/ Software Interface Instructions: Language of the Computer Sometimes the alternatives may be more efficiently encoded as a table of addresses of alternative instruction sequences, called a jump address table or jump table, and the program needs only to index into the table and then jump to the appropriate sequence. The jump table is then just an array of words containing addresses that correspond to labels in the code. The program needs to jump using the address in the appropriate entry from the jump table. ARM has a surprisingly easy way to handle such situations. Implicit in the stored-program idea is the need to have a register to hold the address of the current instruction being executed. For historical reasons, this register is almost always called the program counter, abbreviated pc in the ARM architecture, although a more sensible name would have been instruction address register. Register 15 is actually the program counter in ARM, so a LDR instruction with the destination register 15 means an unconditional branch to the address specified in memory. In fact, any instruction with register 15 as the destination register is an unconditional branch to the address at that value. In Section 2.8, we’ll see how this trick is useful when returning from a procedure. Although there are many statements for decisions and loops in programming languages like C and Java, the bedrock statement that implements them at the instruction set level is the conditional branch. Encoding Branch Instructions in ARM A problem occurs when an instruction needs longer fields than those shown above in Section 2.6. For example, the longest field in the instruction formats above is just 12 bits, suggesting that branch field might be just 12 bits. That might limit programs to just 212 or 4096 bytes! Hence, we have a conflict between the desire to keep all instructions the same length and the desire to have a single instruction format. This leads us to the final hardware design principle: Design Principle 4: Good design demands good compromises. The compromise chosen by the ARM designers is to keep all instructions the same length, thereby requiring different kinds of instruction formats for different kinds of instructions. We saw the DP and DT formats above, which had many similarities. Branch has a third instruction type: 03-Ch02-P374750.indd 110 Cond 12 address 4 bits 4 bits 24 bits 7/3/09 12:18:24 PM 2.7 Value Meaning Instructions for Making Decisions Value Meaning 0 EQ (EQual) 8 HI (unsigned HIgher) 1 NE (Not Equal) 9 LS (unsigned Lower or Same) 2 HS (unsigned Higher or Same) 10 GE (signed Greater than or Equal) 3 LO (unsigned LOwer) 11 LT (signed Less Than) 4 MI (MInus, <0) 12 GT (signed Greater Than) 5 PL - (PLus, >=0) 13 LE (signed Less Than or Equal) 6 VS (oVerflow Set, overflow) 14 AL (Always) 7 VC (oVerflow Clear, no overflow) 15 NV (reserved) FIGURE 2.9.5 111 Encodings of Options for Cond field. The Cond field encodes the many versions of the conditional branch mentioned above. Figure 2.9.5 shows those encodings. You might think that the 24-bit address would extend the program limit to 224 or 16 MB, which would be fine for many programs but constrain some large ones. An alternative would be to specify a register that would always be added to the branch address, so that a branch instruction would calculate the following: Program counter = Register + Branch address This sum allows the program to be as large as 232, solving the branch address size problem. Then the question is, which register? The answer comes from seeing how conditional branches are used. Conditional branches are found in loops and in if statements, so they tend to branch to a nearby instruction. For example, about half of all conditional branches in SPEC benchmarks go to locations less than 16 instructions away. Since the program counter (PC) contains the address of the current instruction, we can branch within ± 224 words of the current instruction if we use the PC as the register to be added to the address. All loops and if statements are much smaller than 224 words, so the PC is the ideal choice. This form of branch addressing is called PC-relative addressing. As we shall see in Chapter 4, it is convenient for the hardware to increment the PC early. Hence, the ARM address is actually relative to the address of the instruction two after the branch (PC + 8) as opposed to the current instruction (PC). Since all ARM instructions are 4 bytes long, ARM stretches the distance of the branch by having PC-relative addressing refer to the number of words to the next instruction instead of the number of bytes. Thus, the 24-bit field can branch four times as far by interpreting the field as a relative word address rather than as a relative byte address. PC-relative addressing An addressing regime in which the address is the sum of the program counter (PC) and a constant in the instruction. Conditional Execution Another unusual feature of ARM is that most instructions can be conditionally executed, not just branches. That is the purpose of the 4-bit Cond field found 03-Ch02-P374750.indd 111 7/3/09 12:18:24 PM 112 Chapter 2 Instructions: Language of the Computer in most ARM instruction formats. The assembly language programmer simply appends the desired condition to the instruction name, telling the computer to perform the operation only if the condition is true based on the last time the condition flags were set. That is, ADDEQ performs the addition only if the condition flags suggest the operands were equal for the operation that set the flags. For example, the ARM code to perform the if statement in Figure 2.9 is reproduced below: CMP r3,r4 BNE Else ADD r0,r1,r2 B Exit Else: SUB r0,r1,r2 Exit: ; ; ; ; go to f = g go to f = g Else if i ≠ j + h (skipped if i ≠ j) Exit – h (skipped if i = j) The desired result can be achieved without any branches using conditional execution: CMP r3,r4 ADDEQ r0,r1,r2 ; f = g + h (skipped if i ≠ j) SUBNE r0,r1,r2 ; f = g – h (skipped if i = j) Figure 2.9.5 shows the encodings for conditional execution of all instructions, not just branches. Note that 14 means always execute. Thus, Figures 2.5 and 2.6 in Section 2.5 show the value 14 in the Cond field of the machine language instructions to indicate the version of instructions that are always executed. Conditional execution provides a technique to execute instructions depending on a test without using conditional branch instructions. Chapter 4 shows that branches can lower the performance of pipelined computers, so removing branches can help performance even more than the reduction in instructions suggests. Check Yourself I. C has many statements for decisions and loops, while ARM has few. Which of the following do or do not explain this imbalance? Why? 1. More decision statements make code easier to read and understand. 2. Fewer decision statements simplify the task of the underlying layer that is responsible for execution. 3. More decision statements mean fewer lines of code, which generally reduces coding time. 4. More decision statements mean fewer lines of code, which generally results in the execution of fewer operations. 03-Ch02-P374750.indd 112 7/3/09 12:18:24 PM 2.8 Supporting Procedures in Computer Hardware 113 II. Why does C provide two sets of operators for AND (& and &&) and two sets of operators for OR (| and ||), while ARM doesn’t? 1. Logical operations AND and OR implement & and |, while conditional branches implement && and ||. 2. The previous statement has it backwards: && and || correspond to logical operations, while & and | map to conditional branches. 3. They are redundant and mean the same thing: && and || are simply inherited from the programming language B, the predecessor of C. 2.8 Supporting Procedures in Computer Hardware A procedure or function is one tool programmers use to structure programs, both to make them easier to understand and to allow code to be reused. Procedures allow the programmer to concentrate on just one portion of the task at a time; parameters act as an interface between the procedure and the rest of the program and data, since they can pass values and return results. We describe the equivalent to procedures in Java in Section 2.15 on the CD, but Java needs everything from a computer that C needs. You can think of a procedure like a spy who leaves with a secret plan, acquires resources, performs the task, covers his or her tracks, and then returns to the point of origin with the desired result. Nothing else should be perturbed once the mission is complete. Moreover, a spy operates on only a “need to know” basis, so the spy can’t make assumptions about his employer. Similarly, in the execution of a procedure, the program must follow these six steps: procedure A stored subroutine that performs a specific task based on the parameters with which it is provided. 1. Put parameters in a place where the procedure can access them. 2. Transfer control to the procedure. 3. Acquire the storage resources needed for the procedure. 4. Perform the desired task. 5. Put the result value in a place where the calling program can access it. 6. Return control to the point of origin, since a procedure can be called from several points in a program. 03-Ch02-P374750.indd 113 7/3/09 12:18:24 PM 114 Chapter 2 Instructions: Language of the Computer As mentioned above, registers are the fastest place to hold data in a computer, so we want to use them as much as possible. ARM software follows the following convention for procedure calling in allocating its 16 registers: ■ r0−r3: four argument registers in which to pass parameters ■ lr: one link register containing the return address register to return to the point of origin Branch-and-link instruction An instruction that jumps to an address and simultaneously saves the address of the following instruction in a register (lr or register 14 in ARM). return address A link to the calling site that allows a procedure to return to the proper address; in ARM it is stored in register lr (register 14). caller The program that instigates a procedure and provides the necessary parameter values. callee A procedure that executes a series of stored instructions based on parameters provided by the caller and then returns control to the caller. stack A data structure for spilling registers organized as a last-in-first-out queue. 03-Ch02-P374750.indd 114 In addition to allocating these registers, ARM assembly language includes an instruction just for the procedures: it jumps to an address and simultaneously saves the address of the following instruction in register lr. The Branch-and-Link instruction (BL) is simply written BL ProcedureAddress The link portion of the name means that an address or link is formed that points to the calling site to allow the procedure to return to the proper address. This “link,” stored in register lr (register 14), is called the return address. The return address is needed because the same procedure could be called from several parts of the program. To return, ARM just uses the move instruction to copy the link register into the PC, which causes an unconditional branch to the address specified in a register: MOV pc, lr This instruction branches to the address stored in register lr—which is just what we want. Thus, the calling program, or caller, puts the parameter values in r0−r3 and uses BL X to jump to procedure X (sometimes named the callee). The callee then performs the calculations, places the results (if any) into r0 and r1, and returns control to the caller using MOV pc, lr. The BL instruction actually saves PC + 4 in register lr to link to the following instruction to set up the procedure return. Using More Registers Suppose a compiler needs more registers for a procedure than the four argument and two return value registers. Since we must cover our tracks after our mission is complete, any registers needed by the caller must be restored to the values that they contained before the procedure was invoked. This situation is an example in which we need to spill registers to memory, as mentioned in the Hardware/ Software Interface section. The ideal data structure for spilling registers is a stack—a last-in-first-out queue. A stack needs a pointer to the most recently allocated address in the stack to show where the next procedure should place the registers to be spilled or where 7/3/09 12:18:24 PM Supporting Procedures in Computer Hardware 115 old register values are found. The stack pointer is adjusted by one word for each register that is saved or restored. ARM software reserves register 13 for the stack pointer, giving it the obvious name sp. Stacks are so popular that they have their own buzzwords for transferring data to and from the stack: placing data onto the stack is called a push, and removing data from the stack is called a pop. By historical precedent, stacks “grow” from higher addresses to lower addresses. This convention means that you push values onto the stack by subtracting from the stack pointer. Adding to the stack pointer shrinks the stack, thereby popping values off the stack. stack pointer A value denoting the most recently allocated address in a stack that shows where registers should be spilled or where old register values can be found. In ARM, it is register 2.8 sp (register 13). push Add element to stack. pop Remove element from stack. Compiling a C Procedure That Doesn’t Call Another Procedure Let’s turn the example on page 79 from Section 2.2 into a C procedure: EXAMPLE int leaf_example (int g, int h, int i, int j) { int f; f = (g + h) – (i + j); return f; } What is the compiled ARM assembly code? The parameter variables g, h, i, and j correspond to the argument registers r0, r1, r2, and r3, and f corresponds to r4. The compiled program starts with the label of the procedure: ANSWER leaf_example: The next step is to save the registers used by the procedure. The C assignment statement in the procedure body is identical to the example on page 79, which uses two temporary registers. Thus, we need to save three registers: r4, r5, and r6. We “push” the old values onto the stack by creating space for three words (12 bytes) on the stack and then store them: SUB STR STR STR 03-Ch02-P374750.indd 115 sp, r6, r5, r4, sp, #12 [sp,#8] [sp,#4] [sp,#0] ; ; ; ; adjust stack to make room save register r6 for use save register r5 for use save register r4 for use for 3 items afterwards afterwards afterwards 7/3/09 12:18:24 PM 116 Chapter 2 Instructions: Language of the Computer High address sp sp Contents of register r6 Contents of register r5 sp Contents of register r4 Low address a. b. c. FIGURE 2.10 The values of the stack pointer and the stack (a) before, (b) during, and (c) after the procedure call. The stack pointer always points to the “top” of the stack, or the last word in the stack in this drawing. Figure 2.10 shows the stack before, during, and after the procedure call. The next three statements correspond to the body of the procedure, which follows the example on page 79: ADD r5,r0,r1 ; register r5 contains g + h ADD r6,r2,r3 ; register r6 contains i + j SUB r4,r5,r6 ; f gets r5 – r6, which is (g + h) – (i + j) To return the value of f, we copy it into a return value register r0: MOV r0,r4 ; returns f (r0 = r4) Before returning, we restore the three old values of the registers we saved by “popping” them from the stack: LDR r4, [sp,#0] LDR r5, [sp,#4] LDR r6, [sp,#8] ADD sp,sp,#12 ; ; ; ; restore register r4 for caller restore register r5 for caller restore register r6 for caller adjust stack to delete 3 items The procedure ends with a jump register using the return address: MOV pc, lr ; jump back to calling routine In the previous example, we used temporary registers and assumed their old values must be saved and restored. To avoid saving and restoring a register whose value is never used, which might happen with a temporary register, ARM software separates 12 of the registers into two groups: 03-Ch02-P374750.indd 116 7/3/09 12:18:24 PM 2.8 Supporting Procedures in Computer Hardware 117 ■ r0−r3, r12: argument or scratch registers that are not preserved by the callee (called procedure) on a procedure call ■ r4−r11: eight variable registers that must be preserved on a procedure call (if used, the callee saves and restores them) This simple convention reduces register spilling. In the example above, if we could rewrite the code to use r12 and reuse one of the r0 to r3, we can drop two stores and two loads from the code. We still must save and restore r4, since the callee must assume that the caller needs its value. Nested Procedures Procedures that do not call others are called leaf procedures. Life would be simple if all procedures were leaf procedures, but they aren’t. Just as a spy might employ other spies as part of a mission, who in turn might use even more spies, so do procedures invoke other procedures. Moreover, recursive procedures even invoke “clones” of themselves. Just as we need to be careful when using registers in procedures, more care must also be taken when invoking nonleaf procedures. For example, suppose that the main program calls procedure A with an argument of 3, by placing the value 3 into register r0 and then using BL A. Then suppose that procedure A calls procedure B via BL B with an argument of 7, also placed in r0. Since A hasn’t finished its task yet, there is a conflict over the use of register r0. Similarly, there is a conflict over the return address in register lr, since it now has the return address for B. Unless we take steps to prevent the problem, this conflict will eliminate procedure A’s ability to return to its caller. One solution is to push all the other registers that must be preserved onto the stack, just as we did with the extra registers. The caller pushes any argument registers (r0−r3) that are needed after the call. The callee pushes the return address register lr and any variable registers (r4−r11) used by the callee. The stack pointer sp is adjusted to account for the number of registers placed on the stack. Upon the return, the registers are restored from memory and the stack pointer is readjusted. Compiling a Recursive C Procedure, Showing Nested Procedure Linking Let’s tackle a recursive procedure that calculates factorial: int fact (int n) { if (n < 1) return (1); EXAMPLE else return (n * fact(n – 1)); } What is the ARM assembly code? 03-Ch02-P374750.indd 117 7/3/09 12:18:24 PM 118 Chapter 2 ANSWER Instructions: Language of the Computer The parameter variable n corresponds to the argument register r0. The compiled program starts with the label of the procedure and then saves two registers on the stack, the return address and r0: fact: SUB STR STR sp, sp, #8 lr, [sp,#8] r0, [sp,#0] ; adjust stack for 2 items ; save the return address ; save the argument n The first time fact is called, STR saves an address in the program that called fact. The next two instructions test whether n is less than 1, going to L1 if n ≥ 1. CMP BGE r0,#1 L1 ; compare n to 1 ; if n >= 1, go to L1 If n is less than 1, fact returns 1 by putting 1 into a value register: it moves 1 to r0. It then pops the two saved values off the stack and jumps to the return address: MOV ADD MOV r0,#1 sp,sp,#8 pc,lr ; return 1 ; pop 2 items off stack ; return to the caller Before popping two items off the stack, we could have loaded r0 and lr. Since r0 and lr don’t change when n is less than 1, we skip those instructions. If n is not less than 1, the argument n is decremented and then fact is called again with the decremented value: L1: SUB r0,r0,#1 BL fact ; n >= 1: argument gets (n – 1) ; call fact with (n – 1) The next instruction is where fact returns. First we save the returned value and restore the old return address and old argument, along with the stack pointer: MOV r12,r0 LDR r0, [sp,#0] LDR lr, [sp,#0] ADD sp, sp, #8 items ; ; ; ; save the return value return from BL: restore argument n restore the return address adjust stack pointer to pop 2 Next, the return value register r0 gets the product of old argument r0 and the current value of the value register. We assume a multiply instruction is available, even though it is not covered until Chapter 3: MUL r0,r0,r12 ; return n * fact (n – 1) Finally, fact jumps again to the return address: MOV 03-Ch02-P374750.indd 118 pc,lr ; return to the caller 7/3/09 12:18:24 PM 2.8 Supporting Procedures in Computer Hardware A C variable is generally a location in storage, and its interpretation depends both on its type and storage class. Examples include integers and characters (see Section 2.9). C has two storage classes: automatic and static. Automatic variables are local to a procedure and are discarded when the procedure exits. Static variables exist across exits from and entries to procedures. C variables declared outside all procedures are considered static, as are any variables declared using the keyword static. The rest are automatic. 119 Hardware/ Software Interface Figure 2.11 summarizes what is preserved across a procedure call. Note that several schemes preserve the stack, guaranteeing that the caller will get the same data back on a load from the stack as it stored onto the stack. The stack above sp is preserved simply by making sure the callee does not write above sp; sp is itself preserved by the callee adding exactly the same amount that was subtracted from it; and the other registers are preserved by saving them on the stack (if they are used) and restoring them from there. Preserved Not preserved Variable registers: r4–r11 Argument registers: r0–r3 Stack pointer register: sp Intra-procedure-call scatch register: r12 Link register: lr Stack below the stack pointer Stack above the stack pointer FIGURE 2.11 What is and what is not preserved across a procedure call. Allocating Space for New Data on the Stack The final complexity is that the stack is also used to store variables that are local to the procedure but do not fit in registers, such as local arrays or structures. The segment of the stack containing a procedure’s saved registers and local variables is called a procedure frame or activation record. Figure 2.12 shows the state of the stack before, during, and after the procedure call. Allocating Space for New Data on the Heap In addition to automatic variables that are local to procedures, C programmers need space in memory for static variables and for dynamic data structures. Figure 2.13 shows the ARM convention for allocation of memory. The stack starts in the high end of memory and grows down. The first part of the low end of memory is reserved, followed by the home of the ARM machine code, traditionally called the text segment. Above the code is the static data segment, which is the place for constants and other static variables. Although arrays tend to be a fixed length 03-Ch02-P374750.indd 119 procedure frame Also called activation record. The segment of the stack containing a procedure’s saved registers and local variables. text segment The segment of a UNIX object file that contains the machine language code for routines in the source file. 7/3/09 12:18:25 PM 120 Chapter 2 Instructions: Language of the Computer High address sp sp Saved argument registers (if any) Saved return address Saved saved registers (if any) Local arrays and structures (if any) sp Low address a. b. c. FIGURE 2.12 Illustration of the stack allocation (a) before, (b) during, and (c) after the procedure call. The stack pointer (sp) points to the top of the stack. The stack is adjusted to make room for all the saved registers and any memory-resident local variables. If there are no local variables on the stack within a procedure, the compiler will save time by not setting and restoring the stack pointer. sp 7fff fffchex Stack Dynamic data 1000 8000hex 1000 0000hex pc 0040 0000hex 0 Static data Text Reserved FIGURE 2.13 Typical ARM memory allocation for program and data. These addresses are only a software convention, and not part of the ARM architecture. The stack pointer is initialized to 7fff fffchex and grows down toward the data segment. At the other end, the program code (“text”) starts at 0040 0000hex. The static data starts at 1000 0000hex. Dynamic data, allocated by malloc in C and by new in Java, is next. It grows up toward the stack in an area called the heap. and thus are a good match to the static data segment, data structures like linked lists tend to grow and shrink during their lifetimes. The segment for such data structures is traditionally called the heap, and it is placed next in memory. Note that this allocation allows the stack and heap to grow toward each other, thereby allowing the efficient use of memory as the two segments wax and wane. C allocates and frees space on the heap with explicit functions. malloc() allocates space on the heap and returns a pointer to it, and free() releases space 03-Ch02-P374750.indd 120 7/3/09 12:18:25 PM 2.8 Supporting Procedures in Computer Hardware 121 on the heap to which the pointer points. Memory allocation is controlled by programs in C, and it is the source of many common and difficult bugs. Forgetting to free space leads to a “memory leak,” which eventually uses up so much memory that the operating system may crash. Freeing space too early leads to “dangling pointers,” which can cause pointers to point to things that the program never intended. Java uses automatic memory allocation and garbage collection just to avoid such bugs. Figure 2.14 summarizes the register conventions for the ARM assembly language. Name Register number Usage Preserved on call? a1-a2 a3-a4 0–1 Argument / return result / scratch register 2–3 Argument / scratch register no no v1-v8 4–11 Variables for local routine yes ip 12 Intra-procedure-call scratch register no sp 13 Stack pointer yes lr 14 Link Register (Return address) yes pc 15 Program Counter n.a. FIGURE 2.14 ARM register conventions. Elaboration: What if there are more than four parameters? The ARM convention is to place the extra parameters on the stack. The procedure then expects the first four parameters to be in registers r0 through r3 and the rest in memory, addressable via the stack pointer. Elaboration: Some recursive procedures can be implemented iteratively without using recursion. Iteration can significantly improve performance by removing the overhead associated with procedure calls. For example, consider a procedure used to accumulate a sum: int sum (int n, int acc) { if (n > 0) return sum(n – 1, acc + n); else return acc; } Consider the procedure call sum(3,0). This will result in recursive calls to sum (2,3), sum(1,5), and sum(0,6), and then the result 6 will be returned four times. This recursive call of sum is referred to as a tail call, and this example use of tail recursion can be implemented very efficiently (assume r0 = n and r1 = acc): sum: CMP r0, #0 BLE sum_exit 03-Ch02-P374750.indd 121 ; test if n <= 0 ; go to sum_exit if n <= 0 7/3/09 12:18:25 PM 122 Chapter 2 Instructions: Language of the Computer ADD r1, r1, r0 SUB r0, r0, #1 B sum sum_exit: MOV r0, r1 MOV pc, lr ; add n to acc ; subtract 1 from n ; go to sum ; return value acc ; return to caller Check Which of the following statements about C and Java are generally true? Yourself 1. C programmers manage data explicitly, while it’s automatic in Java. 2. C leads to more pointer bugs and memory leak bugs than does Java. !(@ | = > (wow open tab at bar is great) Fourth line of the keyboard poem “Hatless Atlas,” 1991 (some give names to ASCII characters: “!” is “wow,” “(” is open, “|” is bar, and so on). ASCII value Character 32 33 2.9 Communicating with People Computers were invented to crunch numbers, but as soon as they became commercially viable they were used to process text. Most computers today offer 8-bit bytes to represent characters, with the American Standard Code for Information Interchange (ASCII) being the representation that nearly everyone follows. Figure 2.15 summarizes ASCII. ASCII value Character ASCII value Character ASCII value Character ASCII value Character ASCII value Character space 48 0 64 @ 80 P 096 ` 112 p ! 49 1 65 A 81 Q 097 a 113 q 34 " 50 2 66 B 82 R 098 b 114 r 35 ; 51 3 67 C 83 S 099 c 115 s 36 $ 52 4 68 D 84 T 100 d 116 t 37 % 53 5 69 E 85 U 101 e 117 u 38 & 54 6 70 F 86 V 102 f 118 v 39 ' 55 7 71 G 87 W 103 g 119 w 40 ( 56 8 72 H 88 X 104 h 120 x 41 ) 57 9 73 I 89 Y 105 i 121 y 42 * 58 : 74 J 90 Z 106 j 122 z 43 + 59 ; 75 K 91 [ 107 k 123 { 44 , 60 < 76 L 92 \ 108 l 124 | 45 - 61 = 77 M 93 ] 109 m 125 } 46 . 62 > 78 N 94 ^ 110 n 126 ~ 47 / 63 ? 79 O 95 _ 111 o 127 DEL FIGURE 2.15 ASCII representation of characters. Note that upper- and lowercase letters differ by exactly 32; this observation can lead to shortcuts in checking or changing upper- and lowercase. Values not shown include formatting characters. For example, 8 represents a backspace, 9 represents a tab character, and 13 a carriage return. Another useful value is 0 for null, the value the programming language C uses to mark the end of a string. This information is also found in Column 3 of the ARM Reference Data Card at the front of this book. 03-Ch02-P374750.indd 122 7/3/09 12:18:25 PM 2.9 123 Communicating with People Base 2 is not natural to human beings; we have 10 fingers and so find base 10 natural. Why didn’t computers use decimal? In fact, the first commercial computer did offer decimal arithmetic. The problem was that the computer still used on and off signals, so a decimal digit was simply represented by several binary digits. Decimal proved so inefficient that subsequent computers reverted to all binary, converting to base 10 only for the relatively infrequent input/output events. Hardware/ Software Interface ASCII versus Binary Numbers We could represent numbers as strings of ASCII digits instead of as integers. How much does storage increase if the number 1 billion is represented in ASCII versus a 32-bit integer? One billion is 1,000,000,000, so it would take 10 ASCII digits, each 8 bits long. Thus the storage expansion would be (10 × 8)/32 or 2.5. In addition to the expansion in storage, the hardware to ADD, subtract, multiply, and divide such decimal numbers is difficult. Such difficulties explain why computing professionals are raised to believe that binary is natural and that the occasional decimal computer is bizarre. EXAMPLE ANSWER A series of instructions can extract a byte from a word, so load word and store word are sufficient for transferring bytes as well as words. Because of the popularity of text in some programs, however, ARM provides instructions to move bytes. Load register byte (LDRB) loads a byte from memory, placing it in the rightmost 8 bits of a register. Store register byte (STRB) takes a byte from the rightmost 8 bits of a register and writes it to memory. Thus, we copy a byte with the sequence LDRB r0,[sp,#0] STRB r0,[r10,#0] ; Read byte from source ; Write byte to destination Signed versus unsigned applies to loads as well as to arithmetic. The function of a signed load is to copy the sign repeatedly to fill the rest of the register—called sign extension—but its purpose is to place a correct representation of the number within that register. Unsigned loads simply fill with 0s to the left of the data, since the number represented by the bit pattern is unsigned. 03-Ch02-P374750.indd 123 Hardware/ Software Interface 7/3/09 12:18:25 PM 124 Chapter 2 Instructions: Language of the Computer When loading a 32-bit word into a 32-bit register, the point is moot; signed and unsigned loads are identical. ARM does offer two flavors of byte loads: load register signed byte (LDRSB) treats the byte as a signed number and thus sign-extends to fill the 24 leftmost bits of the register, while load register byte (LDRB) works with unsigned integers. Since C programs almost always use bytes to represent characters rather than consider bytes as very short signed integers, LDRB is used practically exclusively for byte loads. Characters are normally combined into strings, which have a variable number of characters. There are three choices for representing a string: (1) the first position of the string is reserved to give the length of a string, (2) an accompanying variable has the length of the string (as in a structure), or (3) the last position of a string is indicated by a character used to mark the end of a string. C uses the third choice, terminating a string with a byte whose value is 0 (named null in ASCII). Thus, the string “Cal” is represented in C by the following 4 bytes, shown as decimal numbers: 67, 97, 108, 0. (As we shall see, Java uses the first option.) Compiling a String Copy Procedure, Showing How to Use C Strings EXAMPLE The procedure strcpy copies string y to string x using the null byte termination convention of C: void strcpy (char x[], char y[]) { int i; i = 0; while ((x[i] = y[i]) != ‘\0’) /* copy & test byte */ i += 1; } What is the ARM assembly code? ANSWER Below is the basic ARM assembly code segment. Assume that base addresses for arrays x and y are found in r0 and r1, while i is in r4. strcpy adjusts the stack pointer and then saves the saved register r4 on the stack: strcpy: SUB STR 03-Ch02-P374750.indd 124 sp, sp, #4 r4, [sp,#0] ; adjust stack for 1 more item ; save r4 7/3/09 12:18:25 PM 2.9 Communicating with People 125 To initialize i to 0, the next instruction sets r4 to 0 by adding 0 to 0 and placing that sum in r4: MOV r4, #0 ; i = 0 + 0 This is the beginning of the loop. The address of y[i] is first formed by adding i to y[]: L1: ADD r2,r4,r1 ; address of y[i] in r2 Note that we don’t have to multiply i by 4 since y is an array of bytes and not of words, as in prior examples. To load the character in y[i], we use load byte unsigned, which puts the character into r3: LDRBS r3, [r2,#0] ; r3 = y[i] and set condition flags A similar address calculation puts the address of x[i] in r12, and then the character in r3 is stored at that address. ADD STRB r12,r4,r0 ; address of x[i] in r12 r3, [r12,#0] ; x[i] = y[i] Next, we exit the loop if the character was 0. That is, we exit if it is the last character of the string: BEQ L2 ; if y[i] == 0, go to L2 If not, we increment i and loop back: ADD B r4, r4, #1 ; i = i + 1 L1 ; go to L1 If we don’t loop back, it was the last character of the string; we restore r4 and the stack pointer, and then return. L2: LDR ADD MOV r4, [sp,#0] sp, sp, #4 pc, lr ; y[i] == 0: end of string. Restore old r4 ; pop 1 word off stack ; return String copies usually use pointers instead of arrays in C to avoid the operations on i in the code above. Section 2.10 shows more sophisticated addressing options for loads and stores that would reduce the number of instructions in this loop. Also, see Section 2.14 for an explanation of arrays versus pointers. Since the procedure strcpy above is a leaf procedure, the compiler could allocate i to a temporary register and avoid saving and restoring it. Hence, instead of thinking of the registers r0 to r3 as being just for arguments, we can think of them as registers that the callee should use whenever convenient. When a compiler finds a leaf procedure, it exhausts such registers before using registers it must save. 03-Ch02-P374750.indd 125 7/3/09 12:18:25 PM 126 Chapter 2 Instructions: Language of the Computer Characters and Strings in Java Unicode is a universal encoding of the alphabets of most human languages. Figure 2.16 is a list of Unicode alphabets; there are almost as many alphabets in Unicode as there are useful symbols in ASCII. To be more inclusive, Java uses Unicode for characters. By default, it uses 16 bits to represent a character. The ARM instruction set has explicit instructions to load and store such 16-bit quantities, called halfwords. Load register half (LDRH) loads a halfword from memory, placing it in the rightmost 16 bits of a register. Like load register byte, load half (LDRH) treats the halfword as a unsigned number and thus zero-extends to fill the 16 leftmost bits of the register, while load register signed halfword (LDRSH) works with signed integers. Thus, LDRH is the more popular of the two. Store half (STRH) takes a halfword from the rightmost 16 bits of a register and writes it to memory. We copy a halfword with the sequence LDRH r0,[sp,#0] STRH r0,[r12,#0] ; Read halfword (16 bits) from source ; Write halfword (16 bits) to destination Strings are a standard Java class with special built-in support and predefined methods for concatenation, comparison, and conversion. Unlike C, Java includes a word that gives the length of the string, similar to Java arrays. Latin Malayalam Tagbanwa General Punctuation Greek Sinhala Khmer Spacing Modifier Letters Cyrillic Thai Mongolian Currency Symbols Armenian Lao Limbu Combining Diacritical Marks Hebrew Tibetan Tai Le Combining Marks for Symbols Arabic Myanmar Kangxi Radicals Superscripts and Subscripts Syriac Georgian Hiragana Number Forms Thaana Hangul Jamo Katakana Mathematical Operators Devanagari Ethiopic Bopomofo Mathematical Alphanumeric Symbols Bengali Cherokee Kanbun Braille Patterns Gurmukhi Unified Canadian Aboriginal Syllabic Shavian Optical Character Recognition Gujarati Ogham Osmanya Byzantine Musical Symbols Oriya Runic Cypriot Syllabary Musical Symbols Tamil Tagalog Tai Xuan Jing Symbols Arrows Telugu Hanunoo Yijing Hexagram Symbols Box Drawing Kannada Buhid Aegean Numbers Geometric Shapes FIGURE 2.16 Example alphabets in Unicode. Unicode version 4.0 has more than 160 “blocks,” which is their name for a collection of symbols. Each block is a multiple of 16. For example, Greek starts at 0370hex, and Cyrillic at 0400hex. The first three columns show 48 blocks that correspond to human languages in roughly Unicode numerical order. The last column has 16 blocks that are multilingual and are not in order. A 16-bit encoding, called UTF-16, is the default. A variable-length encoding, called UTF-8, keeps the ASCII subset as eight bits and uses 16−32 bits for the other characters. UTF-32 uses 32 bits per character. To learn more, see www.unicode.org. 03-Ch02-P374750.indd 126 7/3/09 12:18:25 PM 2.10 127 ARM Addressing for 32-Bit Immediates Elaboration: ARM software tries to keep the stack aligned to word addresses, allowing the program to always use LDR and STR (which must be aligned) to access the stack. This convention means that a char variable allocated on the stack occupies 4 bytes, even though it needs less. However, a C string variable or an array of bytes will pack 4 bytes per word, and a Java string variable or array of shorts packs 2 halfwords per word. I. Which of the following statements about characters and strings in C and Java are true? Check Yourself 1. A string in C takes about half the memory as the same string in Java. 2. Strings are just an informal name for single-dimension arrays of characters in C and Java. 3. Strings in C and Java use null (0) to mark the end of a string. 4. Operations on strings, like length, are faster in C than in Java. II. Which type of variable that can contain 1,000,000,000ten takes the most memory space? 1. int in C 2. string in C 3. string in Java 2.10 ARM Addressing for 32-Bit Immediates and More Complex Addressing Modes Although keeping all ARM instructions 32 bits long simplifies the hardware, there are times where it would be convenient to have a 32-bit constant or 32-bit address. This section starts with the how ARM builds ins support for certain 32-bit patterns from just 12 bits, and then shows the optimizations for addresses used in data transfers. 32-Bit Immediate Operands Although constants are frequently short and fit into a narrow field, sometimes they are bigger. The ARM architects believed that some 32-bit constants were more popular than others, and so included a trick to allow ARM to specify some that they thought would be important. The 12-bit Operand2 field in the DP format actually is subdivided into 2 fields: a 8-bit constant field on the right and a 4-bit rotate right field. This latter field rotates the 8-bit constant to the right by twice the value in the rotate field. Using unsigned numbers, this trick can represent any number such that X * 22i 03-Ch02-P374750.indd 127 7/3/09 12:18:25 PM 128 Chapter 2 Instructions: Language of the Computer where X is between 0 and 255 and i is between 0 and 15. It can also represent a few other patterns when the 8-bit rotated constant straddles both the most significant and least significant bits: T * 230 + W, U * 228 + V, and W * 226 + T where T is between 0 and 3, U is 0 to 15, V is 0 to 15, and W is 0 and 63. Since ARM has the MVN instruction, it is also quick to produce the one’s complement of any of the patterns above. Loading a 32-Bit Constant EXAMPLE ANSWER Hardware/ Software Interface What is the ARM machine code to load this 32-bit constant into register r0? 0000 0000 1101 1001 0000 0000 0000 0000 First, we would find the pattern for the 8 non-zerobits (1101 1001) which is 217 in decimal. To move that pattern from the 8 least significant bits to bits 23 to 16, we need to rotate the 8-bit pattern to the right by 16 bits. Since ARM doubles the amount in the rotation field, we need to enter 8 in that field. The machine language MOV instruction (opcode=13) to put this 32-bit value in r4 is: Cond F I Opcode S Rn Rd rotate-imm mm_8 14 0 1 13 0 0 4 8 217 4 bits 2 bits 1 bit 4 bits 1 bit 4 bits 4 bits 4 bits 8 bits The assembler should map large constants into the right instruction format so that the programmer need not worry about it. If it doesn’t match one of these patterns, the assembler should place the 32-bit constant in memory and then load it into the register. This useful feature can be generalized so that the symbolic representation of the ARM machine language is no longer limited by the hardware, but by whatever the creator of an assembler chooses to include (see Section 2.12). Elaboration: ARMv6T2 added instructions that allow creation of any 32-bit constant in just two instructions. MOVT writes a 16-bit immediate into the top half of the destination register without affecting the bottom half. MOVW does the opposite: Writes the bottom half without affecting the top half. The ARM assembler picks the most efficient instruction sequence depending on the constant. 03-Ch02-P374750.indd 128 7/3/09 12:18:25 PM 2.10 ARM Addressing for 32-Bit Immediates 129 More Complex Addressing in Data Transfer Instructions Multiple forms of addressing are generically called addressing modes. Consistent with the ARM philosophy of trying to reduce the number of instructions to execute a program—as exemplified by the shifting of arithmetic operands, conditional instruction execution, and rotating immediates—are including many addressing modes for data processing instructions. The addressing mode in Section 2.6 is called immediate offset, where a constant address is added to a base register. There are seven more: addressing mode One of several addressing regimes delimited by their varied use of operands and/or addresses. 1. Register Offset. Instead of adding a constant to the base register, another register is added to the base register. This mode can help with an index into an array, where the array index is in one register and the base of the array is in another. Example: LDR r2, [r0,r1]. 2. Scaled Register Offset. Just as the second operand can be optionally shifted left or right in data processing instructions, this addressing mode allows the register to be shifted before it is added to the base register. This mode can be useful to turn an array index into a byte address by shifting it left by 2 bits. We’ll see this mode used in Section 2.14. Example: LDR r2, [r0,r1, LSL #2] 3. Immediate Pre-Indexed. This and the following modes update the base register with the new address as part of the addressing mode. That is, on a load instruction, the content of the destination register changes based on the value fetched from memory and the base register changes to the address that was used to access memory. This mode can be useful when going sequentially through an array. There are two versions—one where the address is added to the base and one where the address is subtracted from the base—to allow the programmer to go through the array forwards or backwards. Note that in this mode the addition or subtraction occurs before the address is sent to memory. Example: LDR r2, [r0, #4]! 4. Immediate Post-Indexed. Just to cover all eventualities, ARM also has a mode just like immediate pre-indexed except the address in the base register is used to access memory first and then the constant is added or subtracted later. Depending on how you set up your program to access memory, either pre-indexed or post-indexed is desired, but not likely both. We’ll see this mode used in Section 2.14. Example: LDR r2, [r0], #4 5. Register Pre-Indexed. The same Immediate Pre-Indexed, except you add or subtract a register instead of a constant. Example: LDR r2, [r0,r1]! 6. Scaled Register Pre-Indexed. The same Register Pre-Indexed, except you shift the register before adding or subtracting it. Example: LDR r2, [r0,r1, LSL #2]! 7. Register Post-Indexed. The same Immediate Post-Indexed, except you add or subtract a register instead of a constant. Example: LDR r2, [r0],r1 03-Ch02-P374750.indd 129 7/3/09 12:18:25 PM 130 Chapter 2 Instructions: Language of the Computer Some consider the operands in the data processing instructions to be addressing modes as well. Hence, we can add the following modes to the list: 1. Immediate addressing, where the operand is a constant within the instruction itself. Example: ADD r2, r0, #5 2. Register addressing, where the operand is a register. Example: ADD r2, r0, r1 3. Scaled register addressing, where the register operand is shifted first. Example: ADD r2, r0, r1, LSL #2 4. PC-relative addressing, where the branch address is the sum of the PC and a constant in the instruction. Example: BEQ 1000 Since one of the 16 ARM registers is the program counter, the data transfer addressing modes can also offer PC-relative addressing, just as we saw for branches. One reason for using PC-relative addressing with loads would be to fetch 32-bit constants that could be placed in memory along with the program. Figure 2.17 shows how operands are identified for each addressing mode. Note that a single operation can use more than one addressing mode. Add, for example, uses both immediate and register addressing. Figure 2.18 shows the ARM instruction formats covered so far. Figure 2.1 on page 78 shows the ARM assembly language revealed in this chapter. The remaining portion of ARM instructions deals mainly with arithmetic and real numbers, which are covered in the next chapter. 1. Immediate: ADD r2, r0, #5 cond f opcode rn rd Immediate 2. Register: ADD r2, r0, r1 cond f opcode rn rd ... rm Register Register 3. Scaled register: ADD r2, r0, r1, LSL #2 cond f opcode rn rd ... rm Register Register Shifter FIGURE 2.17 Illustration of the twelve ARM addressing modes. The operands are shaded in color. The operand of mode 3 is in memory, whereas the operand for mode 2 is a register. Note that versions of load and store access bytes, halfwords, or words. For mode 1, the operand is in the instruction itself. 03-Ch02-P374750.indd 130 7/3/09 12:18:25 PM 2.10 ARM Addressing for 32-Bit Immediates 131 4. PC-relative: BEQ 1000 cond opcode offset Memory PC + Byte Half Word 5. Immediate offset: LDR r2, [r0, #8] cond f opcode rn rd address Memory + Byte Half Word register 6. Register offset: LDR r2, [r0, r1] cond f opcode rn rd ... rm Memory register Byte Half Word + register 7. Scaled register offset: LDR r2, [r0, r1, LSL #2] cond f opcode rn rd ... rm Memory register Byte Half Word shifter + register 8. Immediate offset pre-indexed: LDR r2, [r0, #4]! cond f opcode rn rd address Memory + Byte Half Word register FIGURE 2.17 03-Ch02-P374750.indd 131 Illustration of the twelve ARM addressing modes. (continued) 7/3/09 12:18:26 PM 132 Chapter 2 Instructions: Language of the Computer 9. Immediate offset post-indexed: LDR r2, [r0], #4 cond f opcode rn rd address Memory + Byte Half Word register 10. Register offset pre-indexed: LDR r2, [r0, r1]! cond f opcode rn rd rm ... Memory register Byte Half Word + register 11. Scaled register offset pre-indexed: LDR r2, [r0, r1, LSL #2]! cond f opcode rn rd ... rm Memory register Byte Half Word shifter + register 12. Register offset post-indexed: LDR r2, [r0], r1 cond f opcode rn rd ... rm Memory register Byte Half Word + register FIGURE 2.17 03-Ch02-P374750.indd 132 Illustration of the twelve ARM addressing modes. (continued) 7/3/09 12:18:27 PM 2.11 Name Format Field size DP format DP DT format DT Field size BR format FIGURE 2.18 BR Example Comments 4 bits 2 bits 1 bit 4 bits 1 bit 4 bits 4 bits Cond F I Opcode S Rn Rd Operand2 Arithmetic instruction format Rn Rd Offset12 Cond F 4 bits 2 bits 2 bits Opcode 24 bits Cond F Opcode signed_immed_24 12 bits All ARM instructions are 32 bits long Data transfer format B and BL instructions ARM instruction formats. Although ARM has 32-bit addresses, many microprocessors also have 64-bit address extensions in addition to 32-bit addresses. These extensions were in response to the needs of software for larger programs. The process of instruction set extension allows architectures to expand in such a way that is able to move software compatibly upward to the next generation of architecture. 2.11 Hardware/ Software Interface Parallelism and Instructions: Synchronization Parallel execution is easier when tasks are independent, but often they need to cooperate. Cooperation usually means some tasks are writing new values that others must read. To know when a task is finished writing so that it is safe for another to read, the tasks need to synchronize. If they don’t synchronize, there is a danger of a data race, where the results of the program can change depending on how events happen to occur. For example, recall the analogy of the eight reporters writing a story on page 43 of Chapter 1. Suppose one reporter needs to read all the prior sections before writing a conclusion. Hence, he must know when the other reporters have finished their sections, so that he or she need not worry about them being changed afterwards. That is, they had better synchronize the writing and reading of each section so that the conclusion will be consistent with what is printed in the prior sections. In computing, synchronization mechanisms are typically built with user-level software routines that rely on hardware-supplied synchronization instructions. In this section, we focus on the implementation of lock and unlock synchronization operations. Lock and unlock can be used straightforwardly to create regions where only a single processor can operate, called mutual exclusion, as well as to implement more complex synchronization mechanisms. The critical ability we require to implement synchronization in a multiprocessor is a set of hardware primitives with the ability to atomically read and modify a memory 03-Ch02-P374750.indd 133 133 Parallelism and Instructions: Synchronization data race Two memory accesses form a data race if they are from different threads to same location, at least one is a write, and they occur one after another. 7/3/09 12:18:34 PM 134 Chapter 2 Instructions: Language of the Computer location. That is, nothing else can interpose itself between the read and the write of the memory location. Without such a capability, the cost of building basic synchronization primitives will be too high and will increase as the processor count increases. There are a number of alternative formulations of the basic hardware primitives, all of which provide the ability to atomically read and modify a location, together with some way to tell if the read and write were performed atomically. In general, architects do not expect users to employ the basic hardware primitives, but instead expect that the primitives will be used by system programmers to build a synchronization library, a process that is often complex and tricky. Let’s start with one such hardware primitive and show how it can be used to build a basic synchronization primitive. One typical operation for building synchronization operations is the atomic exchange or atomic swap, which interchanges a value in a register for a value in memory. ARM has such an instruction: SWP. To see how to use this to build a basic synchronization primitive, assume that we want to build a simple lock where the value 0 is used to indicate that the lock is free and 1 is used to indicate that the lock is unavailable. A processor tries to set the lock by doing an exchange of 1, which is in a register, with the memory address corresponding to the lock. The value returned from the exchange instruction is 1 if some other processor had already claimed access and 0 otherwise. In the latter case, the value is also changed to 1, preventing any competing exchange in another processor from also retrieving a 0. For example, consider two processors that each try to do the exchange simultaneously: this race is broken, since exactly one of the processors will perform the exchange first, returning 0, and the second processor will return 1 when it does the exchange. The key to using the exchange primitive to implement synchronization is that the operation is atomic: the exchange is indivisible, and two simultaneous exchanges will be ordered by the hardware. It is impossible for two processors trying to set the synchronization variable in this manner to both think they have simultaneously set the variable. Elaboration: Although it was presented for multiprocessor synchronization, atomic exchange is also useful for the operating system in dealing with multiple processes in a single processor. To make sure nothing interferes in a single processor, the store conditional also fails if the processor does a context switch between the two instructions (see Chapter 5). Check When do you use primitives like SWP? Yourself 1. When cooperating threads of a parallel program need to synchronize to get proper behaviour for reading and writing shared data 2. When cooperating processes on a uniprocessor need to synchronize for reading and writing shared data 03-Ch02-P374750.indd 134 7/3/09 12:18:34 PM 2.12 2.12 Translating and Starting a Program 135 Translating and Starting a Program This section describes the four steps in transforming a C program in a file on disk into a program running on a computer. Figure 2.19 shows the translation hierarchy. Some systems combine these steps to reduce translation time, but these are the logical four phases that programs go through. This section follows this translation hierarchy. Compiler The compiler transforms the C program into an assembly language program, a symbolic form of what the machine understands. High-level language programs take C program Compiler Assembly language program Assembler Object: Machine language module Object: Library routine (machine language) Linker Executable: Machine language program Loader Memory FIGURE 2.19 A translation hierarchy for C. A high-level language program is first compiled into an assembly language program and then assembled into an object module in machine language. The linker combines multiple modules with library routines to resolve all references. The loader then places the machine code into the proper memory locations for execution by the processor. To speed up the translation process, some steps are skipped or combined. Some compilers produce object modules directly, and some systems use linking loaders that perform the last two steps. To identify the type of file, UNIX follows a suffix convention for files: C source files are named x.c, assembly files are x.s, object files are named x.o, statically linked library routines are x.a, dynamically linked library routes are x.so, and executable files by default are called a.out. MS-DOS uses the suffixes .C, .ASM, .OBJ, .LIB, .DLL, and .EXE to the same effect. 03-Ch02-P374750.indd 135 7/3/09 12:18:34 PM 136 assembly language A symbolic language that can be translated into binary machine language. Chapter 2 Instructions: Language of the Computer many fewer lines of code than assembly language, so programmer productivity is much higher. In 1975, many operating systems and assemblers were written in assembly language because memories were small and compilers were inefficient. The 500,000fold increase in memory capacity per single DRAM chip has reduced program size concerns, and optimizing compilers today can produce assembly language programs nearly as good as an assembly language expert, and sometimes even better for large programs. Assembler pseudoinstruction A common variation of assembly language instructions often treated as if it were an instruction in its own right. symbol table A table that matches names of labels to the addresses of the memory words that instructions occupy. 03-Ch02-P374750.indd 136 Since assembly language is an interface to higher-level software, the assembler can also treat common variations of machine language instructions as if they were instructions in their own right. The hardware need not implement these instructions; however, their appearance in assembly language simplifies translation and programming. Such instructions are called pseudoinstructions. For example, the ARM assembler accepts this instruction: LDR r0, #constant and the assembler determines which instructions to use to create the constant in the most efficient way possible. The default is loading a 32-bit constant from memory. Assemblers will also accept numbers in a variety of bases. In addition to binary and decimal, they usually accept a base that is more succinct than binary yet converts easily to a bit pattern. ARM assemblers use hexadecimal. Such features are convenient, but the primary task of an assembler is assembly into machine code. The assembler turns the assembly language program into an object file, which is a combination of machine language instructions, data, and information needed to place instructions properly in memory. To produce the binary version of each instruction in the assembly language program, the assembler must determine the addresses corresponding to all labels. Assemblers keep track of labels used in branches and data transfer instructions in a symbol table. As you might expect, the table contains pairs of symbols and addresses. The object file for UNIX systems typically contains six distinct pieces: ■ The object file header describes the size and position of the other pieces of the object file. ■ The text segment contains the machine language code. ■ The static data segment contains data allocated for the life of the program. (UNIX allows programs to use both static data, which is allocated throughout the program, and dynamic data, which can grow or shrink as needed by the program. See Figure 2.13.) ■ The relocation information identifies instructions and data words that depend on absolute addresses when the program is loaded into memory. 7/3/09 12:18:34 PM 2.12 Translating and Starting a Program ■ The symbol table contains the remaining labels that are not defined, such as external references. ■ The debugging information contains a concise description of how the modules were compiled so that a debugger can associate machine instructions with C source files and make data structures readable. 137 The next subsection shows how to attach such routines that have already been assembled, such as library routines. Linker What we have presented so far suggests that a single change to one line of one procedure requires compiling and assembling the whole program. Complete retranslation is a terrible waste of computing resources. This repetition is particularly wasteful for standard library routines, because programmers would be compiling and assembling routines that by definition almost never change. An alternative is to compile and assemble each procedure independently, so that a change to one line would require compiling and assembling only one procedure. This alternative requires a new systems program, called a link editor or linker, which takes all the independently assembled machine language programs and “stitches” them together. There are three steps for the linker: 1. Place code and data modules symbolically in memory. 2. Determine the addresses of data and instruction labels. linker Also called link editor. A systems program that combines independently assembled machine language programs and resolves all undefined labels into an executable file. 3. Patch both the internal and external references. The linker uses the relocation information and symbol table in each object module to resolve all undefined labels. Such references occur in branch instructions, jump instructions, and data addresses, so the job of this program is much like that of an editor: it finds the old addresses and replaces them with the new addresses. Editing is the origin of the name “link editor,” or linker for short. The reason a linker is useful is that it is much faster to patch code than it is to recompile and reassemble. If all external references are resolved, the linker next determines the memory locations each module will occupy. Recall that Figure 2.13 on page 120 shows the ARM convention for allocation of program and data to memory. Since the files were assembled in isolation, the assembler could not know where a module’s instructions and data would be placed relative to other modules. When the linker places a module in memory, all absolute references, that is, memory addresses that are not relative to a register, must be relocated to reflect its true location. The linker produces an executable file that can be run on a computer. Typically, this file has the same format as an object file, except that it contains no unresolved references. It is possible to have partially linked files, such as library routines, that still have unresolved addresses and hence result in object files. 03-Ch02-P374750.indd 137 executable file A functional program in the format of an object file that contains no unresolved references. It can contain symbol tables and debugging information. A “stripped executable” does not contain that information. Relocation information may be included for the loader. 7/3/09 12:18:34 PM 138 Chapter 2 Instructions: Language of the Computer Linking Object Files EXAMPLE Link the two object files below. Show updated addresses of the first few instructions of the completed executable file. We show the instructions in assembly language just to make the example understandable; in reality, the instructions would be numbers. Note that in the object files we have highlighted the addresses and symbols that must be updated in the link process: the instructions that refer to the addresses of procedures A and B and the instructions that refer to the addresses of data words X and Y. Object file header Name Text size Procedure A 100hex Data size 20hex Address Instruction 0 LDR r0, 0(r3) 4 … BL 0 … Data segment 0 … (X) … Relocation information Address Instruction type 0 LDR X 4 B Label BL Address X – B – Name Text size Procedure B 200hex Text segment Symbol table Dependency Object file header Text segment Data size 30hex Address Instruction 0 STR r1, 0(r3) 4 BL 0 … … Data segment Relocation information Symbol table 03-Ch02-P374750.indd 138 0 … (Y) … Address Instruction type Dependency 0 STR Y 4 A Label BL Address Y – A – 7/3/09 12:18:34 PM 2.12 139 Translating and Starting a Program Procedure A needs to find the address for the variable labelled X to put in the load instruction and to find the address of procedure B to place in the BL instruction. Procedure B needs the address of the variable labelled Y for the store instruction and the address of procedure A for its BL instruction. From Figure 2.13 on page 120, we know that the text segment starts at address 40 0000hex and the data segment at 1000 0000hex. The text of procedure A is placed at the first address and its data at the second. The object file header for procedure A says that its text is 100hex bytes and its data is 20hex bytes, so the starting address for procedure B text is 40 0100hex, and its data starts at 1000 0020hex. ANSWER Executable file header Text size Text segment Data segment 300hex Data size 50hex Address Instruction 0040 0000hex LDR r0, 8000hex(r3) 0040 0004hex BL 00 00EChex … … 0040 0100hex 0040 0104hex STR r1, 8020hex(r3) … … BL FF FDDDhex Address 1000 0000hex … (X) … 1000 0020hex … (Y) … Now the linker updates the address fields of the instructions. It uses the instruction type field to know the format of the address to be edited. We have two types here: 1. The BLs use PC-relative addressing. The BL at address 40 0004hex gets 40 0100hex (the address of procedure B) - (40 0004hex + 8) = 00 00EChex in its address field. The BL at 40 0104hex needs 40 0000hex (the address of procedure A) – (40 0104hex + 8) = -112hex. The two’s complement representation is FF FDDDhex. 2. The load and store addresses are harder because they are relative to a base register. This example uses the global pointer as the base register. Assume that register r4 is initialized to 1000 8000hex. To get the address 1000 0000hex (the address of word X), we place 8000hex in the address field of LDR at address 40 0000hex. Similarly, we place 8020hex in the address field of STR at address 40 0100hex to get the address 1000 0020hex (the address of word Y). 03-Ch02-P374750.indd 139 7/3/09 12:18:34 PM 140 Chapter 2 Instructions: Language of the Computer Elaboration: Recall that ARM instructions are word aligned, so BL drops the right two bits to increase the instruction’s address range. Thus, it uses 24 bits to create a 26-bit byte address. Hence, the actual address in the lower 24 bits of the BL instruction in this example is 00 002Bhex, rather than 00 00EChex. Loader loader A systems program that places an object program in main memory so that it is ready to execute. Now that the executable file is on disk, the operating system reads it to memory and starts it. The loader follows these steps in UNIX systems: 1. Reads the executable file header to determine size of the text and data segments. 2. Creates an address space large enough for the text and data. 3. Copies the instructions and data from the executable file into memory. 4. Copies the parameters (if any) to the main program onto the stack. 5. Initializes the machine registers and sets the stack pointer to the first free location. 6. Jumps to a start-up routine that copies the parameters into the argument registers and calls the main routine of the program. When the main routine returns, the start-up routine terminates the program with an exit system call. Dynamically Linked Libraries The first part of this section describes the traditional approach to linking libraries before the program is run. Although this static approach is the fastest way to call library routines, it has a few disadvantages: dynamically linked libraries (DLLs) Library routines that are linked to a program during execution. 03-Ch02-P374750.indd 140 ■ The library routines become part of the executable code. If a new version of the library is released that fixes bugs or supports new hardware devices, the statically linked program keeps using the old version. ■ It loads all routines in the library that are called anywhere in the executable, even if those calls are not executed. The library can be large relative to the program; for example, the standard C library is 2.5 MB. These disadvantages lead to dynamically linked libraries (DLLs), where the library routines are not linked and loaded until the program is run. Both the program and library routines keep extra information on the location of nonlocal procedures and their names. In the initial version of DLLs, the loader ran a dynamic linker, using the extra information in the file to find the appropriate libraries and to update all external references. The downside of the initial version of DLLs was that it still linked all routines of the library that might be called, versus only those that are called during 7/3/09 12:18:34 PM 2.12 Translating and Starting a Program 141 the running of the program. This observation led to the lazy procedure linkage version of DLLs, where each routine is linked only after it is called. Like many innovations in our field, this trick relies on a level of indirection. Figure 2.20 shows the technique. It starts with the nonlocal routines calling a set of dummy routines at the end of the program, with one entry per nonlocal routine. These dummy entries each contain an indirect jump. The first time the library routine is called, the program calls the dummy entry and follows the indirect jump. It points to code that puts a number in a register to identify the desired library routine and then jumps to the dynamic linker/loader. The linker/loader finds the desired routine, remaps it, and changes the address in the indirect jump location to point to that routine. It then jumps to it. When the Text Text BL ... BL ... lw mov pc,Lr ... lw mov pc,Lr ... Data Data Text ... li B ... ID Text Dynamic linker/loader Remap DLL routine B ... Data/Text DLL routine ... mov pc,Lr a. First call to DLL routine Text DLL routine ... mov pc,Lr b. Subsequent calls to DLL routine FIGURE 2.20 Dynamically linked library via lazy procedure linkage. (a) Steps for the first time a call is made to the DLL routine. (b) The steps to find the routine, remap it, and link it are skipped on subsequent calls. As we will see in Chapter 5, the operating system may avoid copying the desired routine by remapping it using virtual memory management. 03-Ch02-P374750.indd 141 7/3/09 12:18:34 PM 142 Chapter 2 Instructions: Language of the Computer routine completes, it returns to the original calling site. Thereafter, the call to the library routine jumps indirectly to the routine without the extra hops. In summary, DLLs require extra space for the information needed for dynamic linking, but do not require that whole libraries be copied or linked. They pay a good deal of overhead the first time a routine is called, but only a single indirect jump thereafter. Note that the return from the library pays no extra overhead. Microsoft’s Windows relies extensively on dynamically linked libraries, and it is also the default when executing programs on UNIX systems today. Starting a Java Program Java bytecode Instruction from an instruction set designed to interpret Java programs. The discussion above captures the traditional model of executing a program, where the emphasis is on fast execution time for a program targeted to a specific instruction set architecture, or even a specific implementation of that architecture. Indeed, it is possible to execute Java programs just like C. Java was invented with a different set of goals, however. One was to run safely on any computer, even if it might slow execution time. Figure 2.21 shows the typical translation and execution steps for Java. Rather than compile to the assembly language of a target computer, Java is compiled first to instructions that are easy to interpret: the Java bytecode instruction set (see Section 2.15 on the CD). This instruction set is designed to be close to the Java language so that this compilation step is trivial. Virtually no optimizations are performed. Like the C compiler, the Java compiler checks the types of data and produces the proper operation for each type. Java programs are distributed in the binary version of these bytecodes. Java program Compiler Class files (Java bytecodes) Just In Time compiler Java library routines (machine language) Java Virtual Machine Compiled Java methods (machine language) FIGURE 2.21 A translation hierarchy for Java. A Java program is first compiled into a binary version of Java bytecodes, with all addresses defined by the compiler. The Java program is now ready to run on the interpreter, called the Java Virtual Machine (JVM). The JVM links to desired methods in the Java library while the program is running. To achieve greater performance, the JVM can invoke the JIT compiler, which selectively compiles methods into the native machine language of the machine on which it is running. 03-Ch02-P374750.indd 142 7/3/09 12:18:34 PM A C Sort Example to Put It All Together 143 A software interpreter, called a Java Virtual Machine (JVM), can execute Java bytecodes. An interpreter is a program that simulates an instruction set architecture. For example, the ARM simulator used with this book is an interpreter. There is no need for a separate assembly step since either the translation is so simple that the compiler fills in the addresses or JVM finds them at runtime. The upside of interpretation is portability. The availability of software Java virtual machines meant that most people could write and run Java programs shortly after Java was announced. Today, Java virtual machines are found in hundreds of millions of devices, in everything from cell phones to Internet browsers. The downside of interpretation is lower performance. The incredible advances in performance of the 1980s and 1990s made interpretation viable for many important applications, but the factor of 10 slowdown when compared to traditionally compiled C programs made Java unattractive for some applications. To preserve portability and improve execution speed, the next phase of Java development was compilers that translated while the program was running. Such Just In Time compilers (JIT) typically profile the running program to find where the “hot” methods are and then compile them into the native instruction set on which the virtual machine is running. The compiled portion is saved for the next time the program is run, so that it can run faster each time it is run. This balance of interpretation and compilation evolves over time, so that frequently run Java programs suffer little of the overhead of interpretation. As computers get faster so that compilers can do more, and as researchers invent betters ways to compile Java on the fly, the performance gap between Java and C or Section 2.15 on the CD goes into much greater depth on the C++ is closing. implementation of Java, Java bytecodes, JVM, and JIT compilers. Java Virtual Machine (JVM) The program that 2.13 Which of the advantages of an interpreter over a translator do you think was most important for the designers of Java? interprets Java bytecodes. Just In Time compiler (JIT) The name commonly given to a compiler that operates at runtime, translating the interpreted code segments into the native code of the computer. Check Yourself 1. Ease of writing an interpreter 2. Better error messages 3. Smaller object code 4. Machine independence 2.13 A C Sort Example to Put It All Together One danger of showing assembly language code in snippets is that you will have no idea what a full assembly language program looks like. In this section, we derive the ARM code from two procedures written in C: one to swap array elements and one to sort them. 03-Ch02-P374750.indd 143 7/3/09 12:18:34 PM 144 Chapter 2 Instructions: Language of the Computer void swap(int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } FIGURE 2.22 A C procedure that swaps two locations in memory. This subsection uses this procedure in a sorting example. The Procedure swap Let’s start with the code for the procedure swap in Figure 2.22. This procedure simply swaps two locations in memory. When translating from C to assembly language by hand, we follow these general steps: 1. Allocate registers to program variables. 2. Produce code for the body of the procedure. 3. Preserve registers across the procedure invocation. This section describes the swap procedure in these three pieces, concluding by putting all the pieces together. Register Allocation for swap As mentioned on pages 113–114, the ARM convention on parameter passing is to use registers r0, r1, r2, and r3. Since swap has just two parameters, v and k, they will be found in registers r0 and r1. The only other variable is temp, which we associate with register r2 since swap is a leaf procedure (see page 116–117). This register allocation corresponds to the variable declarations in the first part of the swap procedure in Figure 2.22. We’ll also need two registers to hold temporary calculations. We’ll use r3 and r12, since the caller does not expect them to be preserved across a procedure call. To make to assembly language easier to read, we’ll use the assembler directive that let’s us associate names with registers. v k temp temp2 vkAddr RN RN RN RN RN 0 1 2 3 12 ; ; ; ; ; 1st argument address of v 2nd argument index k local variable temporary for v[k+1] to hold address of v[k] Code for the Body of the Procedure swap The remaining lines of C code in swap are temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; 03-Ch02-P374750.indd 144 7/3/09 12:18:35 PM 2.13 A C Sort Example to Put It All Together 145 Recall that the memory address for ARM refers to the byte address, and so words are really 4 bytes apart. Hence we need to multiply the index k by 4 before adding it to the address. Forgetting that sequential word addresses differ by 4 instead of by 1 is a common mistake in assembly language programming. Hence the first step is to get the address of v[k] by multiplying k by 4 via a shift left by 2: ADD vkAddr, v, k, LSL #2; reg vkAddr = v + (k * 4) ; reg vkAddr has the address of v[k] Now we load v[k] using vkAddr, and then v[k+1] by adding 4 to vkAddr: LDR LDR temp, [vkAddr, #0] ; temp = v[k] temp2, [vkAddr, #4] ; temp2 = v[k + 1] ; refers to next element of v Next we store temp and temp2 to the swapped addresses: STR STR temp2, [vkAddr, #0] ; v[k] = temp2 temp, [vkAddr, #4] ; v[k+1] = temp Now we have allocated registers and written the code to perform the operations of the procedure. What is missing is the code for preserving the saved registers used within swap. Since we are not using saved registers in this leaf procedure, there is nothing to preserve. The Full swap Procedure We are now ready for the whole routine, which includes the procedure label and the return jump. To make it easier to follow, we identify in Figure 2.23 each block of code with its purpose in the procedure. Procedure body swap: ADD vkAddr, v, k, LSL #2 LDR LDR temp, [vkAddr, #0] temp2, [vkAddr, #4] STR STR temp2, [vkAddr, #0] temp, [vkAddr, #4] MOV pc, lr ; reg vkAddr = v + (k * 4) ; reg vkAddr has the address of v[k] ; temp (temp) = v[k] ; temp2 = v[k + 1] ; refers to next element of v ; v[k] = temp2 ; v[k+1] = temp Procedure return FIGURE 2.23 ; return to calling routine ARM assembly code of the procedure swap in Figure 2.22. The Procedure sort To ensure that you appreciate the rigor of programming in assembly language, we’ll try a second, longer example. In this case, we’ll build a routine that calls the 03-Ch02-P374750.indd 145 7/3/09 12:18:35 PM 146 Chapter 2 Instructions: Language of the Computer swap procedure. This program sorts an array of integers, using bubble or exchange sort, which is one of the simplest if not the fastest sorts. Figure 2.24 shows the C version of the program. Once again, we present this procedure in several steps, concluding with the full procedure. void sort { int i, for (i for (int v[], int n) j; = 0; i < n; i += 1) { (j = i – 1; j >= 0 && v[j] > v[j + 1]; j -= 1) { swap(v,j); } } } FIGURE 2.24 A C procedure that performs a sort on the array v. Register Allocation for sort The two parameters of the procedure sort, v and n, are in the parameter registers r0 and r1, and we assign register r2 to local variable i and register r3 to j. We’ll also need five more registers to hold values, which we’ll assign to r12 and r4 to r7. To enhance code legibility, we’ll tell the assembler to rename the registers: v n i j vjAddr vj vj1 vcopy ncopy RN RN RN RN RN RN RN RN Rn 0 1 2 3 12 4 5 6 7 ; ; ; ; ; ; ; ; ; 1st argument address of v 2nd argument index n local variable i local variable j to hold address of v[j] to hold a copy of v[j] to hold a copy of v[j+1] to hold a copy of v to hold a copy of n Code for the Body of the Procedure sort The procedure body consists of two nested for loops and a call to swap that includes parameters. Let’s unwrap the code from the outside to the middle. The first translation step is the first for loop: for (i = 0; i < n; i += 1) { Recall that the C for statement has three parts: initialization, loop test, and iteration increment. It takes just one instruction to initialize i to 0, the first part of the for statement: MOV 03-Ch02-P374750.indd 146 i, #0 ; i = 0 7/3/09 12:18:35 PM 2.13 A C Sort Example to Put It All Together 147 (Remember that move is a pseudoinstruction provided by the assembler for the convenience of the assembly language programmer; see page 136.) It also takes just one instruction to increment i, the last part of the for statement: ADD i, i, #1 ; i += 1 The loop should be exited if i < n is not true or, said another way, should be exited if i ≥ n. This test takes two instructions: for1tst:CMP i, n BGE exit1 ; if i ≥ n ; go to exit1 if i ≥ n The bottom of the loop just jumps back to the loop test: B for1tst ; branch to test of outer loop exit1: The skeleton code of the first for loop is then MOV i, #0 for1tst:CMP i, n BGE exit1 ... (body of first ... ADD i, i, #1 B for1tst exit1: ; i = 0 ; if i ≥ n ; go to exit1 if i ≥ n for loop) ; i += 1 ; branch to test of outer loop Voila! (The exercises explore writing faster code for similar loops.) The second for loop looks like this in C: for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j –= 1) { The initialization portion of this loop is again one instruction: SUB j, i, #1 ; j = i – 1 The decrement of j at the end of the loop is also one instruction: SUB j, j, #1 ; j –= 1 The loop test has two parts. We exit the loop if either condition fails, so the first test must exit the loop if it fails (j < 0): for2tst: CMP j, #0 BLT exit2 ; if j < 0 ; go to exit2 if j < 0 This branch will skip over the second condition test. If it doesn’t skip, j ≥ 0. 03-Ch02-P374750.indd 147 7/3/09 12:18:35 PM 148 Chapter 2 Instructions: Language of the Computer The second test exits if v[j] > v[j + 1] is not true, or exits if v[j] ≤ v[j + 1]. First we create the address by multiplying j by 4 (since we need a byte address) and add it to the base address of v: ADD vjAddr, v, j, LSL #2 ; reg vjAddr = v + (j * 4) Now we load v[j]: LDR vj, [vjAddr,s#0] ; reg vj = v[j] Since we know that the second element is just the following word, we add 4 to the address in register vjAddr to get v[j + 1]: LDR vj1, [vjAddr,#4] ; reg vj1 = v[j + 1] The test of v[j] ≤ v[j + 1] is next, so the two instructions of the exit test are CMP BLE vj, vj1 exit2 ; if vj ≤ vj1 ; go to exit2 if vj ≤ vj1 The bottom of the loop jumps back to the inner loop test: B for2tst ; branch to test of inner loop Combining the pieces, the skeleton of the second for loop looks like this: SUB for2tst:CMP BLT ADD LDR LDR CMP BLE j, i,#1 ; j = i – 1 j, #0 ; if j < 0 exit2 ; go to exit2 if j < 0 vjAddr, v, j, LSL #2 ; reg vjAddr = v + (j * 4) vj, [vjAddr,#0] ; reg vj = v[j] vj1, [vjAddr,#4] ; reg vj1 = v[j + 1] vj, vj1 ; if vj ≤ vj1 exit2 ; go to exit2 if vj ≤ vj1 ... (body of second for loop) ... SUB j, j, #1 ; j –= 1 B for2tst ; branch to test of inner loop exit2: The Procedure Call in sort The next step is the body of the second for loop: swap(v,j); Calling swap is easy enough: BL 03-Ch02-P374750.indd 148 swap 7/3/09 12:18:35 PM 2.13 A C Sort Example to Put It All Together 149 Passing Parameters in sort The problem comes when we want to pass parameters because the sort procedure needs the values in registers v and n, yet the swap procedure needs to have its parameters placed in those same registers. One solution is to copy the parameters for sort into other registers earlier in the procedure, making registers v and n available for the call of swap. We first copy v and n into vcopy and ncopy during the procedure: MOV MOV vcopy, v ncopy, n ; copy parameter v into vcopy (save r0) ; copy parameter n into ncopy (save r1) Then we pass the parameters to swap with these two instructions: MOV MOV r0, vcopy ; first swap parameter is v r1, j ; second swap parameter is j Preserving Registers in sort The only remaining code is the saving and restoring of registers. Clearly, we must save the return address in register lr, since sort is a procedure and is called itself. The sort procedure also uses the saved registers i, j, vcopy, and ncopy, so they must be saved. The prologue of the sort procedure is then SUB STR STR STR STR STR sp,sp,#20 lr, [sp, #16] ncopy, [sp, #12] vcopy, [sp, #8] j, [sp, #4] i, [sp, #0] ; ; ; ; ; ; make room on stack for 5 registers save lr on stack save ncopy on stack save vcopy on stack save j on stack save i on stack The tail of the procedure simply reverses all these instructions, then adds a MOV pc,lr to return. The Full Procedure sort Now we put all the pieces together in Figure 2.25. Once again, to make the code easier to follow, we identify each block of code with its purpose in the procedure. In this example, nine lines of the sort procedure in C became 32 lines in the ARM assembly language. Elaboration: One optimization that works with this example is procedure inlining. Instead of passing arguments in parameters and invoking the code with a BL instruction, the compiler would copy the code from the body of the swap procedure where the call to swap appears in the code. Inlining would avoid four instructions in this example. The downside of the inlining optimization is that the compiled code would be bigger if the inlined procedure is called from several locations. Such a code expansion might turn into lower performance if it increased the cache miss rate; see Chapter 5. 03-Ch02-P374750.indd 149 7/3/09 12:18:35 PM 150 Chapter 2 Instructions: Language of the Computer Saving registers sort: SUB STR STR STR STR STR sp,sp,#20 lr, [sp, #16] ncopy, [sp, #12] vcopy, [sp, #8] j, [sp, #4] i, [sp, #0] ; ; ; ; ; ; make save save save save save room on stack for 5 registers lr on stack ncopy on stack vcopy on stack j on stack i on stack Procedure body Move parameters Outer loop vcopy, v ncopy, n ; copy parameter v into vcopy (save r0) ; copy parameter n into ncopy (save r1) ; i = 0 MOV i, #0 for1tst: CMP i, n BGE exit1 SUB for2tst: CMP BLT ADD LDR LDR Inner loop Pass parameters and call Inner loop Outer loop MOV MOV exit2: ; if i ≥ n ; go to exit1 if j, i, #1 ; j = i – 1 j, #0 ; if j < 0 exit2 ; go to exit2 if j < 0 vjAddr, v, j, LSL #2 vj, [vjAddr,#0] vj1, [vjAddr,#4] i ≥ n ; reg vjAddr = v + (j * 4) ; reg vj = v[j] ; reg vj1 = v[j + 1] CMP vj, vj1 ; if vj ≤ vj1 BLE exit2 ; go to exit2 if vj ≤ vj1 MOV MOV BL r0, vcopy r1, j swap ; first swap parameter is v ; second swap parameter is j ; swap code shown in Figure 2.23 SUB B j, j, #1 for2tst ; j –= 1 ; branch to test of inner loop ADD B i, i, #1 for1tst ; i += 1 ; branch to test of outer loop Restoring registers exit1: LDR LDR LDR LDR LDR ADD i, [sp, #0] j, [sp, #4] vcopy, [sp, #8] ncopy, [sp, #12] lr, [sp, #16] sp,sp,#20 ; ; ; ; ; ; restore restore restore restore restore restore i from stack j from stack vcopy from stack ncopy from stack lr from stack stack pointer Procedure return MOV FIGURE 2.25 pc, lr ; return to calling routine ARM assembly version of procedure sort in Figure 2.24. Elaboration: ARM includes instructions to save and restore registers at procedure call boundaries. Store Multiple (STM) and Load Multiple (LDM) can store and load up to 16 registers, which are specified by a bit mask in the instruction. Thus, the 5 stores and 03-Ch02-P374750.indd 150 7/3/09 12:18:35 PM 2.13 A C Sort Example to Put It All Together 151 the SUB could be replaced by the instruction STM sp!, {i, j, ncopy, vcopy, lr} since the stack pointer would be updated by the pre-index addressing mode. Figure 2.26 shows the impact of compiler optimization on sort program performance, compile time, clock cycles, instruction count, and CPI. Note that unoptimized code has the best CPI, and O1 optimization has the lowest instruction count, but O3 is the fastest, reminding us that time is the only accurate measure of program performance. Figure 2.27 compares the impact of programming languages, compilation versus interpretation, and algorithms on performance of sorts. The fourth column shows that the unoptimized C program is 8.3 times faster than the interpreted Java code for Bubble Sort. Using the JIT compiler makes Java 2.1 times faster than the unoptimized C and within a factor of 1.13 of the highest optimized C code. Section 2.15 on the CD gives more details on interpretation versus compi( lation of Java and the Java and ARM code for Bubble Sort.) The ratios aren’t as close for Quicksort in Column 5, presumably because it is harder to amortize the cost of runtime compilation over the shorter execution time. The last column demonstrates the impact of a better algorithm, offering three orders of magnitude a performance increases by when sorting 100,000 items. Even comparing interpreted Java in Column 5 to the C compiler at highest optimization in Column 4, Quicksort beats Bubble Sort by a factor of 50 (0.05 × 2468, or 123 times faster than the unoptimized C code versus 2.41 times faster). Understanding Program Performance Elaboration: The ARM compilers always save room on the stack for the arguments in case they need to be stored, so in reality they always decrement sp by 16 to make room for all four argument registers (16 bytes). One reason is that C provides a vararg option that allows a pointer to pick, say, the third argument to a procedure. When the compiler encounters the rare vararg, it copies the four argument registers onto the stack into the four reserved locations. gcc optimization Relative performance Clock cycles (millions) Instruction count (millions) CPI None 1.00 158,615 114,938 1.38 O1 (medium) 2.37 66,990 37,470 1.79 O2 (full) 2.38 66,521 39,993 1.66 O3 (procedure integration) 2.41 65,747 44,993 1.46 FIGURE 2.26 Comparing performance, instruction count, and CPI using compiler optimization for Bubble Sort. The programs sorted 100,000 words with the array initialized to random values. These programs were run on a Pentium 4 with a clock rate of 3.06 GHz and a 533 MHz system bus with 2 GB of PC2100 DDR SDRAM. It used Linux version 2.4.20. 03-Ch02-P374750.indd 151 7/3/09 12:18:35 PM 152 Chapter 2 Instructions: Language of the Computer Language Execution method Optimization Bubble Sort relative performance Quicksort relative performance Speedup Quicksort vs. Bubble Sort C Compiler None 1.00 1.00 2468 Compiler O1 2.37 1.50 1562 Compiler O2 2.38 1.50 1555 Compiler O3 2.41 1.91 1955 Interpreter – 0.12 0.05 1050 JIT compiler – 2.13 0.29 338 Java FIGURE 2.27 Performance of two sort algorithms in C and Java using interpretation and optimizing compilers relative to unoptimized C version. The last column shows the advantage in performance of Quicksort over Bubble Sort for each language and execution option. These programs were run on the same system as Figure 2.26. The JVM is Sun version 1.3.1, and the JIT is Sun Hotspot version 1.3.1. 2.14 Arrays versus Pointers A challenge for any new C programmer is understanding pointers. Comparing assembly code that uses arrays and array indices to the assembly code that uses pointers offers insights about pointers. This section shows C and ARM assembly versions of two procedures to clear a sequence of words in memory: one using array indices and one using pointers. Figure 2.28 shows the two C procedures. The purpose of this section is to show how pointers map into ARM instructions, and not to endorse a dated programming style. We’ll see the impact of modern compiler optimization on these two procedures at the end of the section. Array Version of Clear Let’s start with the array version, clear1, focusing on the body of the loop and ignoring the procedure linkage code. We assume that the two parameters array and size are found in the registers r0 and r1, and that i is allocated to register r2. We start by renaming registers. array n i zero RN RN RN RN 0 1 2 3 ; ; ; ; 1st argument address of array 2nd argument size (of array) local variable i temporary to hold constant 0 The initialization of i, the first part of the for loop, is straightforward: MOV i,0 ; i = 0 We also need to put 0 into the temporary register so that we can write it to memory: MOV 03-Ch02-P374750.indd 152 zero,0 ; zero = 0 7/3/09 12:18:35 PM 2.14 Arrays versus Pointers 153 clear1(int array[], int size) { int i; for (i = 0; i < size; i += 1) array[i] = 0; } clear2(int *array, int size) { int *p; for (p = &array[0]; p < array[size]; p = p + 1) *p = 0; } FIGURE 2.28 Two C procedures for setting an array to all zeros. Clear1 uses indices, while clear2 uses pointers. The second procedure needs some explanation for those unfamiliar with C. The address of a variable is indicated by &, and the object pointed to by a pointer is indicated by *. The declarations declare that array and p are pointers to integers. The first part of the for loop in clear2 assigns the address of the first element of array to the pointer p. The second part of the for loop tests to see if the pointer is pointing beyond the last element of array. Incrementing a pointer by one, in the last part of the for loop, means moving the pointer to the next sequential object of its declared size. Since p is a pointer to integers, the compiler will generate ARM instructions to increment p by four, the number of bytes in a ARM integer. The assignment in the loop places 0 in the object pointed to by p. To set array[i] to 0 we must first get its address. We can use scaled register offset addressing mode to multiply i by 4 to get the byte address and then add it to the index to get the address of array[i]: loop1:STR zero, [array,i, LSL #2] ; array[i] = 0 This instruction is the end of the body of the loop, so the next step is to increment i: ADD i,i,#1 ; i = i + 1 The loop test checks if i is less than size: CMP BLT i,size loop1 ; i < size ; if (i < size) go to loop1 We have now seen all the pieces of the procedure. Here is the ARM code for clearing an array using indices: MOV MOV loop1:STR ADD CMP BLT 03-Ch02-P374750.indd 153 i,0 zero,0 zero, [array,i, i,i,#1 i,size loop1 ; i = 0 ; zero = 0 LSL #2] ; array[i] = 0 ; i = i + 1 ; i < size ; if (i < size) go to loop1 7/3/09 12:18:35 PM 154 Chapter 2 Instructions: Language of the Computer (This code works as long as size is greater than 0; ANSI C requires a test of size before the loop, but we’ll skip that legality here.) Pointer Version of Clear The second procedure that uses pointers allocates the two parameters array and size to the registers r0 and r1 and allocates p to register r2, renaming the registers to do this: array n p zero arraySize RN RN RN RN RN 0 1 2 3 12 ; ; ; ; ; 1st argument address of array 2nd argument size (of array) local variable i temporary to hold constant 0 address of array[size] The code for the second procedure starts with assigning the pointer p to the address of the first element of the array and setting zero: MOV MOV p,array zero,#0 ; p = address of array[0] ; zero = 0 The next code is the body of the for loop, which simply stores 0 into memory pointed to by p. We’ll use the immediate post-indexed addressing mode to increment p. Incrementing a pointer by 1 means moving the pointer to the next sequential object in C. Since p is a pointer to integers, each of which uses 4 bytes, the compiler increments p by 4. loop2: STR zero,[p],#4 ; Memory[p] = 0; p = p + 4 The loop test is next. The first step is calculating the address of the last element of array. Start with multiplying size by 4 to get its byte address and then we add the product to the starting address of the array to get the address of the first word after the array: ADD arraySize,array,size,LSL #2 ; arraySize = address ; of array[size] The loop test is simply to see if p is less than the last element of array: CMP BLT p,arraySize loop2 ; p < &array[size] ; if (p<&array[size]) go to loop2 With all the pieces completed, we can show a pointer version of the code to zero an array: MOV MOV 03-Ch02-P374750.indd 154 p,array zero,#0 ; p = address of array[0] ; zero = 0 7/3/09 12:18:35 PM 2.14 Arrays versus Pointers 155 loop2: STR zero,[p],#4 ; Memory[p] = 0; p = p + 4 ADD arraySize,array,size,LSL #2 ; arraySize = address ; of array[size] CMP p,arraySize ; p < &array[size] BLT loop2 ; if (p<&array[size]) go to loop2 As in the first example, this code assumes size is greater than 0. Note that this program calculates the address of the end of the array in every iteration of the loop, even though it does not change. A faster version of the code moves this calculation outside the loop: MOV MOV ADD p,array ; p = address of array[0] zero,#0 ; zero = 0 arraySize,array,size,LSL #2 ; arraySize = ; address of array[size] loop2:STR zero,[p],#4 ; Memory[p] = 0; p = p + 4 CMP p,arraySize ; p < &array[size] BLT loop2 ; if (p<&array[size]) go to loop2 Comparing the Two Versions of Clear Comparing the two code sequences side by side illustrates the difference between array indices and pointers (the changes introduced by the pointer version are highlighted): MOV i,#0 ;i=0 MOV p,array ; p = & array[0] MOV zero,#0 ;zero = 0 MOV zero,#0 ;zero = 0 ADD arraySize,array,size,LSL #2 STR zero,[p],#4 ; Memory[p] = 0,p = p + 4 p,arraySize; p<&array[size] loop1: STR zero, [array,i, LSL #2] ; array[i] = 0 ; arraySize = &array[size] ADD i,i,#1 ;i=i+1 loop2: CMP i,size ; i < size CMP BLT loop1 ; if (i < size) go to loop1 BLT loop2 ; if () go to loop2 The version on the left must have the “multiply” and add inside the loop because i is incremented and each address must be recalculated from the new index. How- ever, the scaled register offset addressing mode hides that extra work. The memory pointer version on the right increments the pointer p directly via the post-indexed immediate addressing mode. The pointer version moves the end of array calculation outside the loop, thereby reducing the instructions executed per iteration from 4 to 3. This manual optimization corresponds to the compiler optimization of strength reduction (shift instead of multiply) and induction variable eliminaSection 2.15 on the tion (eliminating array address calculations within loops). CD describes these two and many other optimizations. 03-Ch02-P374750.indd 155 7/3/09 12:18:35 PM 156 Chapter 2 Instructions: Language of the Computer Elaboration: As mentioned earlier, a C compiler would add a test to be sure that size is greater than 0. One way would be to add a jump just before the first instruction of the loop to the CMP instruction. Understanding Program Performance People used to be taught to use pointers in C to get greater efficiency than that available with arrays: “Use pointers, even if you can’t understand the code.” Modern optimizing compilers can produce code for the array version that is just as good. Most programmers today prefer that the compiler do the heavy lifting. 2.15 object oriented language A programming language that is oriented around objects rather than actions, or data versus logic. Advanced Material: Compiling C and Interpreting Java This section gives a brief overview of how the C compiler works and how Java is executed. Because the compiler will significantly affect the performance of a computer, understanding compiler technology today is critical to understanding performance. Keep in mind that the subject of compiler construction is usually taught in a one- or two-semester course, so our introduction will necessarily only touch on the basics. The second part of this section is for readers interested in seeing how an objected oriented language like Java executes on a ARM architecture. It shows the Java bytecodes used for interpretation and the ARM code for the Java version of some of the C segments in prior sections, including Bubble Sort. It covers both the Java Virtual Machine and JIT compilers. The rest of this section is on the CD. 2.16 Real Stuff: MIPS Instructions While not as popular as ARM, MIPS is an elegant instruction set that came out the same year as ARM and was inspired by similar philosophies. Figure 2.29 lists the similarities. The principle differences are that MIPS has more registers and ARM has more addressing modes. There is a similar core of instruction sets for arithmetic-logical and data transfer instructions for MIPS and ARM, as Figure 2.30 shows. 03-Ch02-P374750.indd 156 7/3/09 12:18:35 PM 2.16 Real Stuff: MIPS Instructions ARM 157 MIPS Date announced 1985 Instruction size (bits) 26 (32 starting with ARMv3) 1985 32 Address space (size, model) 32 bits, flat 32 bits, flat Data alignment Aligned Aligned Data addressing modes 9 3 Integer registers (number, model, size) I/O 15 GPR × 32 bits Memory mapped 31 GPR × 32 bits Memory mapped FIGURE 2.29 Similarities in ARM and MIPS instruction sets. Instruction name Register-register Data transfer ARM MIPS Add ADD Add (trap if overflow) ADDS; SWIVS addu, addiu add Subtract SUB subu Subtract (trap if overflow) SUBS; SWIVS sub Multiply MUL mult, multu Divide — div, divu And AND and Or ORR or Xor EOR xor Load high part register MOVT lui Shift left logical LSL1 sllv, sll Shift right logical LSR1 srlv, srl Shift right arithmetic ASR1 srav, sra Compare CMP, CMN, TST, TEQ slt/i, slt/iu Load byte signed LDRSB lb Load byte unsigned LDRB lbu Load halfword signed LDRSH lh Load halfword unsigned LDRH lhu Load word LDR lw Store byte STRB sb Store halfword STRH sh Store word STR sw Read, write special registers MRS, MSR move Atomic Exchange SWP, SWPB ll;sc FIGURE 2.30 ARM register-register and data transfer instructions equivalent to MIPS core. Dashes mean the operation is not available in that architecture or not synthesized in a few instructions. If there are several choices of instructions equivalent to the ARM core, they are separated by commas. ARM includes shifts as part of every data operation instruction, so the shifts with superscript 1 are just a variation of a move instruction, such as LSR1. “Note that ARMv7M has divide instructions.” 03-Ch02-P374750.indd 157 7/3/09 12:18:35 PM 158 Chapter 2 Instructions: Language of the Computer Data addressing mode ARM MIPS Register operand X X Immediate operand X X Register + offset (displacement or based) X X Register + register (indexed) X — Register + scaled register (scaled) X — Register + offset and update register X — Register + register and update register X — Autoincrement, autodecrement X — PC-relative data X — FIGURE 2.31 Summary of data addressing modes in ARM vs. MIPS. MIPS has three basic addressing modes. The remaining six ARM addressing modes would require another instruction to calculate the address in MIPS. Addressing Modes Figure 2.31 shows the data addressing modes supported by ARM. MIPS has just three simple data addressing modes while ARM has nine, including fairly complex calculations. MIPS would require extra instructions to perform those address calculations. Compare and Conditional Branch Instead of using condition codes that are set as a side effect of an arithmetic or logical instruction code as in ARM, MIPS uses the contents of registers to evaluate conditional branches. Comparisons are done using the instruction Set on Less Than (slt) sets the contents of a register to 1 if the first operand is less than the second operand, and to 0 otherwise. The MIPS instructions Branch Equal (beq) and Branch Not Equal (bne) instructions compare the values of two registers before deciding to branch. The comparison to zero comes for free since on of MIPS 32 registers always has the value 0, called $zero. Every ARM instruction has the option of executing conditionally, depending on the condition codes. The MIPS instructions do not have such a field. Figure 2.32 shows the instruction formats for ARM and MIPS. The principal differences are the 4-bit conditional execution field in every instruction and the smaller register field, because ARM has half the number of registers. The MIPS 16-bit immediate field is either zero extended to 32 bits or sign extended to 32 bits depending on the instruction, and doesn’t have the unusual bit twidling features ARM’s 12-bit immediate field. Figure 2.33 shows a few of the ARM arithmetic-logical instructions not found in MIPS. Since ARM does not have a dedicated register for 0, it has separate opcodes to perform some operations that MIPS can do with $zero. MIPS instructions are generally a subset of the ARM instruction set, although MIPS does provide a Divide instruction (div). Figure 2.33.5 shows the MIPS instruction set used in Chapters 4 to 6. 03-Ch02-P374750.indd 158 7/3/09 12:18:35 PM 2.16 31 Register-register 8 26 25 6 MIPS Op 31 31 Branch Op Jump/Call 31 MIPS 0 16 15 Const16 0 21 20 Rs15 0 16 15 Opx5/Rs25 Const16 24 23 28 27 Opx4 0 Const24 26 25 31 Opx Const12 Rd5 4 Op6 ARM 0 6 Const Rd4 Op 0 6 5 5 12 11 16 15 21 20 5 159 Rs24 24 23 28 27 31 Rd Rs14 4 3 Opx8 11 10 5 20 19 Rs15 Opx4 MIPS Rs2 8 26 25 12 11 Rd4 16 15 Op Op6 ARM 5 28 27 31 MIPS 21 20 Rs1 Opx4 ARM 16 15 Rs14 Op 31 Data transfer 20 19 28 27 Opx4 ARM Real Stuff: MIPS Instructions 0 4 Const24 0 26 25 Op6 Const Opcode Register 26 Constant FIGURE 2.32 Instruction formats for ARM and MIPS. The differences result from whether the architecture has 16 or 32 registers. and whether it has the 4-bit conditional execution field. Name Definition ARM v.6 MIPS Load immediate Rd = Imm MOV addi, $0, Not Rd = ~(Rs1) MVN nor, $0, Move Rd = Rs1 MOV or, $0, Rotate right Rd = Rs i >> i Rd0. . . i–1 = Rs31–i. . . 31 ROR And not Rd = Rs1 & ~(Rs2) BIC Reverse subtract Rd = Rs2 - Rs1 RSB, RSC Support for multiword integer add CarryOut, Rd = Rd + Rs1 + OldCarryOut ADCS — Support for multiword integer sub CarryOut, Rd = Rd – Rs1 + OldCarryOut SBCS — FIGURE 2.33 ARM arithmetic/logical instructions not found in MIPS. 03-Ch02-P374750.indd 159 7/3/09 12:18:36 PM 160 Chapter 2 Instructions: Language of the Computer MIPS operands Name Example Comments $s0–$s7, $t0–$t9, $zero, 32 registers $a0–$a3, $v0–$v1, $gp, $fp, $sp, $ra, $at 230 memory Memory[0], Memory[4], . . . , words Memory[4294967292] Fast locations for data. In MIPS, data must be in registers to perform arithmetic, register $zero always equals 0, and register $at is reserved by the assembler to handle large constants. Accessed only by data transfer instructions. MIPS uses byte addresses, so sequential word addresses differ by 4. Memory holds data structures, arrays, and spilled registers. MIPS assembly language Category Arithmetic Data transfer Logical Instruction Example add subtract add immediate load word store word load half load half unsigned store half load byte load byte unsigned store byte load linked word store condition. word load upper immed. and or nor and immediate or immediate shift left logical shift right logical branch on equal add $s1,$s2,$s3 sub $s1,$s2,$s3 addi $s1,$s2,20 lw $s1,20($s2) sw $s1,20($s2) lh $s1,20($s2) lhu $s1,20($s2) sh $s1,20($s2) lb $s1,20($s2) lbu $s1,20($s2) sb $s1,20($s2) ll $s1,20($s2) sc $s1,20($s2) lui $s1,20 branch on not equal set on less than Conditional branch set on less than unsigned set less than immediate set less than immediate unsigned jump Unconditional jump register jump jump and link Meaning Comments Three register operands $s1 = $s2 + $s3 Three register operands $s1 = $s2 – $s3 Used to add constants $s1 = $s2 + 20 Word from memory to register $s1 = Memory[$s2 + 20] Word from register to memory Memory[$s2 + 20] = $s1 Halfword memory to register $s1 = Memory[$s2 + 20] Halfword memory to register $s1 = Memory[$s2 + 20] Halfword register to memory Memory[$s2 + 20] = $s1 Byte from memory to register $s1 = Memory[$s2 + 20] Byte from memory to register $s1 = Memory[$s2 + 20] Byte from register to memory Memory[$s2 + 20] = $s1 Load word as 1st half of atomic swap $s1 = Memory[$s2 + 20] Memory[$s2+20]=$s1;$s1=0 or 1 Store word as 2nd half of atomic swap Loads constant in upper 16 bits $s1 = 20 * 216 Three reg. operands; bit-by-bit AND and $s1,$s2,$s3 $s1 = $s2 & $s3 Three reg. operands; bit-by-bit OR or $s1,$s2,$s3 $s1 = $s2 | $s3 Three reg. operands; bit-by-bit NOR nor $s1,$s2,$s3 $s1 = ~ ($s2 | $s3) Bit-by-bit AND reg with constant andi $s1,$s2,20 $s1 = $s2 & 20 Bit-by-bit OR reg with constant ori $s1,$s2,20 $s1 = $s2 | 20 Shift left by constant sll $s1,$s2,10 $s1 = $s2 << 10 Shift right by constant srl $s1,$s2,10 $s1 = $s2 >> 10 Equal test; PC-relative branch beq $s1,$s2,25 if ($s1 == $s2) go to PC + 4 + 100 Not equal test; PC-relative bne $s1,$s2,25 if ($s1!= $s2) go to PC + 4 + 100 slt $s1,$s2,$s3 if ($s2 < $s3) $s1 = 1; Compare less than; for beq, bne else $s1 = 0 Compare less than unsigned sltu $s1,$s2,$s3 if ($s2 < $s3) $s1 = 1; else $s1 = 0 Compare less than constant slti $s1,$s2,20 if ($s2 < 20) $s1 = 1; else $s1 = 0 Compare less than constant sltiu $s1,$s2,20 if ($s2 < 20) $s1 = 1; unsigned else $s1 = 0 go to 10000 Jump to target address j 2500 jr jal $ra 2500 go to $ra $ra = PC + 4; go to 10000 For switch, procedure return For procedure call MIPS assembly language. 03-Ch02-P374750.indd 160 7/3/09 12:18:36 PM 2.17 2.17 Real Stuff: x86 Instructions Real Stuff: x86 Instructions Designers of instruction sets sometimes provide more powerful operations than those found in ARM and MIPS. The goal is generally to reduce the number of instructions executed by a program. The danger is that this reduction can occur at the cost of simplicity, increasing the time a program takes to execute because the instructions are slower. This slowness may be the result of a slower clock cycle time or of requiring more clock cycles than a simpler sequence. The path toward operation complexity is thus fraught with peril. To avoid these problems, designers have moved toward simpler instructions. Section 2.18 demonstrates the pitfalls of complexity. 161 Beauty is altogether in the eye of the beholder. Margaret Wolfe Hungerford, Molly Bawn, 1877 Evolution of the Intel x86 ARM and MIPS were the vision of single small groups in 1985; the pieces of these architectures fit nicely together, and the whole architecture can be described succinctly. Such is not the case for the x86; it is the product of several independent groups who evolved the architecture over 30 years, adding new features to the original instruction set as someone might add clothing to a packed bag. Here are important x86 milestones. ■ 1978: The Intel 8086 architecture was announced as an assembly language– compatible extension of the then successful Intel 8080, an 8-bit microprocessor. The 8086 is a 16-bit architecture, with all internal registers 16 bits wide. Unlike ARM and MIPS, the registers have dedicated uses, and hence the 8086 is not considered a general-purpose register architecture. ■ 1980: The Intel 8087 floating-point coprocessor is announced. This architecture extends the 8086 with about 60 floating-point instructions. Instead of Section 2.20 and Section 3.7). using registers, it relies on a stack (see ■ 1982: The 80286 extended the 8086 architecture by increasing the address space to 24 bits, by creating an elaborate memory-mapping and protection model (see Chapter 5), and by adding a few instructions to round out the instruction set and to manipulate the protection model. ■ 1985: The 80386 extended the 80286 architecture to 32 bits. In addition to a 32-bit architecture with 32-bit registers and a 32-bit address space, the 80386 added new addressing modes and additional operations. The added instructions make the 80386 nearly a general-purpose register machine. The 80386 also added paging support in addition to segmented addressing (see Chapter 5). Like the 80286, the 80386 has a mode to execute 8086 programs without change. 03-Ch02-P374750.indd 161 general-purpose register (GPR) A register that can be used for addresses or for data with virtually any instruction. 7/3/09 12:18:36 PM 162 03-Ch02-P374750.indd 162 Chapter 2 Instructions: Language of the Computer ■ 1989–95: The subsequent 80486 in 1989, Pentium in 1992, and Pentium Pro in 1995 were aimed at higher performance, with only four instructions added to the user-visible instruction set: three to help with multiprocessing (Chapter 7) and a conditional move instruction. ■ 1997: After the Pentium and Pentium Pro were shipping, Intel announced that it would expand the Pentium and the Pentium Pro architectures with MMX (Multi Media Extensions). This new set of 57 instructions uses the floating-point stack to accelerate multimedia and communication applications. MMX instructions typically operate on multiple short data elements at a time, in the tradition of single instruction, multiple data (SIMD) architectures (see Chapter 7). Pentium II did not introduce any new instructions. ■ 1999: Intel added another 70 instructions, labelled SSE (Streaming SIMD Extensions) as part of Pentium III. The primary changes were to add eight separate registers, double their width to 128 bits, and add a single precision floating-point data type. Hence, four 32-bit floating-point operations can be performed in parallel. To improve memory performance, SSE includes cache prefetch instructions plus streaming store instructions that bypass the caches and write directly to memory. ■ 2001: Intel added yet another 144 instructions, this time labelled SSE2. The new data type is double precision arithmetic, which allows pairs of 64-bit floating-point operations in parallel. Almost all of these 144 instructions are versions of existing MMX and SSE instructions that operate on 64 bits of data in parallel. Not only does this change enable more multimedia operations, it gives the compiler a different target for floating-point operations than the unique stack architecture. Compilers can choose to use the eight SSE registers as floating-point registers like those found in other computers. This change boosted the floating-point performance of the Pentium 4, the first microprocessor to include SSE2 instructions. ■ 2003: A company other than Intel enhanced the x86 architecture this time. AMD announced a set of architectural extensions to increase the address space from 32 to 64 bits. Similar to the transition from a 16- to 32-bit address space in 1985 with the 80386, AMD64 widens all registers to 64 bits. It also increases the number of registers to 16 and increases the number of 128bit SSE registers to 16. The primary ISA change comes from adding a new mode called long mode that redefines the execution of all x86 instructions with 64-bit addresses and data. To address the larger number of registers, it adds a new prefix to instructions. Depending how you count, long mode also adds four to ten new instructions and drops 27 old ones. PC-relative data addressing is another extension. AMD64 still has a mode that is identical to x86 (legacy mode) plus a mode that restricts user programs to x86 but allows operating systems to use AMD64 (compatibility mode). These modes allow a more graceful transition to 64-bit addressing than the HP/Intel IA-64 architecture. 7/3/09 12:18:36 PM 2.17 Real Stuff: x86 Instructions ■ 2004: Intel capitulates and embraces AMD64, relabeling it Extended Memory 64 Technology (EM64T). The major difference is that Intel added a 128-bit atomic compare and swap instruction, which probably should have been included in AMD64. At the same time, Intel announced another generation of media extensions. SSE3 adds 13 instructions to support complex arithmetic, graphics operations on arrays of structures, video encoding, floating-point conversion, and thread synchronization (see Section 2.11). AMD will offer SSE3 in subsequent chips and it will almost certainly add the missing atomic swap instruction to AMD64 to maintain binary compatibility with Intel. ■ 2006: Intel announces 54 new instructions as part of the SSE4 instruction set extensions. These extensions perform tweaks like sum of absolute differences, dot products for arrays of structures, sign or zero extension of narrow data to wider sizes, population count, and so on. They also added support for virtual machines (see Chapter 5). ■ 2007: AMD announces 170 instructions as part of SSE5, including 46 instructions of the base instruction set that adds three operand instructions like ARM and MIPS. ■ 2008: Intel announces the Advanced Vector Extension that expands the SSE register width from 128 to 256 bits, thereby redefining about 250 instructions and adding 128 new instructions. 163 This history illustrates the impact of the “golden handcuffs” of compatibility on the x86, as the existing software base at each step was too important to jeopardize with significant architectural changes. If you looked over the life of the x86, on average the architecture has been extended by one instruction per month! Whatever the artistic failures of the x86, keep in mind that there are more instances of this architectural family on desktop computers than of any other architecture, increasing by more than 250 million per year. Nevertheless, this checkered ancestry has led to an architecture that is difficult to explain and impossible to love. Brace yourself for what you are about to see! Do not try to read this section with the care you would need to write x86 programs; the goal instead is to give you familiarity with the strengths and weaknesses of the world’s most popular desktop architecture. Rather than show the entire 16-bit and 32-bit instruction set, in this section we concentrate on the 32-bit subset that originated with the 80386, as this portion of the architecture is what is used today. We start our explanation with the registers and addressing modes, move on to the integer operations, and conclude with an examination of instruction encoding. x86 Registers and Data Addressing Modes The registers of the 80386 show the evolution of the instruction set (Figure 2.34). The 80386 extended all 16-bit registers (except the segment registers) to 32 bits, prefixing an E to their name to indicate the 32-bit version. We’ll refer to them generically as 03-Ch02-P374750.indd 163 7/3/09 12:18:36 PM 164 Chapter 2 Instructions: Language of the Computer Name Use 31 0 EAX GPR 0 ECX GPR 1 EDX GPR 2 EBX GPR 3 ESP GPR 4 EBP GPR 5 ESI GPR 6 EDI GPR 7 EIP EFLAGS CS Code segment pointer SS Stack segment pointer (top of stack) DS Data segment pointer 0 ES Data segment pointer 1 FS Data segment pointer 2 GS Data segment pointer 3 Instruction pointer (PC) Condition codes FIGURE 2.34 The 80386 register set. Starting with the 80386, the top eight registers were extended to 32 bits and could also be used as general-purpose registers. GPRs (general-purpose registers). The 80386 contains only eight GPRs. This means MIPS programs can use four times as many and ARM twice as many. Figure 2.35 shows the arithmetic, logical, and data transfer instructions are two-operand instructions. There are two important differences here. The x86 arithmetic and logical instructions must have one operand act as both a source and a destination; ARM and MIPS allow separate registers for source and destination. This restriction puts more pressure on the limited registers, since one source register must be modified. The second important difference is that one of the operands can be in memory. Thus, virtually any instruction may have one operand in memory, unlike ARM and MIPS. 03-Ch02-P374750.indd 164 7/3/09 12:18:37 PM 2.17 Real Stuff: x86 Instructions Source/destination operand type Second source operand Register Register Register Immediate Register Memory Memory Register Memory Immediate 165 FIGURE 2.35 Instruction types for the arithmetic, logical, and data transfer instructions. The x86 allows the combinations shown. The only restriction is the absence of a memory-memory mode. Immediates may be 8, 16, or 32 bits in length; a register is any one of the 14 major registers in Figure 2.34 (not EIP or EFLAGS). Mode Description Register restrictions Address is in a register. Not ESP or EBP Address is contents of base register plus displacement. Not ESP Base plus scaled index The address is Base + (2Scale x Index) where Scale has the value 0, 1, 2, or 3. Base: any GPR Index: not ESP Base plus scaled index with 8- or 32-bit displacement The address is Base + (2Scale x Index) + displacement where Scale has the value 0, 1, 2, or 3. Base: any GPR Index: not ESP Register indirect Based mode with 8- or 32-bit displacement FIGURE 2.36 x86 32-bit addressing modes with register restrictions. The Base plus Scaled Index addressing mode, found in ARM or but not MIPS, is included to avoid the multiplies by 4 (scale factor of 2) to turn an index in a register into a byte address (see Figures 2.23 and 2.25). A scale factor of 1 is used for 16-bit data, and a scale factor of 3 for 64-bit data. A scale factor of 0 means the address is not scaled. (Intel gives two different names to what is called Based addressing mode—Based and Indexed—but they are essentially identical and we combine them here.) Data memory-addressing modes, described in detail below, offer two sizes of addresses within the instruction. These so-called displacements can be 8 bits or 32 bits. Although a memory operand can use any addressing mode, there are restrictions on which registers can be used in a mode. Figure 2.36 shows the x86 addressing modes and which GPRs cannot be used with each mode. x86 Integer Operations The 8086 provides support for both 8-bit (byte) and 16-bit (word) data types. The 80386 adds 32-bit addresses and data (double words) in the x86. (AMD64 adds 64-bit addresses and data, called quad words; we’ll stick to the 80386 in this section.) The data type distinctions apply to register operations as well as memory accesses. 03-Ch02-P374750.indd 165 7/3/09 12:18:37 PM 166 Chapter 2 Instructions: Language of the Computer Almost every operation works on both 8-bit data and on one longer data size. That size is determined by the mode and is either 16 bits or 32 bits. Clearly, some programs want to operate on data of all three sizes, so the 80386 architects provided a convenient way to specify each version without expanding code size significantly. They decided that either 16-bit or 32-bit data dominates most programs, and so it made sense to be able to set a default large size. This default data size is set by a bit in the code segment register. To override the default data size, an 8-bit prefix is attached to the instruction to tell the machine to use the other large size for this instruction. The prefix solution was borrowed from the 8086, which allows multiple prefixes to modify instruction behaviour. The three original prefixes override the default segment register, lock the bus to support synchronization (see Section 2.11), or repeat the following instruction until the register ECX counts down to 0. This last prefix was intended to be paired with a byte move instruction to move a variable number of bytes. The 80386 also added a prefix to override the default address size. The x86 integer operations can be divided into four major classes: 1. Data movement instructions, including move, push, and pop 2. Arithmetic and logic instructions, including test, integer, and decimal arithmetic operations 3. Control flow, including conditional branches, unconditional jumps, calls, and returns 4. String instructions, including string move and string compare The first two categories are unremarkable, except that the arithmetic and logic instruction operations allow the destination to be either a register or a memory location. Figure 2.37 shows some typical x86 instructions and their functions. Instruction Function je name if equal(condition code) {EIP=name}; EIP–128 <= name < EIP+128 jmp name EIP=name call name SP=SP–4; M[SP]=EIP+5; EIP=name; movw EBX,[EDI+45] EBX=M[EDI+45] push ESI SP=SP–4; M[SP]=ESI pop EDI EDI=M[SP]; SP=SP+4 add EAX,;6765 EAX= EAX+6765 Set condition code (flags) with EDX and 42 test EDX,;42 movsl M[EDI]=M[ESI]; EDI=EDI+4; ESI=ESI+4 FIGURE 2.37 Some typical x86 instructions and their functions. A list of frequent operations appears in Figure 2.38. The CALL saves the EIP of the next instruction on the stack. (EIP is the Intel PC.) 03-Ch02-P374750.indd 166 7/3/09 12:18:37 PM 2.17 Real Stuff: x86 Instructions 167 Conditional branches on the x86 are based on condition codes or flags, like ARM. Unlike ARM, where condition codes are set or not depending on the S bit, condition codes are set as an implicit side effect of most x86 instructions. Branches then test the condition codes. PC-relative branch addresses must be specified in the number of bytes, since unlike ARM and MIPS, x86 instructions are not all 4 bytes in length. String instructions are part of the 8080 ancestry of the x86 and are not commonly executed in most programs. They are often slower than equivalent software routines (see the fallacy on page 170). Figure 2.38 lists some of the integer x86 instructions. Many of the instructions are available in both byte and word formats. Instruction Meaning Control Conditional and unconditional branches jnz, jz Jump if condition to EIP + 8-bit offset; JNE (for JNZ), JE (for JZ) are alternative names jmp Unconditional jump—8-bit or 16-bit offset call Subroutine call—16-bit offset; return address pushed onto stack ret Pops return address from stack and jumps to it loop Loop branch—decrement ECX; jump to EIP + 8-bit displacement if ECX ≠ 0 Data transfer Move data between registers or between register and memory move Move between two registers or between register and memory push, pop Push source operand on stack; pop operand from stack top to a register les Load ES and one of the GPRs from memory Arithmetic, logical Arithmetic and logical operations using the data registers and memory add, sub cmp Add source to destination; subtract source from destination; register-memory format Compare source and destination; register-memory format shl, shr, rcr Shift left; shift logical right; rotate right with carry condition code as fill cbw Convert byte in eight rightmost bits of EAX to 16-bit word in right of EAX test Logical AND of source and destination sets condition codes inc, dec Increment destination, decrement destination or, xor Logical OR; exclusive OR; register-memory format String Move between string operands; length given by a repeat prefix movs Copies from string source to destination by incrementing ESI and EDI; may be repeated lods Loads a byte, word, or doubleword of a string into the EAX register FIGURE 2.38 Some typical operations on the x86. Many operations use register-memory format, where either the source or the destination may be memory and the other may be a register or immediate operand. 03-Ch02-P374750.indd 167 7/3/09 12:18:37 PM 168 Chapter 2 Instructions: Language of the Computer x86 Instruction Encoding Saving the worst for last, the encoding of instructions in the 80386 is complex, with many different instruction formats. Instructions for the 80386 may vary from 1 byte, when there are no operands, up to 15 bytes. Figure 2.39 shows the instruction format for several of the example instructions in Figure 2.37. The opcode byte usually contains a bit saying whether the operand is 8 bits or 32 bits. For some instructions, the opcode may include the addressing a. JE EIP + displacement 4 4 8 CondiJE Displacement tion b. CALL 8 32 CALL c. MOV 6 Offset EBX, [EDI + 45] 1 1 8 r/m d w Postbyte MOV 8 Displacement d. PUSH ESI 5 3 PUSH Reg e. ADD EAX, #6765 4 3 1 ADD 32 Reg w f. TEST EDX, #42 7 1 TEST w Immediate 8 32 Postbyte Immediate FIGURE 2.39 Typical x86 instruction formats. Figure 2.40 shows the encoding of the postbyte. Many instructions contain the 1-bit field w, which says whether the operation is a byte or a double word. The d field in MOV is used in instructions that may move to or from memory and shows the direction of the move. The ADD instruction requires 32 bits for the immediate field, because in 32-bit mode, the immediates are either 8 bits or 32 bits. The immediate field in the TEST is 32 bits long because there is no 8-bit immediate for test in 32-bit mode. Overall, instructions may vary from 1 to 17 bytes in length. The long length comes from extra 1-byte prefixes, having both a 4-byte immediate and a 4-byte displacement address, using an opcode of 2 bytes, and using the scaled index mode specifier, which adds another byte. 03-Ch02-P374750.indd 168 7/3/09 12:18:37 PM 2.17 169 Real Stuff: x86 Instructions mode and the register; this is true in many instructions that have the form “register = register op immediate.” Other instructions use a “postbyte” or extra opcode byte, labeled “mod, reg, r/m,” which contains the addressing mode information. This postbyte is used for many of the instructions that address memory. The base plus scaled index mode uses a second postbyte, labeled “sc, index, base.” Figure 2.40 shows the encoding of the two postbyte address specifiers for both 16-bit and 32-bit mode. Unfortunately, to understand fully which registers and which addressing modes are available, you need to see the encoding of all addressing modes and sometimes even the encoding of the instructions. reg w=0 w=1 16b r/m 32b mod = 0 16b mod = 1 mod = 2 32b 16b 32b 16b 32b mod = 3 0 AL AX EAX 0 addr=BX+SI =EAX same same same same 1 CL CX ECX 1 addr=BX+DI =ECX addr as addr as addr as addr as same as 2 DL DX EDX 2 addr=BP+SI =EDX mod=0 mod=0 mod=0 mod=0 reg field 3 BL BX EBX 3 addr=BP+SI =EBX + disp8 + disp8 + disp16 + disp32 4 AH SP ESP 4 addr=SI =(sib) SI+disp8 (sib)+disp8 SI+disp8 (sib)+disp32 “ 5 CH BP EBP 5 addr=DI =disp32 DI+disp8 EBP+disp8 DI+disp16 EBP+disp32 “ 6 DH SI ESI 6 addr=disp16 =ESI BP+disp8 ESI+disp8 BP+disp16 ESI+disp32 “ 7 BH DI EDI 7 addr=BX =EDI BX+disp8 EDI+disp8 BX+disp16 EDI+disp32 “ FIGURE 2.40 The encoding of the first address specifier of the x86: mod, reg, r/m. The first four columns show the encoding of the 3-bit reg field, which depends on the w bit from the opcode and whether the machine is in 16-bit mode (8086) or 32-bit mode (80386). The remaining columns explain the mod and r/m fields. The meaning of the 3-bit r/m field depends on the value in the 2-bit mod field and the address size. Basically, the registers used in the address calculation are listed in the sixth and seventh columns, under mod = 0, with mod = 1 adding an 8-bit displacement and mod = 2 adding a 16-bit or 32-bit displacement, depending on the address mode. The exceptions are 1) r/m = 6 when mod = 1 or mod = 2 in 16-bit mode selects BP plus the displacement; 2) r/m = 5 when mod = 1 or mod = 2 in 32-bit mode selects EBP plus displacement; and 3) r/m = 4 in 32-bit mode when mod does not equal 3, where (sib) means use the scaled index mode shown in Figure 2.36. When mod = 3, the r/m field indicates a register, using the same encoding as the reg field combined with the w bit. x86 Conclusion Intel had a 16-bit microprocessor two years before its competitors’ more elegant architectures, such as the Motorola 68000, and this head start led to the selection of the 8086 as the CPU for the IBM PC. Intel engineers generally acknowledge that the x86 is more difficult to build than computers like ARM and MIPS, but the large market means AMD and Intel can afford more resources to help overcome the added complexity. What the x86 lacks in style, it makes up for in quantity, making it beautiful from the right perspective. Its saving grace is that the most frequently used x86 architectural components are not too difficult to implement, as AMD and Intel have demonstrated by rapidly improving performance of integer programs since 1978. To get that performance, compilers must avoid the portions of the architecture that are hard to implement fast. 03-Ch02-P374750.indd 169 7/3/09 12:18:37 PM 170 Chapter 2 2.18 Instructions: Language of the Computer Fallacies and Pitfalls Fallacy: More powerful instructions mean higher performance. Part of the power of the Intel x86 is the prefixes that can modify the execution of the following instruction. One prefix can repeat the following instruction until a counter counts down to 0. Thus, to move data in memory, it would seem that the natural instruction sequence is to use move with the repeat prefix to perform 32-bit memory-to-memory moves. An alternative method, which uses the standard instructions found in all computers, is to load the data into the registers and then store the registers back to memory. This second version of this program, with the code replicated to reduce loop overhead, copies at about 1.5 times faster. A third version, which uses the larger floating-point registers instead of the integer registers of the x86, copies at about 2.0 times faster than the complex move instruction. Fallacy: Write in assembly language to obtain the highest performance. At one time compilers for programming languages produced naïve instruction sequences; the increasing sophistication of compilers means the gap between compiled code and code produced by hand is closing fast. In fact, to compete with current compilers, the assembly language programmer needs to understand the concepts in Chapters 4 and 5 thoroughly (processor pipelining and memory hierarchy). This battle between compilers and assembly language coders is one situation in which humans are losing ground. For example, C offers the programmer a chance to give a hint to the compiler about which variables to keep in registers versus spilled to memory. When compilers were poor at register allocation, such hints were vital to performance. In fact, some old C textbooks spent a fair amount of time giving examples that effectively use register hints. Today’s C compilers generally ignore such hints, because the compiler does a better job at allocation than the programmer does. Even if writing by hand resulted in faster code, the dangers of writing in assembly language are the longer time spent coding and debugging, the loss in portability, and the difficulty of maintaining such code. One of the few widely accepted axioms of software engineering is that coding takes longer if you write more lines, and it clearly takes many more lines to write a program in assembly language than in C or Java. Moreover, once it is coded, the next danger is that it will become a popular program. Such programs always live longer than expected, meaning that someone will have to update the code over several years and make it work with new releases of operating systems and new models of machines. Writing in higher-level language instead of assembly language not only allows future compilers to tailor the code to future machines, it also makes the software easier to maintain and allows the program to run on more brands of computers. Fallacy: The importance of commercial binary compatibility means successful instruction sets don’t change. 03-Ch02-P374750.indd 170 7/3/09 12:18:37 PM 171 2.19 Concluding Remarks 1000 Number of Instructions 900 800 700 600 500 400 300 200 100 19 7 19 8 8 19 0 8 19 2 8 19 4 8 19 6 8 19 8 9 19 0 9 19 2 9 19 4 9 19 6 9 20 8 0 20 0 0 20 2 0 20 4 0 20 6 08 0 Year FIGURE 2.41 Growth of x86 instruction set over time. While there is clear technical value to some of these extensions, this rapid change also increases the difficulty for other companies to try to build compatible processors. While backwards binary compatibility is sacrosanct, Figure 2.41 shows that the x86 architecture has grown dramatically. The average is more than one instruction per month over its 30-year lifetime! Pitfall: Forgetting that sequential word addresses in machines with byte addressing do not differ by one. Many an assembly language programmer has toiled over errors made by assuming that the address of the next word can be found by incrementing the address in a register by one instead of by the word size in bytes. Forewarned is forearmed! Pitfall: Using a pointer to an automatic variable outside its defining procedure. A common mistake in dealing with pointers is to pass a result from a procedure that includes a pointer to an array that is local to that procedure. Following the stack discipline in Figure 2.12, the memory that contains the local array will be reused as soon as the procedure returns. Pointers to automatic variables can lead to chaos. Less is more. 2.19 Concluding Remarks Robert Browning, Andrea del Sarto, 1855 The two principles of the stored-program computer are the use of instructions that are indistinguishable from numbers and the use of alterable memory for programs. These principles allow a single machine to aid environmental scientists, financial 03-Ch02-P374750.indd 171 7/3/09 12:18:37 PM 172 Chapter 2 Instructions: Language of the Computer advisers, and novelists in their specialties. The selection of a set of instructions that the machine can understand demands a delicate balance among the number of instructions needed to execute a program, the number of clock cycles needed by an instruction, and the speed of the clock. As illustrated in this chapter, four design principles guide the authors of instruction sets in making that delicate balance: 1. Simplicity favors regularity. Regularity motivates many features of the ARM instruction set: keeping all instructions a single size, always requiring three register operands in arithmetic instructions, and keeping the register fields in the same place in each instruction format. 2. Smaller is faster. The desire for speed is the reason that ARM has 16 registers rather than many more. 3. Make the common case fast. Examples of making the common ARM case fast include PC-relative addressing for conditional branches and immediate addressing for larger constant operands. 4. Good design demands good compromises. One ARM example was the compromise between providing for larger addresses and constants in instructions and keeping all instructions the same length. Above this machine level is assembly language, a language that humans can read. The assembler translates it into the binary numbers that machines can understand, and it even “extends” the instruction set by creating symbolic instructions that aren’t in the hardware. For instance, constants or addresses that are too big are broken into properly sized pieces, common variations of instructions are given their own name, and so on. Figure 2.42 lists the ARM instructions we have covered so far, both real and pseudoinstructions. Each category of ARM instructions is associated with constructs that appear in programming languages: ■ The arithmetic instructions correspond to the operations found in assignment statements. ■ Data transfer instructions are most likely to occur when dealing with data structures like arrays or structures. ■ The conditional branches are used in if statements and in loops. ■ The unconditional jumps are used in procedure calls and returns and for case/switch statements. These instructions are not born equal; the popularity of the few dominates the many. For example, Figure 2.43 shows the popularity of each class of instructions for SPEC2006. The varying popularity of instructions plays an important role in the chapters about, datapath, control, and pipelining. After we explain computer arithmetic in Chapter 3, we reveal the rest of the ARM instruction set architecture. 03-Ch02-P374750.indd 172 7/3/09 12:18:38 PM 173 2.19 Concluding Remarks ARM instructions Name Format add ADD DP subtract SUB DP load register LDR DT store register STR DT load register halfword LDRH DT LDRHS DT store register halfword STRH DT load register byte LDRB DT LDRBS DT store register byte STRB DT swap SWP DT mov MOV DP and AND ORR DP DP logical shift left (optional operation) MVN LSL DP logical shift right (optional operation) LSR DP compare CMP DP branch on X: EQ, NE, LT, LE, GT, GE LO, LS, HI, HS, VS, VC, MI, PL Bx BR branch (always) B BR branch and link BL BR load register halfword signed load register byte signed or not DP FIGURE 2.42 The ARM instruction set covered so far. Appendixes B1, B2 and B3 describe the full ARM architecture. Figure 2.1 shows more details of the ARM architecture revealed in this chapter. Frequency Instruction class ARM examples HLL correspondence Integer Ft. pt. Arithmetic ADD, SUB, MOV Operations in assignment statements 16% 48% Data transfer LDR, STR, LDRB, LDRSB, LDRH, LDRSH, STRB, STRH References to data structures, such as arrays 35% 36% Logical AND, ORR, MNV, LSL, LSR 0perations in assignment statements 12% 4% Conditional branch B_, CMP If statements and loops 34% 8% Jump B, BL Procedure calls, returns, and case/switch statements 2% 0% FIGURE 2.43 ARM instruction classes, examples, correspondence to high-level program language constructs, and percentage of ARM instructions executed by category for the average SPEC2006 benchmarks. Figure 3.24 in Chapter 3 shows average percentage of the individual ARM instructions executed. Extrapolated from measurements of MIPS programs. 03-Ch02-P374750.indd 173 7/3/09 12:18:38 PM 174 Chapter 2 \ 2.20 Instructions: Language of the Computer Historical Perspective and Further Reading This section surveys the history of instruction set architectures (ISAs) over time, and we give a short history of programming languages and compilers. ISAs include accumulator architectures, general-purpose register architectures, stack architectures, and a brief history of ARM, MIPS, and the x86. We also review the controversial subjects of high-level-language computer architectures and reduced instruction set computer architectures. The history of programming languages includes Fortran, Lisp, Algol, C, Cobol, Pascal, Simula, Smalltalk, C++, and Java, and the history of compilers includes the key milestones and the pioneers who achieved them. The rest of this section is on the CD. 2.21 Exercises Contributed by John Oliver of Cal Poly, San Luis Obispo, with contributions from Nicole Kaiyan (University of Adelaide) and Milos Prvulovic (Georgia Tech) A link for an ARM simulator is provided on the CD and is helpful for these exercises. Although the simulator accepts pseudoinstructions, try not to use pseudoinstructions for any exercises that ask you to produce ARM code. Your goal should be to learn the real ARM instruction set, and if you are asked to count instructions, your count should reflect the actual instructions that will be executed and not the pseudoinstructions. There are some cases where pseudoinstructions must be used (for example, the la instruction when an actual value is not known at assembly time). In many cases, they are quite convenient and result in more readable code (for example, the li and move instructions). If you choose to use pseudoinstructions for these reasons, please add a sentence or two to your solution stating which pseudoinstructions you have used and why. Exercise 2.1 The following problems deal with translating from C to ARM. Assume that the variables g, h, i, and j are given and could be considered 32-bit integers as declared in a C program. a. f = g + h + i + j; b. f = g + (h + 5); 2.1.1 [5] <2.2> For the C statements above, what is the corresponding ARM assembly code? Use a minimal number of ARM assembly instructions. 03-Ch02-P374750.indd 174 7/3/09 12:18:38 PM 2.21 Exercises 175 2.1.2 [5] <2.2> For the C statements above, how many ARM assembly instructions are needed to perform the C statement? 2.1.3 [5] <2.2> If the variables f, g, h, i, and j have values 1, 2, 3, 4, and 5, respectively, what is the end value of f? The following problems deal with translating from ARM to C. Assume that the variables g, h, i, and j are given and could be considered 32-bit integers as declared in a C program. a. ADD f, g, h b. ADD ADD f, f, #1 f, g, h 2.1.4 [5] <2.2> For the ARM statements above, what is a corresponding C statement? 2.1.5 [5] <2.2> If the variables f, g, h, and i have values 1, 2, 3, and 4, respectively, what is the end value of f? Exercise 2.2 The following problems deal with translating from C to ARM. Assume that the variables g, h, i, and j are given and could be considered 32-bit integers as declared in a C program. a. f = f + f + i; b. f = g + (j + 2); 2.2.1 [5] <2.2> For the C statements above, what is the corresponding ARM assembly code? Use a minimal number of ARM assembly instructions. 2.2.2 [5] <2.2> For the C statements above, how many ARM assembly instructions are needed to perform the C statement? 2.2.3 [5] <2.2> If the variables f, g, h, and i have values 1, 2, 3, and 4, respectively, what is the end value of f? The following problems deal with translating from ARM to C. For the following exercise, assume that the variables g, h, i, and j are given and could be considered 32-bit integers as declared in a C program. a. ADD f, f, h b. RSB ADD f, f, #0 f, f, 1 03-Ch02-P374750.indd 175 7/3/09 12:18:38 PM 176 Chapter 2 Instructions: Language of the Computer 2.2.4 [5] <2.2> For the ARM statements above, what is a corresponding C statement? 2.2.5 [5] <2.2> If the variables f, g, h, and i have values 1, 2, 3, and 4, respectively, what is the end value of f? Exercise 2.3 The following problems deal with translating from C to ARM. Assume that the variables g, h, i, and j are given and could be considered 32-bit integers as declared in a C program. a. f = f + g + h + i + j + 2; b. f = g – (f + 5); 2.3.1 [5] <2.2> For the C statements above, what is the corresponding ARM assembly code? Use a minimal number of ARM assembly instructions. 2.3.2 [5] <2.2> For the C statements above, how many ARM assembly instructions are needed to perform the C statement? 2.3.3 [5] <2.2> If the variables f, g, h, i, and j have values 1, 2, 3, 4, and 5, respectively, what is the end value of f? The following problems deal with translating from ARM to C. Assume that the variables g, h, i, and j are given and could be considered 32-bit integers as declared in a C program. a. ADD f, –g, h b. ADD SUB h, f, #1 f, g, h 2.3.4 [5] <2.2> For the ARM statements above, what is a corresponding C statement? 2.3.5 [5] <2.2> If the variables f, g, h, and i have values 1, 2, 3, and 4, respectively, what is the end value of f? Exercise 2.4 The following problems deal with translating from C to ARM. Assume that the variables f, g, h, i, and j are assigned to registers r0, r1, r2, r3, and r4, respectively. 03-Ch02-P374750.indd 176 7/3/09 12:18:38 PM 2.21 Exercises 177 Assume that the base address of the arrays A and B are in registers r6 and r7, respectively. a. f = g + h + B[4]; b. f = g – A[B[4]]; 2.4.1 [10] <2.2, 2.3> For the C statements above, what is the corresponding ARM assembly code? 2.4.2 [5] <2.2, 2.3> For the C statements above, how many ARM assembly instructions are needed to perform the C statement? 2.4.3 [5] <2.2, 2.3> For the C statements above, how many different registers are needed to carry out the C statement? The following problems deal with translating from ARM to C. Assume that the variables f, g, h, i, and j are assigned to registers r0, r1, r2, r3, and r4, respectively. Assume that the base address of the arrays A and B are in registers r6 and r7, respectively. a. ADD add add add r0, r0, r0, r0, r0, r0, r0, r0, r1 r2 r3 r4 b. LDR r0, 4[r6, #0x4] 2.4.4 [10] <2.2, 2.3> For the ARM assembly instructions above, what is the corresponding C statement? 2.4.5 [5] <2.2, 2.3> For the ARM assembly instructions above, rewrite the assembly code to minimize the number of ARM instructions (if possible) needed to carry out the same function. 2.4.6 [5] <2.2, 2.3> How many registers are needed to carry out the ARM assembly as written above? If you could rewrite the code above, what is the minimal number of registers needed? Exercise 2.5 In the following problems, we will be investigating memory operations in the context of a ARM processor. The table below shows the values of an array stored in memory. 03-Ch02-P374750.indd 177 7/3/09 12:18:38 PM 178 Chapter 2 Instructions: Language of the Computer a. Address 12 8 4 0 Data 1 6 4 2 b. Address 16 12 8 4 0 Data 1 2 3 4 5 2.5.1 [10] <2.2, 2.3> For the memory locations in the table above, write C code to sort the data from lowest-to-highest, placing the lowest value in the smallest memory location shown in the figure. Assume that the data shown represents the C variable called Array, which is an array of type int. Assume that this particular machine is a byte-addressable machine and a word consists of 4 bytes. 2.5.2 [10] <2.2, 2.3> For the memory locations in the table above, write ARM code to sort the data from lowest-to-highest, placing the lowest value in the smallest memory location. Use a minimum number of ARM instructions. Assume the base address of Array is stored in register r6. 2.5.3 [5] <2.2, 2.3> To sort the array above, how many instructions are required for the ARM code? If you are not allowed to use the immediate field in lw and sw instructions, how many ARM instructions do you need? The following problems explore the translation of hexadecimal numbers to other number formats. a. 0x12345678 b. 0xbeadf00d 2.5.4 [5] <2.3> Translate the hexadecimal numbers above into decimal. 2.5.5 [5] <2.3> Show how the data in the table would be arranged in memory of a little-endian and a big-endian machine. Assume the data is stored starting at address 0. Exercise 2.6 The following problems deal with translating from C to ARM. Assume that the variables f, g, h, i, and j are assigned to registers r0, r1, r2, r3, and r4, respectively. Assume that the base address of the arrays A and B are in registers r6 and r7, respectively. 03-Ch02-P374750.indd 178 7/3/09 12:18:38 PM 2.21 a. f = –g + h + B[1]; b. f = A[B[g]+1]; Exercises 179 2.6.1 [10] <2.2, 2.3> For the C statements above, what is the corresponding ARM assembly code? 2.6.2 [5] <2.2, 2.3> For the C statements above, how many ARM assembly instructions are needed to perform the C statement? 2.6.3 [5] <2.2, 2.3> For the C statements above, how many registers are needed to carry out the C statement using ARM assembly code? The following problems deal with translating from ARM to C. Assume that the variables f, g, h, i, and j are assigned to registers r0, r1, r2, r3, and r4, respectively. Assume that the base address of the arrays A and B are in registers r6 and r7, respectively. a. ADD r0, r0, r1 ADD r0, r3, r2 ADD r0, r0, r3 b. ADD ADD LDR r6, r6, #–20 ;(SUB r6, r6, #20) r6, r6, r1 r0, [r6, #8] 2.6.4 [5] <2.2, 2.3> For the ARM assembly instructions above, what is the corresponding C statement? 2.6.5 [5] <2.2, 2.3> For the ARM assembly above, assume that the registers r0, r1, r2, r3, contain the values 10, 20, 30, and 40, respectively. Also, assume that register r6 contains the value 256, and that memory contains the following values: Address Value 256 100 260 200 264 300 Find the value of r0 at the end of the assembly code. 2.6.6 [10] <2.3, 2.5> For each ARM instruction, show the value of the opcode, Rd, Rn, operand2 and I fields. Remember to distinguish between DP-type and DT-type instructions. 03-Ch02-P374750.indd 179 7/3/09 12:18:38 PM 180 Chapter 2 Instructions: Language of the Computer Exercise 2.7 The following problems explore number conversions from signed and unsigned binary number to decimal numbers. a. 1010 1101 0001 0000 0000 0000 0000 0010two b. 1111 1111 1111 1111 1011 0011 0101 0011two 2.7.1 [5] <2.4> For the patterns above, what base 10 number does it represent, assuming that it is a two’s complement integer? 2.7.2 [5] <2.4> For the patterns above, what base 10 number does it represent, assuming that it is an unsigned integer? 2.7.3 [5] <2.4> For the patterns above, what hexadecimal number does it represent? The following problems explore number conversions from decimal to signed and unsigned binary numbers. a. 2147483647ten b. 1000ten 2.7.4 [5] <2.4> For the base ten numbers above, convert to 2’s complement binary. 2.7.5 [5] <2.4> For the base ten numbers above, convert to 2’s complement hexadecimal. 2.7.6 [5] <2.4> For the base ten numbers above, convert the negated values from the table to 2’s complement hexadecimal. Exercise 2.8 The following problems deal with sign extension and overflow. Registers r0 and r1 hold the values as shown in the table below. You will be asked to perform a ARM operation on these registers and show the result. a. r0 = 70000000sixteen, r1 = 0x0FFFFFFFsixteen b. r0 = 0x40000000sixteen, r1 = 0x40000000sixteen 2.8.1 [5] <2.4> For the contents of registers r0 and r1 as specified above, what is the value of r4 for the following assembly code: ADD r4, r0, r1 Is the result in r4 the desired result, or has there been overflow? 03-Ch02-P374750.indd 180 7/3/09 12:18:38 PM 2.21 Exercises 181 2.8.2 [5] <2.4> For the contents of registers r0 and r1 as specified above, what is the value of r0 for the following assembly code: SUB r4, r0, r1 Is the result in r4 the desired result, or has there been overflow? 2.8.3 [5] <2.4> For the contents of registers r0 and r1 as specified above, what is the value of r4 for the following assembly code: ADD r4, r0, r1 ADD r4, r0, r0 Is the result in r4 the desired result, or has there been overflow? In the following problems, you will perform various ARM operations on a pair of registers, r0 and r1. Given the values of r0 and r1 in each of the questions below, state if there will be overflow. a. ADD r0, r0, r1 b. SUB r0, r0, r1 sub r0, r0, r1 2.8.4 [5] <2.4> Assume that register r0 = 0x70000000 and r1 = 0x10000000. For the table above, will there be overflow? 2.8.5 [5] <2.4> Assume that register r0 = 0x40000000 and r1 = 0x20000000. For the table above, will there be overflow? 2.8.6 [5] <2.4> Assume that register r0 = 0x8FFFFFFF and r1 = 0xD0000000. For the table above, will there be overflow? Exercise 2.9 The table below contains various values for register r1. You will be asked to evaluate if there would be overflow for a given operation. a. 2147483647ten b. 0xD0000000sixteen 2.9.1 [5] <2.4> Assume that register r0 = 0x70000000 and r1 has the value as given in the table. If the instruction: ADD r0, r0, r1 is executed, will there be overflow? 03-Ch02-P374750.indd 181 7/3/09 12:18:38 PM 182 Chapter 2 Instructions: Language of the Computer 2.9.2 [5] <2.4> Assume that register r0 = 0x80000000 and r1 has the value as given in the table. If the instruction: SUB r0, r0, r1 is executed, will there be overflow? 2.9.3 [5] <2.4> Assume that register r0 = 0x7FFFFFFF and r1 has the value as given in the table. If the instruction: SUB r0, r0, r1 is executed, will there be overflow? The table below contains various values for register r1. You will be asked to evaluate if there would be overflow for a given operation. a. 1010 1101 0001 0000 0000 0000 0000 0010two b. 1111 1111 1111 1111 1011 0011 0101 0011two 2.9.4 [5] <2.4> Assume that register r0 = 0x70000000 and r1 has the value as given in the table. If the instruction: ADD r0, r0, r1 is executed, will there be overflow? 2.9.5 [5] <2.4> Assume that register r0 = 0x70000000 and r1 has the value as given in the table. If the instruction: ADD r0, r0, r1 is executed, what is the result in hex? 2.9.6 [5] <2.4> Assume that register r0 = 0x70000000 and r1 has the value as given in the table. If the instruction: ADD r0, r0, r1 is executed, what is the result in base ten? Exercise 2.10 In the following problems, the data table contains bits that represent the opcode of an instruction. You will be asked to translate the entries into assembly code and determine what format of ARM instruction the bits represent. a. 1010 1110 0000 1011 0000 0000 0000 0100two b. 1000 1101 0000 1000 0000 0000 0100 0000two 2.10.1 [5] <2.5> For the binary entries above, what instruction do they represent? 2.10.2 [5] <2.5> What type (DP-type, DT-type) instruction do the binary entries above represent? 2.10.3 [5] <2.4, 2.5> If the binary entries above were data bits, what number would they represent in hexadecimal? 03-Ch02-P374750.indd 182 7/3/09 12:18:38 PM 2.21 Exercises 183 In the following problems, the data table contains ARM instructions. You will be asked to translate the entries into the bits of the opcode and determine what is the ARM instruction format. a. ADD r0, r0, r5 b. LDR r1, [r3, #4] 2.10.4 [5] <2.4, 2.5> For the instructions above, show the hexadecimal representation of these instructions. 2.10.5 [5] <2.5> What type (DP-type, DT-type) instruction do the instructions above represent? 2.10.6 [5] <2.5> What is the hexadecimal representation of the opcode, Rd, and Rn fields in this instruction? For DP-type instruction, what is the hexadecimal representation of the Rd, I and operand2 fields? Exercise 2.11 In the following problems, the data table contains bits that represent the opcode of an instruction. You will be asked to translate the entries into assembly code and determine what format of ARM instruction the bits represent. a. 0xE0842005 b. 0xE0423001 2.11.1 [5] <2.4, 2.5> What binary number does the above hexadecimal number represent? 2.11.2 [5] <2.4, 2.5> What decimal number does the above hexadecimal number represent? 2.11.3 [5] <2.5> What instruction does the above hexadecimal number represent? In the following problems, the data table contains the values of various fields of ARM instructions. You will be asked to determine what the instruction is, and find the ARM format for the instruction. a. Cond=14, F=1, opcode=25, Rn=6, Rd=5, operand2/offset=0 b. Cond=14, F=0, opcode=2, Rn=6, Rd=5, operand2/offset=0 03-Ch02-P374750.indd 183 7/3/09 12:18:38 PM 184 Chapter 2 Instructions: Language of the Computer 2.11.4 [5] <2.5> What type (DP-type, DT-type) instruction do the instructions above represent? 2.11.5 [5] <2.5> What is the ARM assembly instruction described above? 2.11.6 [5] <2.4, 2.5> What is the binary representation of the instructions above? Exercise 2.12 In the following problems, the data table contains various modifications that could be made to the ARM instruction set architecture. You will investigate the impact of these changes on the instruction format of the ARM architecture. a. 8 registers & 10-bit immediate constant b. 10 bit offset 2.12.1 [5] <2.5> If the instruction set of the ARM processor is modified, the instruction format must also be changed. For each of the suggested changes above, show the size of the bit fields of an DP-type format instruction. What is the total number of bits needed for each instruction? 2.12.2 [5] <2.5> If the instruction set of the ARM processor is modified, the instruction format must also be changed. For each of the suggested changes above, show the size of the bit fields of an DT-type format instruction. What is the total number of bits needed for each instruction? 2.12.3 [5] <2.5, 2.10> Why could the suggested change in the table above decrease the size of a ARM assembly program? Why could the suggested change in the table above increase the size of a ARM assembly program? In the following problems, the data table contains hexadecimal values. You will be asked to determine what ARM instruction the value represents, and find the ARM instruction format. a. 0xE2801000 b. 0xE1801020 2.12.4 [5] <2.5> For the entries above, what is the value of the number in decimal? 2.12.5 [5] <2.5> For the hexadecimal entries above, what instruction do they represent? 2.12.6 [5] <2.4, 2.5> What type (DP-type, DT-type) instruction do the binary entries above represent? What is the value of the opcode field and the Rd field? 03-Ch02-P374750.indd 184 7/3/09 12:18:38 PM 2.21 Exercises 185 Exercise 2.13 In the following problems, the data table contains the values for registers r3 and r4. You will be asked to perform several ARM logical operations on these registers. a. r3 = 0x55555555, r4 = 0x12345678 b. r3 = 0xBEADFEED, r4 = 0xDEADFADE 2.13.1 [5] <2.6> For the lines above, what is the value of r5 for the following sequence of instructions: OR r5, r4, r3, LSL #4 2.13.2 [5] <2.6> For the values in the table above, what is the value of r2 for the following sequence of instructions: MVN r3, #1 AND r5, r3, r4, LSL #4 2.13.3 [5] <2.6> For the lines above, what is the value of r5 for the following sequence of instructions: MOV r5, 0xFFEF AND r5, r5, r3, LSR #3 In the following exercise, the data table contains various ARM logical operations. You will be asked to find the result of these operations given values for registers r0 and r1. a. ORR r2, r1, r0, LSL #1 b. AND r2, r1, r0, LSR #1 2.13.4 [5] <2.6> Assume that r0 = 0x0000A5A5 and r1 = 00005A5A. What is the value of r2 after the two instructions in the table? 2.13.5 [5] <2.6> Assume that r0 = 0xA5A50000 and r1 = A5A50000. What is the value of r2 after the two instructions in the table? 2.13.6 [5] <2.6> Assume that r0 = 0xA5A5FFFF and r1 = A5A5FFFF. What is the value of r2 after the two instructions in the table? 03-Ch02-P374750.indd 185 7/3/09 12:18:38 PM 186 Chapter 2 Instructions: Language of the Computer Exercise 2.14 The following figure shows the placement of a bit field in register r0. 31 i j 0 Field 31 – i bits i – j bits j bits In the following problems, you will be asked to write ARM instructions to extract the bits “Field” from register r0 and place them into register r1 at the location indicated in the following table. a. i–j 31 000…000 b. 31 Field 14 + i – j bits 000…000 14 Field 0 000…000 2.14.1 [20] <2.6> Find the shortest sequence of ARM instructions that extracts a field from r0 for the constant values i = 22 and j = 5 and places the field into r1 in the format shown in the data table. 2.14.2 [5] <2.6> Find the shortest sequence of ARM instructions that extracts a field from r0 for the constant values i = 4 and j = 0 and places the field into r1 in the format shown in the data table. 2.14.3 [5] <2.6> Find the shortest sequence of ARM instructions that extracts a field from r0 for the constant values i = 31 and j = 28 and places the field into r1 in the format shown in the data table. In the following problems, you will be asked to write ARM instructions to extract the bits “Field” from register r0 shown in the figure and place them into register $t1 at the location indicated in the following table. The bits shown as “XXX” are to remain unchanged. a. i–j 31 XXX…XXX b. 31 14 + i – j bits XXX…XXX 03-Ch02-P374750.indd 186 Field 14 Field 0 XXX…XXX 7/3/09 12:18:38 PM 2.21 Exercises 187 2.14.4 [20] <2.6> Find the shortest sequence of ARM instructions that extracts a field from r0 for the constant values i = 17 and j = 11 and places the field into r1 in the format shown in the data table. 2.14.5 [5] <2.6> Find the shortest sequence of ARM instructions that extracts a field from r0 for the constant values i = 5 and j = 0 and places the field into r1 in the format shown in the data table. 2.14.6 [5] <2.6> Find the shortest sequence of ARM instructions that extracts a field from r0 for the constant values i = 31 and j = 29 and places the field into r1 in the format shown in the data table. Exercise 2.15 For these problems, the table holds some logical operations that are not included in the ARM instruction set. How can these instructions be implemented? a. ANDN r1, r2, r3 // bit-wise AND of r2, !r3 b. XNOR r1, r2, r3 // bit-wise exclusive-NOR 2.15.1 [5] <2.6> The logical instructions above are not included in the ARM instruction set, but are described above. If the value of r2 = 0x00FFA5A5 and the value of r3 = 0xFFFF003C, what is the result in r1? 2.15.2 [10] <2.6> The logical instructions above are not included in the ARM instruction set, but can be synthesized using one or more ARM assembly instructions. Provide a minimal set of ARM instructions that may be used in place of the instructions in the table above. 2.15.3 [5] <2.6> For your sequence of instructions in 2.15.2, show the bit-level representation of each instruction. Various C-level logical statements are shown in the table below. In this exercise, you will be asked to evaluate the statements and implement these C statements using ARM assembly instructions. a. A = B & C[0]; b. A = A ? B : C[0] 2.15.4 [5] <2.6> The table above shows different C statements that use logical operators. If the memory location at C[0] contains the integer value 0x00001234, and the initial integer value of A and B are 0x00000000 and 0x00002222, what is the result value of A? 03-Ch02-P374750.indd 187 7/3/09 12:18:39 PM 188 Chapter 2 Instructions: Language of the Computer 2.15.5 [5] <2.6> For the C statements in the table above, write a minimal sequence of ARM assembly instructions that does the identical operation. 2.15.6 [5] <2.6> For your sequence of instructions in 2.15.5, show the bit-level representation of each instruction. Exercise 2.16 For these problems, the table holds various binary values for register r0. Given the value of r0, you will be asked to evaluate the outcome of different branches. a. 1010 1101 0001 0000 0000 0000 0000 0010two b. 1111 1111 1111 1111 1111 1111 1111 1111two 2.16.1 [5] <2.7> Suppose that register r0 contains a value from above and r1 has the value 0011 1111 1111 1000 0000 0000 0000 0000two What is the value of r2 after the following instructions? MOV CMP BGE B ELSE: MOV DONE: r2,#0 r0,r1 ELSE DONE r2, #2 2.16.2 [5] <2.7> Suppose that register r0 contains a value from above and r1 has the value 0011 1111 1111 1000 0000 0000 0000 0000two What is the value of r2 after the following instructions? MOV CMP BLO B ELSE: MOV DONE: r2,#0 r0,r1 ELSE DONE r2, #2 2.16.3 [5] <2.7> Rewrite the above code using conditional instructions of ARM. For these problems, the table holds various binary values for register r0. Given the value of r0, you will be asked to evaluate the outcome of different branches. 03-Ch02-P374750.indd 188 7/3/09 12:18:39 PM 2.21 a. 0x00001000 b. 0x20001400 Exercises 189 2.16.4 [5] <2.7> Suppose that register r0 contains a value from above. What is the value of r2 after the following instructions? MOV r2, #0 CMP r0, r0 BLT ELSE B DONE ELSE: ADD r2, r2, #2 DONE: 2.16.5 [5] <2.6, 2.7> Suppose that register r0 contains a value from above. What is the value of r2 after the following instructions? MOV r2, #0 CMP r0, r0 BHI ELSE B DONE ELSE: ADD r2, r2, #2 DONE: Exercise 2.17 For these problems, several instructions that are not included in the ARM instruction set are shown. a. ABS r2, r3 # r2 = |r3| b. SGT r1, r2, r3 # R[r1] = (R[r2] > R[r3]) ? 1:0 2.17.1 [5] <2.7> The table above contains some instructions not included in the ARM instruction set and the description of each instruction. Why are these instructions not included in the ARM instruction set. 2.17.2 [5] <2.7> The table above contains some instructions not included in the ARM instruction set and the description of each instruction. If these instructions were to be implemented in the ARM instruction set, what is the most appropriate instruction format? 2.17.3 [5] <2.7> For each instruction in the table above, find the shortest sequence of ARM instructions that performs the same operation. For these problems, the table holds ARM assembly code fragments. You will be asked to evaluate each of the code fragments, familiarizing you with the different ARM branch instructions. 03-Ch02-P374750.indd 189 7/3/09 12:18:39 PM 190 Chapter 2 a. Instructions: Language of the Computer LOOP: ELSE: CMP BLT B ADD SUB B r1, #0 ELSE DONE r3, r3, #2 r1, r1, #1 LOOP DONE: b. LOOP: MOV LOOP2: ADD SUB CMP BNE SUB BNE DONE: r2, 0xA r4, r4, #2 r2, r2, #1 r1, #0 LOOP2 r1, r1, #1 LOOP 2.17.4 [5] <2.7> For the loops written in ARM assembly above, assume that the register r1 is initialized to the value 10. What is the value in register r2 assuming the r2 is initially zero? 2.17.5 [5] <2.7> For each of the loops above, write the equivalent C code routine. Assume that the registers r3, r4, r1, and r2 are integers A, B, i, and temp, respectively. 2.17.6 [5] <2.7> For the loops written in ARM assembly above, assume that the register r1 is initialized to the value N. How many ARM instructions are executed? Exercise 2.18 For these problems, the table holds some C code. You will be asked to evaluate these C code statements in ARM assembly code. a. for(i=0; i<10; i++) a += b; b. while (a < 10){ D[a] = b + a; a += 1; } 2.18.1 [5] <2.7> For the table above, draw a control-flow graph of the C code. 2.18.2 [5] <2.7> For the table above, translate the C code to ARM assembly code. Use a minimum number of instructions. Assume that the value a, b, i, j are in registers r0, r1, r3, r3, respectively. Also, assume that register r1 holds the base address of the array D. 2.18.3 [5] <2.7> How many ARM instructions does it take to implement the C code? If the variables a and b are initialized to 10 and 1 and all elements of D are initially 0, what is the total number of ARM instructions that is executed to complete the loop? 03-Ch02-P374750.indd 190 7/3/09 12:18:39 PM 2.21 Exercises 191 For these problems, the table holds ARM assembly code fragments. You will be asked to evaluate each of the code fragments, familiarizing you with the different ARM branch instructions. a. MOV LOOP: LDR ADD ADD SUB CMP BNE r1, #100 r3, [r2, #0] r4, r4, r3 r2, r2, #4 r1, r1, #1 r1, #0 LOOP b. ADD LOOP: LDR ADD LDR ADD ADD CMP BNE r1, r2, #400 r3, [r2, #0] r4, r4, r3 r3, [r2, #0] r4, r4, r3 r2, r2, #8 r1, r2 LOOP 2.18.4 [5] <2.7> What is the total number of ARM instructions executed? 2.18.5 [5] <2.7> Translate the loops above into C. Assume that the C-level integer i is held in register r1, r4 holds the C-level integer called result, and r2 holds the base address of the integer MemArray. 2.18.6 [5] <2.7> Rewrite the loop in ARM assembly to reduce the number of ARM instructions executed. Exercise 2.19 For the following problems, the table holds C code functions. Assume that the first function listed in the table is called first. You will be asked to translate these C code routines into ARM Asembly. a. int compare(int a, int b) { if (sub(a, b) >= 0) return 1; else return 0; } int sub (int a, int b) { return a–b; } b. int fib_iter(int a, int b, int n){ if(n == 0) return b; else return fib_iter(a+b, a, n–1); } 03-Ch02-P374750.indd 191 7/3/09 12:18:39 PM 192 Chapter 2 Instructions: Language of the Computer 2.19.1 [15] <2.8> Implement the C code in the table in ARM assembly. What is the total number of ARM instructions needed to execute the function? 2.19.2 [5] <2.8> Functions can often be implemented by compilers “in-line”. An in-line function is when the body of the function is copied into the program space, allowing the overhead of the function call to be eliminated. Implement an “in-line” version of the C code in the table in ARM assembly. What is the reduction in the total number of ARM assembly instructions needed to complete the function? Assume that the C variable n is initialized to 5. 2.19.3 [5] <2.8> For each function call, show the contents of the stack after the function call is made. Assume the stack pointer is originally at addresss 0x7ffffffc, and follow the register conventions as specified in Figure 2.11. The following three problems in this exercise refer to a function f that calls another function func. The code for C function func is already compiled in another module using the ARM calling convention from Figure 2.14. The function declaration for func is “int func(int a, int b);”. The code for function f is as follows: a. int f(int a, int b, int c){ return func(func(a,b),c); } b. int f(int a, int b, int c){ return func(a,b)+func(b,c); } 2.19.4 [10] <2.8> Translate function f into ARM assembler, also using the ARM calling convention from Figure 2.14. If you need to use registers r4 through r11, use the lower-numbered registers first. 2.19.5 [5] <2.8> Can we use the tail-call optimization in this function? If no, explain why not. If yes, what is the difference in the number of executed instructions in f with and without the optimization? 2.19.6 [5] <2.8> Right before your function f from Problem 2.19.4 returns, what do we know about contents of registers and sp? Keep in mind that we know what the entire function f looks like, but for function func we only know its declaration. Exercise 2.20 This exercise deals with recursive procedure calls. For the following problems, the table has an assembly code fragment that computes the factorial of a number. However, the entries in the table have errors, and you will be asked to fix these errors. 03-Ch02-P374750.indd 192 7/3/09 12:18:39 PM 2.21 a. b. FACT: SUB STR STR CMP BGE MOV ADD MOV L1: SUB R0,R0,#1 BL FACT LDR R0,[SP,#4] LDR LR,[SP,0] ADD SP,SP,#4 MUL R1,R0,R1 MOV PC,LR FACT: SUB STR STR CMP BGE MOV ADD MOV L1: SUB R0,R0,#1 BL FACT LDR R0,[SP,#4] LDR LR,[SP,0] ADD SP,SP,#4 MUL R1,R0,R1 MOV PC,LR Exercises 193 SP,SP,#8 LR,[SP,#4] R0,[SP,#8] R0,#1 L1 R1,#1 SP,SP,#8 PC,LR SP,SP,#8 LR,[SP,#4] R0,[SP,#8] R0,#1 L1 R1,#1 SP,SP,#8 PC,LR 2.20.1 [5] <2.8> The ARM assembly program above computes the factorial of a given input. The integer input is passed through register r0, and the result is returned in register r1. In the assembly code, there are a few errors. Correct the ARM errors. 2.20.2 [10] <2.8> For the recursive factorial ARM program above, assume that the input is 4. Rewrite the factorial program to operate in a nonrecursive manner. What is the total number of instructions used to execute your solution from 2.20.2 versus the recursive version of the factorial program? 2.20.3 [5] <2.8> Show the contents of the stack after each function call, assuming that the input is 4. For the following problems, the table has an assembly code fragment that computes a Fibonacci number. However, the entries in the table have errors, and you will be asked to fix these errors. 03-Ch02-P374750.indd 193 7/3/09 12:18:39 PM 194 Chapter 2 a. b. Instructions: Language of the Computer FIB: SUB STR STR STR CMP BGE MOV B sp,sp,#12 lr,[sp,#0] r2,[sp,#4] r1,[sp,#8] r1,#1 L1 r0,r1 EXIT L1: SUB BL MOV SUB BL ADD r1,r1,#1 FIB r2,r0 r1,r1,#1 FIB r0,r0,r2 EXIT: LDR LDR LDR ADD MOV lr,[sp,#0] r1,[sp,#8] r2,[sp,#4] sp,sp,#12 pc,lr FIB: SUB STR STR STR CMP BGE MOV B sp,sp,#12 lr,[sp,#0] r2,[sp,#4] r1,[sp,#8] r1,#1 L1 r0,r1 EXIT L1: SUB BL MOV SUB BL ADD r1,r1,#1 FIB r2,r0 r1,r1,#1 FIB r0,r0,r2 EXIT: LDR LDR LDR ADD MOV lr,[sp,#0] r1,[sp,#8] r2,[sp,#4] sp,sp,#12 pc,lr 2.20.4 [5] <2.8> The ARM assembly program above computes the Fibonacci of a given input. The integer input is passed through register r1, and the result is returned in register r0. In the assembly code, there are a few errors. Correct the ARM errors. 2.20.5 [10] <2.8> For the recursive Fibonacci ARM program above, assume that the input is 4. Rewrite the Fibonacci program to operate in a nonrecursive manner. What is the total number of instructions used to execute your solution from 2.20.2 versus the recursive version of the factorial program? 2.20.6 [5] <2.8> Show the contents of the stack after each function call, assuming that the input is 4. 03-Ch02-P374750.indd 194 7/3/09 12:18:39 PM 2.21 Exercises 195 Exercise 2.21 Assume that the stack and the static data segments are empty and that the stack and global pointers start at address 0x7fff fffc and 0x1000 8000, respectively. Assume the calling conventions as specified in Figure 2.11 and that function inputs are passed using registers r0 and returned in register r1. a. main() { leaf_function(1); } int leaf_function (int f) { int result; result = f + 1; if (f > 5) return result; leaf_function(result); } b. int my_global = 100; main() { int x = 10; int y = 20; int z; z = my_function(x, my_global) } int my_function(int x, int y) { return x – y; } 2.21.1 [5] <2.8> Show the contents of the stack and the static data segments after each function call. 2.21.2 [5] <2.8> Write ARM code for the code in the table above. 2.21.3 [5] <2.8> If the leaf function could use temporary registers, write the ARM code for the code in the table above. The following three problems in this exercise refer to this function, written in ARM assembler following the calling conventions from Figure 2.14: a. f: 03-Ch02-P374750.indd 195 SUB ADD SUB MOV r4,r0,r3 r4,r2,r4 r4,r4,r1 pc,lr LSL #1 7/3/09 12:18:39 PM 196 Chapter 2 b. Instructions: Language of the Computer f: ADD sp,sp,#8 STR lr,[sp,#4] STR r3,[sp,#0] MOV r3,r2 BL g ADD r0,r0,r3 LDR lr,[sp,#4] LDR r3,[sp,#0] SUB sp,sp,#8 MOV pc,lr 2.21.4 [10] <2.8> This code contains a mistake that violates the ARM calling convention. What is this mistake and how should it be fixed? 2.21.5 [10] <2.8> What is the C equivalent of this code? Assume that the function’s arguments are named a, b, c, etc. in the C version of the function. 2.21.6 [10] <2.8> At the point where this function is called register r0, r1, r2, and r3 have values 1, 100, 1000, and 30, respectively. What is the value returned by this function? If another function g is called from f, assume that the value returned from g is always 500. Exercise 2.22 This exercise explores ASCII and Unicode conversion. The following table shows strings of characters. a. A byte b. computer 2.22.1 [5] <2.9> Translate the strings into decimal ASCII byte values. 2.22.2 [5] <2.9> Translate the strings into 16-bit Unicode (using hex notation and the Basic Latin character set). The following table shows hexadecimal ASCII character values. a. 61 64 64 b. 73 68 69 66 74 2.22.3 [5] <2.5, 2.9> Translate the hexadecimal ASCII values to text. 03-Ch02-P374750.indd 196 7/3/09 12:18:39 PM 2.21 Exercises 197 Exercise 2.23 In this exercise, you will be asked to write a ARM assembly program that converts strings into the number format as specified in the table. a. positive integer decimal strings b. 2’s complement hexadecimal integers 2.23.1 [10] <2.9> Write a program in ARM assembly language to convert an ASCII number string with the conditions listed in the table above, to an integer. Your program should expect register r0 to hold the address of a null-terminated string containing some combination of the digits 0 through 9. Your program should compute the integer value equivalent to this string of digits, then place the number in register r1. If a non-digit character appears anywhere in the string, your program should stop with the value –1 in register r1. For example, if register r0 points to a sequence of three bytes 50ten, 52ten, 0ten (the null-terminated string “24”), then when the program stops, register r1 should contain the value 24ten. Exercise 2.24 Assume that the register r1 contains the address 0x1000 0000 and the register r2 contains the address 0x1000 0010. a. LDRB r0, [r1,#0] STRH r0, [r2,#0] b. LDRB r0, [r1,#0] STRB r0, [r2,#0] 2.24.1 [5] <2.9> Assume that the data (in hexadecimal) at address 0x1000 0000 is: 1000 0000 12 34 56 78 What value is stored at the address pointed to by register r2? Assume that the memory location pointed to r2 is initialized to 0xFFFF FFFF. 2.24.2 [5] <2.9> Assume that the data (in hexadecimal) at address 0x1000 0000 is: 1000 0000 80 80 80 80 What value is stored at the address pointed to by register r2? Assume that the memory location pointed to r2 is initialized to 0x0000 0000. 2.24.3 [5] <2.9> Assume that the data (in hexadecimal) at address 0x1000 0000 is: 1000 0000 03-Ch02-P374750.indd 197 11 00 00 FF 7/3/09 12:18:39 PM 198 Chapter 2 Instructions: Language of the Computer What value is stored at the address pointed to by register r2? Assume that the memory location pointed to r2 is initialized to 0x5555 5555. Exercise 2.25 In this exercise, you will explore 32-bit constants in ARM. For the following problems, you will be using the binary data in the table below. a. 1010 1101 0001 0000 0000 0000 0000 0010two b. 1111 1111 1111 1111 1111 1111 1111 1111two 2.25.1 [10] <2.10> Write the ARM code that creates the 32-bit constants listed above and stores that value to register r1 2.25.2 [5] <2.6, 2.10> If the current value of the PC is 0x00000000, write the instruction(s), to get to the PC address shown in the table above. 2.25.3 [10] <2.10> If the immediate field of an ARM instruction was only 8 bits wide, would it be possible to create 32-bit constants ? If so, write the ARM code to do that. 2.25.4 [5] <2.10> How would you create the one’s complement of these numbers? 2.25.5 [5] <2.10> Write the ARM machine code for the instructions used in 2.25.1. Exercise 2.26 For this exercise, you will explore the addressing modes in ARM. Consider the four addressing modes given in the table below. a. Register offset b. Scaled register offset c Pre-indexed d Post-indexed 2.26.1 [5] <2.10> Give an example ARM instruction for each of the above ARM addressing modes. 2.26.2 [5] <2.10> For the instructions in 2.26.1, what is the instruction format type used for the given instruction? 03-Ch02-P374750.indd 198 7/3/09 12:18:39 PM 2.21 Exercises 199 2.26.3 [5] <2.10> List benefits and drawbacks of each ARM addressing mode. Write ARM code that shows these benefits and drawbacks. For the following problems, you will use the instructions and information given below to determine the effect of the different addressing modes. LDR LDR LDR STR r0,[r1,#4] r2,[r0,r3, LSL#2] r3,[r1,#4]! r0,[r2],#4 r0=0x0000000 r1=0x00009000 r2=0x00009004 r3=0x00000002 mem[0x00000000]= 0x01010101 mem[0x00000004]= 0x04040404 mem[0x00000008]= 0x05050505 mem[0x00009000]= 0x02020202 mem[0x00009004]= 0x00000000 mem[0x00009008]= 0x06060606 2.26.4 [10] <2.10> What will be the values in the registers r0, r1, r2 and r3 when the above instructions are executed? 2.26.5 [10] <2.10> What will be the values in the memory locations given above? Exercise 2.27 For this exercise, you will explore the addressing modes in ARM. Consider the four addressing modes given in the table below. a. Immediate b. Scaled register c Scaled register Pre-indexed d PC-relative 2.27.1 [5] <2.10> Give an example ARM instruction for each of the above ARM addressing modes. 2.27.2 [5] <2.10> For the instructions in 2.27.1, what is the instruction format type used for the given instruction? 2.27.3 [5] <2.10> List benefits and drawbacks of each ARM addressing mode. Write ARM code that shows these benefits and drawbacks. 03-Ch02-P374750.indd 199 7/3/09 12:18:39 PM 200 Chapter 2 Instructions: Language of the Computer Exercise 2.28 The following table contains ARM assembly code for a lock. try: MOV SWP CMP BEQ LDR ADD STR SWP R3,#1 R2,R3,[R1,#0] R2,#1 try R4,[R2,#0] R3,R4,#1 R3,[R2,#0] R2,R3,[R1,#0] 2.28.1 [5] <2.11> For each test and fail of the “swp”, how many instructions need to be executed? 2.28.2 [5] <2.11> For the swp-based lock-unlock code above, explain why this code may fail. 2.28.3 [15] <2.11> Re-write the code above so that the code may operate correct. Be sure to avoid any race conditions. Each entry in the following table has code and also shows the contents of various registers. The notation, “(r1)” shows the contents of a memory location pointed to by register r1. The assembly code in each table is executed in the cycle shown on parallel processors with a shared memory space. a. Processor 1 Processor 1 Processor 2 MEM Processor 2 Cycle r2 r3 (r1) r2 r3 0 1 2 99 30 40 Processor 1 MEM SWP R2,R3, [R1,#0] 1 SWP R2,R3,[R1,#0] 2 b. Processor 1 Processor 2 try: try: MOV R3,#1 MOV R3,#1 SWP R2,R3,[R1,#0] SWP R2,R3,[R1,#0] 03-Ch02-P374750.indd 200 r2 r3 (r1) r2 r3 0 2 3 1 10 20 1 2 3 CMP R2,#1 BEQ try Processor 2 Cycle 4 CMP R2,#1 5 BEQ try 6 7/3/09 12:18:39 PM 2.21 Exercises 201 2.28.4 [5] <2.11> Fill out the table with the value of the registers for each given cycle. Exercise 2.29 The first three problems in this exercise refer to a critical section of the form lock(lk); operation unlock(lk); where the “operation” updates the shared variable shvar using the local (nonshared) variable x as follows: Operation a. shvar=shvar+x; b. shvar=min(shvar,x); 2.29.1 [10] <2.11> Write the ARM assembler code for this critical section, assuming that the address of the lk variable is in r1, the address of the shvar variable is in r4, and the value of variable x is in r5. Your critical section should not contain any function calls, i.e., you should include the ARM instructions for lock(), unlock(), max(), and min() operations. Use swp instructions to implement the lock() operation, and the unlock() operation is simply an ordinary store instruction. 2.29.2 [10] <2.11> Repeat problem 2.29.1, but this time use swp to perform an atomic update of the shvar variable directly, without using lock() and unlock(). Note that in this problem there is no variable lk. 2.29.3 [10] <2.11> Compare the best-case performance of your code from 2.29.1 and 2.29.2, assuming that each instruction takes one cycle to execute. Note: best-case means that swp always succeeds, the lock is always free when we want to lock(), and if there is a branch we take the path that completes the operation with fewer executed instructions. 2.29.4 [10] <2.11> Using your code from 2.29.2 as an example, explain what happens when two processors begin to execute this critical section at the same time, assuming that each processor executes exactly one instruction per cycle. 2.29.5 [10] <2.11> Explain why in your code from 2.29.2 register r4 contains the address of variable shvar and not the value of that variable, and why register r5 contains the value of variable x and not its address. 03-Ch02-P374750.indd 201 7/3/09 12:18:39 PM 202 Chapter 2 Instructions: Language of the Computer 2.29.6 [10] <2.11> If we want to atomically perform the same operation on two shared variables (e.g., shvar1 and shvar2) in the same critical section, we can do this easily using the approach from 2.29.1 (simply put both updates between the lock operation and the corresponding unlock operation). Explain why we cannot do this using the approach from 2.29.2., i.e., why we cannot use swp to access both shared variables in a way that guarantees that both updates are executed together as a single atomic operation. Exercise 2.30 Assembler pseudoinstructions are not a part of the ARM instruction set, but often appear in ARM programs. The table below contains some ARM pseudoinstructions that, when assembled, are translated to other ARM assembly instructions. a. LDR r0,#constant 2.30.1 [5] <2.12> For each pseudo instruction in the table above, give atleast two different sequence of actual ARM instructions to accomplish the same thing. 2.30.2 [5] <2.12> If the constant value is FFF0, which of these would you choose? Exercise 2.31 The table below contains the link-level details of two different procedures. In this exercise, you will be taking the place of the linker. Procedure A a. Procedure B Address Instruction 0 LDR r0, [r3, #0] 0 STR r1, [r3, #0] 4 BL 0 4 BL 0 … … … … Data Segment 0 (X) 0 (Y) … … Data Segment … … Relocation Info Address Instruction Type Dependency Address Instruction Type Dependency 0 LDR X Relocation Info 0 STR Y 4 BL B 4 BL A Text Segment Symbol Table 03-Ch02-P374750.indd 202 Address Symbol — X — B Text Segment Symbol Table Address Instruction Address Symbol — Y — A 7/3/09 12:18:39 PM 2.21 b. Procedure A Text Segment Address Instruction 0 LDR r0, [r3,#0] 4 Procedure B Text Segment Address Instruction 0 STR r0, [r3,#0] ORR r1, r0, #0 4 B 0 8 BL 0 … … … … 0x180 MOV pc, lr … … 0 (Y) … … Address Instruction Type Dependency 0 STR Y 4 B FOO Data Segment 0 (X) … … Relocation Info Address Instruction Type Dependency 0 LDR X 4 ORR X 8 BL B Address Symbol — X — B Symbol Table 203 Exercises Data Segment Relocation Info Symbol Table Address Symbol — Y 0x180 FOO 2.31.1 [5] <2.12> Link the object files above to form the executable file header. Assume that Procedure A has a text size of 0x140, data size of 0x40 and Procedure B has a text size of 0x300 and data size of 0x50. Also assume the memory allocation strategy as shown in Figure 2.13. 2.31.2 [5] <2.12> What limitations, if any, are there on the size of an executable? Exercise 2.32 The first three problems in this exercise assume that function swap, instead of the code in Figure 2.22, is defined in C as follows: a. void swap(int v[], int k, int j){ int temp; temp=v[k]; v[k]=v[j]; v[j]=temp; } b. void swap(int *p){ int temp; temp=*p; *p=*(p+1); *(p+1)=*p; } 03-Ch02-P374750.indd 203 7/3/09 12:18:39 PM 204 Chapter 2 Instructions: Language of the Computer 2.32.1 [10] <2.13> Translate this function into ARM assembler code. 2.32.2 [5] <2.13> What needs to change in the sort function? 2.32.3 [5] <2.13> If we were sorting 8-bit bytes, not 32-bit words, how would your ARM code for swap in 2.32.1 change? Exercise 2.33 The problems in this exercise refer to the following function, given as array code: a. int find(int a[], int n, int x){ int i; for(i=0;i!=n;i++) if(a[i]==x) return i; return –1; } b. int count(int a[], int n, int x){ int res=0; int i; for(i=0;i!=n;i++) if(a[i]==x) res=res+1; return res; } 2.33.1 [10] <2.14> Translate this function into ARM assembly. 2.33.2 [10] <2.14> Convert this function into pointer-based code (in C). 2.33.3 [10] <2.14> Translate your pointer-based C code from 2.33.2 into ARM assembly. 2.33.4 [5] <2.14> Compare the worst-case number of executed instructions per nonlast loop iteration in your array-based code from 2.33.1 and your pointer-based code from 2.33.3. Note: the worst-case occurs when branch conditions are such that the longest path through the code is taken, i.e., if there is an if statement, the result of the condition check is such that the path with more instructions is taken. However, if the result of the condition check would cause the loop to exit, then we assume that the path that keeps us in the loop is taken. 2.33.5 [5] <2.14> Compare the number of registers needed for your array-based code from 2.33.1 and for your pointer-based code from 2.33.3. 03-Ch02-P374750.indd 204 7/3/09 12:18:39 PM 2.21 Exercises 205 Exercise 2.34 The table below contains ARM assembly code. In the following problems, you will translate ARM assembly code to MIPS. a. LOOP: b. MOV ADD SUBS BNE r0, ;10 r0, r1 r0, 1 LOOP ;init loop counter to 10 ;add r1 to r0 ;decrement counter ;if Z=0 repeat loop ROR r1, r2, #4 ;r1 = r23:0 concatenated with r231:4 2.34.1 [5] <2.16> For the table above, translate this ARM assembly code to MIPS assembly code. Assume that ARM registers r0, r1, and r2 hold the same values as MIPS registers $s0, $s1, and $s2, respectively. Use MIPS temporary registers ($t0, etc.) where necessary. 2.34.2 [5] <2.16> For the MIPS assembly instructions in 2.34.1, indicate the instruction types. The table below contains MIPS assembly code. In the following problems, you will translate MIPS assembly code to ARM. a. slt $t0, $s0, $s1 blt $t0, $0, FARAWAY b. add $s0, $s1, $s2 2.34.3 [5] <2.16> For the table above, find the ARM assembly code that corresponds to the sequence of MIPS assembly code. 2.34.4 [5] <2.16> Show the bit fields that represent the ARM assembly code. Exercise 2.35 The ARM processor has a few different addressing modes that are not supported in MIPS. The following problems explore how these addressing modes can be realized on MIPS. a. LDR r0, [r1] ; r0 = memory[r1] b. LDMIA r0, {r1, r2, r4} ; r1 = memory[r0], r2 = memory[r0+4] ; r4 = memory[r0+8] 2.35.1 [5] <2.16> Identify the type of addressing mode of the ARM assembly instructions in the table above. 03-Ch02-P374750.indd 205 7/3/09 12:18:39 PM 206 Chapter 2 Instructions: Language of the Computer 2.35.2 [5] <2.16> For the ARM assembly instructions above, write a sequence of ARM assembly instructions to accomplish the same data transfer. In the following problems, you will compare code written using the ARM and ARM instruction sets. The following table shows code written in the ARM instruction set. a. ADDLP: b. LDR LDR EOR LDR ADD ADD SUBS BNE r0, =Table1 r1, #100 r2, r2, r2 r4, [r0] r2, r2, r4 r0, r0, #4 r1, r1, #1 ADDLP ;load base address of table ;initialize loop counter ;clear r2 ;get first addition operand ;add to r2 ;increment to next table element ;decrement loop counter ;if loop counter != 0, go to ADDLP ROR r1, r2, #4 ;r1 = r23:0 concatenated with r231:4 2.35.3 [10] <2.16> For the ARM assembly code above, write an equivalent MIPS assembly code routine. 2.35.4 [5] <2.16> What is the total number of ARM assembly instructions required to execute the code? What is the total number of MIPS assembly instructions required to execute the code? 2.35.5 [5] <2.16> Assuming that the average CPI of the MIPS assembly routine is the same as the average CPI of the ARM assembly routine, and the MIPS processor has an operation frequency that is 1.5 times of the ARM processor, how much faster is the ARM processor than the MIPS processor? Exercise 2.36 The ARM processor has an interesting way of supporting immediate constants. This exercise investigates those differences. The following table contains ARM instructions. a. ADD, r3, r2, r1, LSL #3 ;r3 = r2 + (r1 << 3) b. ADD, r3, r2, r1, ROR #3 ;r3 = r2 + (r1, rotated_right 3 bits) 2.36.1 [5] <2.16> Write the equivalent MIPS code for the ARM assembly code above. 2.36.2 [5] <2.16> If the register R1 had the constant value of 8, re-write your MIPS code to minimize the number of MIPS assembly instructions needed. 2.36.3 [5] <2.16> If the register R1 had the constant value of 0x06000000, rewrite your MIPS code to minimize the number of MIPS assembly instructions needed. 03-Ch02-P374750.indd 206 7/3/09 12:18:39 PM 2.21 Exercises 207 The following table contains MIPS instructions. a. addi r3, r2, 0x1 b. addi r3, r2, 0x8000 2.36.4 [5] <2.16> For the MIPS assembly code above, write the equivalent ARM assembly code. Exercise 2.37 This exercise explores the differences between the ARM and x86 instruction sets. The following table contains x86 assembly code. mov edx, [esi+4*ebx] a. b. START: mov mov mov and or ax, cx, bx, ax, ax, 00101100b 00000011b 11110000b bx cx 2.37.1 [10] <2.17> Write pseudo code for the given routine. 2.37.2 [10] <2.17> What is the equivalent ARM for the given routine? The following table contains x86 assembly instructions. a. mov edx, [esi+4*ebx] b. add eax, 0x12345678 2.37.3 [5] <2.17> For each assembly instruction, show the size of each of the bit fields that represent the instruction. Treat the label MY_FUNCTION as a 32-bit constant. 2.37.4 [10] <2.17> Write equivalent ARM assembly statements. Exercise 2.38 The x86 instruction set includes the REP prefix that causes the instruction to be repeated a given number of times or until a condition is satisfied. The first three problems in this exercise refer to the following x86 instruction: 03-Ch02-P374750.indd 207 7/3/09 12:18:40 PM 208 Chapter 2 Instructions: Language of the Computer Instruction Interpretation a. REP MOVSB Repeat until ECX is zero: Mem8[EDI]=Mem8[ESI], EDI=EDI+1, ESI=ESI+1, ECX=ECX–1 b. REP MOVSD Repeat until ECX is zero: Mem32[EDI]=Mem32[ESI], EDI=EDI+4, ESI=ESI+4, ECX=ECX–1 2.38.1 [5] <2.17> What would be a typical use for this instruction? 2.38.2 [5] <2.17> Write ARM code that performs the same operation, assuming that r0 corresponds to ECX, r1 to EDI, r2 to ESI, and r3 to EAX. 2.38.3 [5] <2.17> If the x86 instruction takes one cycle to read memory, one cycle to write memory, and one cycle for each register update, and if ARM takes one cycle per instruction, what is the speed-up of using this x86 instruction instead of the equivalent ARM code when ECX is very large? Assume that the clock cycle time for x86 and ARM is the same. The remaining three problems in this exercise refer to the following function, given in both C and x86 assembly. For each x86 instruction, we also show its length in the x86 variable-length instruction format and the interpretation (what the instruction does). Note that the x86 architecture has very few registers compared to ARM, and as a result the x86 calling convention is to push all arguments onto the stack. The return value of an x86 function is passed back to the caller in the EAX register. C code x86 code a. int f(int a, int b){ return a+b; } f: push %ebp mov %esp,%ebp mov 0xc(%ebp),%eax add 0x8(%ebp),%eax pop %ebp ret ; ; ; ; ; ; 1B, 2B, 3B, 3B, 1B, 1B, push %ebp to stack move %esp to %ebp load 2nd arg to %eax add 1st arg to %eax restore %ebp return b. void f(int *a, int *b){ *a=*a+*b; *b=*a; } f: push %ebp mov %esp,%ebp mov 8(%ebp),%eax mov 12(%ebp),%ecx mov (%eax),%edx add (%ecx),%edx mov %edx,(%eax) mov %edx,(%ecx) pop %ebp ret ; ; ; ; ; ; ; ; ; ; 1B, 2B, 3B, 3B, 2B, 2B, 2B, 2B, 1B, 1B, push %ebp to stack move %esp to %ebp load 1st arg into %eax load 2nd arg into %ecx load *a into %edx add *b to %edx store %edx to *a store %edx to *b restore %ebp return 2.38.4 [5] <2.17> Translate this function into ARM assembly. Compare the size (how many bytes of instruction memory are needed) for this x86 code and for your ARM code. 03-Ch02-P374750.indd 208 7/3/09 12:18:40 PM 2.21 Exercises 209 2.38.5 [5] <2.17> If the processor can execute two instructions per cycle, it must at least be able to read two consecutive instructions in each cycle. Explain how it would be done in ARM and how it would be done in x86. 2.38.6 [5] <2.17> If each ARM instruction takes one cycle, and if each x86 instruction takes one cycle plus a cycle for each memory read or write it has to perform, what is the speed-up of using x86 instead of ARM? Assume that the clock cycle time is the same in both x86 and ARM, and that the execution takes the shortest possible path through the function (i.e., every loop is exited immediately and every if statement takes the direction that leads toward the return from the function). Note that x86 ret instruction reads the return address from the stack. Exercise 2.39 The CPI of the different instruction types is given in the following table. Arithmetic Load/Store Branch a. 2 10 3 b. 1 10 4 2.39.1 [5] <2.18> Assume the following instruction breakdown given for executing a given program: Instructions (in millions) Arithmetic 500 Load/Store 300 Branch 100 What is the execution time for the processor if the operation frequency is 5 GHz? 2.39.2 [5] <2.18> Suppose that new, more powerful arithmetic instructions are added to the instruction set. On average, through the use of these more powerful arithmetic instructions, we can reduce the number of arithmetic instructions needed to execute a program by 25%, and the cost of increasing the clock cycle time by only 10%. Is this a good design choice? Why? 2.39.3 [5] <2.18> Suppose that we find a way to double the performance of arithmetic instructions? What is the overall speed-up of our machine? What if we find a way to improve the performance of arithmetic instructions by 10 times!? The following table shows the proportions of instruction execution for the different instruction types. 03-Ch02-P374750.indd 209 7/3/09 12:18:40 PM 210 Chapter 2 Instructions: Language of the Computer Arithmetic Load/Store Branch a. 60% 20% 20% b. 80% 15% 5% 2.39.4 [5] <2.18> Given the instruction mix above and the assumption that an arithmetic instruction requires 2 cycles, a load/store instruction takes 6 cycles, and a branch instruction takes 3 cycles, find the average CPI. 2.39.5 [5] <2.18> For a 25% improvement in performance, how many cycles, on average, may an arithmetic instruction take if load/store and branch instructions are not improved at all? 2.39.6 [5] <2.18> For a 50% improvement in performance, how many cycles, on average, may an arithmetic instruction take if load/store and branch instructions are not improved at all? Exercise 2.40 The first three problems in this exercise refer to the following function, given in ARM assembly. Unfortunately, the programmer of this function has fallen prey to the pitfall of assuming that ARM is a word addressed machine, but in fact ARM is byte-addressed. a. b. 03-Ch02-P374750.indd 210 ; int f(int a[],int f: MOV R4,#0 MOV R5,#0 L: ADD R6,R5,R0 LDR R6,[R6,#0] CMP R6,R2 BNE S ADD R4,R4,#1 S: ADD R5,R5,#1 CMP R5,R1 BNE L MOV R0,R4 MOV PC,LR ; void f: MOV MOV ADD L: LDR LDR ADD STR ADD ADD CMP BNE MOV n,int x); ; ret=0 ; i=0 ; &(a[i]) ; read a[i] ; if( a[i]==x) ; ret++ ; i++ ; repeat if i!=n ; return ret f(int *a,int *b,int n); R4,R0 ; p=a R5,R1 ; q=b R6,R2,R0 ; &(a[n]) R7,[R4,#0] ; read *p R8,[R5,#0] ; read *q R7,R7,R8 ; *p + *q R7,[R4,#0] ; *p = *p + *q R4,R4,#1 ; p=p+1 R5,R5,#1 ; q=q+1 R4,R6,#1 ; repeat if p!= &(a[n]) L PC,LR ; return 7/3/09 12:18:40 PM 2.21 Exercises 211 Note that in ARM assembly the “;” character denotes that the remainder of the line is a comment. 2.40.1 [5] <2.18> The ARM architecture requires word-sized accesses (LDR and STR) to be word-aligned, i.e. the lowermost 2 bits of the address must both be zero. Explain how this alignment requirement affects the execution of this function. 2.40.2 [5] <2.18> If “a” was a pointer to the beginning of an array of one-byte elements, and if we replaced LDR and STR with LDRB (load byte) and STRB (store byte), respectively, would this function be correct? 2.40.3 [5] <2.18> Change this code to make it correct for 32-bit integers. The remaining three problems in this exercise refer to a program that allocates memory for an array, fills the array with some numbers, calls the sort function from Figure 2.25, and then prints out the array. The main function of the program is as follows (given as both C and ARM code): main code in C main(){ int *v; int n=5; v=my_alloc(5); my_init(v,n); sort(v,n); . . . ARM version of the main code main: MOV R9,#5 MOV R0,R9 BL my_alloc MOV R10,R0 MOV R0,R10 MOV R1,R9 BL my_init MOV R0,R10 MOV R1,R9 BL sort The my_alloc function is defined as follows (given as both C and ARM code). Note that the programmer of this function has fallen prey to the pitfall of using a pointer to an automatic variable arr outside the function in which it is defined. my_alloc in C int *my_alloc(int n){ int arr[n]; return arr; } 03-Ch02-P374750.indd 211 ARM code for my_alloc my_alloc: SUB SP,SP,4 STR R11,[SP,#0] MOV R11,SP LSL R4,R0,R2 SUB SP,SP,R4 MOV R0,SP MOV SP,R11 LDR R11,[SP,#0] ADD SP,SP,#4 MOV PC,LR ; ; ; ; ; ; ; ; ; push r11 to stack save sp in r11 We need 4*n bytes Make room for arr Return address of arr Restore sp from r11 pop r11 from stack 7/3/09 12:18:40 PM 212 Chapter 2 Instructions: Language of the Computer The my_init function is defined as follows (ARM code): a. b. my_init: MOV R4,#0 MOV R5,R0 L: MOV R6,#0 STR R6,[R5,#0] ADD R5,R5,#4 ADD R4,R4,#1 CMP R4,R1 BNE L MOV PC,LR my_init: MOV R4,#0 MOV R5,R0 L: SUB R6,R1,R4 STR R6,[R5,#0] ADD R5,R5,#4 ADD R4,R4,#1 CMP R4,R1 BNE L MOV PC,LR ; i=0 ; v[i]=0 ; i=i+1 ; untill i==n ; i=0 ; a[i]=n-i ; i=i+1 ; until i==n 2.40.4 [5] <2.18> What are the contents (values of all five elements) of array v right before the “BL sort” instruction in the main code is executed? 2.40.5 [15] <2.18, 2.13> What are the contents of array v right before the sort function enters its outer loop for the first time? Assume that registers sp, r9, and r10 have values of 0x1000, 20, and 40, respectively, at the beginning of the main code. 2.40.6 [10] <2.18, 2.13> What are the contents of the 5-element array pointed by v right after “BL sort” returns to the main code? Answers to Check Yourself 03-Ch02-P374750.indd 212 §2.2, page 80: ARM, C, Java §2.3, page 86: 2) Very slow §2.4, page 92: 3) –8ten §2.5, page 99: 4) sub r2, r0, r1 §2.6, page 104: Both. AND with a mask pattern of 1s will leaves 0s everywhere but the desired field. Shifting left by the right amount removes the bits from the left of the field. Shifting right by the appropriate amount puts the field into the rightmost bits of the word, with 0s in the rest of the word. Note that AND leaves the field where it was originally, and the shift pair moves the field into the rightmost part of the word. 7/3/09 12:18:40 PM 2.21 Exercises 213 §2.7, page 112: I. All are true. II. 1). §2.8, page 122: Both are true. §2.9, page 127: I. 2) II. 3) §2.11, page 134: Both are true. §2.12, page 143: 4) Machine independence. 03-Ch02-P374750.indd 213 7/3/09 12:18:40 PM 3 Arithmetic for Computers Numerical precision is the very soul of science. Sir D’arcy Wentworth Thompson On Growth and Form, 1917 3.1 Introduction 216 3.2 Addition and Subtraction 216 3.3 Multiplication 220 3.4 Division 3.5 Floating Point 232 3.6 Parallelism and Computer Arithmetic: 226 Associativity 258 04-Ch03-P374750.indd 214 3.7 Real Stuff: Floating Point in the x86 259 3.8 Fallacies and Pitfalls 262 7/3/09 9:00:45 AM 3.9 Concluding Remarks 265 3.10 Historical Perspective and Further Reading 268 3.11 Exercises 269 The Five Classic Components of a Computer 04-Ch03-P374750.indd 215 7/3/09 9:00:45 AM 216 Chapter 3 Arithmetic for Computers 3.1 Introduction Computer words are composed of bits; thus, words can be represented as binary numbers. Chapter 2 shows that integers can be represented either in decimal or binary form, but what about the other numbers that commonly occur? For example: ■ What about fractions and other real numbers? ■ What happens if an operation creates a number bigger than can be represented? ■ And underlying these questions is a mystery: How does hardware really multiply or divide numbers? The goal of this chapter is to unravel these mysteries including representation of real numbers, arithmetic algorithms, hardware that follows these algorithms, and the implications of all this for instruction sets. These insights may explain quirks that you have already encountered with computers. Subtraction: Addition’s Tricky Pal No. 10, Top Ten Courses for Athletes at a Football Factory, David Letterman et al., Book of Top Ten Lists, 1990 3.2 Addition and Subtraction Addition is just what you would expect in computers. Digits are added bit by bit from right to left, with carries passed to the next digit to the left, just as you would do by hand. Subtraction uses addition: the appropriate operand is simply negated before being added. Binary Addition and Subtraction EXAMPLE Let’s try adding 6ten to 7ten in binary and then subtracting 6ten from 7ten in binary. 0000 0000 0000 0000 0000 0000 0000 0111two = 7ten + 0000 0000 0000 0000 0000 0000 0000 0110two = 6ten = 0000 0000 0000 0000 0000 0000 0000 1101two = 13ten The 4 bits to the right have all the action; Figure 3.1 shows the sums and carries. The carries are shown in parentheses, with the arrows showing how they are passed. 04-Ch03-P374750.indd 216 7/3/09 9:00:46 AM 3.2 217 Addition and Subtraction Subtracting 6ten from 7ten can be done directly: ANSWER – 0000 0000 0000 0000 0000 0000 0000 0111two = 7ten 0000 0000 0000 0000 0000 0000 0000 0110two = 6ten = 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten or via addition using the two’s complement representation of −6: + 0000 0000 0000 0000 0000 0000 0000 0111two = 7ten 1111 1111 1111 1111 1111 1111 1111 1010two = –6ten = 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten (0) (0) (1) (1) (0) 0 0 0 1 1 0 0 0 1 1 . . . (0) 0 (0) 0 (0) 1 (1) 1 (1) 0 ... ... (Carries) 1 0 (0) 1 FIGURE 3.1 Binary addition, showing carries from right to left. The rightmost bit adds 1 to 0, resulting in the sum of this bit being 1 and the carry out from this bit being 0. Hence, the operation for the second digit to the right is 0 + 1 + 1. This generates a 0 for this sum bit and a carry out of 1. The third digit is the sum of 1 + 1 + 1, resulting in a carry out of 1 and a sum bit of 1. The fourth bit is 1 + 0 + 0, yielding a 1 sum and no carry. Recall that overflow occurs when the result from an operation cannot be represented with the available hardware, in this case a 32-bit word. When can overflow occur in addition? When adding operands with different signs, overflow cannot occur. The reason is the sum must be no larger than one of the operands. For example, −10 + 4 = −6. Since the operands fit in 32 bits and the sum is no larger than an operand, the sum must fit in 32 bits as well. Therefore, no overflow can occur when adding positive and negative operands. There are similar restrictions to the occurrence of overflow during subtract, but it’s just the opposite principle: when the signs of the operands are the same, overflow cannot occur. To see this, remember that x − y = x + (−y) because we subtract by negating the second operand and then add. Therefore, when we subtract operands of the same sign we end up by adding operands of different signs. From the prior paragraph, we know that overflow cannot occur in this case either. Knowing when overflow cannot occur in addition and subtraction is all well and good, but how do we detect it when it does occur? Clearly, adding or subtracting two 32-bit numbers can yield a result that needs 33 bits to be fully expressed. The lack of a 33rd bit means that when overflow occurs, the sign bit is set with the value of the result instead of the proper sign of the result. Since we need just one extra bit, only the sign bit can be wrong. Hence, overflow occurs when adding two 04-Ch03-P374750.indd 217 7/3/09 9:00:46 AM 218 Chapter 3 Arithmetic for Computers positive numbers and the sum is negative, or vice versa. This means a carry out occurred into the sign bit. Overflow occurs in subtraction when we subtract a negative number from a positive number and get a negative result, or when we subtract a positive number from a negative number and get a positive result. This means a borrow occurred from the sign bit. Figure 3.2 shows the combination of operations, operands, and results that indicate an overflow. We have just seen how to detect overflow for two’s complement numbers in a computer. What about overflow with unsigned integers? Unsigned integers are commonly used for memory addresses where overflows are ignored. The computer designer must therefore provide a way to ignore overflow in some cases and to recognize it in others. The ARM solution is to have two conditional branches that test for overflow: BVS (branch if overflow set) and BVC (branch if overflow clear. The arithmetic instruction just needs to append S to set the condition flags before the branch. Because C ignores overflows, the ARM C compilers would not set the condition flag and branch on overflow. The ARM Fortran compilers, however, would add the test if the operands were signed integers. Arithmetic Logic Unit (ALU) Hardware that performs addition, subtraction, and usually logical operations such as AND and OR. Result indicating overflow Operation Operand A Operand B A+B A+B A–B A–B ≥0 ≥0 <0 <0 <0 ≥0 ≥0 ≥0 FIGURE 3.2 ≥0 <0 <0 <0 Overflow conditions for addition and subtraction. Appendix C describes the hardware that performs addition and subtraction, which is called an Arithmetic Logic Unit or ALU. Arithmetic for Multimedia Since every desktop microprocessor by definition has its own graphical displays, as transistor budgets increased it was inevitable that support would be added for graphics operations. Many graphics systems originally used 8 bits to represent each of the three primary colors plus 8 bits for a location of a pixel. The addition of speakers and microphones for teleconferencing and video games suggested support of sound as well. Audio samples need more than 8 bits of precision, but 16 bits are sufficient. Every microprocessor has special support so that bytes and halfwords take up less space when stored in memory (see Section 2.9), but due to the infrequency of arithmetic operations on these data sizes in typical integer programs, there is little support beyond data transfers. Architects recognized that many graphics and audio applications would perform the same operation on vectors of this data. By partitioning the carry chains within a 64-bit adder, a processor could 04-Ch03-P374750.indd 218 7/3/09 9:00:46 AM 3.2 219 Addition and Subtraction perform simultaneous operations on short vectors of eight 8-bit operands, four 16-bit operands, or two 32-bit operands. The cost of such partitioned adders was small. These extensions have been called vector or SIMD, for single instruction, multiple data (see Section 2.17 and Chapter 7). One feature not generally found in general-purpose microprocessors is saturating operations. Saturation means that when a calculation overflows, the result is set to the largest positive number or most negative number, rather than a modulo calculation as in two’s complement arithmetic. Saturation is likely what you want for media operations. For example, the volume knob on a radio set would be frustrating if, as you turned, it would get continuously louder for a while and then immediately very soft. A knob with saturation would stop at the highest volume no matter how far you turned it. Figure 3.3 shows arithmetic and logical operations found in many multimedia extensions to modern instruction sets. Instruction category Unsigned add/subtract Operands Eight 8-bit or Four 16-bit Saturating add/subtract Eight 8-bit or Four 16-bit Max/min/minimum Eight 8-bit or Four 16-bit Average Eight 8-bit or Four 16-bit Shift right/left Eight 8-bit or Four 16-bit FIGURE 3.3 Summary of multimedia support for desktop computers. Summary A major point of this section is that, independent of the representation, the finite word size of computers means that arithmetic operations can create results that are too large to fit in this fixed word size. It’s easy to detect overflow in unsigned numbers, although these are almost always ignored because programs don’t want to detect overflow for address arithmetic, the most common use of natural numbers. Two’s complement presents a greater challenge, yet some software systems require detection of overflow, so today all computers have a way to detect it. The rising popularity of multimedia applications led to arithmetic instructions that support narrower operations that can easily operate in parallel. Some programming languages allow two’s complement integer arithmetic on variables declared byte and half. What ARM instructions would be used? Check Yourself 1. Load with LDRB, LDRH; arithmetic with ADD, SUB, MUL; then store using STRB, STRH. 2. Load with LDRBS, LDRHS; arithmetic with ADD, SUB, MUL; then store using STRB, STRH. 3. LDRBS, LDRHS; arithmetic with ADD, SUB, MUL; using AND to mask result to 8 or 16 bits after each operation; then store using STRB, STRH. 04-Ch03-P374750.indd 219 7/3/09 9:00:46 AM 220 Chapter 3 Arithmetic for Computers Elaboration: The speed of addition is increased by determining the carry in to the high-order bits sooner. There are a variety of schemes to anticipate the carry so that the worst-case scenario is a function of the log 2 of the number of bits in the adder. These anticipatory signals are faster because they go through fewer gates in sequence, but it takes many more gates to anticipate the proper carry. The most popular is carry lookahead, which Section C.6 in Appendix C on the CD describes. Multiplication is vexation, Division is as bad; The rule of three doth puzzle me, And practice drives me mad. Anonymous, Elizabethan manuscript, 1570 3.3 Multiplication Now that we have completed the explanation of addition and subtraction, we are ready to build the more vexing operation of multiplication. First, let’s review the multiplication of decimal numbers in longhand to remind ourselves of the steps of multiplication and the names of the operands. For reasons that will become clear shortly, we limit this decimal example to using only the digits 0 and 1. Multiplying 1000ten by 1001ten: Multiplicand Multiplier x Product 1000ten 1001ten 1000 0000 0000 1000 1001000ten The first operand is called the multiplicand and the second the multiplier. The final result is called the product. As you may recall, the algorithm learned in grammar school is to take the digits of the multiplier one at a time from right to left, multiplying the multiplicand by the single digit of the multiplier, and shifting the intermediate product one digit to the left of the earlier intermediate products. The first observation is that the number of digits in the product is considerably larger than the number in either the multiplicand or the multiplier. In fact, if we ignore the sign bits, the length of the multiplication of an n-bit multiplicand and an m-bit multiplier is a product that is n + m bits long. That is, n + m bits are required to represent all possible products. Hence, like add, multiply must cope with overflow because we frequently want a 32-bit product as the result of multiplying two 32-bit numbers. In this example, we restricted the decimal digits to 0 and 1. With only two choices, each step of the multiplication is simple: 1. Just place a copy of the multiplicand (1 × multiplicand) in the proper place if the multiplier digit is a 1, or 2. Place 0 (0 × multiplicand) in the proper place if the digit is 0. 04-Ch03-P374750.indd 220 7/3/09 9:00:46 AM 3.3 Multiplication 221 Although the decimal example above happens to use only 0 and 1, multiplication of binary numbers must always use 0 and 1, and thus always offers only these two choices. Now that we have reviewed the basics of multiplication, the traditional next step is to provide the highly optimized multiply hardware. We break with tradition in the belief that you will gain a better understanding by seeing the evolution of the multiply hardware and algorithm through multiple generations. For now, let’s assume that we are multiplying only positive numbers. Sequential Version of the Multiplication Algorithm and Hardware This design mimics the algorithm we learned in grammar school; Figure 3.4 shows the hardware. We have drawn the hardware so that data flows from top to bottom to resemble more closely the paper-and-pencil method. Let’s assume that the multiplier is in the 32-bit Multiplier register and that the 64-bit Product register is initialized to 0. From the paper-and-pencil example above, it’s clear that we will need to move the multiplicand left one digit each step, as it may be added to the intermediate products. Over 32 steps, a 32-bit multiplicand would move 32 bits to the left. Hence, we need a 64-bit Multiplicand register, initialized with the 32-bit multiplicand in the right half and zero in the left half. This register is then shifted left 1 bit each step to align the multiplicand with the sum being accumulated in the 64-bit Product register. Multiplicand Shift left 64 bits Multiplier Shift right 64-bit ALU 32 bits Product Write Control test 64 bits FIGURE 3.4 First version of the multiplication hardware. The Multiplicand register, ALU, and Product register are all 64 bits wide, with only the Multiplier register containing 32 bits. ( Appendix C describes ALUs.) The 32-bit multiplicand starts in the right half of the Multiplicand register and is shifted left 1 bit on each step. The multiplier is shifted in the opposite direction at each step. The algorithm starts with the product initialized to 0. Control decides when to shift the Multiplicand and Multiplier registers and when to write new values into the Product register. 04-Ch03-P374750.indd 221 7/3/09 9:00:46 AM 222 Chapter 3 Arithmetic for Computers Figure 3.5 shows the three basic steps needed for each bit. The least significant bit of the multiplier (Multiplier0) determines whether the multiplicand is added to the Product register. The left shift in step 2 has the effect of moving the intermediate operands to the left, just as when multiplying with paper and pencil. The shift right in step 3 gives us the next bit of the multiplier to examine in the following iteration. These three steps are repeated 32 times to obtain the product. If each step took a clock cycle, this algorithm would require almost 100 clock cycles to multiply two 32-bit numbers. The relative importance of arithmetic operations like multiply Start Multiplier0 = 1 1. Test Multiplier0 Multiplier0 = 0 1a. Add multiplicand to product and place the result in Product register 2. Shift the Multiplicand register left 1 bit 3. Shift the Multiplier register right 1 bit No: < 32 repetitions 32nd repetition? Yes: 32 repetitions Done FIGURE 3.5 The first multiplication algorithm, using the hardware shown in Figure 3.4. If the least significant bit of the multiplier is 1, add the multiplicand to the product. If not, go to the next step. Shift the multiplicand left and the multiplier right in the next two steps. These three steps are repeated 32 times. 04-Ch03-P374750.indd 222 7/3/09 9:00:46 AM 3.3 223 Multiplication varies with the program, but addition and subtraction may be anywhere from 5 to 100 times more popular than multiply. Accordingly, in many applications, multiply can take multiple clock cycles without significantly affecting performance. Yet Amdahl’s law (see Section 1.8) reminds us that even a moderate frequency for a slow operation can limit performance. This algorithm and hardware are easily refined to take 1 clock cycle per step. The speed-up comes from performing the operations in parallel: the multiplier and multiplicand are shifted while the multiplicand is added to the product if the multiplier bit is a 1. The hardware just has to ensure that it tests the right bit of the multiplier and gets the preshifted version of the multiplicand. The hardware is usually further optimized to halve the width of the adder and registers by noticing where there are unused portions of registers and adders. Figure 3.6 shows the revised hardware. Replacing arithmetic by shifts can also occur when multiplying by constants. Some compilers replace multiplies by short constants with a series of shifts and adds. Because one bit to the left represents a number twice as large in base 2, shifting the bits left has the same effect as multiplying by a power of 2. As mentioned in Chapter 2, almost every compiler will perform the strength reduction optimization of substituting a left shift for a multiply by a power of 2. Hardware/ Software Interface Multiplicand 32 bits 32-bit ALU Product Shift right Write Control test 64 bits FIGURE 3.6 Refined version of the multiplication hardware. Compare with the first version in Figure 3.4. The Multiplicand register, ALU, and Multiplier register are all 32 bits wide, with only the Product register left at 64 bits. Now the product is shifted right. The separate Multiplier register also disappeared. The multiplier is placed instead in the right half of the Product register. These changes are highlighted in color. (The Product register should really be 65 bits to hold the carry out of the adder, but it’s shown here as 64 bits to highlight the evolution from Figure 3.4.) 04-Ch03-P374750.indd 223 7/3/09 9:00:47 AM 224 Chapter 3 Arithmetic for Computers A Multiply Algorithm EXAMPLE ANSWER Using 4-bit numbers to save space, multiply 2ten × 3ten, or 0010two × 0011two. Figure 3.7 shows the value of each register for each of the steps labeled according to Figure 3.5, with the final value of 0000 0110two or 6ten. Color is used to indicate the register values that change on that step, and the bit circled is the one examined to determine the operation of the next step. Signed Multiplication So far, we have dealt with positive numbers. The easiest way to understand how to deal with signed numbers is to first convert the multiplier and multiplicand to positive numbers and then remember the original signs. The algorithms should then be run for 31 iterations, leaving the signs out of the calculation. As we learned in grammar school, we need negate the product only if the original signs disagree. It turns out that the last algorithm will work for signed numbers, provided that we remember that we are dealing with numbers that have infinite digits, and we are only representing them with 32 bits. Hence, the shifting steps would need to extend the sign of the product for signed numbers. When the algorithm completes, the lower word would have the 32-bit product. Iteration 0 1 2 3 4 Step Initial values 1a: 1 ⇒ Prod = Prod + Mcand 2: Shift left Multiplicand 3: Shift right Multiplier 1a: 1 ⇒ Prod = Prod + Mcand 2: Shift left Multiplicand 3: Shift right Multiplier 1: 0 ⇒ No operation 2: Shift left Multiplicand 3: Shift right Multiplier 1: 0 ⇒ No operation 2: Shift left Multiplicand 3: Shift right Multiplier Multiplier Multiplicand Product 0011 0011 0011 0001 0001 0001 0000 0000 0000 0000 0000 0000 0000 0000 0010 0000 0010 0000 0100 0000 0100 0000 0100 0000 1000 0000 1000 0000 1000 0001 0000 0001 0000 0001 0000 0010 0000 0010 0000 0000 0000 0000 0010 0000 0010 0000 0010 0000 0110 0000 0110 0000 0110 0000 0110 0000 0110 0000 0110 0000 0110 0000 0110 0000 0110 FIGURE 3.7 Multiply example using algorithm in Figure 3.5. The bit examined to determine the next step is circled in color. 04-Ch03-P374750.indd 224 7/3/09 9:00:47 AM 3.3 225 Multiplication Faster Multiplication Moore’s law has provided so much more in resources that hardware designers can now build much faster multiplication hardware. Whether the multiplicand is to be added or not is known at the beginning of the multiplication by looking at each of the 32 multiplier bits. Faster multiplications are possible by essentially providing one 32-bit adder for each bit of the multiplier: one input is the multiplicand ANDed with a multiplier bit, and the other is the output of a prior adder. A straightforward approach would be to connect the outputs of adders on the right to the inputs of adders on the left, making a stack of adders 32 high. An alternative way to organize these 32 additions is in a parallel tree, as Figure 3.8 shows. Instead of waiting for 32 add times, we wait just the log2 (32) or five 32-bit add times. Figure 3.8 shows how this is a faster way to connect them. In fact, multiply can go even faster than five add times because of the use of Appendix C) and because it is easy to carry save adders (see Section C.6 in pipeline such a design to be able to support many multiplies simultaneously (see Chapter 4). Multiply in ARM ARM provides a multiply instruction (MUL) that puts the lower 32 bits of the product into the destination register. Since it doesn’t offer the upper 32 bits, there is no difference between signed and unsigned multiplication. Mplier31 • Mcand Mplier30 • Mcand Mplier29 • Mcand Mplier28 • Mcand 32 bits 32 bits Mplier3 • Mcand Mplier2 • Mcand Mplier1 • Mcand Mplier0 • Mcand 32 bits 32 bits ... 32 bits 1 bit 1 bit 32 bits ... ... ... 1 bit 1 bit 32 bits Product63 Product62 ... Product47..16 ... Product1 Product0 FIGURE 3.8 Fast multiplication hardware. Rather than use a single 32-bit adder 31 times, this hardware “unrolls the loop” to use 31 adders and then organizes them to minimize delay. 04-Ch03-P374750.indd 225 7/3/09 9:00:47 AM 226 Chapter 3 Arithmetic for Computers Summary Multiplication hardware is simply shifts and add, derived from the paper-andpencil method learned in grammar school. Compilers even use shift instructions for multiplications by powers of 2. Elaboration: The ARM multiply instruction ignores overflow, so it is up to the software to check to see if the product is too big to fit in 32 bits. Divide et impera. Latin for “Divide and rule,” ancient political maxim cited by Machiavelli, 1532 dividend A number being divided. divisor A number that the dividend is divided by. 3.4 Division The reciprocal operation of multiply is divide, an operation that is even less frequent and even more quirky. It even offers the opportunity to perform a mathematically invalid operation: dividing by 0. Let’s start with an example of long division using decimal numbers to recall the names of the operands and the grammar school division algorithm. For reasons similar to those in the previous section, we limit the decimal digits to just 0 or 1. The example is dividing 1,001,010ten by 1000ten: 1001ten Quotient Divisor 1000ten 1001010ten −1000 10 101 1010 −1000 10ten Dividend Remainder quotient The primary result of a division; a number that when multiplied by the divisor and added to the remainder produces the dividend. Divide’s two operands, called the dividend and divisor, and the result, called the quotient, are accompanied by a second result, called the remainder. Here is another way to express the relationship between the components: remainder The secondary result of a division; a number that when added to the product of the quotient and the divisor produces the dividend. where the remainder is smaller than the divisor. Infrequently, programs use the divide instruction just to get the remainder, ignoring the quotient. The basic grammar school division algorithm tries to see how big a number can be subtracted, creating a digit of the quotient on each attempt. Our carefully selected decimal example uses only the numbers 0 and 1, so it’s easy to figure out 04-Ch03-P374750.indd 226 Dividend = Quotient × Divisor + Remainder 7/3/09 9:00:47 AM 3.4 Division 227 how many times the divisor goes into the portion of the dividend: it’s either 0 times or 1 time. Binary numbers contain only 0 or 1, so binary division is restricted to these two choices, thereby simplifying binary division. Let’s assume that both the dividend and the divisor are positive and hence the quotient and the remainder are nonnegative. The division operands and both results are 32-bit values, and we will ignore the sign for now. A Division Algorithm and Hardware Figure 3.9 shows hardware to mimic our grammar school algorithm. We start with the 32-bit Quotient register set to 0. Each iteration of the algorithm needs to move the divisor to the right one digit, so we start with the divisor placed in the left half of the 64-bit Divisor register and shift it right 1 bit each step to align it with the dividend. The Remainder register is initialized with the dividend. Divisor Shift right 64 bits Quotient Shift left 64-bit ALU 32 bits Remainder Write Control test 64 bits FIGURE 3.9 First version of the division hardware. The Divisor register, ALU, and Remainder register are all 64 bits wide, with only the Quotient register being 32 bits. The 32-bit divisor starts in the left half of the Divisor register and is shifted right 1 bit each iteration. The remainder is initialized with the dividend. Control decides when to shift the Divisor and Quotient registers and when to write the new value into the Remainder register. Figure 3.10 shows three steps of the first division algorithm. Unlike a human, the computer isn’t smart enough to know in advance whether the divisor is smaller than the dividend. It must first subtract the divisor in step 1; remember that this is how we performed the comparison in the set on less than instruction. If the result is positive, the divisor was smaller or equal to the dividend, so we generate a 1 in the quotient (step 2a). If the result is negative, the next step is to restore the original value by adding the divisor back to the remainder and generate a 0 in the quotient (step 2b). The divisor is shifted right and then we iterate again. The remainder and quotient will be found in their namesake registers after the iterations are complete. 04-Ch03-P374750.indd 227 7/3/09 9:00:47 AM 228 Chapter 3 Arithmetic for Computers Start 1. Subtract the Divisor register from the Remainder register and place the result in the Remainder register Remainder ≥ 0 Remainder < 0 Test Remainder 2a. Shift the Quotient register to the left, setting the new rightmost bit to 1 2b. Restore the original value by adding the Divisor register to the Remainder register and placing the sum in the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0 3. Shift the Divisor register right 1 bit No: < 33 repetitions 33rd repetition? Yes: 33 repetitions Done FIGURE 3.10 A division algorithm, using the hardware in Figure 3.9. If the remainder is positive, the divisor did go into the dividend, so step 2a generates a 1 in the quotient. A negative remainder after step 1 means that the divisor did not go into the dividend, so step 2b generates a 0 in the quotient and adds the divisor to the remainder, thereby reversing the subtraction of step 1. The final shift, in step 3, aligns the divisor properly, relative to the dividend for the next iteration. These steps are repeated 33 times. 04-Ch03-P374750.indd 228 7/3/09 9:00:47 AM 3.4 229 Division A Divide Algorithm Using a 4-bit version of the algorithm to save pages, let’s try dividing 7ten by 2ten, or 0000 0111two by 0010two. Figure 3.11 shows the value of each register for each of the steps, with the quotient being 3ten and the remainder 1ten. Notice that the test in step 2 of whether the remainder is positive or negative simply tests whether the sign bit of the Remainder register is a 0 or 1. The surprising requirement of this algorithm is that it takes n + 1 steps to get the proper quotient and remainder. EXAMPLE ANSWER This algorithm and hardware can be refined to be faster and cheaper. The speed-up comes from shifting the operands and the quotient simultaneously with the subtraction. This refinement halves the width of the adder and registers by noticing where there are unused portions of registers and adders. Figure 3.12 shows the revised hardware. Signed Division So far, we have ignored signed numbers in division. The simplest solution is to remember the signs of the divisor and dividend and then negate the quotient if the signs disagree. Iteration 0 1 2 3 4 5 Step Initial values 1: Rem = Rem – Div 2b: Rem < 0 ⇒ +Div, sll Q, Q0 = 0 3: Shift Div right 1: Rem = Rem – Div 2b: Rem < 0 ⇒ +Div, sll Q, Q0 = 0 3: Shift Div right 1: Rem = Rem – Div 2b: Rem < 0 ⇒ +Div, sll Q, Q0 = 0 3: Shift Div right 1: Rem = Rem – Div 2a: Rem ≥ 0 ⇒ sll Q, Q0 = 1 3: Shift Div right 1: Rem = Rem – Div 2a: Rem ≥ 0 ⇒ sll Q, Q0 = 1 3: Shift Div right Quotient Divisor Remainder 0000 0000 0000 0010 0000 0010 0000 0010 0000 0000 0111 1110 0111 0000 0111 0000 0000 0000 0000 0000 0000 0000 0000 0001 0001 0001 0011 0011 0001 0000 0001 0000 0001 0000 0000 1000 0000 1000 0000 1000 0000 0100 0000 0100 0000 0100 0000 0010 0000 0010 0000 0010 0000 0001 0000 0111 1111 0111 0000 0111 0000 0111 1111 1111 0000 0111 0000 0111 0000 0011 0000 0011 0000 0011 0000 0001 0000 0001 0000 0001 FIGURE 3.11 Division example using the algorithm in Figure 3.10. The bit examined to determine the next step is circled in color. 04-Ch03-P374750.indd 229 7/3/09 9:00:47 AM 230 Chapter 3 Arithmetic for Computers Divisor 32 bits 32-bit ALU Remainder Shift right Shift left Write Control test 64 bits FIGURE 3.12 An improved version of the division hardware. The Divisor register, ALU, and Quotient register are all 32 bits wide, with only the Remainder register left at 64 bits. Compared to Figure 3.9, the ALU and Divisor registers are halved and the remainder is shifted left. This version also combines the Quotient register with the right half of the Remainder register. (As in Figure 3.6, the Remainder register should really be 65 bits to make sure the carry out of the adder is not lost.) Elaboration: The one complication of signed division is that we must also set the sign of the remainder. Remember that the following equation must always hold: Dividend = Quotient × Divisor + Remainder To understand how to set the sign of the remainder, let’s look at the example of dividing all the combinations of ±7ten by ±2ten. The first case is easy: +7 ÷ +2: Quotient = +3, Remainder = +1 Checking the results: 7 = 3 × 2 + (+1) = 6 + 1 If we change the sign of the dividend, the quotient must change as well: –7 ÷ +2: Quotient = –3 Rewriting our basic formula to calculate the remainder: Remainder = (Dividend – Quotient × Divisor) = –7 – (–3 × +2) = –7–(–6) = –1 So, –7 ÷ +2: Quotient = –3, Remainder = –1 Checking the results again: –7 = –3 × 2 + (–1) = – 6 – 1 The reason the answer isn’t a quotient of –4 and a remainder of +1, which would also fit this formula, is that the absolute value of the quotient would then change depending on the sign of the dividend and the divisor! Clearly, if –(x ÷ y) ≠ (–x) ÷ y 04-Ch03-P374750.indd 230 7/3/09 9:00:47 AM 3.4 Division 231 programming would be an even greater challenge. This anomalous behavior is avoided by following the rule that the dividend and remainder must have the same signs, no matter what the signs of the divisor and quotient. We calculate the other combinations by following the same rule: +7 ÷ –2: Quotient = –3, Remainder = +1 –7 ÷ –2: Quotient = +3, Remainder = –1 Thus the correctly signed division algorithm negates the quotient if the signs of the operands are opposite and makes the sign of the nonzero remainder match the dividend. Faster Division We used many adders to speed up multiply, but we cannot do the same trick for divide. The reason is that we need to know the sign of the difference before we can perform the next step of the algorithm, whereas with multiply we could calculate the 32 partial products immediately. There are techniques to produce more than one bit of the quotient per step. The SRT division technique tries to guess several quotient bits per step, using a table lookup based on the upper bits of the dividend and remainder. It relies on subsequent steps to correct wrong guesses. A typical value today is 4 bits. The key is guessing the value to subtract. With binary division, there is only a single choice. These algorithms use 6 bits from the remainder and 4 bits from the divisor to index a table that determines the guess for each step. The accuracy of this fast method depends on having proper values in the lookup table. The fallacy on page 263 in Section 3.8 shows what can happen if the table is incorrect. Divide in ARM (or lack there of) While there are many versions of ARM, the classic ARM instruction set had no divide instruction. That tradition continues with the ARMv7A, although the ARMv7R and AMv7M include signed integer divide (SDIV) and unsigned integer divide (UDIV) instructions. Summary The common hardware support for multiply and divide allows ARM to provide a single pair of 32-bit registers that are used both for multiply and divide. Elaboration: An even faster algorithm does not immediately add the divisor back if the remainder is negative. It simply adds the dividend to the shifted remainder in the following step, since (r + d ) × 2 – d = r × 2 + d × 2 – d = r × 2 + d. This nonrestoring division algorithm, which takes 1 clock cycle per step, is explored further in the exercises; the algorithm here is called restoring division. A third algorithm that doesn’t save the result of the subtract if its negative is called a nonperforming division algorithm. It averages one-third fewer arithmetic operations. 04-Ch03-P374750.indd 231 7/3/09 9:00:48 AM 232 Speed gets you nowhere if you’re headed the wrong way. Chapter 3 3.5 Arithmetic for Computers Floating Point American proverb Going beyond signed and unsigned integers, programming languages support numbers with fractions, which are called reals in mathematics. Here are some examples of reals: 3.14159265 . . .ten (pi) 2.71828 . . .ten (e) 0.000000001ten or 1.0ten × 10−9 (seconds in a nanosecond) 3,155,760,000ten or 3.15576ten × 109 (seconds in a typical century) scientific notation A notation that renders numbers with a single digit to the left of the decimal point. normalized A number in floating-point notation that has no leading 0s. floating point Computer arithmetic that represents numbers in which the binary point is not fixed. Notice that in the last case, the number didn’t represent a small fraction, but it was bigger than we could represent with a 32-bit signed integer. The alternative notation for the last two numbers is called scientific notation, which has a single digit to the left of the decimal point. A number in scientific notation that has no leading 0s is called a normalized number, which is the usual way to write it. For example, 1.0ten × 10−9 is in normalized scientific notation, but 0.1ten × 10−8 and 10.0ten × 10 −10 are not. Just as we can show decimal numbers in scientific notation, we can also show binary numbers in scientific notation: 1.0two × 2−1 To keep a binary number in normalized form, we need a base that we can increase or decrease by exactly the number of bits the number must be shifted to have one nonzero digit to the left of the decimal point. Only a base of 2 fulfills our need. Since the base is not 10, we also need a new name for decimal point; binary point will do fine. Computer arithmetic that supports such numbers is called floating point because it represents numbers in which the binary point is not fixed, as it is for integers. The programming language C uses the name float for such numbers. Just as in scientific notation, numbers are represented as a single nonzero digit to the left of the binary point. In binary, the form is 1.xxxxxxxxxtwo × 2yyyy (Although the computer represents the exponent in base 2 as well as the rest of the number, to simplify the notation we show the exponent in decimal.) A standard scientific notation for reals in normalized form offers three advantages. It simplifies exchange of data that includes floating-point numbers; it simplifies the floating-point arithmetic algorithms to know that numbers will always 04-Ch03-P374750.indd 232 7/3/09 9:00:48 AM 3.5 233 Floating Point be in this form; and it increases the accuracy of the numbers that can be stored in a word, since the unnecessary leading 0s are replaced by real digits to the right of the binary point. Floating-Point Representation A designer of a floating-point representation must find a compromise between the size of the fraction and the size of the exponent, because a fixed word size means you must take a bit from one to add a bit to the other. This trade-off is between precision and range: increasing the size of the fraction enhances the precision of the fraction, while increasing the size of the exponent increases the range of numbers that can be represented. As our design guideline from Chapter 2 reminds us, good design demands good compromise. Floating-point numbers are usually a multiple of the size of a word. The representation of a ARM floating-point number is shown below, where s is the sign of the floating-point number (1 meaning negative), exponent is the value of the 8-bit exponent field (including the sign of the exponent), and fraction is the 23-bit number. This representation is called sign and magnitude, since the sign is a separate bit from the rest of the number. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 s exponent fraction 1 bit 8 bits 23 bits 8 fraction The value, generally between 0 and 1, placed in the fraction field. exponent In the numerical representation system of floating-point arithmetic, the value that is placed in the exponent field. 7 6 5 4 3 2 1 0 In general, floating-point numbers are of the form (−1)S × F × 2E F involves the value in the fraction field and E involves the value in the exponent field; the exact relationship to these fields will be spelled out soon. (We will shortly see that ARM does something slightly more sophisticated.) These chosen sizes of exponent and fraction give ARM computer arithmetic an extraordinary range. Fractions almost as small as 2.0ten × 10−38 and numbers almost as large as 2.0ten × 1038 can be represented in a computer. Alas, extraordinary differs from infinite, so it is still possible for numbers to be too large. Thus, overflow interrupts can occur in floating-point arithmetic as well as in integer arithmetic. Notice that overflow here means that the exponent is too large to be represented in the exponent field. Floating point offers a new kind of exceptional event as well. Just as programmers will want to know when they have calculated a number that is too large to be represented, they will want to know if the nonzero fraction they are calculating has become so small that it cannot be represented; either event could result in a program giving incorrect answers. To distinguish it from overflow, we call this event underflow. This situation occurs when the negative exponent is too large to fit in the exponent field. 04-Ch03-P374750.indd 233 overflow (floatingpoint) A situation in which a positive exponent becomes too large to fit in the exponent field. underflow (floatingpoint) A situation in which a negative exponent becomes too large to fit in the exponent field. 7/3/09 9:00:48 AM 234 Chapter 3 Hardware/ Software Interface exception Also called interrupt. An unscheduled event that disrupts program execution double precision A floating-point value represented in two 32-bit words. single precision A floating-point value represented in a single 32-bit word. Arithmetic for Computers One of the optional modes of the IEEE floating point standard is to cause an exception when underflow or overflow occurs. The programmer or the programming environment must then decide what to do. An exception, also called an interrupt on many computers, is essentially an unscheduled procedure call. The address of the instruction that overflowed is saved in a register, and the computer jumps to a predefined address to invoke the appropriate routine for that exception. The interrupted address is saved so that in some situations the program can continue after corrective code is executed. (Section 4.9 covers exceptions in more detail; Chapters 5 and 6 describe other situations where exceptions and interrupts occur.) One way to reduce chances of underflow or overflow is to offer another format that has a larger exponent. In C this number is called double, and operations on doubles are called double precision floating-point arithmetic; single precision floating point is the name of the earlier format. The representation of a double precision floating-point number takes two ARM words, as shown below, where s is still the sign of the number, exponent is the value of the 11-bit exponent field, and fraction is the 52-bit number in the fraction field. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 s exponent fraction 1 bit 11 bits 20 bits 8 7 6 5 4 3 2 1 0 fraction (continued) 32 bits ARM double precision allows numbers almost as small as 2.0ten × 10−308 and almost as large as 2.0ten × 10308. Although double precision does increase the exponent range, its primary advantage is its greater precision because of the much larger significand. These formats go beyond ARM. They are part of the IEEE 754 floating-point standard, found in virtually every computer invented since 1980. This standard has greatly improved both the ease of porting floating-point programs and the quality of computer arithmetic. To pack even more bits into the significand, IEEE 754 makes the leading 1-bit of normalized binary numbers implicit. Hence, the number is actually 24 bits long in single precision (implied 1 and a 23-bit fraction), and 53 bits long in double precision (1 + 52). To be precise, we use the term significand to represent the 24- or 53-bit number that is 1 plus the fraction, and fraction when we mean the 23- or 52-bit number. Since 0 has no leading 1, it is given the reserved exponent value 0 so that the hardware won’t attach a leading 1 to it. 04-Ch03-P374750.indd 234 7/3/09 9:00:48 AM 3.5 Floating Point 235 Thus 00 . . . 00two represents 0; the representation of the rest of the numbers uses the form from before with the hidden 1 added: (−1)S × (1 + Fraction) × 2E where the bits of the fraction represent a number between 0 and 1 and E specifies the value in the exponent field, to be given in detail shortly. If we number the bits of the fraction from left to right s1, s2, s3, . . . , then the value is (−1)S × (1 + (s1 × 2−1) + (s2 × 2−2) + (s3 × 2−3) + (s4 × 2−4) + …) × 2E Figure 3.13 shows the encodings of IEEE 754 floating-point numbers. Other features of IEEE 754 are special symbols to represent unusual events. For example, instead of interrupting on a divide by 0, software can set the result to a bit pattern representing +∞ or −∞; the largest exponent is reserved for these special symbols. When the programmer prints the results, the program will print an infinity symbol. (For the mathematically trained, the purpose of infinity is to form topological closure of the reals.) IEEE 754 even has a symbol for the result of invalid operations, such as 0/0 or subtracting infinity from infinity. This symbol is NaN, for Not a Number. The purpose of NaNs is to allow programmers to postpone some tests and decisions to a later time in the program when they are convenient. The designers of IEEE 754 also wanted a floating-point representation that could be easily processed by integer comparisons, especially for sorting. This desire is why the sign is in the most significant bit, allowing a quick test of less than, greater than, or equal to 0. (It’s a little more complicated than a simple integer sort, since this notation is essentially sign and magnitude rather than two’s complement.) Placing the exponent before the significand also simplifies the sorting of floating-point numbers using integer comparison instructions, since numbers with bigger exponents look larger than numbers with smaller exponents, as long as both exponents have the same sign. Negative exponents pose a challenge to simplified sorting. If we use two’s complement or any other notation in which negative exponents have a 1 in the most Single precision Exponent Fraction Double precision Exponent Object represented Fraction 0 0 0 0 0 0 Nonzero 0 Nonzero ± denormalized number 1–254 Anything 1–2046 Anything ± floating-point number 255 0 2047 0 ± infinity 255 Nonzero 2047 Nonzero NaN (Not a Number) FIGURE 3.13 IEEE 754 encoding of floating-point numbers. A separate sign bit determines the sign. Denormalized numbers are described in the Elaboration on page 257. 04-Ch03-P374750.indd 235 7/3/09 9:00:48 AM 236 Chapter 3 Arithmetic for Computers significant bit of the exponent field, a negative exponent will look like a big number. For example, 1.0two × 2−1 would be represented as 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 . . . 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (Remember that the leading 1 is implicit in the significand.) The value 1.0two × 2+1 would look like the smaller binary number 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 . . . 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The desirable notation must therefore represent the most negative exponent as 00 . . . 00two and the most positive as 11 . . . 11two. This convention is called biased notation, with the bias being the number subtracted from the normal, unsigned representation to determine the real value. IEEE 754 uses a bias of 127 for single precision, so an exponent of −1 is represented by the bit pattern of the value −1 + 127ten, or 126ten = 0111 1110two, and +1 is represented by 1 + 127, or 128ten = 1000 0000two. The exponent bias for double precision is 1023. Biased exponent means that the value represented by a floatingpoint number is really (−1)S × (1 + Fraction) × 2(Exponent − Bias) The range of single precision numbers is then from as small as ±1.0000 0000 0000 0000 0000 000two × 2−126 to as large as ±1.1111 1111 1111 1111 1111 111two × 2+127. Let’s show the representation. Floating-Point Representation EXAMPLE ANSWER Show the IEEE 754 binary representation of the number −0.75ten in single and double precision. The number −0.75ten is also −3/4ten or −3/22ten It is also represented by the binary fraction −11two/22ten or −0.11two 04-Ch03-P374750.indd 236 7/3/09 9:00:48 AM 3.5 237 Floating Point In scientific notation, the value is −0.11two × 20 and in normalized scientific notation, it is −1.1two × 2−1 The general representation for a single precision number is (−1)S × (1 + Fraction) × 2(Exponent − 127) Subtracting the bias 127 from the exponent of −1.1two × 2−1 yields (−1)1 × (1 + .1000 0000 0000 0000 0000 000two) × 2(126−127) The single precision binary representation of −0.75ten is then 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 bit 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 8 bits 0 0 0 0 23 bits The double precision representation is (−1)1 × (1 + .1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000two) × 2(1022−1023) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 bit 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 11 bits 0 0 0 0 0 0 0 0 0 20 bits 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 32 bits Now let’s try going the other direction. 04-Ch03-P374750.indd 237 7/3/09 9:00:48 AM 238 Chapter 3 Arithmetic for Computers Converting Binary to Decimal Floating Point What decimal number is represented by this single precision float? EXAMPLE 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 0 0 0 0 0 0 . . . 1 0 0 0 ANSWER 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 The sign bit is 1, the exponent field contains 129, and the fraction field contains 1 × 2−2 = 1/4, or 0.25. Using the basic equation, (−1)S × (1 + Fraction) × 2(Exponent − Bias) = (−1)1 × (1 + 0.25) × 2(129−127) = −1 × 1.25 × 22 = −1.25 × 4 = −5.0 In the next subsections, we will give the algorithms for floating-point addition and multiplication. At their core, they use the corresponding integer operations on the significands, but extra bookkeeping is necessary to handle the exponents and normalize the result. We first give an intuitive derivation of the algorithms in decimal and then give a more detailed, binary version in the figures. Elaboration: In an attempt to increase range without removing bits from the significand, some computers before the IEEE 754 standard used a base other than 2. For example, the IBM 360 and 370 mainframe computers use base 16. Since changing the IBM exponent by one means shifting the significand by 4 bits, “normalized” base 16 numbers can have up to 3 leading bits of 0s! Hence, hexadecimal digits mean that up to 3 bits must be dropped from the significand, which leads to surprising problems in the accuracy of floating-point arithmetic. Recent IBM mainframes support IEEE 754 as well as the hex format. Floating-Point Addition Let’s add numbers in scientific notation by hand to illustrate the problems in floating-point addition: 9.999ten × 101 + 1.610ten × 10−1. Assume that we can store only four decimal digits of the significand and two decimal digits of the exponent. Step 1. To be able to add these numbers properly, we must align the decimal point of the number that has the smaller exponent. Hence, we need a form of the smaller number, 1.610ten × 10−1, that matches the larger exponent. 04-Ch03-P374750.indd 238 7/3/09 9:00:48 AM 3.5 Floating Point 239 We obtain this by observing that there are multiple representations of an unnormalized floating-point number in scientific notation: 1.610ten × 10−1 = 0.1610ten × 100 = 0.01610ten × 101 The number on the right is the version we desire, since its exponent matches the exponent of the larger number, 9.999ten × 101. Thus, the first step shifts the significand of the smaller number to the right until its corrected exponent matches that of the larger number. But we can represent only four decimal digits so, after shifting, the number is really 0.016ten × 101 Step 2. Next comes the addition of the significands: + 9.999ten 0.016ten 10.015ten The sum is 10.015ten × 101. Step 3. This sum is not in normalized scientific notation, so we need to adjust it: 10.015ten × 101 = 1.0015ten × 102 Thus, after the addition we may have to shift the sum to put it into normalized form, adjusting the exponent appropriately. This example shows shifting to the right, but if one number were positive and the other were negative, it would be possible for the sum to have many leading 0s, requiring left shifts. Whenever the exponent is increased or decreased, we must check for overflow or underflow—that is, we must make sure that the exponent still fits in its field. Step 4. Since we assumed that the significand can be only four digits long (excluding the sign), we must round the number. In our grammar school algorithm, the rules truncate the number if the digit to the right of the desired point is between 0 and 4 and add 1 to the digit if the number to the right is between 5 and 9. The number 1.0015ten × 102 is rounded to four digits in the significand to 1.002ten × 102 since the fourth digit to the right of the decimal point was between 5 and 9. Notice that if we have bad luck on rounding, such as adding 1 to a string of 9s, the sum may no longer be normalized and we would need to perform step 3 again. 04-Ch03-P374750.indd 239 7/3/09 9:00:48 AM 240 Chapter 3 Arithmetic for Computers Start 1. Compare the exponents of the two numbers; shift the smaller number to the right until its exponent would match the larger exponent 2. Add the significands 3. Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent Overflow or underflow? Yes No Exception 4. Round the significand to the appropriate number of bits No Still normalized? Yes Done FIGURE 3.14 Floating-point addition. The normal path is to execute steps 3 and 4 once, but if rounding causes the sum to be unnormalized, we must repeat step 3. Figure 3.14 shows the algorithm for binary floating-point addition that follows this decimal example. Steps 1 and 2 are similar to the example just discussed: adjust the significand of the number with the smaller exponent and then add the two significands. Step 3 normalizes the results, forcing a check for overflow or underflow. 04-Ch03-P374750.indd 240 7/3/09 9:00:48 AM 3.5 241 Floating Point The test for overflow and underflow in step 3 depends on the precision of the operands. Recall that the pattern of all 0 bits in the exponent is reserved and used for the floating-point representation of zero. Moreover, the pattern of all 1 bits in the exponent is reserved for indicating values and situations outside the scope of normal floating-point numbers (see the Elaboration on page 257). Thus, for single precision, the maximum exponent is 127, and the minimum exponent is −126. The limits for double precision are 1023 and −1022. Binary Floating-Point Addition Try adding the numbers 0.5ten and −0.4375ten in binary using the algorithm in Figure 3.14. EXAMPLE Let’s first look at the binary version of the two numbers in normalized scientific notation, assuming that we keep 4 bits of precision: ANSWER = 1/21ten 0.5ten = 1/2ten = 0.1two = 0.1two × 20 = 1.000two × 2−1 4 −0.4375ten = −7/16ten = −7/2 ten = −0.0111two = −0.0111two × 20 = −1.110two × 2−2 Now we follow the algorithm: Step 1. The significand of the number with the lesser exponent (−1.11two × 2−2) is shifted right until its exponent matches the larger number: −1.110two × 2−2 = −0.111two × 2−1 Step 2. Add the significands: 1.000two × 2−1 + (−0.111two × 2−1) = 0.001two × 2−1 Step 3. Normalize the sum, checking for overflow or underflow: 0.001two × 2−1 = 0.010two × 2−2 = 0.100two × 2−3 = 1.000two × 2−4 Since 127 ≥ −4 ≥ −126, there is no overflow or underflow. (The biased exponent would be −4 + 127, or 123, which is between 1 and 254, the smallest and largest unreserved biased exponents.) Step 4. Round the sum: 1.000two × 2−4 04-Ch03-P374750.indd 241 7/3/09 9:00:48 AM 242 Chapter 3 Arithmetic for Computers The sum already fits exactly in 4 bits, so there is no change to the bits due to rounding. This sum is then 1.000two × 2−4 = 0.0001000two = 0.0001two = 1/24ten = 1/16ten = 0.0625ten This sum is what we would expect from adding 0.5ten to −0.4375ten. Many computers dedicate hardware to run floating-point operations as fast as possible. Figure 3.15 sketches the basic organization of hardware for floating-point addition. Floating-Point Multiplication Now that we have explained floating-point addition, let’s try floating-point multiplication. We start by multiplying decimal numbers in scientific notation by hand: 1.110ten × 1010 × 9.200ten × 10−5. Assume that we can store only four digits of the significand and two digits of the exponent. Step 1. Unlike addition, we calculate the exponent of the product by simply adding the exponents of the operands together: New exponent = 10 + (−5) = 5 Let’s do this with the biased exponents as well to make sure we obtain the same result: 10 + 127 = 137, and −5 + 127 = 122, so New exponent = 137 + 122 = 259 This result is too large for the 8-bit exponent field, so something is amiss! The problem is with the bias because we are adding the biases as well as the exponents: New exponent = (10 + 127) + (−5 + 127) = (5 + 2 × 127) = 259 Accordingly, to get the correct biased sum when we add biased numbers, we must subtract the bias from the sum: New exponent = 137 + 122 − 127 = 259 − 127 = 132 = (5 + 127) and 5 is indeed the exponent we calculated initially. 04-Ch03-P374750.indd 242 7/3/09 9:00:48 AM 3.5 Sign Exponent Fraction Sign 243 Floating Point Fraction Exponent Compare Small ALU exponents Exponent difference 0 1 0 1 0 1 Shift smaller Shift right Control number right Add Big ALU 0 0 1 Increment or decrement Shift left or right Rounding hardware Sign Exponent 1 Normalize Round Fraction FIGURE 3.15 Block diagram of an arithmetic unit dedicated to floating-point addition. The steps of Figure 3.14 correspond to each block, from top to bottom. First, the exponent of one operand is subtracted from the other using the small ALU to determine which is larger and by how much. This difference controls the three multiplexors; from left to right, they select the larger exponent, the significand of the smaller number, and the significand of the larger number. The smaller significand is shifted right, and then the significands are added together using the big ALU. The normalization step then shifts the sum left or right and increments or decrements the exponent. Rounding then creates the final result, which may require normalizing again to produce the final result. 04-Ch03-P374750.indd 243 7/3/09 9:00:48 AM 244 Chapter 3 Arithmetic for Computers Step 2. Next comes the multiplication of the significands: 1.110ten x 9.200ten 0000 0000 2220 9990 10212000ten There are three digits to the right of the decimal point for each operand, so the decimal point is placed six digits from the right in the product significand: 10.212000ten Assuming that we can keep only three digits to the right of the decimal point, the product is 10.212 × 105. Step 3. This product is unnormalized, so we need to normalize it: 10.212ten × 105 = 1.0212ten × 106 Thus, after the multiplication, the product can be shifted right one digit to put it in normalized form, adding 1 to the exponent. At this point, we can check for overflow and underflow. Underflow may occur if both operands are small—that is, if both have large negative exponents. Step 4. We assumed that the significand is only four digits long (excluding the sign), so we must round the number. The number 1.0212ten × 106 is rounded to four digits in the significand to 1.021ten × 106 Step 5. The sign of the product depends on the signs of the original operands. If they are both the same, the sign is positive; otherwise, it’s negative. Hence, the product is +1.021ten × 106 The sign of the sum in the addition algorithm was determined by addition of the significands, but in multiplication, the sign of the product is determined by the signs of the operands. Once again, as Figure 3.16 shows, multiplication of binary floating-point numbers is quite similar to the steps we have just completed. We start with 04-Ch03-P374750.indd 244 7/3/09 9:00:48 AM 3.5 Floating Point 245 Start 1. Add the biased exponents of the two numbers, subtracting the bias from the sum to get the new biased exponent 2. Multiply the significands 3. Normalize the product if necessary, shifting it right and incrementing the exponent Overflow or underflow? Yes No Exception 4. Round the significand to the appropriate number of bits No Still normalized? Yes 5. Set the sign of the product to positive if the signs of the original operands are the same; if they differ make the sign negative Done FIGURE 3.16 Floating-point multiplication. The normal path is to execute steps 3 and 4 once, but if rounding causes the sum to be unnormalized, we must repeat step 3. 04-Ch03-P374750.indd 245 7/3/09 9:00:48 AM 246 Chapter 3 Arithmetic for Computers calculating the new exponent of the product by adding the biased exponents, being sure to subtract one bias to get the proper result. Next is multiplication of significands, followed by an optional normalization step. The size of the exponent is checked for overflow or underflow, and then the product is rounded. If rounding leads to further normalization, we once again check for exponent size. Finally, set the sign bit to 1 if the signs of the operands were different (negative product) or to 0 if they were the same (positive product). Binary Floating-Point Multiplication EXAMPLE ANSWER Let’s try multiplying the numbers 0.5ten and −0.4375ten, using the steps in Figure 3.16. In binary, the task is multiplying 1.000two × 2−1 by − 1.110two × 2−2. Step 1. Adding the exponents without bias: −1 + (−2) = −3 or, using the biased representation: (−1 + 127) + (−2 + 127) − 127 = (−1 − 2) + (127 + 127 − 127) = −3 + 127 = 124 Step 2. Multiplying the significands: 1.000two x 1.110two 0000 1000 1000 1000 1110000two The product is 1.110000two × 2−3, but we need to keep it to 4 bits, so it is 1.110two × 2−3. Step 3. Now we check the product to make sure it is normalized, and then check the exponent for overflow or underflow. The product is already normalized and, since 127 ≥ −3 ≥ −126, there is no overflow or underflow. (Using the biased representation, 254 ≥ 124 ≥ 1, so the exponent fits.) 04-Ch03-P374750.indd 246 7/3/09 9:00:49 AM 3.5 Floating Point 247 Step 4. Rounding the product makes no change: 1.110two × 2−3 Step 5. Since the signs of the original operands differ, make the sign of the product negative. Hence, the product is −1.110two × 2−3 Converting to decimal to check our results: −1.110two × 2−3 = −0.001110two = −0.00111two = −7/25ten = −7/32ten = −0.21875ten The product of 0.5ten and −0.4375ten is indeed −0.21875ten. Floating-Point Instructions in ARM ARM supports the IEEE 754 single precision and double precision formats with these instructions with the optional VFP set of instructions: These include ■ Floating-point addition, single (FADDS) and addition, double (FADDD) ■ Floating-point subtraction, single (FSUBS) and subtraction, double (FSUBD) ■ Floating-point multiplication, single (FMULS) and multiplication, double (FMULD) ■ Floating-point division, single (FDIVS) and division, double (FDIVD) ■ Floating-point comparison, single (FCMPS) and comparison, double (FCMPD) Floating-point comparison set floating point condition flags. To be able to branch on them, the programmer must first transfer them to the integer condition flags, which is accomplished with the FMSTAT instruction. As you might expect, BEQ, BNE, BGT, and BGE test as their namesakes suggest for floating point comparisons. However, because of the way the floating point conditions are mapped to integer condition flags, you test for less than with BMI and less than or equal with BLS. The ARM designers decided to add 32 separate floating-point registers—called s0, s1, s2, . . . , s31—used for single precision. Hence, they included separate loads and stores for floating-point registers: FLDS and FSTS. The base registers for floating-point data transfers remain integer registers. The ARM code to load two single precision numbers from memory, add them, and then store the sum might look like this: FLDS FLDS FADDS FSTS 04-Ch03-P374750.indd 247 s4,[sp,#x] s6,[sp,#y] s2,s4,s6 s2,[sp,#z] ; ; ; ; Load 32-bit F.P. number into s4 Load 32-bit F.P. number into s6 s2 = s4 + s6 single precision Store 32-bit F.P. number from s2 7/3/09 9:00:49 AM 248 Chapter 3 Arithmetic for Computers FP data transfers have a limited number of addressing modes. The most useful are immediate offset, immediate offset pre-indexed, and immediate offset postindexed. Note that the offsets are multiplied by 4 to extend the range of the offset, similar to what ARM does for branches. (see Chapter 2). A double precision register is really just an even-odd pair of single precision registers, using the names d0, d1, d2, . . . , d15. Thus, the pair of single precision registers s16 and s17 also form the double precision register named d8. Figure 3.17 summarizes the floating-point portion of the ARM architecture revealed in this chapter, with the additions to support floating point shown in color. ARM floating-point operands Example Name Comments 32 floating-point registers s0, s1, s2, . . . , s31 or d0, d1, d2, . . ., d15 ARM single-precision floating-point registers are used in pairs for double precision numbers. 230 memory words Memory[0], Memory[4], . . . , Memory[4294967292] Accessed only by data transfer instructions. ARM uses byte addresses, so sequential word addresses differ by 4. Memory holds data structures, such as arrays, and spilled registers, such as those saved on procedure calls. ARM floating-point assembly language Category Arithmetic Data transfer Compare FIGURE 3.17 Example Instruction Meaning FP add single FP subtract single FP multiply single FP divide single FADDS FSUBS FMULS FDIVS s2,s4,s6 s2,s4,s6 s2,s4,s6 s2,s4,s6 s2 = s4 + s6 s2 = s4 – s6 FP add double FP subtract double FP multiply double FADDD FSUBD FMULD d2,d4,d6 d2,d4,d6 d2,d4,d6 d2 = d4 + d6 d2 = d4 – d6 FP divide double FP load, single prec. FP store, single prec. FP load, double prec. FP store, double prec. FP compare single FDIVD d2,d4,d6 FLDS s1,[r1,#100] FSTS s1,[r1,#100] FLDD d1,[r1,#100] FSTD d1,[r1,#100] FCMPS s2,s4 d2 = d4 / d6 s1 = Memory[r1 + 400] FP compare double FCMPD d2,d4 if (d2 - d4) FP Move Status (for conditional branch) FMSTAT cond. flags = FP cond. flags s2 = s4 × s6 s2 = s4 / s6 d2 = d4 × d6 Memory[r1 + 400] = s1 d1 = Memory[r1 + 400] Memory[r1 + 400] = d1 if (s2 - s4) Comments FP add (single precision) FP sub (single precision) FP multiply (single precision) FP divide (single precision) FP add (double precision) FP sub (double precision) FP multiply (double precision) FP divide (double precision) 32-bit data to FP register 32-bit data to memory 64-bit data to FP register 64--bit data to memory FP compare less than single precision FP compare less than double precision Copy FP condition flags to integer condition flags ARM floating-point architecture revealed thus far. Elaboration: VPF version 3 has 32 double-precision floating-point registers. 04-Ch03-P374750.indd 248 7/3/09 9:00:49 AM 3.5 249 Floating Point One issue that architects face in supporting floating-point arithmetic is whether to use the same registers used by the integer instructions or to add a special set for floating point. Because programs normally perform integer operations and floating-point operations on different data, separating the registers will only slightly increase the number of instructions needed to execute a program. The major impact is to create a separate set of data transfer instructions to move data between floatingpoint registers and memory. The benefits of separate floating-point registers are having twice as many registers without using up more bits in the instruction format, having twice the register bandwidth by having separate integer and floating-point register sets, and being able to customize registers to floating point; for example, some computers convert all sized operands in registers into a single internal format. Hardware/ Software Interface Elaboration: The V in VFP stands for vector, as the floating point coprocessor actually supports short vector instructions. Section 7.6 of Chapter 7 describes Vector instruction set architecture. The maximum length of a vectors is either 8 single precision registers or 4 double precision registers. Vector length is determined by a LEN field in the floating point status and control register of ARM (FPSCR). If its set to 0, VFP acts like a scalar floating point instruction. The vector loads and stores can do unit stride accesses. If the LEN and STRIDE fields of the FPSCR are set properly, they can also support a stride of 2. The other pieces commonly found in vector architectures are missing, such as scattergather data transfers and conditional execution of vector elements. To be able to perform the common scalar -vector operations, where an operator using a single scalar variable in combination with each element of the vector, VFP divides the FP registers into those that can make up the vector registers and those that can only be scalar registers. For single precision, registers s0 to s7 are scalar and three sets of registers can be used as vectors; 8 to 15, 16 to 23, and 24 to 31. We use the corresponding registers in double precision to the same end. Scalar are s0 to s3, and the three sets of vectorizable registers are 4 to 7, 8 to 11, and 12 to 15. In both cases, if the destination register is in the first bank, then the whole operation is considered scalar. Compiling a Floating-Point C Program into ARM Assembly Code Let’s convert a temperature in Fahrenheit to Celsius: EXAMPLE float f2c (float fahr) { return ((5.0/9.0) * (fahr – 32.0)); } 04-Ch03-P374750.indd 249 7/3/09 9:00:49 AM 250 Chapter 3 Arithmetic for Computers Assume that the floating-point argument fahr is passed in s12 and the result should go in s0. (Unlike integer registers, floating-point register 0 can contain a number.) What is the ARM assembly code? ANSWER We assume that the compiler places the three floating-point constants in memory within easy reach of the register r12. The first two instructions load the constants 5.0 and 9.0 into floating-point registers: f2c: FLDS s16,[r12, FLDS s18,[r12, const5] ; s16 = 5.0 (5.0 in memory) const9] ; s18 = 9.0 (9.0 in memory) They are then divided to get the fraction 5.0/9.0: FDIVS s16, s16, s18 ; s16 = 5.0 / 9.0 (Many compilers would divide 5.0 by 9.0 at compile time and save the single constant 5.0/9.0 in memory, thereby avoiding the divide at runtime.) Next, we load the constant 32.0 and then subtract it from fahr (s12): FLDS s18,[r12,const32]; s18 = 32.0 FSUBS s18, s12, s18 ; s18 = fahr – 32.0 Finally, we multiply the two intermediate results, placing the product in s0 as the return result, and then return FMULS s0, MOV pc, lr s16, s18 ; s0 = (5/9)*(fahr – 32.0) ; return Now let’s perform floating-point operations on matrices, code commonly found in scientific programs. Compiling Floating-Point C Procedure with Two-Dimensional Matrices into ARM EXAMPLE Most floating-point calculations are performed in double precision. Let’s perform matrix multiply of X = X + Y * Z. Let’s assume X, Y, and Z are all square matrices with 32 elements in each dimension. void mm (double x[][], double y[][], double z[][]) { int i, j, k; for (i = 0; i < 32; i = i + 1) for (j = 0; j < 32; j = j + 1) 04-Ch03-P374750.indd 250 7/3/09 9:00:49 AM 3.5 251 Floating Point for (k = 0; k < 32; k = k + 1) x[i][j] = x[i][j] + y[i][k] * z[k][j]; } The array starting addresses are parameters, so they are in r0, r1, and r2. Assume that the integer variables are in r3, r4, and r5, respectively. What is the ARM assembly code for the body of the procedure? Note that x[i][j] is used in the innermost loop above. Since the loop index is k, the index does not affect x[i][j], so we can avoid loading and storing x[i][j] each iteration. Instead, the compiler loads x[i][j] into a register outside the loop, accumulates the sum of the products of y[i][k] and z[k] [j] in that same register, and then stores the sum into x[i][j] upon termination of the innermost loop. As we did in Chapter 2, let’s rename the registers to make it easier to read and write the code: x y z i j k xijAddr tempAddr RN RN RN RN RN RN RN RN 0 1 2 3 4 5 6 12 ; ; ; ; ; ; ; ; ANSWER 1st argument address of x 2nd argument address of y 3rd argument address of z local variable i local variable j local variable k address of x[i][j] address of y[i][j] or z[i][j] Since r4, r5, and r6 are not preserved across the procedure call, we must save them: mm: SUB STR STR STR sp,sp,#12 r4, [sp, #8] r5, [sp, #4] r6, [sp, #0] ; ; ; ; make save save save room on stack for 3 registers r4 on stack r5 on stack r6 on stack The body of the procedure starts with initializing the three for loop variables: L1: L2: 04-Ch03-P374750.indd 251 MOV MOV MOV i, j, k, 0 0 0 ; ; ; i = 0; initialize 1st for loop j = 0; restart 2nd for loop k = 0; restart 3rd for loop 7/3/09 9:00:49 AM 252 Chapter 3 Arithmetic for Computers To calculate the address of x[i][j], we need to know how a 32 × 32, two-dimensional array is stored in memory. As you might expect, its layout is the same as if there were 32 single-dimension arrays, each with 32 elements. So the first step is to skip over the i “single-dimensional arrays,” or rows, to get the one we want. Thus, we multiply the index in the first dimension by the size of the row, 32. Since 32 is a power of 2, we can use a shift instead as part of adding the second index to select the jth element of the desired row: ADD xijAddr, j, i, LSL #5 ; xijAddr = i*size(row) + j To turn this sum into a byte index, we multiply it by the size of a matrix element in bytes. Since each element is 8 bytes for double precision, we can instead shift left by 3 as part of adding this sum to the base address of x, giving the address of x[i][j]: ADD ; xijAddr, x, xijAddr, LSL #3 ; xijAddr = byte address of x[i][j] We then load the double precision number x[i][j] into s4: FLDD s4, [xijAddr,#0] ; s4 = 8 bytes of x[i][j] The following three instructions are virtually identical to the last three: calculate the address and then load the double precision number z[k][j]. L3: ADD tempAddr, j , k, LSL #5 ; tempAddr = k * ; size(row) + j ADD tempAddr, z, tempAddr, LSL #3 ; tempAddr=byte address of z[k][j] ; s16 = 8 bytes of FLDD s16, [tempAddr,#0] z[k][j] Similarly, the next three instructions are like the last three: calculate the address and then load the double precision number y[i][k]. ADD tempAddr, k, i, LSL #5 ; tempAddr = i * ; size(row) + k ADD tempAddr, y, tempAddr, LSL #3 ; tempAddr=byte ; address of y[i][k] FLDD s18, [tempAddr,#0] ; s18 = 8 bytes of y[i][k] Now that we have loaded all the data, we are finally ready to do some floating-point operations! We multiply elements of y and z located in registers s18 and s16, and then accumulate the sum in s4. FMULD s16, s18, s16 FADDD s4, s4, s16 04-Ch03-P374750.indd 252 ; s16 = y[i][k] * z[k][j] ; s4 = x[i][j]+ y[i][k] * z[k][j] 7/3/09 9:00:49 AM 3.5 Floating Point 253 The final block increments the index k and loops back if the index is not 32. If it is 32, and thus the end of the innermost loop, we need to store the sum accumulated in s4 into x[i][j]. ADD CMP BLT FSTD k, k, #1 k, #32 L3 s4, [xijAddr,#0] ; k = k + 1 ; if (k < 32) go to L3 ; x[i][j] = s4 Similarly, these final four instructions increment the index variable of the middle and outermost loops, looping back if the index is less than 32 and exiting if the index is 32. ADD CMP BLT ADD CMP BLT j, j, L2 i, i, L1 j, #1 ; j = j + 1 #32 ; if (j < 32) go to L2 i, #1 ; i = i + 1 #32 ; if (i < 32) go to L1 The end of the procedure restores r4, r5, and r6 from the stack, and then returns to the callee. Elaboration: The array layout discussed in the example, called row-major order, is used by C and many other programming languages. Fortran instead uses column-major order, whereby the array is stored column by column. Elaboration: Another reason for separate integers and floating-point registers is that microprocessors in the 1980s didn’t have enough transistors to put the floating-point unit on the same chip as the integer unit. Hence, the floating-point unit, including the floating-point registers, was optionally available as a second chip. Such optional accelerator chips are called coprocessors, Today a coprocessor function may mean its optional for some members of the family, in that lower cost chips might leave out coprocessors so that the chip can be smaller. Elaboration: As mentioned in Section 3.4, accelerating division is more challenging than multiplication. In addition to SRT, another technique to leverage a fast multiplier is Newton’s iteration, where division is recast as finding the zero of a function to find the reciprocal 1/x, which is then multiplied by the other operand. Iteration techniques cannot be rounded properly without calculating many extra bits. A TI chip solves this problem by calculating an extra-precise reciprocal. Elaboration: Java embraces IEEE 754 by name in its definition of Java floating-point data types and operations. Thus, the code in the first example could have well been generated for a class method that converted Fahrenheit to Celsius. 04-Ch03-P374750.indd 253 7/3/09 9:00:49 AM 254 Chapter 3 Arithmetic for Computers The second example uses multiple dimensional arrays, which are not explicitly supported in Java. Java allows arrays of arrays, but each array may have its own length, unlike multiple dimensional arrays in C. Like the examples in Chapter 2, a Java version of this second example would require a good deal of checking code for array bounds, including a new length calculation at the end of row access. It would also need to check that the object reference is not null. Accurate Arithmetic guard The first of two extra bits kept on the right during intermediate calculations of floatingpoint numbers; used to improve rounding accuracy. round Method to make the intermediate floating-point result fit the floating-point format; the goal is typically to find the nearest number that can be represented in the format. Unlike integers, which can represent exactly every number between the smallest and largest number, floating-point numbers are normally approximations for a number they can’t really represent. The reason is that an infinite variety of real numbers exists between, say, 0 and 1, but no more than 253 can be represented exactly in double precision floating point. The best we can do is getting the floating-point representation close to the actual number. Thus, IEEE 754 offers several modes of rounding to let the programmer pick the desired approximation. Rounding sounds simple enough, but to round accurately requires the hardware to include extra bits in the calculation. In the preceding examples, we were vague on the number of bits that an intermediate representation can occupy, but clearly, if every intermediate result had to be truncated to the exact number of digits, there would be no opportunity to round. IEEE 754, therefore, always keeps two extra bits on the right during intermediate additions, called guard and round, respectively. Let’s do a decimal example to illustrate their value. Rounding with Guard Digits EXAMPLE ANSWER Add 2.56ten × 100 to 2.34ten × 102, assuming that we have three significant decimal digits. Round to the nearest decimal number with three significant decimal digits, first with guard and round digits, and then without them. First we must shift the smaller number to the right to align the exponents, so 2.56ten × 100 becomes 0.0256ten × 102. Since we have guard and round digits, we are able to represent the two least significant digits when we align exponents. The guard digit holds 5 and the round digit holds 6. The sum is 2.3400ten + 0.0256ten 2.3656ten Thus the sum is 2.3656ten × 102. Since we have two digits to round, we want values 0 to 49 to round down and 51 to 99 to round up, with 50 being the tiebreaker. Rounding the sum up with three significant digits yields 2.37ten × 102. 04-Ch03-P374750.indd 254 7/3/09 9:00:49 AM 3.5 Floating Point 255 Doing this without guard and round digits drops two digits from the calculation. The new sum is then 2.34ten + 0.02ten 2.36ten The answer is 2.36ten × 102, off by 1 in the last digit from the sum above. Since the worst case for rounding would be when the actual number is halfway between two floating-point representations, accuracy in floating point is normally measured in terms of the number of bits in error in the least significant bits of the significand; the measure is called the number of units in the last place, or ulp. If a number were off by 2 in the least significant bits, it would be called off by 2 ulps. Provided there is no overflow, underflow, or invalid operation exceptions, IEEE 754 guarantees that the computer uses the number that is within one-half ulp. Elaboration: Although the example above really needed just one extra digit, multiply can need two. A binary product may have one leading 0 bit; hence, the normalizing step must shift the product one bit left. This shifts the guard digit into the least significant bit of the product, leaving the round bit to help accurately round the product. IEEE 754 has four rounding modes: always round up (toward +∞), always round down (toward −∞), truncate, and round to nearest even. The final mode determines what to do if the number is exactly halfway in between. The U.S. Internal Revenue Service (IRS) always rounds 0.50 dollars up, possibly to the benefit of the IRS. A more equitable way would be to round up this case half the time and round down the other half. IEEE 754 says that if the least significant bit retained in a halfway case would be odd, add one; if it’s even, truncate. This method always creates a 0 in the least significant bit in the tie-breaking case, giving the rounding mode its name. This mode is the most commonly used, and the only one that Java supports. The goal of the extra rounding bits is to allow the computer to get the same results as if the intermediate results were calculated to infinite precision and then rounded. To support this goal and round to the nearest even, the standard has a third bit in addition to guard and round; it is set whenever there are nonzero bits to the right of the round bit. This sticky bit allows the computer to see the difference between 0.50 . . . 00 ten and 0.50 . . . 01ten when rounding. The sticky bit may be set, for example, during addition, when the smaller number is shifted to the right. Suppose we added 5.01ten × 10–1 to 2.34ten × 102 in the example above. Even with guard and round, we would be adding 0.0050 to 2.34, with a sum of 2.3450. The sticky bit would be set, since there are nonzero bits to the right. Without the sticky bit to remember whether any 1s were shifted off, we would assume the number is equal to 2.345000 . . . 00 and round to the nearest even of 2.34. With the sticky bit to remember that the number is larger than 2.345000 . . . 00, we round instead to 2.35. units in the last place (ulp) The number of bits in error in the least significant bits of the significand between the actual number and the number that can be represented. sticky bit A bit used in rounding in addition to guard and round that is set whenever there are nonzero bits to the right of the round bit. Elaboration: PowerPC, SPARC64, and AMD SSE5 architectures provide a single instruction that does a multiply and add on three registers: a = a + (b × c). Obviously, this 04-Ch03-P374750.indd 255 7/3/09 9:00:49 AM 256 Chapter 3 fused multiply add A floating-point instruction that performs both a multiply and an add, but rounds only once after the add. Arithmetic for Computers instruction allows potentially higher floating-point performance for this common operation. Equally important is that instead of performing two roundings—-after the multiply and then after the add—which would happen with separate instructions, the multiply add instruction can perform a single rounding after the add. A single rounding step increases the precision of multiply add. Such operations with a single rounding are called fused multiply add. It was added to the revised IEEE 754 standard (see Section 3.10 on the CD). Summary The Big Picture that follows reinforces the stored-program concept from Chapter 2; the meaning of the information cannot be determined just by looking at the bits, for the same bits can represent a variety of objects. This section shows that computer arithmetic is finite and thus can disagree with natural arithmetic. For example, the IEEE 754 standard floating-point representation (−1)S × (1 + Fraction) × 2(Exponent − Bias) is almost always an approximation of the real number. Computer systems must take care to minimize this gap between computer arithmetic and arithmetic in the real world, and programmers at times need to be aware of the implications of this approximation. BIG Bit patterns have no inherent meaning. They may represent signed integers, unsigned integers, floating-point numbers, instructions, and so on. What is represented depends on the instruction that operates on the bits in the word. The major difference between computer numbers and numbers in the real world is that computer numbers have limited size and hence limited precision; it’s possible to calculate a number too big or too small to be represented in a word. Programmers must remember these limits and write programs accordingly. The Picture C type 04-Ch03-P374750.indd 256 Java type Data transfers Operations int int LDR, STR ADD, SUB, MUL, AND, ORR, MVN, MOV, CMP unsigned int — LDR, STR ADD, SUB, MUL, AND, ORR, MVN, MOV, CMP char — LDRSB, STRB ADD, SUB, MUL, AND, ORR, MVN, MOV, CMP — char LDRSH, STRB ADD, SUB, MUL, AND, ORR, MVN, MOV, CMP float float FLDS, FSTS FADDS, FSUBS, FMULS, FDIVS, FCMPS double double FLDD, FSTD FADDD, FSUBD, FMULD, FDIVD, FCMPD 7/3/09 9:00:49 AM 3.5 257 Floating Point In the last chapter, we presented the storage classes of the programming language C (see the Hardware/Software Interface section in Section 2.7). The table above shows some of the C and Java data types, the ARM data transfer instructions, and instructions that operate on those types that appear in Chapter 2 and this chapter. Note that Java omits unsigned integers. Hardware/ Software Interface Suppose there was a 16-bit IEEE 754 floating-point format with five exponent bits. What would be the likely range of numbers it could represent? Check Yourself 1. 1.0000 0000 00 × 20 to 1.1111 1111 11 × 231, 0 2. ±1.0000 0000 0 × 2−14 to ±1.1111 1111 1 × 215, ±0, ±∞, NaN 3. ±1.0000 0000 00 × 2−14 to ±1.1111 1111 11 × 215, ±0, ±∞, NaN 4. ±1.0000 0000 00 × 2−15 to ±1.1111 1111 11 × 214, ±0, ±∞, NaN Elaboration: To accommodate comparisons that may include NaNs, the standard includes ordered and unordered as options for compares. Hence, the full ARM instruction set has many flavors of compares to support NaNs. (Java does not support unordered compares.) In an attempt to squeeze every last bit of precision from a floating-point operation, the standard allows some numbers to be represented in unnormalized form. Rather than having a gap between 0 and the smallest normalized number, IEEE allows denormalized numbers (also known as denorms or subnormals). They have the same exponent as zero but a nonzero significand. They allow a number to degrade in significance until it becomes 0, called gradual underflow. For example, the smallest positive single precision normalized number is 1.0000 0000 0000 0000 0000 000two × 2–126 but the smallest single precision denormalized number is 0.0000 0000 0000 0000 0000 001two × 2–126, or 1.0two × 2–149 For double precision, the denorm gap goes from 1.0 × 2–1022 to 1.0 × 2–1074. The possibility of an occasional unnormalized operand has given headaches to floating-point designers who are trying to build fast floating-point units. Hence, many computers cause an exception if an operand is denormalized, letting software complete the operation. Although software implementations are perfectly valid, their lower performance has lessened the popularity of denorms in portable floating-point software. Moreover, if programmers do not expect denorms, their programs may surprise them. 04-Ch03-P374750.indd 257 7/3/09 9:00:49 AM 258 Chapter 3 3.6 Arithmetic for Computers Parallelism and Computer Arithmetic: Associativity Programs have typically been written first to run sequentially before being rewritten to run concurrently, so a natural question is, “do the two versions get the same answer?” If the answer is no, you presume there is a bug in the parallel version that you need to track down. This approach assumes that computer arithmetic does not affect the results when going from sequential to parallel. That is, if you were to add a million numbers together, you would get the same results whether you used 1 processor or 1000 processors. This assumption holds for two’s complement integers, even if the computation overflows. Another way to say this is that integer addition is associative. Alas, because floating-point numbers are approximations of real numbers and because computer arithmetic has limited precision, it does not hold for floatingpoint numbers. That is, floating-point addition is not associative. Testing Associativity of Floating-Point Addition EXAMPLE See if x + (y + z) = (x + y) + z. For example, suppose x = −1.5ten × 1038, y = 1.5ten × 1038, and z = 1.0, and that these are all single precision numbers. ANSWER Given the great range of numbers that can be represented in floating point, problems occur when adding two large numbers of opposite signs plus a small number, as we shall see: x + (y + z) = −1.5ten × 1038 + (1.5ten × 1038 + 1.0) = −1.5ten × 1038 + (1.5ten × 1038) = 0.0 (x + y) + z = (−1.5ten × 1038 + 1.5ten × 1038) + 1.0 = (0.0ten) + 1.0 = 1.0 Therefore x + (y + z) ≠ (x + y) + z, so floating-point addition is not associative. Since floating-point numbers have limited precision and result in approximations of real results, 1.5ten × 1038 is so much larger than 1.0ten that 1.5ten × 1038 + 1.0 is still 1.5ten × 1038. That is why the sum of x, y, and z is 0.0 or 1.0, depending on the order of the floating-point additions, and hence floatingpoint add is not associative. A more vexing version of this pitfall occurs on a parallel computer where the operating system scheduler may use a different number of processors depending on what other programs are running on a parallel computer. The unaware parallel 04-Ch03-P374750.indd 258 7/3/09 9:00:49 AM 3.7 Real Stuff: Floating Point in the x86 259 programmer may be flummoxed by his or her program getting slightly different answers each time it is run for the same identical code and the same identical input, as the varying number of processors from each run would cause the floating-point sums to be calculated in different orders. Given this quandary, programmers who write parallel code with floating-point numbers need to verify whether the results are credible even if they don’t give the same exact answer as the sequential code. The field that deals with such issues is called numerical analysis, which is the subject of textbooks in its own right. Such concerns are one reason for the popularity of numerical libraries such as LAPACK and SCALAPAK, which have been validated in both their sequential and parallel forms. Elaboration: A subtle version of the associativity issue occurs when two processors perform a redundant computation that is executed in different order so they get slightly different answers, although both answers are considered accurate. The bug occurs if a conditional branch compares to a floating-point number and the two processors take different branches when common sense reasoning suggests they should take the same branch. 3.7 Real Stuff: Floating Point in the x86 The main differences between ARM x 86 arithmetic instructions are found in floating-point instructions. The x86 floating-point architecture is different between ARM and from all other computers in the world. The x86 Floating-Point Architecture The Intel 8087 floating-point coprocessor was announced in 1980. This architecture extended the 8086 with about 60 floating-point instructions. Intel provided a stack architecture with its floating-point instructions: loads push numbers onto the stack, operations find operands in the two top elements of the stacks, and stores can pop elements off the stack. Intel supplemented this stack architecture with instructions and addressing modes that allow the architecture to have some of the benefits of a register-memory model. In addition to finding operands in the top two elements of the stack, one operand can be in memory or in one of the seven registers on-chip below the top of the stack. Thus, a complete stack instruction set is supplemented by a limited set of register-memory instructions. This hybrid is still a restricted register-memory model, however, since loads always move data to the top of the stack while incrementing the top-of-stack pointer, and stores can only move the top of stack to memory. Intel uses the notation ST to indicate the top of stack, and ST(i) to represent the ith register below the top of stack. Another novel feature of this architecture is that the operands are wider in the register stack than they are stored in memory, and all operations are performed 04-Ch03-P374750.indd 259 7/3/09 9:00:49 AM 260 Chapter 3 Arithmetic for Computers at this wide internal precision. Unlike the maximum of 64 bits on ARM, the x86 floating-point operands on the stack are 80 bits wide. Numbers are automatically converted to the internal 80-bit format on a load and converted back to the appropriate size on a store. This double extended precision is not supported by programming languages, although it has been useful to programmers of mathematical software. Memory data can be 32-bit (single precision) or 64-bit (double precision) floating-point numbers. The register-memory version of these instructions will then convert the memory operand to this Intel 80-bit format before performing the operation. The data transfer instructions also will automatically convert 16- and 32-bit integers to floating point, and vice versa, for integer loads and stores. The x86 floating-point operations can be divided into four major classes: 1. Data movement instructions, including load, load constant, and store 2. Arithmetic instructions, including add, subtract, multiply, divide, square root, and absolute value 3. Comparison, including instructions to send the result to the integer processor so that it can branch 4. Transcendental instructions, including sine, cosine, log, and exponentiation Figure 3.18 shows some of the 60 floating-point operations. Note that we get even more combinations when we include the operand modes for these operations. Figure 3.19 shows the many options for floating-point add. Data transfer Arithmetic Compare Transcendental F{I}LD mem/ST(i) F{I}ST{P} mem/ ST(i) FLDPI FLD1 F{I}ADD{P} mem/ST(i) F{I}SUB{R}{P} mem/ST(i) F{I}COM{P} F{I}UCOM{P}{P} FPATAN F2XM1 F{I}MUL{P} mem/ST(i) F{I}DIV{R}{P} mem/ST(i) FSTSW AX/mem FCOS FPTAN FLDZ FSQRT FPREM FABS FSIN FRNDINT FYL2X FIGURE 3.18 The floating-point instructions of the x86. We use the curly brackets {} to show optional variations of the basic operations: {I} means there is an integer version of the instruction, {P} means this variation will pop one operand off the stack after the operation, and {R} means reverse the order of the operands in this operation. The first column shows the data transfer instructions, which move data to memory or to one of the registers below the top of the stack. The last three operations in the first column push constants on the stack: pi, 1.0, and 0.0. The second column contains the arithmetic operations described above. Note that the last three operate only on the top of stack. The third column is the compare instructions. Since there are no special floating-point branch instructions, the result of the compare must be transferred to the integer CPU via the FSTSW instruction, either into the AX register or into memory, followed by an SAHF instruction to set the condition codes. The floating-point comparison can then be tested using integer branch instructions. The final column gives the higher-level floating-point operations. Not all combinations suggested by the notation are provided. Hence, F{I}SUB{R}{P} operations represent these instructions found in the x86: FSUB, FISUB, FSUBR, FISUBR, FSUBP, FSUBRP. For the integer subtract instructions, there is no pop (FISUBP) or reverse pop (FISUBRP). 04-Ch03-P374750.indd 260 7/3/09 9:00:49 AM 3.7 Instruction Operands FADD Real Stuff: Floating Point in the x86 Comment Both operands in stack; result replaces top of stack. FADD ST(i) One source operand is ith register below the top of stack; result replaces the top of stack. FADD ST(i), ST One source operand is the top of stack; result replaces ith register below the top of stack. FADD mem32 One source operand is a 32-bit location in memory; result replaces the top of stack. FADD mem64 One source operand is a 64-bit location in memory; result replaces the top of stack. FIGURE 3.19 261 The variations of operands for floating-point add in the x86. The floating-point instructions are encoded using the ESC opcode of the 8086 and the postbyte address specifier (see Figure 2.45). The memory operations reserve 2 bits to decide whether the operand is a 32- or 64-bit floating point or a 16- or 32-bit integer. Those same 2 bits are used in versions that do not access memory to decide whether the stack should be popped after the operation and whether the top of stack or a lower register should get the result. In the past, floating-point performance of the x86 family lagged far behind other computers. As a result, Intel created a more traditional floating-point architecture as part of SSE2. The Intel Streaming SIMD Extension 2 (SSE2) Floating-Point Architecture Chapter 2 notes that in 2001 Intel added 144 instructions to its architecture, including double precision floating-point registers and operations. It includes eight 64-bit registers that can be used for floating-point operands, giving the compiler a different target for floating-point operations than the unique stack architecture. Compilers can choose to use the eight SSE2 registers as floating-point registers like those found in other computers. AMD expanded the number to 16 registers as part of AMD64, which Intel relabeled EM64T for its use. Figure 3.20 summarizes the SSE and SSE2 instructions. In addition to holding a single precision or double precision number in a register, Intel allows multiple floating-point operands to be packed into a single 128-bit SSE2 register: four single precision or two double precision. Thus, the 16 floatingpoint registers for SSE2 are actually 128 bits wide. If the operands can be arranged in memory as 128-bit aligned data, then 128-bit data transfers can load and store multiple operands per instruction. This packed floating-point format is supported by arithmetic operations that can operate simultaneously on four singles (PS) or two doubles (PD). This architecture more than doubles performance over the stack architecture. 04-Ch03-P374750.indd 261 7/3/09 9:00:49 AM 262 Chapter 3 Arithmetic for Computers Data transfer MOV{A/U}{SS/PS/SD/ PD} xmm, mem/xmm MOV {H/L} {PS/PD} xmm, mem/xmm Arithmetic ADD{SS/PS/SD/PD} xmm, mem/xmm SUB{SS/PS/SD/PD} xmm, mem/xmm MUL{SS/PS/SD/PD} xmm, mem/xmm DIV{SS/PS/SD/PD} xmm, mem/xmm SQRT{SS/PS/SD/PD} mem/xmm Compare CMP{SS/PS/SD/ PD} MAX {SS/PS/SD/PD} mem/xmm MIN{SS/PS/SD/PD} mem/xmm FIGURE 3.20 The SSE/SSE2 floating-point instructions of the x86. xmm means one operand is a 128-bit SSE2 register, and mem/xmm means the other operand is either in memory or it is an SSE2 register. We use the curly brackets {} to show optional variations of the basic operations: {SS} stands for Scalar Single precision floating point, or one 32-bit operand in a 128-bit register; {PS} stands for Packed Single precision floating point, or four 32-bit operands in a 128-bit register; {SD} stands for Scalar Double precision floating point, or one 64-bit operand in a 128-bit register; {PD} stands for Packed Double precision floating point, or two 64-bit operands in a 128-bit register; {A} means the 128-bit operand is aligned in memory; {U} means the 128-bit operand is unaligned in memory; {H} means move the high half of the 128-bit operand; and {L} means move the low half of the 128-bit operand. Thus mathematics may be defined as the subject in which we never know what we are talking about, nor whether what we are saying is true. Bertrand Russell, Recent Words on the Principles of Mathematics, 1901 3.8 Fallacies and Pitfalls Arithmetic fallacies and pitfalls generally stem from the difference between the limited precision of computer arithmetic and the unlimited precision of natural arithmetic. Fallacy: Just as a left shift instruction can replace an integer multiply by a power of 2, a right shift is the same as an integer division by a power of 2. Recall that a binary number x, where xi means the ith bit, represents the number . . . + (x3 × 23) + (x2 × 22) + (x1 × 21) + (x0 × 20) Shifting the bits of x right by n bits would seem to be the same as dividing by 2n. And this is true for unsigned integers. The problem is with signed integers. For example, suppose we want to divide −5ten by 4ten; the quotient should be −1ten. The two’s complement representation of −5ten is 1111 1111 1111 1111 1111 1111 1111 1011two According to this fallacy, shifting right by two should divide by 4ten (22): 0011 1111 1111 1111 1111 1111 1111 1110two 04-Ch03-P374750.indd 262 7/3/09 9:00:49 AM 3.8 Fallacies and Pitfalls 263 With a 0 in the sign bit, this result is clearly wrong. The value created by the shift right is actually 1,073,741,822ten instead of −1ten. A solution would be to have an arithmetic right shift that extends the sign bit instead of shifting in 0s. Indeed, ARM has another optional shift for the second operand of data processing instructions that performs an arithmetic right shift: ASR. A 2-bit arithmetic shift right of −5ten produces 1111 1111 1111 1111 1111 1111 1111 1110two The result is −2ten instead of −1ten; close, but no cigar. Fallacy: Only theoretical mathematicians care about floating-point accuracy. Newspaper headlines of November 1994 prove this statement is a fallacy (see Figure 3.21). The following is the inside story behind the headlines. The Pentium uses a standard floating-point divide algorithm that generates multiple quotient bits per step, using the most significant bits of divisor and dividend to guess the next 2 bits of the quotient. The guess is taken from a lookup table containing −2, −1, 0, +1, or +2. The guess is multiplied by the divisor and subtracted from the remainder to generate a new remainder. Like nonrestoring FIGURE 3.21 A sampling of newspaper and magazine articles from November 1994, including the New York Times, San Jose Mercury News, San Francisco Chronicle, and Infoworld. The Pentium floating-point divide bug even made the “Top 10 List” of the David Letterman Late Show on television. Intel eventually took a $300 million write-off to replace the buggy chips. 04-Ch03-P374750.indd 263 7/3/09 9:00:49 AM 264 Chapter 3 Arithmetic for Computers division, if a previous guess gets too large a remainder, the partial remainder is adjusted in a subsequent pass. Evidently, there were five elements of the table from the 80486 that Intel thought could never be accessed, and they optimized the PLA to return 0 instead of 2 in these situations on the Pentium. Intel was wrong: while the first 11 bits were always correct, errors would show up occasionally in bits 12 to 52, or the 4th to 15th decimal digits. The following is a timeline of the Pentium bug morality play. 04-Ch03-P374750.indd 264 ■ July 1994: Intel discovers the bug in the Pentium. The actual cost to fix the bug was several hundred thousand dollars. Following normal bug fix procedures, it will take months to make the change, reverify, and put the corrected chip into production. Intel planned to put good chips into production in January 1995, estimating that 3 to 5 million Pentiums would be produced with the bug. ■ September 1994: A math professor at Lynchburg College in Virginia, Thomas Nicely, discovers the bug. After calling Intel technical support and getting no official reaction, he posts his discovery on the Internet. It quickly gained a following, and some pointed out that even small errors become big when multiplied by big numbers: the fraction of people with a rare disease times the population of Europe, for example, might lead to the wrong estimate of the number of sick people. ■ November 7, 1994: Electronic Engineering Times puts the story on its front page, which is soon picked up by other newspapers. ■ November 22, 1994: Intel issues a press release, calling it a “glitch.” The Pentium “can make errors in the ninth digit. . . . Even most engineers and financial analysts require accuracy only to the fourth or fifth decimal point. Spreadsheet and word processor users need not worry. . . . There are maybe several dozen people that this would affect. So far, we’ve only heard from one. . . . [Only] theoretical mathematicians (with Pentium computers purchased before the summer) should be concerned.” What irked many was that customers were told to describe their application to Intel, and then Intel would decide whether or not their application merited a new Pentium without the divide bug. ■ December 5, 1994: Intel claims the flaw happens once in 27,000 years for the typical spreadsheet user. Intel assumes a user does 1000 divides per day and multiplies the error rate assuming floating-point numbers are random, which is one in 9 billion, and then gets 9 million days, or 27,000 years. Things begin to calm down, despite Intel neglecting to explain why a typical customer would access floating-point numbers randomly. ■ December 12, 1994: IBM Research Division disputes Intel’s calculation of the rate of errors (you can access this article by visiting www.mkp.com/ books_catalog/cod/links.htm). IBM claims that common spreadsheet programs, recalculating for 15 minutes a day, could produce Pentium-related errors as often as once every 24 days. IBM assumes 5000 divides per second, for 15 minutes, yielding 4.2 million divides per day, and does not assume 7/3/09 9:00:50 AM 3.9 Concluding Remarks 265 random distribution of numbers, instead calculating the chances as one in 100 million. As a result, IBM immediately stops shipment of all IBM personal computers based on the Pentium. Things heat up again for Intel. ■ December 21, 1994: Intel releases the following, signed by Intel’s president, chief executive officer, chief operating officer, and chairman of the board: “We at Intel wish to sincerely apologize for our handling of the recently publicized Pentium processor flaw. The Intel Inside symbol means that your computer has a microprocessor second to none in quality and performance. Thousands of Intel employees work very hard to ensure that this is true. But no microprocessor is ever perfect. What Intel continues to believe is technically an extremely minor problem has taken on a life of its own. Although Intel firmly stands behind the quality of the current version of the Pentium processor, we recognize that many users have concerns. We want to resolve these concerns. Intel will exchange the current version of the Pentium processor for an updated version, in which this floating-point divide flaw is corrected, for any owner who requests it, free of charge anytime during the life of their computer.” Analysts estimate that this recall cost Intel $500 million, and Intel engineers did not get a Christmas bonus that year. This story brings up a few points for everyone to ponder. How much cheaper would it have been to fix the bug in July 1994? What was the cost to repair the damage to Intel’s reputation? And what is the corporate responsibility in disclosing bugs in a product so widely used and relied upon as a microprocessor? In April 1997, another floating-point bug was revealed in the Pentium Pro and Pentium II microprocessors. When the floating-point-to-integer store instructions (fist, fistp) encounter a negative floating-point number that is too large to fit in a 16- or 32-bit word after being converted to integer, they set the wrong bit in the FPO status word (precision exception instead of invalid operation exception). To Intel’s credit, this time they publicly acknowledged the bug and offered a software patch to get around it—quite a different reaction from what they did in 1994. 3.9 Concluding Remarks A side effect of the stored-program computer is that bit patterns have no inherent meaning. The same bit pattern may represent a signed integer, unsigned integer, floating-point number, instruction, and so on. It is the instruction that operates on the word that determines its meaning. Computer arithmetic is distinguished from paper-and-pencil arithmetic by the constraints of limited precision. This limit may result in invalid operations through 04-Ch03-P374750.indd 265 7/3/09 9:00:50 AM 266 Chapter 3 Arithmetic for Computers calculating numbers larger or smaller than the predefined limits. Such anomalies, called “overflow” or “underflow,” may result in exceptions or interrupts, emergency events similar to unplanned subroutine calls. Chapter 4 discusses exceptions in more detail. Floating-point arithmetic has the added challenge of being an approximation of real numbers, and care needs to be taken to ensure that the computer number selected is the representation closest to the actual number. The challenges of imprecision and limited representation are part of the inspiration for the field of numerical analysis. The recent switch to parallelism will shine the searchlight on numerical analysis again, as solutions that were long considered safe on sequential computers must be reconsidered when trying to find the fastest algorithm for parallel computers that still achieves a correct result. Over the years, computer arithmetic has become largely standardized, greatly enhancing the portability of programs. Two’s complement binary integer arithmetic and IEEE 754 binary floating-point arithmetic are found in the vast majority of computers sold today. For example, every desktop computer sold since this book was first printed follows these conventions. With the explanation of computer arithmetic in this chapter comes a description of much more of the ARM instruction set. Figure 3.22 lists the ARM instructions ARM core instructions Name Format add subtract move AND OR NOT logical shift left NOT logical shift right load register store register load register halfword store register halfword load register byte store register byte swap (atomic update) ADD SUB MOV AND ORR MVN LSL MVN LSR LDR STR LDRH STRH LDRB STRB SWP DP DP DP DP DP DP DP DP DP DT DT DT DT DT DT DT branch on x (x = eq, ne, lt, le, gt, ge) compare branch (always) BEQ BR CMP B DP BR branch and link BL BR FIGURE 3.22 04-Ch03-P374750.indd 266 ARM arithmetic core multiply floating-point add single floating-point add double floating-point subtract single floating-point subtract double floating-point multiply single floating-point multiply double floating-point divide single floating-point divide double load word to floating-point single store word to floating-point single load word to floating-point double store word to floating-point double floating-point compare single floating-point compare double FP Move Status (for conditional branch) Name Format MUL FADDS FADDD FSUBS FSUBD FMULS FMULD FDIVS FDIVD FLDS FSTS FLDD FSTD FCMPS FCMPD FMSTAT DP R R R R R R R R I I I I R R The ARM instruction set. This book concentrates on the instructions like those in the left column. 7/3/09 9:00:50 AM 267 3.9 Concluding Remarks covered in this chapter and Chapter 2. We call the set of instructions on the lefthand side of the figure the ARM core. The instructions on the right we call the ARM arithmetic core. For the application version of the ARM Instruction set, floating-point instructions are becoming standard. On the left of Figure 3.23 are the instructions the ARM processor executes that are not found in Figure 3.22. Figure 3.24 gives the popularity of the ARM instructions for SPEC2006 integer and floating-point benchmarks. All instructions are listed that were responsible for at least 0.3% of the instructions executed. Remaining ARMv3-v6 Name exclusive or (Rn ⊕ Rm) bit clear (Rn & ∼ Rm) arithmetic shift right (operation) rotate right (operation) count leading zeros reverse subtract add with carry sutract with carry reverse sutract with carry load register signed byte load register signed halfword swap byte (atomic update) load multiple store multiple compare nagative (Rn + Rm) test equal (Rn ⊕ Rm) test equal (Rn & Rm) multiply and add EOR BIC ASR ROR CLZ RSB ADC SBC RSC LDRSB LDRSH SWPB LDM STM CMN TEQ TST MLA FP move(S or D) FP convert integer to FP (S or D) FP convert FP (S or D) to integer FP square root (S or D) FP absolute value (S or D) FP negate (S or D) FP convert (S or D) FP compare w. exceptions (S or D) FP compare to zero w. exceptions (S or D) move from SP FP to integer move to SP FP from integer move from High half DP FP to integer move to High half DP FP from integer move from Low half DP FP to integer move to Low half DP FP from integer load multiple FP store multiple FP multiply and add long multiply - 64 bit (S or Uns.) long multiply and add (S or Uns.) load byte with user priviledge mode load word with user priviledge mode store byte with user priviledge mode store word with user priviledge mode coprocessor data operation load coprocessor register store coprocessor register move coprocessor register to regiister SMULL SMLAL LDRBT LDRT STRBT STRT CDP LDC STC MRC multiply and subtract negated multiply and add negated multiply and subtract move register to coprocessor regiister MCR move status register to regiister MRS move register to status regiister MSR breakpoint (cause exception) BKPT software inter. (cause exception) SWI Remaining ARM VFP FCPYF FSITOF FTOSIF FSQRTF FABSF FNEGF FCVTFF FCMPEF FCMPEZF FMRS FMSR FMRDH FMDHR FMRDL FMDLR LDMF STMF FMACF FMSCF FNMACF FNMSCF FIGURE 3.23 Remaining ARM instructions. F means single (S) or double (D) precision floating-point instructions, and S means signed (S) and unsigned (U) versions. The underscore represents the letter to include to represent that datatype. 04-Ch03-P374750.indd 267 7/3/09 9:00:50 AM 268 Chapter 3 Core ARM add subtract and or not move load register store register load register byte store register byte conditional branch branch and link compare Name ADD SUB AND ORR MVN MOV LDR STR LDRB STRB Bcc BL CMP Arithmetic for Computers Integer 14.2% 2.2% 0.9% 5.0% 0.4% 10.4% 18.6% 7.6% 3.7% 0.6% 17.0% 0.7% 17.5% Fl. pt. 10.7% 0.6% 0.3% 1.4% 0.2% 3.4% 5.8% 2.0% 0.1% 0.0% 4.0% 0.2% 3.5% Arithmetic core + ARMv4 FP add double FP subtract double FP multiply double FP divide double FP add single FP subtract single FP multiply single FP divide single load word to FP double store word to FP double load word to FP single store word to FP single floating-point compare double multiply load half store half Name add.d sub.d mul.d div.d add.s sub.s mul.s div.s l.d s.d l.s s.s c.x.d mul lhu sh Integer 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.3% 0.1% Fl. pt. 10.6% 4.9% 15.0% 0.2% 1.5% 1.8% 2.4% 0.2% 17.5% 4.9% 4.2% 1.1% 0.6% 0.2% 0.0% 0.0% FIGURE 3.24 The frequency of the ARM instructions for SPEC2006 integer and floating point. All instructions that accounted for at least 1% of the instructions are included in the table. Extrapolated from measuremements of MIPS programs. Note that although programmers and compiler writers have a rich menu of options, ARM core instructions dominate integer SPEC2006 execution, and the integer core plus arithmetic core dominate SPEC2006 floating point, as the table below shows. Instruction subset ARM core ARM arithmetic core Remaining ARM Integer Fl. pt. 98% 2% 0% 31% 66% 3% For the rest of the book, we concentrate on the core instructions—the integer instruction set excluding multiply and divide—to make the explanation of computer design easier. We use the MIPS instruction set core, although there are a few differences between it and ARM, as it is a little simpler. However, the same techniques apply to ARM. As you can see, the core includes the most popular instructions; be assured that understanding a computer that runs the core will give you sufficient background to understand even more ambitious computers. Gresham’s Law (“Bad money drives out Good”) for computers would say, “The Fast drives out the Slow even if the Fast is wrong.” W. Kahan, 1992 04-Ch03-P374750.indd 268 3.10 Historical Perspective and Further Reading This section surveys the history of the floating point going back to von Neumann, including the surprisingly controversial IEEE standards effort, plus the rationale for the 80-bit stack architecture for floating point in the x86. See Section 3.10. 7/3/09 9:00:50 AM 3.11 3.11 269 Exercises Never give in, never give in, never, never, never—in nothing, great or small, large or petty—never give in. Exercises Contributed by Matthew Farrens, UC Davis Exercise 3.1 The book shows how to add and subtract binary and decimal numbers. However, other numbering systems were also very popular when dealing with computers. Octal (base 8) numbering system was one of these. The following table shows pairs of octal numbers. A B a. 5323 2275 b. 0147 3457 Winston Churchill, address at Harrow School, 1941 3.1.1 [5] <3.2> What is the sum of A and B if they represent unsigned 12-bit octal numbers? The result should be written in octal. Show your work. 3.1.2 [5] <3.2> What is the sum of A and B if they represent signed 12-bit octal numbers stored in sign-magnitude format? The result should be written in octal. Show your work. 3.1.3 [10] <3.2> Convert A into a decimal number, assuming it is unsigned. Repeat assuming it stored in sign-magnitude format. Show your work. The following table also shows pairs of octal numbers. A B a. 2762 2032 b. 2646 1066 3.1.4 [5] <3.2> What is A − B if they represent unsigned 12-bit octal numbers? The result should be written in octal. Show your work. 3.1.5 [5] <3.2> What is A − B if they represent signed 12-bit octal numbers stored in sign-magnitude format? The result should be written in octal. Show your work. 3.1.6 [10] <3.2> Convert A into a binary number. What makes base 8 (octal) an attractive numbering system for representing values in computers? 04-Ch03-P374750.indd 269 7/3/09 9:00:50 AM 270 Chapter 3 Arithmetic for Computers Exercise 3.2 Hexadecimal (base 16) is also a commonly used numbering system for representing values in computers. In fact, it has become much more popular than octal. The following table shows pairs of hexadecimal numbers. A B a. 0D34 DD17 b. BA1D 3617 3.2.1 [5] <3.2> What is the sum of A and B if they represent unsigned 16-bit hexadecimal numbers? The result should be written in hexadecimal. Show your work. 3.2.2 [5] <3.2> What is the sum of A and B if they represent signed 16-bit hexadecimal numbers stored in sign-magnitude format? The result should be written in hexadecimal. Show your work. 3.2.3 [10] <3.2> Convert A into a decimal number, assuming it is unsigned. Repeat assuming it stored in sign-magnitude format. Show your work. The following table also shows pairs of hexadecimal numbers. A B a. BA7C 241A b. AADF 47BE 3.2.4 [5] <3.2> What is A − B if they represent unsigned 16-bit hexadecimal numbers? The result should be written in hexadecimal. Show your work. 3.2.5 [5] <3.2> What is A − B if they represent signed 16-bit hexadecimal numbers stored in sign-magnitude format? The result should be written in hexadecimal. Show your work. 3.2.6 [10] <3.2> Convert A into a binary number. What makes base 16 (hexadecimal) an attractive numbering system for representing values in computers? Exercise 3.3 Overflow occurs when a result is too large to be represented accurately given a finite word size. Underflow occurs when a number is too small to be represented correctly—a negative result when doing unsigned arithmetic, for example. (The case when a positive result is generated by the addition of two negative integers is 04-Ch03-P374750.indd 270 7/3/09 9:00:50 AM 3.11 Exercises 271 also referred to as underflow by many, but in this textbook, that is considered an overflow). The following table shows pairs of decimal numbers. A B a. 69 90 b. 102 44 3.3.1 [5] <3.2> Assume A and B are unsigned 8-bit decimal integers. Calculate A − B. Is there overflow, underflow, or neither? 3.3.2 [5] <3.2> Assume A and B are signed 8-bit decimal integers stored in signmagnitude format. Calculate A + B. Is there overflow, underflow, or neither? 3.3.3 [5] <3.2> Assume A and B are signed 8-bit decimal integers stored in signmagnitude format. Calculate A − B. Is there overflow, underflow, or neither? The following table also shows pairs of decimal numbers. A B a. 200 103 b. 247 237 3.3.4 [10] <3.2> Assume A and B are signed 8-bit decimal integers stored in two’s-complement format. Calculate A + B using saturating arithmetic. The result should be written in decimal. Show your work. 3.3.5 [10] <3.2> Assume A and B are signed 8-bit decimal integers stored in two’s-complement format. Calculate A − B using saturating arithmetic. The result should be written in decimal. Show your work. 3.3.6 [10] <3.2> Assume A and B are unsigned 8-bit integers. Calculate A + B using saturating arithmetic. The result should be written in decimal. Show your work. Exercise 3.4 Let’s look in more detail at multiplication. We will use the numbers in the following table. A B a. 50 23 b. 66 04 04-Ch03-P374750.indd 271 7/3/09 9:00:50 AM 272 Chapter 3 Arithmetic for Computers 3.4.1 [20] <3.3> Using a table similar to that shown in Figure 3.7, calculate the product of the octal unsigned 6-bit integers A and B using the hardware described in Figure 3.4. You should show the contents of each register on each step. 3.4.2 [20] <3.3> Using a table similar to that shown in Figure 3.7, calculate the product of the hexadecimal unsigned 8-bit integers A and B using the hardware described in Figure 3.6. You should show the contents of each register on each step. 3.4.3 [60] <3.3> Write an ARM assembly language program to calculate the product of unsigned integers A and B, using the approach described in Figure 3.4. The following table shows pairs of octal numbers. A B a. 54 67 b. 30 07 3.4.4 [30] <3.3> When multiplying signed numbers, one way to get the correct answer is to convert the multiplier and multiplicand to positive numbers, save the original signs, and then adjust the final value accordingly. Using a table similar to that shown in Figure 3.7, calculate the product of A and B using the hardware described in Figure 3.4. You should show the contents of each register on each step, and include the step necessary to produce the correctly signed result. Assume A and B are stored in 6-bit sign-magnitude format. 3.4.5 [30] <3.3> When shifting a register one bit to the right, there are several ways to decide what the new entering bit should be. It can always be a 0, or always a 1, or the incoming bit could be the one that is being pushed out of the right side (turning a shift into a rotate), or the value that is already in the leftmost bit can simply be retained (called an arithmetic shift right, because it preserves the sign of the number that is being shifted.) Using a table similar to that shown in Figure 3.7, calculate the product of the 6-bit two’s-complement numbers A and B using the hardware described in Figure 3.6. The right shifts should be done using an arithmetic shift right. Note that the algorithm described in the text will need to be modified slightly to make this work—in particular, things must be done differently if the multiplier is negative. You can find details by searching the Web. Show the contents of each register on each step. 3.4.6 [60] <3.3> Write an ARM assembly language program to calculate the product of the signed integers A and B. State if you are using the approach given in 3.4.4 or 3.4.5. 04-Ch03-P374750.indd 272 7/3/09 9:00:50 AM 3.11 Exercises 273 Exercise 3.5 For many reasons, we would like to design multipliers that require less time. Many different approaches have been taken to accomplish this goal. In the following table, A represents the bit width of an integer, and B represents the number of time units (tu) taken to perform a step of an operation. A B a. 4 3 tu b. 32 7 tu 3.5.1 [10] <3.3> Calculate the time necessary to perform a multiply using the approach given in Figures 3.4 and 3.5 if an integer is A bits wide and each step of the operation takes B time units. Assume that in step 1a an addition is always performed—either the multiplicand will be added, or a zero will be. Also assume that the registers have already been initialized (you are just counting how long it takes to do the multiplication loop itself). If this is being done in hardware, the shifts of the multiplicand and multiplier can be done simultaneously. If this is being done in software, they will have to be done one after the other. Solve for each case. 3.5.2 [10] <3.3> Calculate the time necessary to perform a multiply using the approach described in the text (31 adders stacked vertically) if an integer is A bits wide and an adder takes B time units. 3.5.3 [20] <3.3> Calculate the time necessary to perform a multiply using the approach given in Figure 3.8, if an integer is A bits wide and an adder takes B time units. Exercise 3.6 In this exercise we will look at a couple of other ways to improve the performance of multiplication, based primarily on doing more shifts and fewer arithmetic operations. The following table shows pairs of hexadecimal numbers. A B a. 24 c9 b. 41 18 04-Ch03-P374750.indd 273 7/3/09 9:00:50 AM 274 Chapter 3 Arithmetic for Computers 3.6.1 [20] <3.3> As discussed in the text, one possible performance enhancement is to do a shift and add instead of an actual multiplication. Since 9 ⋅ 6, for example, can be written (2 ⋅ 2 ⋅ 2 + 1) ⋅ 6, we can calculate 9 ⋅ 6 by shifting 6 to the left three times and then adding 6 to that result. Show the best way to calculate A ⋅ B using shifts and adds/subtracts. Assume that A and B are 8-bit unsigned integers. 3.6.2 [20] <3.3> Show the best way to calculate A ⋅ B using shift and adds, if A and B are 8-bit signed integers stored in sign-magnitude format. 3.6.3 [60] <3.3> Write an ARM assembly language program that performs a multiplication on signed integers using shift and adds, as described in 3.6.1. The following table shows further pairs of hexadecimal numbers. A B a. 42 36 b. 9F 8E 3.6.4 [30] <3.3> Booth’s algorithm is another approach to reducing the number of arithmetic operations necessary to perform a multiplication. This algorithm has been around for years, and details about how it works are available on the Web. Basically, it assumes that a shift takes less time than an add or subtract, and uses this fact to reduce the number of arithmetic operations necessary to perform a multiply. It works by identifying runs of 1s and 0s, and performing shifts during the runs. Find a description of the algorithm and explain in detail how it works. 3.6.5 [30] <3.3> Show the step-by-step result of multiplying A and B, using Booth’s algorithm. Assume A and B are 8-bit two’s-complement integers, stored in hexadecimal format. 3.6.6 [60] <3.3> Write an ARM assembly language program to perform the multiplication of A and B using Booth’s algorithm. Exercise 3.7 Let’s look in more detail at division. We will use the octal numbers in the following table. 04-Ch03-P374750.indd 274 A B a. 50 23 b. 25 44 7/3/09 9:00:50 AM 3.11 Exercises 275 3.7.1 [20] <3.4> Using a table similar to that shown in Figure 3.11, calculate A divided by B using the hardware described in Figure 3.9. You should show the contents of each register on each step. Assume A and B are unsigned 6-bit integers. 3.7.2 [30] <3.4> Using a table similar to that shown in Figure 3.11, calculate A divided by B using the hardware described in Figure 3.12. You should show the contents of each register on each step. Assume A and B are unsigned 6-bit integers. This algorithm requires a slightly different approach than that shown in Figure 3.10. You will want to think hard about this, do an experiment or two, or else go to the Web to figure out how to make this work correctly. (Hint: one possible solution involves using the fact that Figure 3.12 implies the remainder register can be shifted either direction). 3.7.3 [60] <3.4> Write an ARM assembly language program to calculate A divided by B, using the approach described in Figure 3.9. Assume A and B are unsigned 6-bit integers. The following table shows further pairs of octal numbers. A B a. 55 24 b. 36 51 3.7.4 [30] <3.4> Using a table similar to that shown in Figure 3.11, calculate A divided by B using the hardware described in Figure 3.9. You should show the contents of each register on each step. Assume A and B are 6-bit signed integers in sign-magnitude format. Be sure to include how you are calculating the signs of the quotient and remainder. 3.7.5 [30] <3.4> Using a table similar to that shown in Figure 3.11, calculate A divided by B using the hardware described in Figure 3.12. You should show the contents of each register on each step. Assume A and B are 6-bit signed integers in sign-magnitude format. Be sure to include how you are calculating the signs of the quotient and remainder. 3.7.6 [60] <3.4> Write a ARM assembly language program to calculate A divided by B, using the approach described in Figure 3.12. Assume A and B are signed integers. Exercise 3.8 Figure 3.10 describes a restoring division algorithm, because when subtracting the divisor from the remainder produces a negative result, the divisor is added back to the remainder (thus restoring the value). However, there are other algorithms that 04-Ch03-P374750.indd 275 7/3/09 9:00:50 AM 276 Chapter 3 Arithmetic for Computers have been developed that eliminate the extra addition. Many references to these algorithms are easily found on the Web. We will explore these algorithms using the pairs of octal numbers in the following table. A B a. 75 12 b. 52 37 3.8.1 [30] <3.4> Using a table similar to that shown in Figure 3.11, calculate A divided by B using nonrestoring division. You should show the contents of each register on each step. Assume A and B are 6-bit unsigned integers. 3.8.2 [60] <3.4> Write an ARM assembly language program to calculate A divided by B using nonrestoring division. Assume A and B are 6-bit signed (two’s-complement) integers. 3.8.3 [60] <3.4> How does the performance of restoring and non-restoring division compare? Demonstrate by showing the number of steps necessary to calculate A divided by B using each method. Assume A and B are 6-bit signed (sign-magnitude) integers. Writing a program to perform the restoring and nonrestoring divisions is acceptable. The following table shows further pairs of octal numbers. A B a. 17 14 b. 70 23 3.8.4 [30] <3.4> Using a table similar to that shown in Figure 3.11, calculate A divided by B using nonperforming division. You should show the contents of each register on each step. Assume A and B are 6-bit unsigned integers. 3.8.5 [60] <3.4> Write a ARM assembly language program to calculate A divided by B using nonperforming division. Assume A and B are 6-bit two’s complement signed integers. 3.8.6 [60] <3.4> How does the performance of non-restoring and nonperforming division compare? Demonstrate by showing the number of steps necessary to calculate A divided by B using each method. Assume A and B are signed 6-bit integers, stored in sign-magnitude format. Writing a program to perform the nonperforming and non-restoring divisions is acceptable. 04-Ch03-P374750.indd 276 7/3/09 9:00:51 AM 3.11 Exercises 277 Exercise 3.9 Division is so time-consuming and difficult that the CRAY T3E Fortran Optimization guide states, “The best strategy for division is to avoid it whenever possible.” This exercise looks at the following different strategies for performing divisions. a. restoration division b. SRT division 3.9.1 [30] <3.4> Describe the algorithm in detail. 3.9.2 [60] <3.4> Use a flow chart (or a high-level code snippet) to describe how the algorithm works. 3.9.3 [60] <3.4> Write a ARM assembly language program to perform a division using the algorithm. Exercise 3.10 In a Von Neumann architecture, groups of bits have no intrinsic meanings by themselves. What a bit pattern represents depends entirely on how it is used. The following table shows bit patterns expressed in hexademical notation. a. 0x24A60004 b. 0xAFBF0000 3.10.1 [5] <3.5> What decimal number does the bit pattern represent if it is a two’s-complement integer? An unsigned integer? 3.10.2 [10] <3.5> If this bit pattern is placed into the Instruction Register, what ARM instruction will be executed? 3.10.3 [10] <3.5> What decimal number does the bit pattern represent if it is a floating point number? Use the IEEE 754 standard. The following table shows decimal numbers. a. –1609.5 b. –938.8125 3.10.4 [10] <3.5> Write down the binary representation of the decimal number, assuming the IEEE 754 single precision format. 3.10.5 [10] <3.5> Write down the binary representation of the decimal number, assuming the IEEE 754 double precision format. 04-Ch03-P374750.indd 277 7/3/09 9:00:51 AM 278 Chapter 3 Arithmetic for Computers 3.10.6 [10] <3.5> Write down the binary representation of the decimal number assuming it was stored using the single precision IBM format (base 16, instead of base 2, with 7 bits of exponent). Exercise 3.11 In the IEEE 754 floating point standard the exponent is stored in “bias” (also known as “Excess-N”) format. This approach was selected because we want an all zero pattern to be as close to zero as possible. Because of the use of a hidden 1, if we were to represent the exponent in two’s-complement format an all-zero pattern would actually be the number 1! (Remember, anything raised to the zeroth power is 1, so 1.00 = 1.) There are many other aspects of the IEEE 754 standard that exist in order to help hardware floating point units work more quickly. However, in many older machines floating point calculations were handled in software, and therefore other formats were used. The following table shows decimal numbers. a. 5.00736125 x 105 b. –2.691650390625 x 10–2 3.11.1 [20] <3.5> Write down the binary bit pattern assuming a format similar to that employed by the DEC PDP-8 (left most 12 bits are the exponent stored as a two’s-complement number, and the rightmost 24 bits are the mantissa stored as a two’s-complement number.) No hidden 1 is used. Comment on how the range and accuracy of this 36-bit pattern compares to the single and double precision IEEE 754 standards. 3.11.2 [20] <3.5> NVIDIA has a “half ” format, which is similar to IEEE 754 except that it is only 16 bits wide. The left-most bit is still the sign bit, the exponent is 5 bits wide and stored in excess-16 format, and the mantissa is 10 bits long. A hidden 1 is assumed. Write down the bit pattern assuming this format. Comment on how the range and accuracy of this 16-bit pattern compares to the single precision IEEE 754 standard. 3.11.3 [20] <3.5> The Hewlett-Packard 2114, 2115, and 2116 used a format with the left-most 16 bits being the mantissa stored in two’s-complement format, followed by another 16-bit field which had the leftmost 8 bits an extension of the mantissa (making the mantissa 24 bits long), and the right-most 8 bits representing the exponent. However, in an interesting twist, the exponent was stored in signmagnitude format with the sign bit on the far right! Write down the bit pattern assuming this format. No hidden 1 is used. Comment on how the range and accuracy of this 32-bit pattern compares to the single precision IEEE 754 standard. 04-Ch03-P374750.indd 278 7/3/09 9:00:51 AM 3.11 Exercises 279 The following table shows pairs of decimal numbers. A B 3 a. –1278 x 10 –3.90625 x 10–1 b. 2.3109375 x 101 6.391601562 x 10–1 3.11.4 [20] <3.5> Calculate the sum of A and B by hand, assuming A and B are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even. Show all the steps. 3.11.5 [60] <3.5> Write an ARM assembly language program to calculate the sum of A and B, assuming they are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the text). Assume 1 guard, 1 round and 1 sticky bit, and round to the nearest even. 3.11.6 [60] <3.5> Write an ARM assembly language program to calculate the sum of A and B, assuming they are stored using the format described in 3.11.1. Now modify the program to calculate the sum assuming the format described in 3.11.3. Which format is easier for a programmer to deal with? How do they each compare to the IEEE 754 format? (Do not worry about sticky bits for this question.) Exercise 3.12 Floating-point multiplication is even more complicated and challenging than floating-point addition, and both pale in comparison to floating-point division. A a. b. B 5.66015625 x 100 2 6.18 x 10 8.59375 x 100 5.796875 x 101 3.12.1 [30] <3.5> Calculate the product of A and B by hand, assuming A and B are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even. Show all the steps; however, as is done in the example in the text, you can do the multiplication in human-readable format instead of using the techniques described in 3.4 through 3.6. Indicate if there is overflow or underflow. Write your answer as a 16-bit pattern, and also as a decimal number. How accurate is your result? How does it compare to the number you get if you do the multiplication on a calculator? 04-Ch03-P374750.indd 279 7/3/09 9:00:51 AM 280 Chapter 3 Arithmetic for Computers 3.12.2 [60] <3.5> Write an ARM assembly language program to calculate the product of A and B, assuming they are stored in IEEE 754 format. Indicate if there is overflow or underflow. (Remember, IEEE 754 assumes 1 guard, 1 round and 1 sticky bit, and rounds to the nearest even.) 3.12.3 [60] <3.5> Write an ARM assembly language program to calculate the product of A and B, assuming they are stored using the format described in 3.11.1. Now modify the program to calculate the sum assuming the format described in 3.11.3. Which format is easier for a programmer to deal with? How do they each compare to the IEEE 754 format? (Do not worry about sticky bits for this question.) The following table shows further pairs of decimal numbers. A B 3 a. 3.264 x 10 b. –2.27734375 x 100 6.52 x 102 1.154375 x 102 3.12.4 [30] <3.5> Calculate by hand A divided by B. Show all the steps necessary to achieve your answer. Assume there is a guard, round, and sticky bit, and use them if necessary. Write the final answer in both 16-bit floating-point format and in decimal and compare the decimal result to that which you get if you use a calculator. The Livermore Loops are a set of floating-point intensive kernels taken from scientific programs run at Lawrence Livermore Laboratory. The following table identifies individual kernels from the set. a. Livermore Loop 1 b. Livermore Loop 7 3.12.5 [60] <3.5> Write the loop in ARM assembly language. 3.12.6 [60] <3.5> Describe in detail one technique for performing floating-point division in a digital computer. Be sure to include references to the sources you used. Exercise 3.13 Operations performed on fixed-point integers behave the way one expects— the commutative, associative, and distributive laws all hold. This is not always the case when working with floating-point numbers, however. Let’s first look at the associative law. The following table shows sets of decimal numbers. 04-Ch03-P374750.indd 280 7/3/09 9:00:51 AM 3.11 Exercises A B C a. –1.6360 x 104 1.6360 x 104 1.0 x 100 b. 2.865625 x 101 4.140625 x 10–1 1.2140625 x 101 281 3.13.1 [20] <3.2, 3.5, 3.6> Calculate (A + B) + C by hand, assuming A, B, and C are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even. Show all the steps, and write your answer both in 16-bit floating point format and in decimal. 3.13.2 [20] <3.2, 3.5, 3.6> Calculate A + (B + C) by hand, assuming A, B, and C are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even. Show all the steps, and write your answer both in 16-bit floating point format and in decimal. 3.13.3 [10] <3.2, 3.5, 3.6> Based on your answers to 3.13.1 and 3.13.2, does (A + B) + C = A + (B + C)? The following table shows further sets of decimal numbers. A B C a. 4.8828125 x 10–4 1.768 x 103 2.50125 x 102 b. 4.721875 x 101 2.809375 x 101 3.575 x 101 3.13.4 [30] <3.3, 3.5, 3.6> Calculate (A ⋅ B) ⋅ C by hand, assuming A, B, and C are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even. Show all the steps, and write your answer both in 16-bit floating point format and in decimal. 3.13.5 [30] <3.3, 3.5, 3.6> Calculate A ⋅ (B ⋅ C) by hand, assuming A, B, and C are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even. Show all the steps, and write your answer both in 16-bit floating point format and in decimal. 3.13.6 [10] <3.3, 3.5, 3.6> Based on your answers to 3.13.4 and 3.13.5, does (A ⋅ B) ⋅ C = A ⋅ (B ⋅ C)? 04-Ch03-P374750.indd 281 7/3/09 9:00:51 AM 282 Chapter 3 Arithmetic for Computers Exercise 3.14 The associative law is not the only one that does not always hold in dealing with floating point numbers. There are other oddities that occur as well. The following table shows sets of decimal numbers. A B –1 a. 1.5234375 x 10 b. –2.7890625 x 101 C –1 9.96875 x 101 2.0703125 x 10 –8.088 x 103 1.0216 x 104 3.14.1 [30] <3.2, 3.3, 3.5, 3.6> Calculate A ⋅ (B + C) by hand, assuming A, B, and C are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even. Show all the steps, and write your answer both in 16-bit floating point format and in decimal. 3.14.2 [30] <3.2, 3.3, 3.5, 3.6> Calculate (A × B) + (A × C) by hand, assuming A, B, and C are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the text). Assume one guard, one round bit and one sticky bit, and round to the nearest even. Show all the steps, and write your answer both in 16-bit floating-point format and in decimal. 3.14.3 [10] <3.2, 3.3, 3.5, 3.6> Based on your answers to 3.14.1 and 3.14.2, does (A ⋅ B) + (A ⋅ C) = A ⋅ (B + C)? The following table shows pairs, each consisting of a fraction and an integer. A B a. 1/3 3 b. –1/7 7 3.14.4 [10] <3.5> Using the IEEE 754 floating point format, write down the bit pattern that would represent A. Can you represent A exactly? 3.14.5 [10] <3.2, 3.3, 3.5, 3.6> What do you get if you add A to itself B times? What is A × B? Are they the same? What should they be? 3.14.6 [60] <3.2, 3.3, 3.4, 3.5, 3.6> What do you get if you take the square root of B and then multiply that value by itself? What should you get? Do for both 04-Ch03-P374750.indd 282 7/3/09 9:00:51 AM 3.11 Exercises 283 single and double precision floating point numbers. (Write a program to do these calculations). Exercise 3.15 Binary numbers are used in the mantissa field, but they do not have to be. IBM used base 16 numbers, for example, in some of their floating point formats. There are other approaches that are possible as well, each with their own particular advantages and disadvantages. The following table shows fractions to be represented in various floating point formats. a. 1/2 b. 1/9 3.15.1 [10] <3.5, 3.6> Write down the bit pattern in the mantissa assuming a floating point format that uses binary numbers in the mantissa (essentially what you have been doing in this chapter). Assume there are 24 bits, and you do not need to normalize. Is this representation exact? 3.15.2 [10] <3.5, 3.6> Write down the bit pattern in the mantissa assuming a floating-point format that uses Binary Coded Decimal (base 10) numbers in the mantissa instead of base 2. Assume there are 24 bits, and you do not need to normalize. Is this representation exact? 3.15.3 [10] <3.5, 3.6> Write down the bit pattern assuming that we are using base 15 numbers in the mantissa instead of base 2. (Base 16 numbers use the symbols 0–9 and A–F. Base 15 numbers would use 0–9 and A–E.) Assume there are 24 bits, and you do not need to normalize. Is this representation exact? 3.15.4 [20] <3.5, 3.6> Write down the bit pattern assuming that we are using base 30 numbers in the mantissa instead of base 2. (Base 16 numbers use the symbols 0–9 and A–F. Base 30 numbers would use 0–9 and A–T.) Assume there are 20 bits, and you do not need to normalize. Is this representation exact? Do you see any advantage to using this approach? §3.2, page 219: 3. §3.5, page 257: 3. 04-Ch03-P374750.indd 283 Answers to Check Yourself 7/3/09 9:00:51 AM B1 A P P E N D I X ARM and Thumb Assembler Instructions Andrew Sloss, ARM; Dominic Symes, ARM; Chris Wright, Ultimodule Inc. B1.1 Using This Appendix B1-3 B1.2 Syntax B1-4 B1.3 Alphabetical List of ARM and Thumb Instructions B1-8 B1.4 ARM Assembler Quick Reference B1-49 B1.5 GNU Assembler Quick Reference B1-60 This appendix lists the ARM and Thumb instructions available up to, and including, ARM architecture ARMv6, which was just released at the time of writing. We list the operations in alphabetical order for easy reference. Sections B1.5 and B1.4 give quick reference guides to the ARM and GNU assemblers armasm and gas. We have designed this appendix for practical programming use, both for writing assembly code and for interpreting disassembly output. It is not intended as a definitive architectural ARM reference. In particular, we do not list the exhaustive details of each instruction bitmap encoding and behavior. For this level of detail, see the ARM Architecture Reference Manual, edited by David Seal, published by Addison Wesley. We do give a summary of ARM and Thumb instruction set encodings in Appendix B2. B1.1 Using This Appendix Each appendix entry begins by enumerating the available instructions formats for the given instruction class. For example, the first entry for the instruction class ADD reads 1. ADD<cond>{S} Rd, Rn, #<rotated_immed> ARMv1 B1-4 Appendix B1 ARM and Thumb Assembler Instructions The fields <cond> and <rotated_immed> are two of a number of standard fields described in Section B1.2. Rd and Rn denote ARM registers. The instruction is only executed if the condition <cond> is passed. Each entry also describes the action of the instruction if it is executed. The {S} denotes that you may apply an optional S suffix to the instruction. Finally, the right-hand column specifies that the instruction is available from the listed ARM architecture version onwards. Table B1.1 shows the entries possible for this column. TABLE B1.1 Instruction types. Type Meaning ARMvX 32-bit ARM instruction first appearing in ARM architecture version X THUMBvX 16-bit Thumb instruction first appearing in Thumb architecture version X MACRO Assembler pseudoinstruction Note that there is no direct correlation between the Thumb architecture number and the ARM architecture number. The THUMBv1 architecture is used in ARMv4T processors; the THUMBv2 architecture, in ARMv5T processors; and the THUMBv3 architecture, in ARMv6 processors. Each instruction definition is followed by a notes section describing restrictions on the use of the instruction. When we make a statement such as “ Rd must not be pc,’’ we mean that the description of the function only applies when this condition holds. If you break the condition, then the instruction may be unpredictable or have predictable effects that we haven’t had space to describe here. Well-written programs should not need to break these conditions. B1.2 Syntax We use the following syntax and abbreviations throughout this appendix. Optional Expressions ■ {<expr>} is an optional expression. For example, LDR{B} is shorthand for LDR or LDRB. ■ {<exp1>|<exp2>|...|<expN>}, including at least one “|’’ divider, is a list of expressions. One of the listed expressions must appear. For example LDR{B|H} is shorthand for LDRB or LDRH. It does not include LDR. We would represent these three possibilities by LDR{|B|H}. B1.2 Syntax Register Names ■ Rd, Rn, Rm, Rs, RdHi, RdLo represent ARM registers in the range r0 to r15. ■ Ld, Ln, Lm, Ls represent low-numbered ARM registers in the range r0 to r7. ■ Hd, Hn, Hm, Hs represent high-numbered ARM registers in the range r8 to r15. ■ Cd, Cn, Cm represent coprocessor registers in the range c0 to c15. ■ sp, lr, pc are names for r13, r14, r15, respectively. ■ Rn[a] denotes bit a of register Rn. Therefore Rn[a] (Rn » a) & 1. ■ Rn[a:b] denotes the a 1 b bit value stored in bits a to b of Rn inclusive. ■ RdHi:RdLo represents the 64-bit value with high 32 RDHi bits and low 32 bits RdLo. Values Stored as Immediates ■ <immedN> is any unsigned N-bit immediate. For example, <immed8> represents any integer in the range 0 to 255. <immed5>*4 represents any integer in the list 0, 4, 8, ..., 124. ■ <addressN> is an address or label stored as a relative offset. The address must be in the range pc 2N address pc 2N. Here, pc is the address of the instruction plus eight for ARM state, or the address of the instruction plus four for Thumb state. The address must be four-byte aligned if the destination is an ARM instruction or two-byte aligned if the destination is a Thumb instruction. ■ <A-B> represents any integer in the range A to B inclusive. ■ <rotated_immed> is any 32-bit immediate that can be represented as an eight- bit unsigned value rotated right (or left) by an even number of bit positions. In other words, <rotated_immed> = <immed8> ROR (2*<immed4>). For example 0xff, 0x104, 0xe0000005, and 0x0bc00000 are possible values for <rotated_immed>. However, 0x101 and 0x102 are not. When you use a rotated immediate, <shifter_C> is set according to Table B1.3 (discussed in Section Shift Operations). A nonzero rotate may cause a change in the carry flag. For this reason, you can also specify the rotation explicitly, using the assembly syntax <immed8>, 2*<immed4>. Condition Codes and Flags ■ <cond> represents any of the standard ARM condition codes. Table B1.2 shows the possible values for <cond>. B1-5 B1-6 Appendix B1 ARM and Thumb Assembler Instructions TABLE B1.2 ARM condition mnemonics. Instruction is executed when <cond> cpsr condition ALways TRUE EQ EQual (last result zero) Z==1 NE Not Equal (last result nonzero) Z==0 {CS|HS} Carry Set, unsigned Higher or Same (following a compare) C==1 {CC|LO} Carry Clear, unsigned LOwer (following a comparison) C==0 MI MInus (last result negative) N==1 PL PLus (last result greater than or equal to zero) N==0 VS V flag Set (signed overflow on last result) V==1 VC V flag Clear (no signed overflow on last result) V==0 HI unsigned HIgher (following a comparison) LS unsigned Lower or Same (following a comparison) GE signed Greater than or Equal N==V LT signed Less Than N!=V GT signed Greater Than LE signed Less than or Equal NV NeVer—ARMv1 and ARMv2 only—DO NOT USE {|AL} C==1 && Z==0 C==0 || Z==1 N==V && Z==0 N!=V || Z==1 FALSE ■ <SignedOverflow> is a flag indicating that the result of an arithmetic operation suffered from a signed overflow. For example, 0x7fffffff + 1 = 0x80000000 produces a signed overflow because the sum of two positive 32-bit signed integers is a negative 32-bit signed integer. The V flag in the cpsr typically records signed overflows. ■ <UnsignedOverflow> is a flag indicating that the result of an arithmetic operation suffered from an unsigned overflow. For example, 0xffffffff + 1 = 0 produces an overflow in unsigned 32-bit arithmetic. The C flag in the cpsr typically records unsigned overflows. ■ <NoUnsignedOverflow> is the same as 1 – <UnsignedOverflow>. B1.2 Syntax ■ <Zero> is a flag indicating that the result of an arithmetic or logical operation is zero. The Z flag in the cpsr typically records the zero condition. ■ <Negative> is a flag indicating that the result of an arithmetic or logical operation is negative. In other words, <Negative> is bit 31 of the result. The N flag in the cpsr typically records this condition. Shift Operations ■ <imm_shift> represents a shift by an immediate specified amount. The possible shifts are LSL #<0-31>, LSR #<1-32>, ASR #<1-32>, ROR #<131>, and RRX. See Table B1.3 for the actions of each shift. ■ <reg_shift> represents a shift by a register-specified amount. The possible shifts are LSL Rs, LSR Rs, ASR Rs, and ROR Rs. Rs must not be pc . The bottom eight bits of Rs are used as the shift value k in Table B1.3. Bits Rs[31:8] are ignored. ■ <shift> is shorthand for <imm_shift> or <reg_shift>. ■ <shifted_Rm> is shorthand for the value of Rm after the specified shift has been applied. See Table B1.3. ■ <shifter_C> is shorthand for the carry value output by the shifting circuit. See Table B1.3. TABLE B1.3 Barrel shifter circuit outputs for different shift types. Shift k range <shifted_Rm> <shifter_C> LSL k k0 Rm C (from cpsr) LSL k 1 k 31 Rm « k Rm[32-k] LSL k k 32 0 Rm[0] LSL k k 33 0 0 LSR k k0 Rm C LSR k 1 k 31 (unsigned)Rm » k Rm[k-1] LSR k k 32 0 Rm[31] LSR k k 33 0 0 ASR k k0 Rm C B1-7 B1-8 Appendix B1 ARM and Thumb Assembler Instructions Shift k range <shifted_Rm> <shifter_C> ASR k 1 k 31 (signed)Rm»k Rm[k-1] ASR k k 32 Rm[31] Rm[31] ROR k k0 Rm C ROR k 1 k 31 ((unsigned)Rm » k)| (Rm » (32-k)) Rm[k-1] ROR k k 32 Rm ROR (k & 31) Rm[(k-1) & 31] (C « 31) | ((unsigned)Rm » 1) Rm[0] RRX Alphabetical List of ARM and Thumb Instructions B1.3 Instructions are listed in alphabetical order. However, where signed and unsigned variants of the same operation exist, the main entry is under the signed variant. ADC Add two 32-bit values and carry 1. 2. 3. ADC<cond>{S} ADC<cond>{S} ADC Rd, Rn, #<rotated_immed> Rd, Rn, Rm {, <shift>} Ld, Lm ARMv1 ARMv1 THUMBv1 Action Effect on the cpsr 1. Rd = Rn + <rotated_immed> + C 2. Rd = Rn + <shifted_Rm> + C 3. Ld = Ld + Lm + C Updated if S suffix specified Updated if S suffix specified Updated (see Notes below) Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <UnsignedOverflow>, V = <SignedOverflow>. ■ If Rd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr; in this case, the cpsr is set to the value of the spsr. ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. B1.3 Alphabetical List of ARM and Thumb Instructions Examples ADDS ADC ADCS r0, r0, r2 r1, r1, r3 r0, r0, r0 ; first half of a 64-bit add ; second half of a 64-bit add ; shift r0 left, inserting carry (RLX) ADD Add two 32-bit values 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. ADD<cond>S ADD<cond>S ADD ADD ADD ADD ADD ADD ADD ADD ADD Rd, Rd, Ld, Ld, Ld, Hd, Ld, Hd, Ld, Ld, sp, Rn, #<rotated_immed> Rn, Rm {, <shift>} Ln, #<immed3> #<immed8> Ln, Lm Lm Hm Hm pc, #<immed8>*4 sp, #<immed8>*4 #<immed7>*4 Action 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. ARMv1 ARMv1 THUMBv1 THUMBv1 THUMBv1 THUMBv1 THUMBv1 THUMBv1 THUMBv1 THUMBv1 THUMBv1 Effect on the cpsr Rd Rd Ld Ld Ld Hd Ld Hd Ld Ld sp = = = = = = = = = = = Rn Rn Ln Ld Ln Hd Ld Hd pc sp sp + + + + + + + + + + + <rotated_immed> <shifted_Rm> <immed3> <immed8> Lm Lm Hm Hm 4*<immed8> 4*<immed8> 4*<immed7> Updated if S Updated if S Updated (see Updated (see Updated (see Preserved Preserved Preserved Preserved Preserved Preserved suffix suffix Notes Notes Notes specified specified below) below) below) Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <UnsignedOverflow>, V = <SignedOverflow>. ■ If Rd or Hd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr; in this case, the cpsr is set to the value of the spsr. ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. ■ If Hd or Hm is pc, then the value used is the address of the instruction plus four bytes. B1-9 B1-10 Appendix B1 ARM and Thumb Assembler Instructions Examples ADD ADDS ADD ADD ADD ADDS ADR r0, r0, r0, pc, r0, pc, r1, r2, r0, pc, r1, lr, #4 r2 r0, LSL #1 r0, LSL #2 r2, ROR r3 #4 ; ; ; ; ; ; r0 = r0 = r0 = skip r0 = jump r1 + 4 r2 + r2 and flags updated 3*r0 r0+1 instructions r1 + ((r2r»3)|(r2«(32-r3)) to lr+4, restoring the cpsr Address relative 1. ADR{L}<cond> Rd, <address> MACRO This is not an ARM instruction, but an assembler macro that attempts to set Rd to the value <address> using a pc-relative calculation. The ADR instruction macro always uses a single ARM (or Thumb) instruction. The long-version ADRL always uses two ARM instructions and so can access a wider range of addresses. If the assembler cannot generate an instruction sequence reaching the address, then it will generate an error. The following example shows how to call the function pointed to by r9. We use ADR to set lr to the return address; in this case, it will assemble to ADD lr, pc, #4. Recall that pc reads as the address of the current instruction plus eight in this case. ADR lr, return_address MOV r0, #0 BX r9 return_address ; ; ; ; set return address set a function argument call the function resume AND Logical bitwise AND of two 32-bit values 1. AND<cond>{S} 2. AND<cond>{S} 3. AND Rd, Rn, #<rotated_immed> Rd, Rn, Rm {, <shift>} Ld, Lm Action 1. Rd = Rn & <rotated_immed> 2. Rd = Rn & <shifted_Rm> 3. Ld = Ld & Lm ARMv1 ARMv1 THUMBv1 Effect on the cpsr Updated if S suffix specified Updated if S suffix specified Updated (see Notes below) Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3), V is preserved. ■ If Rd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr; in this case, the cpsr is set to the value of the spsr. B1.3 ■ Alphabetical List of ARM and Thumb Instructions If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. Example AND ANDS r0, r0, #0xFF r0, r0, #1«31 ; extract the lower 8 bits of a byte ; extract sign bit ASR Arithmetic shift right for Thumb (see MOV for the ARM equivalent) 1. ASR Ld, Lm, #<immed5> 2. ASR Ld, Ls Action THUMBv1 THUMBv1 Effect on the cpsr 1. Ld = Lm ASR #<immed5> 2. Ld = Ld ASR Ls[7:0] Updated (see Notes below) Updated Note ■ B The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3). Branch relative 1. B<cond> 2. B<cond> 3. B <address25> <address8> <address11> ARMv1 THUMBv1 THUMBv1 Branches to the given address or label. The address is stored as a relative offset. Examples B label BGT loop BIC ; branch unconditionally to a label ; conditionally continue a loop Logical bit clear (AND NOT) of two 32-bit values 1. BIC<cond>{S} Rd, Rn, #<rotated_immed> 2. BIC<cond>{S} Rd, Rn, Rm {, <shift>} 3. BIC Ld, Lm Action 1. Rd = Rn & ~<rotated_immed> 2. Rd = Rn & ~<shifted_Rm> 3. Ld = Ld & ~Lm ARMv1 ARMv1 THUMBv1 Effect on the cpsr Updated if S suffix specified Updated if S suffix specified Updated (see Notes below) B1-11 B1-12 Appendix B1 ARM and Thumb Assembler Instructions Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3), V is preserved. ■ If Rd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr; in this case, the cpsr is set to the value of the spsr. ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. Examples BIC BKPT r0, r0, #1 « 22 ; clear bit 22 of r0 Breakpoint instruction 1. BKPT <immed16> ARMv5 2. BKPT <immed8> THUMBv2 The breakpoint instruction causes a prefetch data abort, unless overridden by debug hardware. The ARM ignores the immediate value. This immediate can be used to hold debug information such as the breakpoint number. Relative branch with link (subroutine call) BL 1. BL<cond> <address25> ARMv1 2. BL <address22> THUMBv1 Action Effect on the cpsr 1. lr = ret+0; pc = <address25> None 2. lr = ret+1; pc = <address22> None Note ■ These instructions set lr to the address of the following instruction ret plus the current cpsr T-bit setting. Therefore you can return from the subroutine using BX lr to resume execution address and ARM or Thumb state. Examples BL subroutine ; call subroutine (return with MOV pc,lr) BLVS overflow BLX ; call subroutine on an overflow Branch with link and exchange (subroutine call with possible state switch) 1. BLX <address25> ARMv5 B1.3 Alphabetical List of ARM and Thumb Instructions 2. BLX<cond> Rm ARMv5 3. BLX <address22> THUMBv2 4. BLX Rm THUMBv2 Action Effect on the cpsr 1. lr = ret+0; pc = <address25> T=1 (switch to Thumb state) 2. lr = ret+0; pc = Rm & 0xfffffffe T=Rm & 1 3. lr = ret+1; pc = <address22> T=0 (switch to ARM state) 4. lr = ret+1; pc = Rm & 0xfffffffe T=Rm & 1 Notes ■ These instructions set lr to the address of the following instruction ret plus the current cpsr T-bit setting. Therefore you can return from the subroutine using BX lr to resume execution address and ARM or Thumb state. ■ Rm must not be pc. ■ Rm & 3 must not be 2. This would cause a branch to an unaligned ARM instruction. Example BLX BLX BX BXJ thumb_code ; call a Thumb subroutine from ARM state r0 ; call the subroutine pointed to by r0 ; ARM code if r0 even, Thumb if r0 odd Branch with exchange (branch with possible state switch) 1. BX<cond> 2. BX 3. BXJ<cond> Rm Rm Rm ARMv4T THUMBv1 ARMv5J Action Effect on the cpsr 1. pc = Rm & 0xfffffffe 2. pc = Rm & 0xfffffffe 3. Depends on JE configuration bit T=Rm & 1 T=Rm & 1 J,T affected Notes ■ If Rm is pc and the instruction is word aligned, then Rm takes the value of the current instruction plus eight in ARM state or plus four in Thumb state. ■ Rm & 3 must not be 2. This would cause a branch to an unaligned ARM instruction. B1-13 B1-14 Appendix B1 ARM and Thumb Assembler Instructions ■ If the JE (Java Enable) configuration bit is clear, then BXJ behaves as a BX. Otherwise, the behavior is defined by the architecture of the Java Extension hardware. Typically it sets J = 1 in the cpsr and starts executing Java instructions from a general purpose register designated as the Java program counter jpc. Examples BX BX CDP lr r0 ; return from ARM or Thumb subroutine ; branch to ARM or Thumb function pointer r0 Coprocessor data processing operation 1. CDP<cond> <copro>, <op1>, Cd, Cn, Cm, <op2> 2. CDP2 <copro>, <op1>, Cd, Cn, Cm, <op2> ARMv2 ARMv5 These instructions initiate a coprocessor-dependent operation. <copro> is the number of the coprocessor in the range p0 to p15. The core takes an undefined instruction trap if the coprocessor is not present. The coprocessor operation specifiers <op1> and <op2>, and the coprocessor register numbers Cd, Cn, Cm, are interpreted by the coprocessor and ignored by the ARM. CDP2 provides an additional set of coprocessor instructions. CLZ Count leading zeros 1. CLZ<cond> Rd, Rm ARMv5 Rn is set to the maximum left shift that can be applied to Rm without unsigned overflow. Equivalently, this is the number of zeros above the highest one in the binary representation of Rm. If Rm = 0, then Rn is set to 32. The following example normalizes the value in r0 so that bit 31 is set CLZ r1, r0 MOV r0, r0, LSL r1 ; find normalization shift ; normalize so bit 31 is set (if r0!=0) CMN Compare negative 1. CMN<cond> 2. CMN<cond> 3. CMN Rn, #<rotated_immed> Rn, Rm {, <shift>} Ln, Lm ARMv1 ARMv1 THUMBv1 Action 1. cpsr flags set on the result of (Rn + <rotated_immed>) 2. cpsr flags set on the result of (Rn + <shifted_Rm>) 3. cpsr flags set on the result of (Ln + Lm) Notes ■ In the cpsr: N = <Negative>, Z = <Zero>, C = <Unsigned-Overflow>, V = <SignedOverflow>. These are the same flags as generated by CMP with the second operand negated. B1.3 ■ Alphabetical List of ARM and Thumb Instructions If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. Example CMN r0, #3 ; compare r0 with -3 BLT label ; if (r0 ‹ 3) goto label CMP Compare two 32-bit integers 1. 2. 3. 4. CMP<cond> CMP<cond> CMP CMP Rn, Rn, Ln, Rn, #<rotated_immed> Rm {, <shift>} #<immed8> Rm ARMv1 ARMv1 THUMBv1 THUMBv1 Action 1. 2. 3. 4. cpsr cpsr cpsr cpsr flags flags flags flags set set set set on on on on the the the the result result result result of of of of (Rn (Rn (Ln (Rn - <rotated_immed>) <shifted_Rm>) <immed8>) Rm) Notes ■ In the cpsr: N = <Negative>, Z = <Zero>, C = <NoUnsigned-Overflow>, V = <SignedOverflow>. The carry flag is set this way because the subtract x – y is implemented as the add x + ~ y + 1. The carry flag is one if x + ~ y + 1 overflows. This happens when x y (equivalently when x – Ây doesn’t overflow). ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes for ARM instructions, or plus four bytes for Thumb instructions. Example CMP BHS r0, r1, LSR#2 label ; compare r0 with (r1/4) ; if (r0 >= (r1/4)) goto label; CPS Change processor state; modifies selected bits in the cpsr 1. 2. 3. 4. 5. CPS CPSID CPSIE CPSID CPSIE #<mode> <flags> {, #<mode>} <flags> {, #<mode>} <flags> <flags> ARMv6 ARMv6 ARMv6 THUMBv3 THUMBv3 Action 1. cpsr[4:0] = <mode> 2. cpsr = cpsr | mask; { cpsr[4:0]=<mode> } B1-15 B1-16 Appendix B1 ARM and Thumb Assembler Instructions 3. cpsr = cpsr & ~mask; { cpsr[4:0]=<mode> } 4. cpsr = cpsr | mask 5. cpsr = cpsr & ~mask Bits are set in mask according to letters in the <flags> value as in Table B1.4. The ID (interrupt disable) variants mask interrupts by setting cpsr bits. The IE (interrupt enable) variants unmask interrupts by clearing cpsr bits. TABLE B1.4 CPS flags characters. Character CPY cpsr bit affected Bit set in mask a imprecise data Abort mask bit 0 100 = 1 << 8 i IRQ mask bit 0 080 = 1 << 7 f FIQ mask bit 0 040 = 1 << 6 Copy one ARM register to another without affecting the cpsr. 1. CPY<cond> 2. CPY Rd, Rm Rd, Rm ARMv6 THUMBv3 This assembles to MOV <cond> Rd, Rm except in the case of Thumb where Rd and Rm are low registers in the range r0 to r7. Then it is a new operation that sets Rd=Rm without affecting the cpsr. EOR Logical exclusive OR of two 32-bit values 1. EOR<cond>{S} Rd, Rn, #<rotated_immed> 2. EOR<cond>{S} Rd, Rn, Rm {, <shift>} 3. EOR Ld, Lm ARMv1 ARMv1 THUMBv1 Action Effect on the cpsr 1. Rd = Rn ^<rotated_immed> 2. Rd = Rn ^ <shifted_Rm> 3. Ld = Ld ^ Lm Updated if S suffix specified Updated if S suffix specified Updated (see Notes below) Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3), V is preserved. ■ If Rd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr; in this case, the cpsr is set to the value of the spsr. B1.3 ■ Alphabetical List of ARM and Thumb Instructions If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. Example EOR r0, r0, #1 16 ; toggle bit 16 LDC Load to coprocessor single or multiple 32-bit values 1. 2. 3. 4. 5. 6. LDC<cond>{L} LDC<cond>{L} LDC<cond>{L} LDC2{L} LDC2{L} LDC2{L} <copro>, <copro>, <copro>, <copro>, <copro>, <copro>, Cd, Cd, Cd, Cd, Cd, Cd, [Rn {, [Rn], [Rn], [Rn {, [Rn], [Rn], #{-}<immed8>*4}]{!} #{-}<immed8>*4 <option> #{-}<immed8>*4}]{!} #{-}<immed8>*4 <option> ARMv2 ARMv2 ARMv2 ARMv5 ARMv5 ARMv5 These instructions initiate a memory read, transferring data to the given coprocessor. <copro> is the number of the coprocessor in the range p0 to p15. The core takes an undefined instruction trap if the coprocessor is not present. The memory read consists of a sequence of words from sequentially increasing addresses. The initial address is specified by the addressing mode in Table B1.5. The coprocessor controls the number of words transferred, up to a maximum limit of 16 words. The fields {L} and Cd are interpreted by the coprocessor and ignored by the ARM. Typically Cd specifies the destination coprocessor register for the transfer. The <option> field is an eight-bit integer enclosed in { }. Its interpretation is coprocessor dependent. TABLE B1.5 LDC addressing modes. Addressing format Address accessed Value written back to Rn [Rn {,# { - } <immed>}] Rn + {{ - } <immed>} Rn preserved [Rn {,# { - } <immed>}]! Rn + {{ - } <immed>} Rn + {{ - }<immed>} [Rn], # { - } <immed> Rn Rn + { - }<immed> [Rn], < option> Rn Rn preserved If the address is not a multiple of four, then the access is unaligned. The restrictions on unaligned accesses are the same as for LDM. LDM Load multiple 32-bit words from memory to ARM registers 1. LDM<cond><amode> 2. LDMIA Rn{!}, <register_list>{^} Rn!, <register_list> ARMv1 THUMBv1 These instructions load multiple words from sequential memory addresses. The <register_list> specifies a list of registers to load, enclosed in curly brackets B1-17 B1-18 Appendix B1 ARM and Thumb Assembler Instructions { }. Although the assembler allows you to specify the registers in the list in any order, the order is not stored in the instruction, so it is good practice to write the list in increasing order of register number because this is the usual order of the memory transfer. The following pseudocode shows the normal action of LDM. We use <register_ list>[i] to denote the register appearing at position i in the list, starting at 0 for the first register. This assumes that the list is in order of increasing register number. N = the number of registers in <register_list> start = the lowest address accessed given in Table B1.6 for (i=0; i<N; i++) <register_list>[i] = memory(start+i*4, 4); if (! specified) then update Rn according to Table B1.6 Note that memory(a, 4) returns the four bytes at address a packed according to the current processor data endianness. If a is not a multiple of four, then the load is unaligned. Because the behavior of an unaligned load depends on the architecture revision, memory system, and system coprocessor (CP15) configuration, it’s best to avoid unaligned loads if possible. Assuming that the external memory system does not abort unaligned loads, then the following rules usually apply: ■ If the core has a system coprocessor and bit 1 (A-bit) or bit 22 (U-bit) of CP15:c1:c0:0 is set, then unaligned load multiples cause an alignment fault data abort exception. ■ Otherwise the access ignores the bottom two address bits. Table B1.6 lists the possible addressing modes specified by <amode>. If you specify the !, then the base address register is updated according to Table B1.6; otherwise it is preserved. Note that the lowest register number is always read from the lowest address. The first half of the addressing mode mnemonics stands for Increment After, Increment Before, Decrement After, and Decrement Before, respectively. Increment modes load the registers sequentially forward, starting from address Rn (increment after) or Rn + 4 (increment before). Decrement modes have the same effect as if you loaded the register list backwards from sequentially TABLE B1.6 LDM addressing modes. Addressing mode Lowest address accessed Highest address accessed Value written back to Rn if ! specified {IA|FD} Rn Rn + N*4 - 4 Rn + N*4 {IB|ED} Rn + 4 Rn + N*4 Rn + N*4 {DA|FA} Rn N*4 + 4 Rn Rn N*4 {DB|EA} Rn N*4 Rn - 4 Rn N*4 B1.3 Alphabetical List of ARM and Thumb Instructions descending memory addresses, starting from address Rn (decrement after) or Rn – 4 (decrement before). The second half of the addressing mode mnemonics stands for the stack type you can implement with that address mode: Full Descending, Empty Descending, Full Ascending, and Empty Ascending, With a full stack, Rn points to the last stacked value; with an empty stack, Rn points to the first unused stack location. ARM stacks are usually full descending. You should use full descending or empty ascending stacks by preference, since LDC also supports these addressing modes. Notes ■ For Thumb (format 2), Rn and the register list registers must be in the range r0 to r7. ■ The number of registers N in the list must be nonzero. ■ Rn must not be pc. ■ Rn must not appear in the register list if ! (writeback) is specified. ■ If pc appears in the register list, then on ARMv5 and above the processor performs a BX to the loaded address. For ARMv4 and below, the processor branches to the loaded address. ■ If ^ is specified, then the operation is modified. The processor must not be in user or system mode. If pc is not in the register list, then the registers appearing in the register list refer to the user mode versions of the registers and writeback must not be specified. If pc is in the register list, then the spsr is copied to the cpsr in addition to the standard operation. ■ The time order of the memory accesses may depend on the implementation. Be careful when using a load multiple to access I/O locations where the access order matters. If the order matters, then check that the memory locations are marked as I/O in the page tables, do not cross page boundaries, and do not use pc in the register list. Examples LDMIA LDMDB LDMEQFD LDMFD LDMFD r4!, {r0, r1} r4!, {r0, r1} sp!, {r0, pc} sp, {sp}^ sp!, {r0-pc}^ ; ; ; ; ; r0=*r4, r1=*(r4+4), r4+=8 r1=*(r4-4), r0=*(r4-8), r4-=8 if (result zero) then unstack r0, pc load sp_usr from sp_current return from exception, restore cpsr LDR Load a single value from a virtual address in memory 1. LDR<cond>{|B} 2. LDR<cond>{|B} Rd, [Rn {, #{-}<immed12>}]{!} ARMv1 Rd, [Rn, {-}Rm {,<imm_shift>}]{!} ARMv1 B1-19 B1-20 Appendix B1 ARM and Thumb Assembler Instructions 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. LDR<cond>{|B}{T} LDR<cond>{|B}{T} LDR<cond>{H|SB|SH} LDR<cond>{H|SB|SH} LDR<cond>{H|SB|SH} LDR<cond>{H|SB|SH} LDR<cond>D LDR<cond>D LDR<cond>D LDR<cond>D LDREX<cond> LDR{|B|H} LDR{|B|H|SB|SH} LDR LDR LDR<cond><type> LDR<cond> Rd, Rd, Rd, Rd, Rd, Rd, Rd, Rd, Rd, Rd, Rd, Ld, Ld, Ld, Ld, Rd, Rd, [Rn], #{-}<immed12> [Rn], {-}Rm {,<imm_shift>} [Rn, {, #{-}<immed8>}]{!} [Rn, {-}Rm]{!} [Rn], #{-}<immed8> [Rn], {-}Rm [Rn, {, #{-}<immed8>}]{!} [Rn, {-}Rm]{!} [Rn], #{-}<immed8> [Rn], {-}Rm [Rn] [Ln, #<immed5>*<size>] [Ln, Lm] [pc, #<immed8>*4] [sp, #<immed8>*4] <label> =<32-bit-value> ARMv1 ARMv1 ARMv4 ARMv4 ARMv4 ARMv4 ARMv5E ARMv5E ARMv5E ARMv5E ARMv6 THUMBv1 THUMBv1 THUMBv1 THUMBv1 MACRO MACRO Formats 1 to 17 load a single data item of the type specified by the opcode suffix, using a preindexed or postindexed addressing mode. Tables B1.7 and B1.8 show the different addressing modes and data types. TABLES B1.7 LDR Addressing Modes. Addressing format Address a accessed Value written back to Rn [Rn {,#{-}<immed>}] Rn + {{-}<immed>} Rn preserved [Rn {,#{-}<immed>}]! Rn + {{-}<immed>} Rn + {{-}<immed>} [Rn, {-}Rm {,<shift>}] Rn + {-}<shifted_Rm> Rn preserved [Rn, {-}Rm {,<shift>}]! Rn + {-}<shifted_Rm> Rn + {-}<shifted_Rm> [Rn], #{-}<immed> Rn Rn + {-}<immed> [Rn], {-}Rm {,<shift>} Rn Rn + {-}<shifted_Rm> In Table B1.8 memory(a, n) reads n sequential bytes from address a. The bytes are packed according to the configured processor data endianness. The function memoryT(a, n) performs the same access but with user mode privileges, regardless of the current processor mode. The function memoryEx(a, n) used by LDREX performs the access and marks the access as exclusive. If address a has the shared TLB attribute, then this marks address a as exclusive to the current processor and clears any other exclusive addresses for this processor. Otherwise the processor remembers that there is an outstanding exclusive access. Exclusivity only affects the action of the STREX instruction. B1.3 Alphabetical List of ARM and Thumb Instructions TABLES B1.8 LDR datatypes. Load Datatype <size> (bytes) Action LDR word 4 Rd = memory(a, 4) LDRB unsigned Byte 1 Rd = (zero-extend)memory(a, 1) LDRBT Byte Translated 1 Rd = (zero-extend)memoryT(a, 1) LDRD Double word 8 Rd = memory(a, 4) R(d+1) = memory(a+4, 4) LDREX word EXclusive 4 Rd = memoryEx(a, 4) LDRH unsigned Halfword 2 Rd = (zero-extend)memory(a, 2) LDRSB Signed Byte 1 Rd = (sign-extend)memory(a, 1) LDRSH Signed Halfword 2 Rd = (sign-extend)memory(a, 2) LDRT word Translated 4 Rd = memoryT(a, 4) If address a is not a multiple of <size>, then the load is unaligned. Because the behavior of an unaligned load depends on the architecture revision, memory system, and system coprocessor (CP15) configuration, it’s best to avoid unaligned loads if possible. Assuming that the external memory system does not abort unaligned loads, then the following rules usually apply. In the rules, A is bit 1 of system coprocessor register CP15:c1:c0:0, and U is bit 22 of CP15:c1:c0:0, introduced in ARMv6. If there is no system coprocessor, then A = U = 0. ■ If A = 1, then unaligned loads cause an alignment fault data abort exception except that word-aligned double-word loads are supported if U = 1. ■ If A = 0 and U = 1, then unaligned loads are supported for LDR{|T|H|SH}. Word-aligned loads are supported for LDRD. A non-word-aligned LDRD generates an alignment fault data abort. ■ If A = 0 and U = 0, then LDR and LDRT return the value memory(a & ~ 3, 4) ROR ((a&3)*8). All other unaligned operations are unpredictable but do not generate an alignment fault. Format 18 generates a pc-relative load accessing the address specified by <label>. In other words, it assembles to LDR<cond><type> Rd, [pc, #<offset>] whenever this instruction is supported and <offset>=<label>-pc is in range. Format 19 generates an instruction to move the given 32-bit value to the register Rd. Usually the instruction is LDR<cond> Rd, [pc, #<offset>], where the 32-bit value is stored in a literal pool at address pc+<offset>. Notes ■ For double-word loads (formats 9 to 12), Rd must be even and in the range r0 to r12. ■ If the addressing mode updates Rn, then Rd and Rn must be distinct. B1-21 B1-22 Appendix B1 ARM and Thumb Assembler Instructions ■ If Rd is pc, then <size> must be 4. Up to ARMv4, the core branches to the loaded address. For ARMv5 and above, the core performs a BX to the loaded address. ■ If Rn is pc, then the addressing mode must not update Rn . The value used for Rn is the address of the instruction plus eight bytes for ARM or four bytes for Thumb. ■ Rm must not be pc. ■ For ARMv6 use LDREX and STREX to implement semaphores rather than SWP. Examples LSL LDR LDRSH LDRB LDRD r0, r0, r0, r2, [r0] [r1], #4 [r1, #-8]! [r1] ; ; ; ; LDRSB r0, [r2, #55] LDRCC LDRB pc, [pc, r0, LSL #2] ; r0, [r1], -r2, LSL #8 ; LDR r0, =0x12345678 ; ; r0 = *(int*)r0; r0 = *(short*)r1; r1 += 4; r1 -= 8; r0 = *(char*)r1; r2 =* (int*)r1; r3 =* (int*)(r1+4); r0 = *(signed char*) (r2+55); if (C==0) goto *(pc+4*r0); r0 = *(char*)r1; r1 -= 256*r2; r0 = 0x12345678; Logical shift left for Thumb (see MOV for the ARM equivalent) 1. LSL Ld, Lm, #<immed5> 2. LSL Ld, Ls THUMBv1 THUMBv1 Action Effect on the cpsr 1. Ld = Lm LSL #<immed5> 2. Ld = Ld LSL Ls[7:0] Updated (see Note below) Updated Note ■ The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3). LSR Logical shift right for Thumb (see MOV for the ARM equivalent) 1. LSR Ld, Lm, #<immed5> 2. LSR Ld, Ls THUMBv1 THUMBv1 Action Effect on the cpsr 1. Ld = Lm LSR #<immed5> 2. Ld = Ld LSR Ls[7:0] Updated (see Note below) Updated B1.3 Alphabetical List of ARM and Thumb Instructions Note ■ MCR MCRR The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3). Move to coprocessor from an ARM register 1. 2. 3. 4. MCR<cond> MCR2 MCRR<cond> MCRR2 <copro>, <copro>, <copro>, <copro>, <op1>, <op1>, <op1>, <op1>, Rd, Rd, Rd, Rd, Cn, Cn, Rn, Rn, Cm {, <op2>} ARMv2 Cm {, <op2>} ARMv5 Cm ARMv5E Cm ARMv6 These instructions transfer the value of ARM register Rd to the indicated coprocessor. Formats 3 and 4 also transfer a second register Rn. <copro> is the number of the coprocessor in the range p0 to p15. The core takes an undefined instruction trap if the coprocessor is not present. The coprocessor operation specifiers <op1> and <op2>, and the coprocessor register numbers Cn, Cm, are interpreted by the coprocessor, and ignored by the ARM. Rd and Rn must not be pc. Coprocessor p15 controls memory management options. For example, the following code sequence enables alignment fault checking: MRC ORR MCR MLA p15, 0, r0, c1, c0, 0 r0, r0, #2 p15, 0, r0, c1, c0, 0 ; read the MMU register, c1 ; set the A bit ; write the MMU register, c1 Multiply with accumulate 1. MLA<cond>{S} Rd, Rm, Rs, Rn ARMv2 Action Effect on the cpsr 1. Rd = Rn + Rm*Rs Updated if S suffix supplied Notes ■ Rd is set to the lower 32 bits of the result. ■ Rd, Rm, Rs, Rn must not be pc. ■ Rd and Rm must be different registers. ■ Implementations may terminate early on the value of the Rs operand. For this reason use small or constant values for Rs where possible. See Appendix B3. ■ If the cpsr is updated, then N = <Negative>, Z = <Zero>, C is unpredictable, and V is preserved. Avoid using the instruction MLAS because implementations often B1-23 B1-24 Appendix B1 ARM and Thumb Assembler Instructions impose penalty cycles for this operation. Instead use MLA followed by a compare, and schedule the compare to avoid multiply result use interlocks. MOV Move a 32-bit value into a register 1. 2. 3. 4. 5. 6. 7. MOV<cond>{S} MOV<cond>{S} MOV MOV MOV MOV MOV Rd, Rd, Ld, Ld, Hd, Ld, Hd, #<rotated_immed> Rm {, <shift>} #<immed8> Ln Lm Hm Hm Action 1. 2. 3. 4. 5. 6. 7. Rd Rd Ld Ld Hd Ld Hd ARMv1 ARMv1 THUMBv1 THUMBv1 THUMBv1 THUMBv1 THUMBv1 Effect on the cpsr = = = = = = = <rotated_immed> <shifted_Rm> <immed8> Ln Lm Hm Hm Updated if S Updated if S Updated (see Updated (see Preserved Preserved Preserved suffix suffix Notes Notes specified specified below) below) Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <shifter_C > (see Table B1.3), and V is preserved. ■ If Rd or Hd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr; in this case, the cpsr is set to the value of the spsr. ■ If Rm is pc, then the value used is the address of the instruction plus eight bytes. ■ If Hm is pc, then the value used is the address of the instruction plus four bytes. Examples MOV MOV MOV MOVS MRC r0, r0, pc, pc, #0x00ff0000 ; r0 = 0x00ff0000 r1, LSL#2 ; r0 = 4*r1 lr ; return from subroutine (pc=lr) lr ; return from exception (pc=lr, cpsr=spsr) Move to ARM register from a coprocessor MRRC 1. MRC<cond> 2. MRC2 3. MRRC<cond> 4. MRRC2 <copro>, <copro>, <copro>, <copro>, <op1>, <op1>, <op1>, <op1>, Rd, Rd, Rd, Rd, Cn, Cn, Rn, Rn, Cm , <op2> Cm , <op2> Cm Cm ARMv2 ARMv5 ARMv5E ARMv6 B1.3 Alphabetical List of ARM and Thumb Instructions These instructions transfer a 32-bit value from the indicated coprocessor to the ARM register Rd. Formats 3 and 4 also transfer a second 32-bit value to Rn. <copro> is the number of the coprocessor in the range p0 to p15. The core takes an undefined instruction trap if the coprocessor is not present. The coprocessor operation specifiers <op1> and <op2>, and the coprocessor register numbers Cn, Cm, are interpreted by the coprocessor and ignored by the ARM. For formats 1 and 2, if Rd is pc, then the top four bits of the cpsr (the NZCV condition code flags) are set from the top four bits of the 32-bit value transferred; pc is not affected. For other formats, Rd and Rn must be distinct and not pc. Coprocessor p15 controls memory management options. For example, the following instruction reads the main ID register from p15 : MRC MRS p15, 0, r0, c0, c0 ; read the MMU ID register, c0 Move to ARM register from status register ( cpsr or spsr ) 1. MRS<cond> Rd, cpsr 2. MRS<cond> Rd, spsr ARMv3 ARMv3 These instructions set Rd = cpsr and Rd = spsr, respectively. Rd must not be pc. MSR Move to status register ( cpsr or spsr ) from an ARM register 1. 2. 3. 4. MSR<cond> MSR<cond> MSR<cond> MSR<cond> cpsr_<fields>, cpsr_<fields>, spsr_<fields>, spsr_<fields>, #<rotated_immed> Rm #<rotated_immed> Rm ARMv3 ARMv3 ARMv3 ARMv3 Action 1. 2. 3. 4. cpsr cpsr spsr spsr = = = = (cpsr (cpsr (spsr (spsr & & & & ~<mask>) ~<mask>) ~<mask>) ~<mask>) | | | | (<rotated_immed> & <mask>) (Rm & <mask>) (<rotated_immed> & <mask>) (Rm & <mask>) These instructions alter selected bytes of the cpsr or spsr according to the value of <mask>. The <fields> specifier is a sequence of one or more letters, determining which bytes of <mask> are set. See Table B1.9. TABLE B1.9 Format of the <fields> specifier. <fields> letter Meaning c Control byte Bits set in <mask> 0x000000ff x eXtension byte 0x0000ff00 s Status byte 0x00ff0000 f Flags byte 0xff000000 B1-25 B1-26 Appendix B1 ARM and Thumb Assembler Instructions Some old ARM toolkits allowed cpsr or cpsr_all in place of cpsr_fsxc. They also used cpsr_flg and cpsr_ctl in place of cpsr_f and cpsr_c, respectively. These formats, and the spsr equivalents, are obsolete, so you should not use them. The following example changes to system mode and enables IRQ, which is useful in a reentrant interrupt handler: MRS BIC ORR MSR r0, cpsr r0, r0, #0x9f r0, r0, #0x1f cpsr_c, r0 ; ; ; ; read cpsr state clear IRQ disable and mode bits set system mode update control byte of the cpsr MUL Multiply 1. MUL<cond>{S} 2. MUL Rd, Rm, Rs Ld, Lm ARMv2 THUMBv1 Action Effect on the cpsr 1. Rd = Rm*Rs 2. Ld = Lm*Ld Updated if S suffix supplied Updated Notes ■ Rd or Ld is set to the lower 32 bits of the result. ■ Rd, Rm, Rs must not be pc. ■ Rd and Rm must be different registers. Similarly Ld and Lm must be different. ■ Implementations may terminate early on the value of the Rs or Ld operand. For this reason use small or constant values for Rs or Ld where possible. ■ If the cpsr is updated, then N = <Negative>, Z = <Zero>, C is unpredictable, and V is preserved. Avoid using the instruction MULS because implementations often impose penalty cycles for this operation. Instead use MUL followed by a compare, and schedule the compare, to avoid multiply result use interlocks. MVN Move the logical not of a 32-bit value into a register 1. MVN<cond>{S} 2. MVN<cond>{S} 3. MVN Rd, #<rotated_immed> Rd, Rm {, <shift>} Ld, Lm Action 1. Rd = ~<rotated_immed> 2. Rd = ~<shifted_Rm> 3. Ld = ~ Lm ARMv1 ARMv1 THUMBv1 Effect on the cpsr Updated if S suffix specified Updated if S suffix specified Updated (see Notes below) B1.3 Alphabetical List of ARM and Thumb Instructions Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3), and V is preserved. ■ If Rd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr; in this case, the cpsr is set to the value of the spsr. ■ If Rm is pc, then the value used is the address of the instruction plus eight bytes. Examples MVN MVN r0, #0xff r0, #0 ; r0 = 0xffffff00 ; r0 = -1 NEG Negate value in Thumb (use RSB to negate in ARM state) 1. NEG Ld, Lm THUMBv1 Action Effect on the cpsr 1. Ld = -Lm Updated (see Notes below) Notes ■ The cpsr is updated: N = <Negative>, Z = <Zero>, C = <NoUnsignedOverflow>, V = <SignedOverflow>. Note that Z = C and V = (Ld== 0x80000000). ■ This is the same as the operation RSBS Ld, Lm, #0 in ARM state. NOP No operation 1. NOP MACRO This is not an ARM instruction. It is an assembly macro that produces an instruction having no effect other than advancing the pc as normal. In ARM state it assembles to MOV r0, r0. In Thumb state it assembles to MOV r8, r8. The operation is not guaranteed to take one processor cycle. In particular, if you use NOP after a load of r0, then the operation may cause pipeline interlocks. ORR Logical bitwise OR of two 32-bit values 1. ORR<cond>{S} 2. ORR<cond>{S} 3. ORR Rd, Rn, #<rotated_immed> Rd, Rn, Rm {, <shift>} Ld, Lm Action 1. Rd = Rn | <rotated_immed> 2. Rd = Rn | <shifted_Rm> 3. Ld = Ld | Lm ARMv1 ARMv1 THUMBv1 Effect on the cpsr Updated if S suffix specified Updated if S suffix specified Updated (see Notes below) B1-27 B1-28 Appendix B1 ARM and Thumb Assembler Instructions Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3), and V is preserved. ■ If Rd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr, in this case, the cpsr is set to the value of the spsr. ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. Example ORR r0, r0,#1 ‹‹1 ; set bit 13 of r0 PKH Pack 16-bit halfwords into a 32-bit word 1. PKHBT<cond> Rd, Rn, Rm {, LSL #<0-31>} ARMv6 2. PKHTB<cond> Rd, Rn, Rm {, ASR #<1-32>} ARMv6 Action 1. Rd[15:00] = Rn[15:00]; Rd[31:16]=<shifted_Rm>[31:16] 2. Rd[31:16] = Rn[31:16]; Rd[15:00]=<shifted_Rm>[15:00] Note ■ Rd, Rn, Rm must not be pc. cpsr is not affected. Examples PKHBT PKHTB r0, r1, r2, LSL#16 ; r0 = (r2[15:00]‹‹16)|r1[15:00] r0, r2, r1, ASR#16 ; r0 = (r2[31:15]‹‹16)|r1[31:15] PLD Preload hint instruction 1. PLD [Rn {, #{-}<immed12>}] 2. PLD [Rn, {-}Rm {,<imm_shift>}] ARMv5E ARMv5E Action 1. Preloads from address (Rn + {{-}<immed12>}) 2. Preloads from address (Rn + {-}<shifted_Rm>) This instruction does not affect the processor registers (other than advancing pc). It merely hints that the programmer is likely to read from the given address in future. A cached processor may take this as a hint to load the cache line containing the address into the cache. The instruction should not generate a data abort or any other memory B1.3 Alphabetical List of ARM and Thumb Instructions system error. If Rn is pc, then the value used for Rn is the address of the instruction plus eight. Rm must not be pc. Examples PLD PLD [r0, #7] [r0, r1, LSL#2] ; Preload from r0+7 ; Preload from r0+4*r1 POP Pops multiple registers from the stack in Thumb state (for ARM state use LDM) 1. POP <regster_list> THUMBv1 Action 1. equivalent to the ARM instruction LDMFD sp!, <register_list> The <register_list> can contain registers in the range r0 to r7 and pc. The following example restores the low-numbered ARM registers and returns from a subroutine: POP {r0-r7,pc} PUSH Pushes multiple registers to the stack in Thumb state (for ARM state use STM) 1. PUSH <regster_list> THUMBv1 Action 1. equivalent to the ARM instruction STMFD sp!, <register_list> The <register_list> can contain registers in the range r0 to r7 and lr. The following example saves the low-numbered ARM registers and link register. PUSH {r0-r7,lr} QADD QDADD QDSUB QSUB Saturated signed and unsigned arithmetic 1. QADD<cond> 2. 3. 4. 5. 6. 7. 8. 9. 10. QDADD<cond> QSUB<cond> QDSUB<cond> {U}QADD16<cond> {U}QADDSUBX<cond> {U}QSUBADDX<cond> {U}QSUB16<cond> {U}QADD8<cond> {U}QSUB8<cond> Rd, Rm, Rn ARMv5E Rd, Rd, Rd, Rd, Rd, Rd, Rd, Rd, Rd, ARMv5E ARMv5E ARMv5E ARMv6 ARMv6 ARMv6 ARMv6 ARMv6 ARMv6 Rm, Rm, Rm, Rn, Rn, Rn, Rn, Rn, Rn, Rn Rn Rn Rm Rm Rm Rm Rm Rm B1-29 B1-30 Appendix B1 ARM and Thumb Assembler Instructions Action 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Rd = sat32(Rm+Rn) Rd = sat32(Rm+sat32(2*Rn)) Rd = sat32(Rm-Rn) Rd = sat32(Rm-sat32(2*Rn)) Rd[31:16] = sat16(Rn[31:16] Rd[15:00] = sat16(Rn[15:00] Rd[31:16] = sat16(Rn[31:16] Rd[15:00] = sat16(Rn[15:00] Rd[31:16] = sat16(Rn[31:16] Rd[15:00] = sat16(Rn[15:00] Rd[31:16] = sat16(Rn[31:16] Rd[15:00] = sat16(Rn[15:00] Rd[31:24] = sat8(Rn[31:24] Rd[23:16] = sat8(Rn[23:16] Rd[15:08] = sat8(Rn[15:08] Rd[07:00] = sat8(Rn[07:00] Rd[31:24] = sat8(Rn[31:24] Rd[23:16] = sat8(Rn[23:16] Rd[15:08] = sat8(Rn[15:08] Rd[07:00] = sat8(Rn[07:00] + + + + + + + + - Rm[31:16]); Rm[15:00]) Rm[15:00]); Rm[31:16]) Rm[15:00]); Rm[31:16]) Rm[31:16]); Rm[15:00]) Rm[31:24]); Rm[23:16]); Rm[15:08]); Rm[07:00]) Rm[31:24]); Rm[23:16]); Rm[15:08]); Rm[07:00]) Notes ■ The operations are signed unless the U prefix is present. For signed operations, satN(x) saturates x to the range –2N–1 x < 2 N–1. For unsigned operations, satN(x) saturates x to the range 0 x < 2 N. ■ The cpsr Q-flag is set if saturation occurred; otherwise it is preserved. ■ Rd, Rn, Rm must not be pc. ■ The X operations are useful for packed complex numbers. The following examples assume bits [15:00] hold the real part and [31:16] the imaginary part. Examples QDADD QADD16 QADDSUBX QSUBADDX REV r0, r0, r0, r0, r0, r1, r1, r1, r2 r2 r2 r2 ; ; ; ; add Q30 value r2 to Q31 accumulator r0 SIMD saturating add r0=r1+i*r2 in packed complex arithmetic r0=r1-i*r2 in packed complex arithmetic Reverse bytes within a word or halfword. 1. REV<cond> 2. REV16<cond> 3. REVSH<cond> Rd, Rm Rd, Rm Rd, Rm ARMv6/THUMBv3 ARMv6/THUMBv3 ARMv6/THUMBv3 B1.3 Alphabetical List of ARM and Thumb Instructions Action 1. Rd[31:24] Rd[15:08] 2. Rd[31:24] Rd[15:08] 3. Rd[31:08] = = = = = Rm[07:00]; Rd[23:16] = Rm[15:08]; Rm[23:16]; Rd[07:00] = Rm[31:24] Rm[23:16]; Rd[23:16] = Rm[31:24]; Rm[07:00]; Rd[07:00] = Rm[15:08] sign-extend(Rm[07:00]); Rd[07:00] = Rm[15:08] Notes ■ Rd and Rm must not be pc. ■ For Thumb, Rd, Rm must be in the range r0 to r7 and <cond> cannot be specified. ■ These instructions are useful to convert big-endian data to little-endian and vice versa. Examples REV REV16 REVSH r0, r0 ; switch endianness of a word r0, r0 ; switch endianness of two packed halfwords r0, r0 ; switch endianness of a signed halfword RFE Return from exception 1. RFE<amode> Rn! ARMv6 This performs the operation that LDM<amode> Rn{!}, {pc, cpsr} would perform if LDM allowed a register list of {pc, cpsr}. See the entry for LDM. ROR Rotate right for Thumb (see MOV for the ARM equivalent) 1. ROR Ld, Ls THUMBv1 Action Effect on the cpsr 1. Ld = Ld ROR Ls[7:0] Updated Notes ■ The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3). RSB Reverse subtract of two 32-bit integers 1. RSB<cond>{S} Rd, Rn, #<rotated_immed> 2. RSB<cond>{S} Rd, Rn, Rm {, <shift>} ARMv1 ARMv1 Action Effect on the cpsr 1. Rd = <rotwated_immed> - Rn 2. Rd = <shifted_Rm> - Rn Updated if S suffix present Updated if S suffix present B1-31 B1-32 Appendix B1 ARM and Thumb Assembler Instructions Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <NoUnsignedOverflow>, and V = <SignedOverflow>. The carry flag is set this way because the subtract x – y is implemented as the add x + ~ y + 1. The carry flag is one if x + ~ y + 1 overflows. This happens when x y, when x – y doesn’t overflow. ■ If Rd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr in this case, the cpsr is set to the value of the spsr. ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. Examples RSB RSB r0, r0, #0 r0, r1, r1, LSL#3 ; r0 = -r0 ; r0 = 7*r1 RSC Reverse subtract with carry of two 32-bit integers 1. RSC<cond>{S} Rd, Rn, #<rotated_immed> 2. RSC<cond>{S} Rd, Rn, Rm {, <shift>} ARMv1 ARMv1 Action Effect on the cpsr 1. Rd = <rotated_immed> - Rn - (~C) 2. Rd = <shifted_Rm> - Rn - (~C) Updated if S suffix present Updated if S suffix present Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <NoUnsignedOverflow>, V = <SignedOverflow>. The carry flag is set this way because the subtract x – y – ~C is implemented as the add x + ~y + C. The carry flag is one if x + ~y + C overflows. This happens when x –y – ~ C doesn’t overflow. ■ If Rd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr; in this case the cpsr is set to the value of the spsr. ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. The following example negates a 64-bit integer where r0 is the low 32 bits and r1 the high 32 bits. RSBS RSC r0, r0, #0 r1, r1, #0 ; r0 = -r0 C=NOT(borrow) ; r1 = -r1-borrow SADD Parallel modulo add and subtract operations 1. {S|U}ADD16<cond> Rd, Rn, Rm 2. {S|U}ADDSUBX<cond> Rd, Rn, Rm ARMv6 ARMv6 B1.3 3. 4. 5. 6. {S|U}SUBADDX<cond> {S|U}SUB16<cond> {S|U}ADD8<cond> {S|U}SUB8<cond> Alphabetical List of ARM and Thumb Instructions Rd, Rd, Rd, Rd, Rn, Rn, Rn, Rn, Rm Rm Rm Rm ARMv6 ARMv6 ARMv6 ARMv6 Action Effect on the cpsr 1. Rd[31:16]=Rn[31:16]+Rm[31:16]; Rd[15:00]=Rn[15:00]+Rm[15:00] 2. Rd[31:16]=Rn[31:16]+Rm[15:00]; Rd[15:00]=Rn[15:00]-Rm[31:16] 3. Rd[31:16]=Rn[31:16]-Rm[15:00]; Rd[15:00]=Rn[15:00]+Rm[31:16] 4. Rd[31:16]=Rn[31:16]-Rm[31:16]; Rd[15:00]=Rn[15:00]-Rm[15:00] 5. Rd[31:24]=Rn[31:24]+Rm[31:24]; Rd[23:16]=Rn[23:16]+Rm[23:16]; Rd[15:08]=Rn[15:08]+Rm[15:08]; Rd[07:00]=Rn[07:00]+Rm[07:00] 6. Rd[31:24]=Rn[31:24]-Rm[31:24]; Rd[23:16]=Rn[23:16]-Rm[23:16]; Rd[15:08]=Rn[15:08]-Rm[15:08]; Rd[07:00]=Rn[07:00]-Rm[07:00] GE3=GE2=cmn(Rn[31:16],Rm[31:16]) GE1=GE0=cmn(Rn[15:00],Rm[15:00]) GE3=GE2=cmn(Rn[31:16],Rm[15:00]) GE1=GE0=(Rn[15:00] >= Rm[31:16]) GE3=GE2=(Rn[31:16] >= Rm[15:00]) GE1=GE0=cmn(Rn[15:00],Rm[31:16]) GE3=GE2=(Rn[31:16] >= Rm[31:16]) GE1=GE0=(Rn[15:00] >= Rm[15:00]) GE3 = cmn(Rn[31:24],Rm[31:24]) GE2 = cmn(Rn[23:16],Rm[23:16]) GE1 = cmn(Rn[15:08],Rm[15:08]) GE0 = cmn(Rn[07:00],Rm[07:00]) GE3 = (Rn[31:24] >= Rm[31:24]) GE2 = (Rn[23:16] >= Rm[23:16]) GE1 = (Rn[15:08] >= Rm[15:08]) GE0 = (Rn[07:00] >= Rm[07:00]) Notes ■ If you specify the S prefix, then all comparisons are signed. The cmn(x, y) function returns x – y or equivalently x + y 0. ■ If you specify the U prefix, then all comparisons are unsigned. The cmn(x, y) function returns x (unasigned) (–y) or equivalently if the x + y operation produces a carry. ■ Rd, Rn, and Rm must not be pc. ■ The X operations are useful for packed complex numbers. The following examples assume bits [15:00] hold the real part and [31:16] the imaginary part. Examples SADD16 SADDSUBX SSUBADDX SBC r0, r1, r2 r0, r1, r2 ; Signed 16-bit SIMD add ; r0=r1+i*r2 in packed complex arithmetic r0, r1, r2 ; r0=r1-i*r2 in packed complex arithmetic Subtract with carry 1. SBC<cond>{S} Rd, Rn, #<rotated_immed> 2. SBC<cond>{S} Rd, Rn, Rm {, <shift>} 3. SBC Ld, Lm ARMv1 ARMv1 THUMBv1 B1-33 B1-34 Appendix B1 ARM and Thumb Assembler Instructions Effect on the cpsr Action 1. Rd = Rn - <rotated_immed> - (~C) Updated if S suffix specified 2. Rd = Rn - <shifted_Rm> - (~C) Updated if S suffix specified 3. Ld = Ld - Lm - (~C) Updated (see Notes below) Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <NoUnsignedOverflow>, V = <SignedOverflow>. The carry flag is set this way because the subtract x – y – ~C is implemented as the add x + ~y + C. The carry flag is one if x + ~y + C overflows. This happens when x – y – ~C doesn’t overflow. ■ If Rd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr. In this case the cpsr is set to the value of the spsr. ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. The following example implements a 64-bit subtract: SUBS SBC r0, r0, r2 r1, r1, r3 ; subtract low words, C=NOT(borrow) ; subtract high words and borrow SEL Select between two source operands based on the GE flags 1. SEL<cond> Rd, Rn, Rm ARMv6 Action 1. Rd[31:24] Rd[23:16] Rd[15:08] Rd[07:00] = = = = GE3 GE2 GE1 GE0 ? ? ? ? Rn[31:24] Rn[23:16] Rn[15:08] Rn[07:00] : : : : Rm[31:24]; Rm[23:16]; Rm[15:08]; Rm[07:00] Notes ■ Rd, Rn, Rm must not be pc. ■ See SADD for instructions that set the GE flags in the cpsr. SETEND Set the endianness for data accesses 1. SETEND BE 2. SETEND LE ARMv6/THUMBv3 ARMv6/THUMBv3 Action 1. In the cpsr E=1 so data accesses will be big-endian 2. In the cpsr E=0 so data accesses will be little-endian B1.3 Alphabetical List of ARM and Thumb Instructions Note ■ ARMv6 uses a byte-invariant endianness model. This means that byte loads and stores are not affected by the configured endianess. For little-endian data access the byte at the lowest address appears in the least significant byte of the loaded word. For big-endian data accesses the byte at the lowest address appears in the most significant byte of the loaded word. SHADD Parallel halving add and subtract operations 1. 2. 3. 4. 5. 6. {S|U}HADD16<cond> {S|U}HADDSUBX<cond> {S|U}HSUBADDX<cond> {S|U}HSUB16<cond> {S|U}HADD8<cond> {S|U}HSUB8<cond> Rd, Rd, Rd, Rd, Rd, Rd, Rn, Rn, Rn, Rn, Rn, Rn, Rm Rm Rm Rm Rm Rm ARMv6 ARMv6 ARMv6 ARMv6 ARMv6 ARMv6 Action 1. Rd[31:16] Rd[15:00] 2. Rd[31:16] Rd[15:00] 3. Rd[31:16] Rd[15:00] 4. Rd[31:16] Rd[15:00] 5. Rd[31:24] Rd[23:16] Rd[15:08] Rd[07:00] 6. Rd[31:24] Rd[23:16] Rd[15:08] Rd[07:00] = = = = = = = = = = = = = = = = (Rn[31:16] (Rn[15:00] (Rn[31:16] (Rn[15:00] (Rn[31:16] (Rn[15:00] (Rn[31:16] (Rn[15:00] (Rn[31:24] (Rn[23:16] (Rn[15:08] (Rn[07:00] (Rn[31:24] (Rn[23:16] (Rn[15:08] (Rn[07:00] + + + + + + + + - Rm[31:16])››1; Rm[15:00])››1 Rm[15:00])››1; Rm[31:16])››1 Rm[15:00])››1; Rm[31:16])››1 Rm[31:16])››1; Rm[15:00])››1 Rm[31:24])››1; Rm[23:16])››1; Rm[15:08])››1; Rm[07:00])››1 Rm[31:24])››1; Rm[23:16])››1; Rm[15:08])››1; Rm[07:00])››1 Notes ■ If you use the S prefix, then all operations are signed and values are sign-extended before the addition. ■ If you use the U prefix, then all operations are unsigned and values are zero-extended before the addition. ■ Rd, Rn, and Rm must not be pc. B1-35 B1-36 Appendix B1 ARM and Thumb Assembler Instructions ■ These operations provide parallel arithmetic that cannot overflow, which is useful for DSP processing of normalized signals. SMLS Signed multiply accumulate instructions SMLA 1. 2. 3. 4. 5. 6. 7. 8. SMLA<x><y><cond> SMLAW<y><cond> SMLAD{X}<cond> SMLSD{X}<cond> {U|S}MLAL<cond>{S} SMLAL<x><y><cond> SMLALD{X}<cond> SMLSLD{X}<cond> Rd, Rm, Rd, Rm, Rd, Rm, Rd, Rm, RdLo, RdHi, RdLo, RdHi, RdLo, RdHi, RdLo, RdHi, Rs, Rs, Rs, Rs, Rm, Rm, Rm, Rm, Rn Rn Rn Rn Rs Rs Rs Rs ARMv5E ARMv5E ARMv6 ARMv6 ARMv3M ARMv5E ARMv6 ARMv6 Action 1. 2. 3. 4. 5. 6. 7. 8. Rd = Rn + (Rm.<x> * Rs.<y>) Rd = Rn + (((signed)Rm * Rs.<y>)››16) Rd = Rn + Rm.B*<rotated_Rs>.B + Rm.T*<rotated_Rs>.T Rd = Rn + Rm.B*<rotated_Rs>.B - Rm.T*<rotated_Rs>.T RdHi:RdLo = RdHi:RdLo + (Rm * Rs) RdHi:RdLo = RdHi:RdLo + (Rm.<x> * Rm.<y>) RdHi:RdLo = RdHi:RdLo + Rm.B*<rotated_Rs>.B + Rm.T*<rotated_Rs>.T RdHi:RdLo = RdHi:RdLo + Rm.B*<rotated_Rs>.B - Rm.T*<rotated_Rs>.T Notes ■ <x> and <y> can be B or T. ■ Rm.B is shorthand for (sign-extend)Rm[15:00], the bottom 16 bits of Rm. ■ Rm.T is shorthand for (sign-extend)Rm[31:16], the top 16 bits of Rm. ■ <rotated_Rs> is Rs if you do not specify the X suffix or Rs ROR 16 if you do specify the X suffix. ■ RdHi and RdLo must be different registers. For format 5, Rm must be a different register from RdHi and RdLo. ■ Formats 1 to 4 update the cpsr Q-flag: Q = Q < SignedOverflow>. ■ Format 5 implements an unsigned multiply with the U prefix or a signed multiply with the S prefix. ■ Format 5 updates the cpsr if the S suffix is present: N = RdHi[31], Z = ( RdHi==0 && RdLo==0); the C and V flags are unpredictable. Avoid using {U|S}MLALS because implementations often impose penalty cycles for this operation. ■ Implementations may terminate early on the value of Rs. For this reason use small or constant values for Rs where possible. B1.3 ■ Alphabetical List of ARM and Thumb Instructions The X suffix and multiply subtract versions are useful for packed complex numbers. The following examples assume bits [15:00] hold the real part and [31:16] the imaginary part. Examples SMLABB SMLABT SMLAWB SMLAL SMLALTB SMLSD SMLADX r0, r0, r0, r0, r0, r0, r0, r1, r1, r1, r1, r1, r1, r1, r2, r2, r2, r2, r2, r2, r2, r0 ; r0 r0 ; r0 r0 ; r0 r3 ; acc r3 ; acc r0 ; r0 r0 ; r0 += += += += += += += (short)r1 * (short)r2 (short)r1 * ((signed)r››216) (r1*(short)r2)››16 r2*r3, acc is 64 bits [r1:r0] ((signed)r2››16)*((short)r3) real(r1*r2) in complex maths imag(r1*r2) in complex maths SMMUL Signed most significant word multiply instructions SMMLA SMMLS 1. SMMUL{R}<cond> Rd, Rm, Rs 2. SMMLA{R}<cond> Rd, Rm, Rs, Rn 3. SMMLS{R}<cond> Rd, Rm, Rs, Rn ARMv6 ARMv6 ARMv6 Action 1. Rd = ((signed)Rm*(signed)Rs + round)››32 2. Rd = ((Rn ‹‹ 32) + (signed)Rm*(signed)Rs + round)››32 3. Rd = ((Rn ‹‹ 32) - (signed)Rm*(signed)Rs + round)››32 Notes ■ If you specify the R suffix then round = 231; otherwise, round = 0. ■ Rd, Rm, Rs, and Rn must not be pc. ■ Implementations may terminate early on the value of Rs. ■ For 32-bit DSP algorithms these operations have several advantages over using the high result register from SMLAL: They often take fewer cycles than SMLAL. They also implement rounding, multiply subtract, and don’t require a temporary scratch register for the low 32 bits of result. Example SMMULR SMUL SMUA SMUS r0, r1, r2 ; r0=r1*r2/2 using Q31 arithmetic Signed multiply instructions 1. SMUL<x><y><cond> 2. SMULW<y><cond> 3. SMUAD{X}<cond> Rd, Rd, Rd, Rm, Rm, Rm, Rs Rs Rs ARMv5E ARMv5E ARMv6 B1-37 B1-38 Appendix B1 ARM and Thumb Assembler Instructions 4. SMUSD{X}<cond> Rd, 5. {U|S}MULL<cond>{S} RdLo, Rm, RdHi, Rs Rm, Rs ARMv6 ARMv3M Action 1. 2. 3. 4. 5. Rd = Rm.<x> * Rs.<y> Rd = (Rm * Rs.<y>)››16 Rd = Rm.B*<rotated_Rs>.B + Rm.T*<rotated_Rs>.T Rd = Rm.B*<rotated_Rs>.B - Rm.T*<rotated_Rs>.T RdHi:RdLo = Rm*Rs Notes ■ <x> and <y> can be B or T. ■ Rm.B is shorthand for (sign-extend)Rm[15:00], the bottom 16 bits of Rm. ■ Rm.T is shorthand for (sign-extend)Rm[31:16], the top 16 bits of Rm. ■ <rotated_Rs> is Rs if you do not specify the X suffix or Rs ROR 16 if you do specify the X suffix. ■ RdHi and RdLo must be different registers. For format 5, Rm must be a different register from RdHi and RdLo. ■ Format 4 updates the cpsr Q-flag: Q = Q | <SignedOverflow>. ■ Format 5 implements an unsigned multiply with the U prefix or a signed multiply with the S prefix. ■ Format 5 updates the cpsr if the S suffix is present: N = RdHi[31], Z = ( RdHi==0 && RdLo==0); the C and V flags are unpredictable. Avoid using {S|U}MULLS because implementations often impose penalty cycles for this operation. ■ Implementations may terminate early on the value of Rs. For this reason use small or constant values for Rs where possible. ■ The X suffix and multiply subtract versions are useful for packed complex numbers. The following examples assume bits [15:00] hold the real part and [31:16] the imaginary part. Examples SMULBB SMULBT SMULWB SMULL SMUADX SRS r0, r0, r0, r0, r0, r1, r1, r1, r1, r1, r2 ; r2 ; r2 ; r2, r3 ; r2 ; r0 r0 r0 acc r0 = = = = = (short)r1 * (short)r2 (short)r1 * ((signed)r2››16) (r1*(short)r2)››16 r2*r3, acc is 64 bits [r1:r0] imag(r1*r2) in complex maths Save return state 1. SRS<amode> #<mode>{!} ARMv6 B1.3 Alphabetical List of ARM and Thumb Instructions This performs the operation that STM<amode> sp_<mode>{!} , {lr, spsr} would perform if STM allowed a register list of {lr, spsr} and allowed you to reference the stack pointer of a different mode. See the entry for STM. SSAT Saturate to n bits 1. {S|U}SAT<cond> Rd, #<n>, Rm {, LSL#<0-31>} 2. {S|U}SAT<cond> Rd, #<n>, Rm {, ASR#<1-32>} 3. {S|U}SAT16<cond> Rd, #<n>, Rm Action Effect on the cpsr 1. Rd = sat(<shifted_Rm>, n); 2. Rd = sat(<shifted_Rm>, n); 3. Rd[31:16] = sat(Rm[31:16], n); Rd[15:00] = sat(Rm[15:00], n) Q=Q | 1 if saturation occurred Q=Q | 1 if saturation occurred Q=Q | 1 if saturation occurred Notes ■ If you specify the S prefix, then sat (x, n) saturates the signed value x to a signed n-bit value in the range 2n1 x 2n1. n is encoded as 1 + <immed5> for SAT and 1 + <immed4> for SAT16. ■ If you specify the U prefix, then sat (x, n) saturates the signed value x to an unsigned n-bit value in the range 0 x 2n. n is encoded as <immed5> for SAT and <immed4> for SAT16. ■ Rd and Rm must not be pc. SSUB Signed parallel subtract (see SADD) STC Store to coprocessor single or multiple 32-bit values 1. 2. 3. 4. 5. 6. STC<cond>{L} STC<cond>{L} STC<cond>{L} STC2{L} STC2{L} STC2{L} <copro>, <copro>, <copro>, <copro>, <copro>, <copro>, Cd, Cd, Cd, Cd, Cd, Cd, [Rn {, #{-}<immed8>*4}]{!} [Rn], #{-}<immed8>*4 [Rn], <option> [Rn {, #{-}<immed8>*4}]{!} [Rn], #{-}<immed8>*4 [Rn], <option> ARMv2 ARMv2 ARMv2 ARMv5 ARMv5 ARMv5 These instructions initiate a memory write, transferring data to memory from the given coprocessor. <copro> is the number of the coprocessor in the range p0 to p15. The core takes an undefined instruction trap if the coprocessor is not present. The memory write consists of a sequence of words to sequentially increasing addresses. The initial address is specified by the addressing mode in Table B1.10. The coprocessor controls the number of words transferred, up to a maximum B1-39 B1-40 Appendix B1 ARM and Thumb Assembler Instructions limit of 16 words. The fields {L} and Cd are interpreted by the coprocessor and ignored by the ARM. Typically Cd specifies the source coprocessor register for the transfer. The <option> field is an eight-bit integer enclosed in {}. Its interpretation is coprocessor dependent. If the address is not a multiple of four, then the access is unaligned. The restrictions on an unaligned access are the same as for STM. TABLE B1.10 STC addressing modes. Addressing format Address accessed Value written back to Rn [Rn {, #{-}<immed>}] Rn + {{-}<immed>} Rn preserved [Rn {, #{-}<immed>}]! Rn + {{-}<immed>} Rn + {{-}<immed>} [Rn], #{-}<immed> Rn Rn + {-}<immed> [Rn], <option> Rn Rn preserved STM Store multiple 32-bit registers to memory ∧ 1. STM<cond><a mode> Rn{!}, <register_list>{ } 2. STMIA Rn!, <register_list> ARMv1 THUMBv1 These instructions store multiple words to sequential memory addresses. The <register_ list> specifies a list of registers to store, enclosed in curly brackets {}. Although the assembler allows you to specify the registers in the list in any order, the order is not stored in the instruction, so it is good practice to write the list in increasing order of register number since this is the usual order of the memory transfer. The following pseudocode shows the normal action of STM. We use <register_ list>[i] to denote the register appearing at position i in the list starting at 0 for the first register. This assumes that the list is in order of increasing register number. N = the number of registers in <register_list> start = the lowest address accessed given in Table B1.11 for (i=0; i<N; i++) memory(start+i*4, 4) = <register_list>[i]; if (! specified) then update Rn according to Table B1.11 Note that memory(a, 4) refers to the four bytes at address a packed according to the current processor data endianness. If a is not a multiple of four, then the store is unaligned. Because the behavior of an unaligned store depends on the architecture revision, memory system, and system coprocessor (CP15) configuration, it is best to avoid unaligned stores if possible. Assuming that the external memory system does not abort unaligned stores, then the following rules usually apply: B1.3 Alphabetical List of ARM and Thumb Instructions ■ If the core has a system coprocessor and bit 1 ( A-bit) or bit 22 ( U-bit) of CP15: c1:c0:0 is set, then unaligned store-multiples cause an alignment fault data abort exception. ■ Otherwise, the access ignores the bottom two address bits. Table B1.11 lists the possible addressing modes specified by <amode>. If you specify the !, then the base address register is updated according to Table B1.11; otherwise, it is preserved. Note that the lowest register number is always written to the lowest address. TABLE B1.11 STM addressing modes. Addressing mode Lowest address accessed Highest address accessed Value written back to Rn if ! specified {IA|EA} Rn Rn + N*4 - 4 Rn + N*4 {IB|FA} Rn + 4 Rn + N*4 Rn + N*4 {DA|ED} Rn - N*4 + 4 Rn Rn - N*4 {DB|FD} Rn - N*4 Rn - 4 Rn - N*4 The first half of the addressing mode mnemonics stands for Increment After, Increment Before, Decrement After, and Decrement Before, respectively. Increment modes store the registers sequentially forward starting from address Rn (increment after) or Rn + 4 (increment before). Decrement modes have the same effect as if you stored the register list backwards to sequentially descending memory addresses starting from address Rn (decrement after) or Rn 4 (decrement before). The second half of the addressing mode mnemonics stands for the stack type you can implement with that address mode: Full Descending, Empty Descending, Full Ascending, and Empty Ascending. With a full stack, Rn points to the last stacked value. With an empty stack, Rn points to the first unused stack location. ARM stacks are usually full descending. You should use full descending or empty ascending stacks by preference, since STC also supports these addressing modes. Notes ■ For Thumb (format 2), Rn and the register list registers must be in the range r0 to r7. ■ The number of registers N in the list must be nonzero. ■ Rn must not be pc. ■ If Rn appears in the register list and ! (writeback) is specified, the behavior is as follows: If Rn is the lowest register number in the list, then the original value is stored; otherwise, the stored value is unpredictable. B1-41 B1-42 Appendix B1 ARM and Thumb Assembler Instructions ■ If pc appears in the register list, then the value stored is implementation defined. ■ If is specified, then the operation is modified. The processor must not be in user or system mode. The registers appearing in the register list refer to the user mode versions of the registers and writeback must not be specified. ■ The time order of the memory accesses may depend on the implementation. Be careful when using a store multiple to access I/O locations where the access order matters. If the order matters, then check that the memory locations are marked as I/O in the page tables. Do not cross page boundaries, and do not use pc in the register list. ∧ Examples STMIA STMDB STMEQFD STMFD STR r4!, r4!, sp!, sp, {r0, r1} {r0, r1} {r0, lr} ∧ {sp} ; *r4=r0, *(r4+4)=r1, r4+=8 ; *(r4-4)=r1, *(r4-8)=r0, r4-=8 ; if (result zero) then stack r0, lr ; store sp_usr on stack sp_current Store a single value to a virtual address in memory 1. STR<cond>{|B} Rd, 2. STR<cond>{|B} Rd, 3. STR<cond>{|B}{T} Rd, 4. STR<cond>{|B}{T} Rd, 5. STR<cond>{H} Rd, 6. STR<cond>{H} Rd, 7. STR<cond>{H} Rd, 8. STR<cond>{H} Rd, 9. STR<cond>D Rd, 10. STR<cond>D Rd, 11. STR<cond>D Rd, 12. STR<cond>D Rd, 13. STREX<cond> Rd, 14. STR{|B|H} Ld, 15. STR{|B|H} Ld, 16. STR Ld, 17. STR<cond><type> Rd, [Rn {, #{-}<immed12>}]{!} [Rn, {-}Rm {,<imm_shift>}]{!} [Rn], #{-}<immed12> [Rn], {-}Rm {,<imm_shift>} [Rn, {, #{-}<immed8>}]{!} [Rn, {-}Rm]{!} [Rn], #{-}<immed8> [Rn], {-}Rm [Rn, {, #{-}<immed8>}]{!} [Rn, {-}Rm]{!} [Rn], #{-}<immed8> [Rn], {-}Rm Rm, [Rn] [Ln, #<immed5>*<size>] [Ln, Lm] [sp, #<immed8>*4] <label> ARMv1 ARMv1 ARMv1 ARMv1 ARMv4 ARMv4 ARMv4 ARMv4 ARMv5E ARMv5E ARMv5E ARMv5E ARMv6 THUMBv1 THUMBv1 THUMBv1 MACRO Formats 1 to 16 store a single data item of the type specified by the opcode suffix, using a preindexed or postindexed addressing mode. Tables B1.12 and B1.13 show the different addressing modes and data types. In Table B1.13, memory (a, n) refers to n sequential bytes at address a. The bytes are packed according to the configured processor data endianness. memoryT(a, n) performs the access with user mode privileges, regardless of the current processor mode. The act of function IsExclusive(a) used by STREX depends on address a. If a has the shared TLB attribute, then IsExclusive(a) is true if address a is marked as exclusive for this processor. It then clears any exclusive accesses on this processor and any exclusive B1.3 Alphabetical List of ARM and Thumb Instructions accesses to address a on other processors in the system. If a does not have the shared TLB attribute, then IsExclusive(a) is true if there is an outstanding exclusive access on this processor. It then clears any such outstanding access. TABLE B1.12 STR addressing modes. Addressing format Address a accessed Value written back to Rn [Rn {,#{-}<immed>}] Rn + {{-}<immed>} Rn preserved [Rn {,#{-}<immed>}]! Rn + {{-}<immed>} Rn + {{-}<immed>} [Rn, {-}Rm {,<shift>}] Rn + {-}<shifted_Rm> Rn preserved [Rn, {-}Rm {,<shift>}]! Rn + {-}<shifted_Rm> Rn + {-}<shifted_Rm> [Rn], #{-}<immed> Rn Rn + {-}<immed> [Rn], {-}Rm {,<shift>} Rn Rn + {-}<shifted_Rm> TABLE B1.13 STR data types. Store Datatype <size> (bytes) Action STR word 4 memory(a, 4) = Rd STRB unsigned Byte 1 memory(a, 1) = (char)Rd STRBT Byte Translated 1 memoryT(a, 1) = (char)Rd STRD Double word 8 memory(a, 4) = Rd STREX word EXclusive 4 memory(a+4, 4) = R(d+1) if (IsExclsuive(a)) { memory (a, 4) = Rm; Rd = 0; } else { Rd = 1; } STRH unsigned Halfword 2 memory(a, 2) = (short) Rd STRT word Translated 4 memoryT(a, 4) = Rd If the address a is not a multiple of <size>, then the store is unaligned. Because the behavior of an unaligned store depends on the architecture revision, memory system, and system coprocessor (CP15) configuration, it is best to avoid unaligned stores if possible. Assuming that the external memory system does not abort unaligned stores, then the following rules usually apply. In the rules, A is bit 1 of system coprocessor register CP15:c1:c0:0, and U is bit 22 of CP15:c1:c0:0, introduced in ARMv6. If there is no system coprocessor, then A U 0. B1-43 B1-44 Appendix B1 ARM and Thumb Assembler Instructions ■ If A = 1, then unaligned stores cause an alignment fault data abort exception except that word-aligned double-word stores are supported if U = 1. ■ If A = 0 and U = 1, then unaligned stores are supported for STR{|T|H|SH}. Wordaligned stores are supported for STRD. A non-word-aligned STRD generates an alignment fault data abort. ■ If A = 0 and U = 0, then STR and STRT write to memory(a&~ 3, 4). All other unaligned operations are unpredictable but do not cause an alignment fault Format 17 generates a pc -relative store accessing the address specified by <label> . In other words it assembles to STR<cond><type> Rd, [pc, #<offset>] whenever this instruction is supported and <offset>=<label>-pc is in range. Notes ■ For double-word stores (formats 9 to 12), Rd must be even and in the range r0 to r12. ■ If the addressing mode updates Rn, then Rd and Rn must be distinct. ■ If Rd is pc, then <size> must be 4. The value stored is implementation defined. ■ If Rn is pc, then the addressing mode must not update Rn . The value used for Rn is the address of the instruction plus eight bytes. ■ Rm must not be pc. Examples STR STRH STRD STRB STRB SUB r0, r0, r2, r0, r0, [r0] [r1], #4 [r1, #-8]! [r2, #55] [r1], -r2, ; *(int*)r0 = r0; ; *(short*)r1 = r0; r1+=4; ; r1-=8; *(int*)r1=r2; *(int*)(r1+4)=r3 ; *(char*)(r2+55) = r0; LSL #8 ; *(char*)r1 = r0; r1-=256*r2; Subtract two 32-bit values 1. 2. 3. 4. 5. 6. SUB<cond>{S} SUB<cond>{S} SUB SUB SUB SUB Rd, Rd, Ld, Ld, Ld, sp, Rn, #<rotated_immed> Rn, Rm {, <shift>} Ln, #<immed3> #<immed8> Ln, Lm #<immed7>*4 Effect on the cpsr Action 1. Rd 2. Rd 3. Ld 4. Ld ARMv1 ARMv1 THUMBv1 THUMBv1 THUMBv1 THUMBv1 = = = = Rn Rn Ln Ld - <rotated_immed> <shifted_Rm> <immed3> <immed8> Updated Updated Updated Updated if S if S (see (see suffix suffix Notes Notes specified specified below) below) B1.3 Alphabetical List of ARM and Thumb Instructions 5. Ld = Ln - Lm 6. sp = sp - <immed7>*4 Updated (see Notes below) Preserved Notes ■ If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>, C = <NoUnsignedOverflow>, and V = <SignedOverflow>. The carry flag is set this way because the subtract x y is implemented as the add x ~ y 1. The carry flag is one if x ~ y 1 overflows. This happens when x y, when x y doesn’t overflow. ■ If Rd is pc, then the instruction effects a jump to the calculated address. If the operation updates the cpsr, then the processor mode must have an spsr; in this case, the cpsr is set to the value of the spsr. ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. Examples SUBS SUB SUBS SWI r0, r0, #1 r0, r1, r1, LSL #2 pc, lr, #4 ; r0-=1, setting flags ; r0 = -3*r1 ; jump to lr-4, set cpsr=spsr Software interrupt 1. SWI<cond> <immed24> 2. SWI <immed8> ARMv1 THUMBv1 The SWI instruction causes the ARM to enter supervisor mode and start executing from the SWI vector. The return address and cpsr are saved in lr_svc and spsr_svc, respectively. The processor switches to ARM state and IRQ interrupts are disabled. The SWI vector is at address 0x00000008, unless high vectors are configured; then it is at address 0xFFFF0008. The immediate operand is ignored by the ARM. It is normally used by the SWI exception handler as an argument determining which function to perform. Example SWI 0x123456 ; Used by the ARM tools to implement Semi-Hosting SWP Swap a word in memory with a register, without interruption 1. SWP<cond> Rd, Rm, [Rn] 2. SWP<cond>B Rd, Rm, [Rn] ARMv2a ARMv2a Action 1. temp=memory(Rn,4); memory(Rn,4)=Rm; Rd=temp; 2. temp=(zero extend)memory(Rn,1); memory(Rn,1)=(char)Rm; Rd=temp; B1-45 B1-46 Appendix B1 ARM and Thumb Assembler Instructions Notes ■ The operations are atomic. They cannot be interrupted partway through. ■ Rd, Rm, Rn must not be pc. ■ Rn and Rm must be different registers. Rn and Rd must be different registers. ■ Rn should be aligned to the size of the memory transfer. ■ If a data abort occurs on the load, then the store does not occur. If a data abort occurs on the store, then Rd is not written. You can use the SWP instruction to implement 8-bit or 32-bit semaphores on ARMv5 and below. For ARMv6 use LDREX and STREX in preference. As an example, suppose a byte semaphore register pointed to by r1 can have the value 0xFF (claimed) or 0x00 (free). The following example claims the lock. If the lock is already claimed, then the code loops, waiting for an interrupt or task switch that will free the lock. MOV loops WPB CMP BEQ r0, #0xFF r0, r0, [r1] r0, #0xFF loop ; ; ; ; value to claim the lock try and claim the lock check to see if it was already claimed if so wait for it to become free SXT Byte or halfword extract or extract with accumulate SXTA 1. 2. 3. 4. 5. 6. 7. 8. {S|U}XTB16<cond> {S|U}XTB<cond> {S|U}XTH<cond> {S|U}XTAB16<cond> {S|U}XTAB<cond> {S|U}XTAH<cond> {S|U}XTB {S|U}XTH Rd, Rd, Rd, Rd, Rd, Rd, Ld, Ld, Rm Rm Rm Rn, Rn, Rn, Lm Lm {, ROR#8*<rot> } {, ROR#8*<rot> } {, ROR#8*<rot> } Rm {, ROR#8*<rot> } Rm {, ROR#8*<rot> } Rm {, ROR#8*<rot> } THUMBv3 THUMBv3 ARMv6 ARMv6 ARMv6 ARMv6 ARMv6 ARMv6 Action 1. Rd[31:16] = extend(<shifted_Rm>[23:16]); Rd[15:00] = extend(<shifted_Rm>[07:00]) 2. Rd = extend(<shifted_Rm>[07:00]) 3. Rd = extend(<shifted_Rm>[15:00]) 4. Rd[31:16] = Rn[31:16] + extend(<shifted_Rm>[23:16]); 5. 6. 7. 8. Rd[15:00] = Rn[15:00] + extend(<shifted_Rm>[07:00]) Rd = Rn + extend(<shifted_Rm>[07:00]) Rd = Rn + extend(<shifted_Rm>[15:00]) Ld = extend(Lm[07:00]) Ld = extend(Lm[15:00]) B1.3 Alphabetical List of ARM and Thumb Instructions Notes ■ If you specify the S prefix, then extend( x ) sign extends x. ■ If you specify the U prefix, then extend( x ) zero extends x. ■ Rd and Rm must not be pc. ■ <rot> is an immediate in the range 0 to 3. TEQ Test for equality of two 32-bit values 1. TEQ<cond> Rn, #<rotated_immed> 2. TEQ<cond> Rn, Rm {, <shift>} ARMv1 ARMv1 Action ∧ 1. Set the cpsr on the result of (Rn <rotated_immed>) ∧ <shifted_Rm>) 2. Set the cpsr on the result of (Rn Notes ■ The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3). ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. ■ Use this instruction instead of CMP when you want to check for equality and preserve the carry flag. Example TEQ r0, #1 ; test to see if r0==1 TST Test bits of a 32-bit value 1. TST<cond> Rn, #<rotated_immed> 2. TST<cond> Rn, Rm {, <shift>} 3. TST Ln, Lm ARMv1 ARMv1 THUMBv1 Action 1. Set the cpsr on the result of (Rn & <rotated_immed>) 2. Set the cpsr on the result of (Rn & <shifted_Rm>) 3. Set the cpsr on the result of (Ln & Lm) Notes ■ The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3). ■ If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes. B1-47 B1-48 Appendix B1 ARM and Thumb Assembler Instructions ■ Use this instruction to test whether a selected set of bits are all zero. Example TST r0, #0xFF ; test if the bottom 8 bits of r0 are 0 UADD Unsigned parallel modulo add (see the entry for SADD) UHADD UHSUB Unsigned halving add and subtract (see the entry for SHADD) UMAAL Unsigned multiply accumulate accumulate long 1. UMAAL<cond> RdLo, RdHi, Rm, Rs ARMv6 Action 1. RdHi:RdLo (unsigned)Rm*Rs (unsigned)RdLo (unsigned)RdHi Notes ■ RdHi and RdLo must be different registers. ■ RdHi, RdLo, Rm, Rs must not be pc. ■ This operation cannot overflow because (232 1) (232 1)(232 1) (232 1) (2641). You can use it to synthesize the multiword multiplications used by public key cryptosystems. UMLAL UMULL Unsigned long multiply and multiply accumulate (see the SMLAL and SMULL entries) UQADD UQSUB Unsigned saturated add and subtract (see the QADD entry) USAD Unsigned sum of absolute differences 1. USAD8<cond> 2. USADA8<cond> Rd, Rm, Rs Rd, Rm, Rs, Rn ARMv6 ARMv6 Action 1. Rd = abs(Rm[31:24]-Rs[31:24]) + + abs(Rm[15:08]-Rs[15:08]) + 2. Rd = Rn + abs(Rm[31:24]-Rs[31:24]) + abs(Rm[15:08]-Rs[15:08]) abs(Rm[23:16]-Rs[23:16]) abs(Rm[07:00]-Rs[07:00]) + abs(Rm[23:16]-Rs[23:16]) + abs(Rm[07:00]-Rs[07:00]) B1.4 ARM Assembler Quick Reference Note ■ abs( x ) returns the absolute value of x. Rm and Rs are treated as unsigned. ■ Rd, Rm, and Rs must not be pc. ■ The sum of absolute differences operation is common in video codecs where it provides a metric to measure how similar two images are. USAT Unsigned saturation instruction (see the SSAT entry) USUB Unsigned parallel modulo subtracts (see the SADD entry) UXT UXTA Unsigned extract, extract with accumulate (see the entry for SXT) B1.4 ARM Assembler Quick Reference This section summarizes the more useful commands and expressions available with the ARM assembler, armasm. Each assembly line has one of the following formats: {<label>} {<instruction>} ; comment {<symbol>} <directive> ; comment {<arg_0>} <macro> {<arg_1>} {,<arg_2>} .. {,<arg_n>} ; comment where ■ <instruction> is any ARM or Thumb instruction supported by the processor you are assembling for. See Section B1.3. ■ <label> is the name of a symbol to store the address of the instruction. ■ <directive> is an ARM assembler directive. See Section ARM Assembler Directives. ■ <symbol> is the name of a symbol used by the <directive>. ■ <macro> is the name of a new directive defined using the MACRO directive. ■ <arg_k> is the kth macro argument. You must use an AREA directive to define an area before any ARM or Thumb instructions appear. All assembly files must finish with the END directive. The following example shows a simple assembly file defining a function add that returns the sum of the two input arguments: B1-49 B1-50 Appendix B1 ARM and Thumb Assembler Instructions AREA EXPORT add ADD MOV maths_routines, CODE, READONLY add ; give the symbol add external linkage r0, r0, r1 pc, lr ; add input arguments ; return from sub-routine END ARM Assembler Variables The ARM assembler supports three types of assemble time variables (see Table B1.14). Variable names are case sensitive and must be declared before use with the directives GBLx or LCLx. TABLE B1.14 ARM assembler variable types. Declare globally Declare locally to a macro Set value Unsigned 32-bit integer GBLA LCLA SETA 15, 0xab ASCII string GBLS LCLS SETS “”, “ADD” Logical GBLL LCLL SETL {TRUE}, {FALSE} Variable type Example values You can use variables in expressions (see Section ARM Assembler Labels), or substitute their value at assembly time using the $ operator. Specifically, $name. expands to the value of the variable name before the line is assembled. You can omit the final period if name is not followed by an alphanumeric or underscore. Use $$ to produce a single $. Arithmetic variables expand to an eight-digit hexadecimal string on substitution. Logical variables expand to T or F. The following example code shows how to declare and substitute variables of each type: ; arithmetic variables GBLA count ; declare an integer variable count count SETA 1 ; set count = 1 WHILE count<15 BL test$count ; call test00000001, test00000002 ... count SETA count+1 ; .... test00000000E WEND cc ; string variables GBLS cc SETS “NE” ADD$cc r0, r0, r0 STR$cc.B r0, [r1] ; ; ; ; declare a string variable called cc set cc=”NE” assembles as ADDNE r0,r0,r0 assembles as STRNEB r0,[r1] B1.4 ; logical variable GBLL debug debug SETL {TRUE} IF debug BL print_debug ENDIF ; ; ; ; ARM Assembler Quick Reference declare a logical variable called debug set debug={TRUE} if debug is TRUE then print out some debug information ARM Assembler Labels A label definition must begin on the first character of a line. The assembler treats indented text as an instruction, directive, or macro. It treats labels of the form <N><name> as a local label, where <N> is an integer in the range 0 to 99 and <name> is an optional textual name. Local labels are limited in scope by the ROUT directive. To reference a local label, you refer to it as %{|F|B}{|A|T}<N>{<name>}. The extra prefix letters tell the assembler how to search for the label: ■ If you specify F, the assembler searches forward; if B, then the assembler searches backwards. Otherwise the assembler searches backwards and then forwards. ■ If you specify T, the assembler searches the current macro only; if A, then the assembler searches all macro levels. Otherwise the assembler searches the current and higher macro nesting levels. ARM Assembler Expressions The ARM assembler can evaluate a number of numeric, string, and logical expressions at assembly time. Table B1.15 shows some of the unary and binary operators you can use within expressions. Brackets can be used to change the order of evaluation in the usual way. TABLE B1.15: ARM assembler unary and binary operators. Expression Result Example A+B, A-B A plus or minus B A*B, A/B A multiplied by or divided by B 2*3 = 6, 7/3 = 2 A:MOD:B A modulo B 7:MOD:3 = 1 :CHR:A string with ASCII code A :CHR:32 = “ ” ‘X’ the ASCII value of X ‘a’ = 0x61 :STR:A, :STR:L A or L converted to a string :STR:32 = “00000020” : STR:{TRUE} = “T” 1-2 = 0xffffffff A‹‹B, A:SHL:B A shifted left by B bits 1 ‹‹ 3 = 8 A››B, A:SHR:B A shifted right by B bits (logical shift) 0x80000000 ›› 4 = 0x08000000 A:ROR:B, A:ROL:B A rotated right/left by B bits 1:ROR:1 = 0x80000000 0x80000000:ROL:1 = 1 B1-51 B1-52 Appendix B1 ARM and Thumb Assembler Instructions A=B, A>B, A>=B, A<B, A<=B, A/=B, A<>B comparison of arithmetic or string variables ( /= and <> both mean not equal) (1=2) = {FALSE}, (1<2) = {TRUE}, (“a”=“c”) = {FALSE}, (“a”<“c”) = {TRUE} A: AND: B, A: OR: B, A: EOR: B, :NOT:A Bitwise AND, OR, exclusive OR of A and B; bitwise NOT of A. 1:AND:3 = 1 1:OR:3 = 3:NOT:0 = 0xFFFFFFFF :LEN:S length of the string S :LEN:“ABC” = 3 S:LEFT:B, S:RIGHT:B leftmost or rightmost B characters of S “ABC”:LEFT:2 = “AB”, “ABC”: RIGHT:2 = “BC” S:CC:T the concatenation of S, T “AB”:CC:“C” = “ABC” L:LAND:M, L:LOR:M, L:LEOR:M logical AND, OR, exclusive OR of L and M {TRUE}:LAND:{FALSE} = {FALSE} :DEF:X returns TRUE if a variable called X is defined :BASE:A :INDEX:A see the MAP directive TABLE B1.16 Predefined expressions. Variable Value {ARCHITECURE} The ARM architecture of the CPU (“4T” for ARMv4T) {ARMASM_VERSION} The assembler version number {CONFIG} or {CODESIZE} The bit width of the instructions being assembled (32 for ARM state, 16 for Thumb state) {CPU} The name of the CPU being assembled for {ENDIAN} The configured endianness, “big’’ or “little” {INTER} {TRUE} if ARM/Thumb interworking is on {PC} The address of the current instruction being assembled (alias .) {ROPI}, {RWPI} {TRUE} if read-only/read-write position independent {VAR} The MAP counter (see the MAP directive) (alias @) In Table B1.15, A and B represent arbitrary integers; S and T, strings; and L and M, logical values. You can use labels and other symbols in place of integers in many expressions. Predefined Variables Table B1.16 shows a number of special variables that can appear in expressions. These are predefined by the assembler, and you cannot override them. ARM Assembler Directives Here is an alphabetical list of the more common armasm directives. B1.4 ARM Assembler Quick Reference ALIGN ALIGN {<expression>, {<offset>}} Aligns the address of the next instruction to the form q*<expression>+<offset>. The alignment is relative to the start of the ELF section so this must be aligned appropriately (see the AREA directive). <expression> must be a power of two; the default is 4. <offset> is zero if not specified. AREA AREA <section> {,<attr_1>} {,<attr_2>} ... {,<attr_k>} Starts a new code or data section of name <section>. Table B1.17 lists the possible attributes. TABLE B1.17 AREA attributes. Attribute Meaning expression ALIGN=<expression> Align the ELF section to a 2 ASSOC=<sectionname> If this section is linked, also link <sectionname>. CODE The section contains instructions and is read only. DATA The section contains data and is read write. NOINIT The data section does not require initialization. READONLY The section is read only. READWRITE The section is read write. byte boundary. ASSERT ASSERT <logical-expression> Assemble time assert. If the logical expression is false, then assembly terminates with an error. CN <name> CN <numeric-expression> Set <name> to be an alias for coprocessor register <numeric-expression>. CODE16, CODE32 CODE16 tells the assembler to assemble the following instructions as 16-bit Thumb instructions. CODE32 indicates 32-bit ARM instructions (the default for armasm). B1-53 B1-54 Appendix B1 ARM and Thumb Assembler Instructions CP <name> CP <numeric-expression> Set <name> to be an alias for coprocessor number <numeric-expression>. DATA <label> DATA The DATA directive indicates that the label points to data rather than code. In Thumb mode this prevents the linker from setting the bottom bit of the label. Bit 0 of a function pointer or code label is 0 for ARM code and 1 for Thumb code (see the BX instruction). DCB, DCD{U}, DCI, DCQ{U}, DCW{U} These directives allocate one or more bytes of initialized memory according to Table B1.18. Follow each directive with a comma-separated list of initialization values. If you specify the optional U suffix, then the assembler does not insert any alignment padding. Examples hello powers TABLE B1.18 DCB “hello”, 0 DCD 1, 2, 4, 8, 10, 0x20, 0x40, 0x80 DCI 0xEA000000 Memory initialization directives. Directive Alias Data size (bytes) DCB = 1 byte or string 2 16-bit integer (aligned to 2 bytes) DCW DCD & Initialization value 4 32-bit integer (aligned to 4 bytes) DCQ 8 64-bit integer (aligned to 4 bytes) DCI 2 or 4 integer defining an ARM or Thumb instruction ELSE (alias |) See IF. END This directive must appear at the end of a source file. Assembler source after an END directive is ignored. B1.4 ARM Assembler Quick Reference ENDFUNC (alias ENDP), ENDIF (alias ]) See FUNCTION and IF, respectively. ENTRY This directive specifies the program entry point for the linker. The entry point is usually contained in the ARM C library. EQU (alias *) <name> EQU <numeric-expression> This directive is similar to #define in C. It defines a symbol <name> with value defined by the expression. This value cannot be redefined. See Section ARM Assembler Variables for the use of redefinable variables. EXPORT (alias GLOBAL) EXPORT <symbol>{[WEAK]} Assembler symbols are local to the object file unless exported using this command. You can link exported symbols with other object and library files. The optional [WEAK] suffix indicates that the linker should try and resolve references with other instances of this symbol before using this instance. EXTERN, IMPORT EXTERN IMPORT <symbol>{[WEAK]} <symbol>{[WEAK]} Both of these directives declare the name of an external symbol, defined in another object file or library. If you use this symbol, then the linker will resolve it at link time. For IMPORT, the symbol will be resolved even if you don’t use it. For EXTERN, only used symbols are resolved. If you declare the symbol as [WEAK], then no error is generated if the linker cannot resolve the symbol; instead the symbol takes the value 0. FIELD (alias #) See MAP. FUNCTION (alias PROC) and ENDFUNC (alias ENDP) The FUNCTION and ENDFUNC directives mark the start and end of an ATPCScompliant function. Their main use is to improve the debug view and allow backtracking of function calls during debugging. They also allow the profiler to B1-55 B1-56 Appendix B1 ARM and Thumb Assembler Instructions more accurately profile assembly functions. You must precede the function directive with the ATPCS function name. For example: sub FUNCTION SUB r0, r0, r1 MOV pc, lr ENDFUNC GBLA, GBLL, GBLS Directives defining global arithmetic, logic, and string variables, respectively. See Section ARM Assembler Variables. GET See INCLUDE. GLOBAL See EXPORT. IF (alias [), ELSE (alias |), ENDIF (alias ]) These directives provide for conditional assembly. They are similar to #if, #else, #endif, available in C. The IF directive is followed by a logical expression. The ELSE directive may be omitted. For example: IF ARCHITECTURE=“5TE” SMULBB r0, r1, r1 ELSE MUL r0, r1, r1 ENDIF IMPORT See EXTERN. INCBIN INCBIN <filename> This directive includes the raw data contained in the binary file <filename> at the current point in the assembly. For example, INCBIN table.dat. INCLUDE (alias GET) INCLUDE <filename> Use this directive to include another assembly file. It is similar to the #include command in C. For example, INCLUDE header.h. B1.4 ARM Assembler Quick Reference INFO (alias !) INFO <numeric_expression>, <string_expression> If <numeric_expresssion> is nonzero, then assembly terminates with error <string_ expresssion>. Otherwise the assembler prints <string_expression> as an information message. KEEP KEEP {<symbol>} By default the assembler does not include local symbols in the object file, only exported symbols (see EXPORT). Use KEEP to include all local symbols or a specified local symbol. This aids the debug view. LCLA, LCLL, LCLS These directives declare macro-local arithmetic, logical, and string variables, respectively. See Section ARM Assembler Variables. LTORG Use LTORG to insert a literal pool. The assembler uses literal pools to store the constants appearing in the LDR Rd,=<value> instruction. See LDR format 19. Usually the assembler inserts literal pools automatically, at the end of each area. However, if an area is too large, then the LDR instruction cannot reach this literal pool using pc-relative addressing. Then you need to insert a literal pool manually, near the LDR instruction. MACRO, MEXIT, MEND Use these directives to declare a new assembler macro or pseudoinstruction. The syntax is {$<arg_0>} MACRO <macro_name> {$<arg_1>} {,$<arg_2>} ... {,$<arg_k>} <macro_code> MEND The macro parameters are stored in the dummy variables $<arg_i>. This argument is set to the empty string if you don’t supply a parameter when calling the macro. The MEXIT directive terminates the macro early and is usually used inside IF statements. For example, the following macro defines a new pseudoinstruction SMUL, which evaluates to a SMULBB on an ARMv5TE processor, and an MUL otherwise. B1-57 B1-58 Appendix B1 ARM and Thumb Assembler Instructions $label $label $label MACRO SMUL $a, $b, $c IF {ARCHITECTURE}=“5TE” SMULBB $a, $b, $c MEXIT ENDIF MUL $a, $b, $c MEND MAP (alias ˆ), FIELD (alias #) These directives define objects similar to C structures. MAP sets the base address or offset of a structure, and FIELD defines structure elements. The syntax is <name> MAP FIELD <base> {, <base_register>} <field_size_in_bytes> The MAP directive sets the value of the special assembler variable {VAR} to the base address of the structure. This is either the value <base> or the register relative value <base_register>+<base>. Each FIELD directive sets <name> to the value VAR and increments VAR by the specified number of bytes. For register relative values, the expressions :INDEX:<name> and :BASE:<name> return the element offset from base register, and base register number, respectively. In practice the base register form is not that useful. Instead you can use the plain form and mention the base register explicitly in the instruction. This allows you to point to a structure of the same type with different base registers. The following example sets up a structure on the stack of two int variables: count type size MAP FIELD FIELD FIELD 0 4 4 0 SUB MOV STR STR sp, r0, r0, r0, ; ; ; ; structure elements offset from 0 define an int called count define an int called type record the struct size sp, #size ; make room on the stack #0 [sp, #count] ; clear the count element [sp, #type] ; clear the type element NOFP This directive bans the use of floating-point instructions in the assembly file. We don’t cover floating-point instructions and directives in this appendix. B1.4 ARM Assembler Quick Reference OPT The OPT directive controls the formatting of the armasm -list option. This is seldom used now that source-level debugging is available. See the armasm documentation. PROC See FUNCTION. RLIST, RN <name> <name> RN <numeric expression> RLIST <list of ARM register enclosed in {}> These directives name a list of ARM registers or a single ARM register. For example, the following code names r0 as arg and the ATPCS preserved registers as saved. arg saved RN 0 RLIST {r4-r11} ROUT The ROUT directive defines a new local label area. See Section ARM Assembler Labels. SETA, SETL, SETS These directives set the values of arithmetic, logical, and string variables, respectively. See Section ARM Assembler Variables. SPACE (alias %) {<label>} SPACE <numeric_expression> This directive reserves <numeric_expression> bytes of space. The bytes are zero initialized. WHILE, WEND These directives supply an assemble-time looping structure. WHILE is followed by a logical expression. While this expression is true, the assembler repeats the code between WHILE and WEND. The following example shows how to create an array of powers of two from 1 to 65,536. B1-59 B1-60 Appendix B1 ARM and Thumb Assembler Instructions count count B1.5 GBLA SETA WHILE DCD SETA WEND count 1 count<=65536 count 2*count GNU Assembler Quick Reference This section summarizes the more useful commands and expressions available with the GNU assembler, gas, when you target this assembler for ARM. Each assembly line has the format {<label>:} {<instruction or directive>} @ comment Unlike the ARM assembler, you needn’t indent instructions and directives. Labels are recognized by the following colon rather than their position at the start of the line. The following example shows a simple assembly file defining a function add that returns the sum of the two input arguments: .section .global .text, “x” add @ give the symbol add external linkage add: ADD MOV r0, r0, r1 @ add input arguments pc, lr @ return from subroutine GNU Assembler Directives Here is an alphabetical list of the more common gas directives. .ascii “<string>” Inserts the string as data into the assembly, as for DCB in armasm. .asciz “<string>” As for .ascii but follows the string with a zero byte. .balign <power_of_2> {,<fill_value> {,<max_padding>} } B1.5 GNU Assembler Quick Reference Aligns the address to <power_of_2> bytes. The assembler aligns by adding bytes of value <fill_value> or a suitable default. The alignment will not occur if more than <max_padding> fill bytes are required. Similar to ALIGN in armasm. .byte <byte1> {,<byte2>} ... Inserts a list of byte values as data into the assembly, as for DCB in armasm. .code <number_of_bits> Sets the instruction width in bits. Use 16 for Thumb and 32 for ARM assembly. Similar to CODE16 and CODE32 in armasm. .else Use with .if and .endif. Similar to ELSE in armasm. .end Marks the end of the assembly file. This is usually omitted. .endif Ends a conditional compilation code block. See .if, .ifdef, .ifndef. Similar to ENDIF in armasm. .endm Ends a macro definition. See .macro. Similar to MEND in armasm. .endr Ends a repeat loop. See .rept and .irp. Similar to WEND in armasm. .equ <symbol name>, <value> This directive sets the value of a symbol. It is similar to EQU in armasm. .err Causes assembly to halt with an error. .exitm Exit a macro partway through. See .macro. Similar to MEXIT in armasm. .global <symbol> This directive gives the symbol external linkage. It is similar to EXPORT in armasm. B1-61 B1-62 Appendix B1 ARM and Thumb Assembler Instructions .hword <short1> {,<short2>} ... Inserts a list of 16-bit values as data into the assembly, as for DCW in armasm. .if <logical_expression> Makes a block of code conditional. End the block using .endif. Similar to IF in armasm. See also .else. .ifdef <symbol> Include a block of code if <symbol> is defined. End the block with .endif. .ifndef <symbol> Include a block of code if <symbol> is not defined. End the block with .endif. .include “<filename>” Includes the indicated source file. Similar to INCLUDE in armasm or # include in C. .irp <param> {,<val_1>} {,<val_2>} ... Repeats a block of code, once for each value in the value list. Mark the end of the block using a .endr directive. In the repeated code block, use \<param> to substitute the associated value in the value list. .macro <name> {<arg_1>} {,<arg_1>} ... {,<arg_k>} Defines an assembler macro called <name> with k parameters. The macro definition must end with .endm. To escape from the macro at an earlier point, use .exitm. These directives are similar to MACRO, MEND, and MEXIT in armasm. You must precede the dummy macro parameters by \. For example: .macro SHIFTLEFT a, b .if \b < 0 MOV \a, \a, ASR #-\b .exitm .endif MOV \a, \a, LSL #\b .endm .rept <number_of_times> Repeats a block of code the given number of times. End the block with .endr. <register_name> .req <register_name> B1.5 GNU Assembler Quick Reference This directive names a register. It is similar to the RN directive in armasm except that you must supply a name rather than a number on the right. For example, acc .req r0. .section <section_name> {,”<flags>”} Starts a new code or data section. Usually you should call a code section .text, an initialized data section .data, and an uninitialized data section .bss . These have default flags, and the linker understands these default names. The directive is similar to the armasm directive AREA. Table B1.19 lists possible characters to appear in the <flags> string for ELF format files. .set <variable_name>, <variable_value> TABLE B1.19 Flag a .section flags for ELF format files. Meaning allocatable section w writable section x executable section This directive sets the value of a variable. It is similar to SETA in armasm. .space <number_of_bytes> {,<fill_byte>} Reserves the given number of bytes. The bytes are filled with zero or <fill_byte> if specified. It is similar to SPACE in armasm. .word <word1> {,<word2>} ... Inserts a list of 32-bit word values as data into the assembly, as for DCD in armasm. B1-63 B2 A P P E N D I X ARM and Thumb Instruction Encodings Andrew Sloss, ARM; Dominic Symes, ARM; Chris Wright, Ultimodule Inc. B2.1 ARM Instruction Set Encodings B-3 B2.2 Thumb Instruction Set Encodings B-9 B2.3 Program Status Registers B-11 This appendix gives tables for the instruction set encodings of the 32-bit ARM and 16-bit Thumb instruction sets. We also describe the fields of the processor status registers cpsr and spsr. B2.1 ARM Instruction Set Encodings Table B2.1 summarizes the bit encodings for the 32-bit ARM instruction set architecture ARMv6. This table is useful if you need to decode an ARM instruction by hand. We’ve expanded the table to aid quick manual decode. Any bitmaps not listed are either unpredictable or undefined for ARMv6. To use Table B2.1 efficiently, follow this decoding procedure: ■ Look at the leading hex digit of the instruction, bits 28 to 31. If this has a value 0xF, then jump to the end of Table B2.1. Otherwise, the top hex digit represents a condition cond. Decode cond using Table B2.2. ARM instruction decode table. cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond STRH | LDRH post LDRD | STRD | LDRSB | LDRSH post LDRD | STRD | LDRSB | LDRSH post MRS Rd, cpsr | MRS Rd, spsr MSR cpsr, Rm | MSR spsr, Rm BXJ SMLAxy SMLAWy SMULWy SMLALxy SMULxy TST | TEQ | CMP | CMN ORR | BIC MOV | MVN BX | BLX CLZ QADD | QSUB | QDADD | QDSUB TST | ORR | MOV | SWP | STREX LDREX TEQ | CMP | CMN BIC MVN SWPB 0 0 0 0 0 cond cond cond cond cond BKPT 0 cond cond cond cond cond cond cond 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cond 0 0 0 0 0 0 0 0 0 0 0 0 0 1 U 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 op op 0 op 1 op 0 0 0 0 0 Rn Rn Rn Rd Rd RdHi RdHi Rn Rn Rn Rn Rn 0 0 Rn Rn Rn 1 1 s x 1 1 Rd Rd Rd RdHi Rd Rn Rn 0 0 0 1 1 1 1 1 1 Rn 1 f 1 1 S S 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 S S 0 0 0 0 op 0 op 0 op op 0 op 1 0 1 0 0 0 1 0 1 1 0 1 1 op op 0 op 1 0 1 1 1 op 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 U 1 0 0 0 U 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S S 0 S 1 S 0 0 op S 0 0 op 0 0 1 op op 0 0 0 U 1 0 0 0 0 0 0 0 0 0 0 0 immed[15:4] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0 0 1 y x y 0 y 1 y x y x shift shift shift 0 0 op 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 op 1 1 1 op 1 1 0 1 1 1 1 1 1 1 0 shift 1 shift 0 0 0 0 0 Rs 0 shift Rd Rs 0 shift 0 Rd Rs 0 shift Rd 0 0 0 0 1 0 0 Rd 1 1 1 1 1 0 0 Rd 1 1 1 1 1 0 0 0 1 1 1 c 1 Rs shift_size 0 0 0 0 Rs Rn Rs RdLo Rs RdLo Rs Rd 0 0 0 0 immed Rd [7:4] Rd 0 0 0 0 immed Rd [7:4] Rd 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 Rn Rs Rn Rs 0 0 0 0 Rs RdLo Rs 0 0 0 0 Rs 0 0 0 0 shift_size Rd shift_size Rd shift_size 1 1 1 1 1 1 1 1 Rd 1 1 1 1 Rd 0 0 0 0 Rd Rd Rm Rm Rm Rm Rm immed [3:0] Rm immed [3:0] 0 0 00 Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm immed [3:0] Rm Rm Rm Rm Rm 1 1 11 Rm Rm 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 AND | EOR | SUB | RSB | ADD | ADC | SBC | RSC AND | EOR | SUB | RSB | ADD | ADC | SBC | RSC MUL MLA UMAAL UMULL | UMLAL | SMULL | SMLAL STRH | LDRH post Instruction classes (indexed by op) TABLE B2.1 B2-4 Appendix B2 ARM and Thumb Instruction Encodings 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 U U U 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 U 0 0 0 op 1 op op 0 op 1 op T op W op T op op op op op op 0 0 op 1 op 1 0 0 op 1 op 0 op 0 op 1 op 1 op 1 op 1 op W 0 0 0 0 1 0 op Rn Rn Rn Rn Rn Rd Rd Rd Rd Rd 0 f s x c 1 1 1 1 1 Rn 0 0 0 0 S Rn Rd S 0 0 0 0 Rd op Rn Rd op Rn Rd op Rn Rd Rn Rd Rn Rd Rn Rd Rn Rd Rn Rd Rn Rd 0 Rn Rd immed5 Rd 0 immed4 Rd 0 Rn Rd 1 1 1 1 1 Rd 0 Rn!=1111 Rd 0 1 1 1 1 Rd 0 Rn!=1111 Rd 0 1 1 1 1 Rd 1 Rn!=1111 Rd 1 1 1 1 1 Rd op Rn Rd 0 Rd Rn!=1111 0 Rd 1 1 1 1 0 RdHi RdLo S 0 0 0 1 U 1 W op 0 0 0 1 U 0 W op 0 0 1 0 cond LDRD | STRD | LDRSB | LDRSH pre cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond LDRD | STRD | LDRSB | LDRSH pre 0 0 0 1 U 1 W op 0 0 0 1 U 0 W op cond cond STRH | LDRH pre AND | EOR | SUB | RSB | ADD | ADC | SBC | RSC MSR cpsr, #imm | MSR spsr, #imm TST | TEQ | CMP | CMN ORR | BIC MOV | MVN STR | LDR | STRB | LDRB post STR | LDR | STRB | LDRB pre STR | LDR | STRB | LDRB post { |S|Q|SH| |U|UQ|UH}ADD16 { |S|Q|SH| |U|UQ|UH}ADDSUBX { |S|Q|SH| |U|UQ|UH}SUBADDX { |S|Q|SH| |U|UQ|UH}SUB16 { |S|Q|SH| |U|UQ|UH}ADD8 { |S|Q|SH| |U|UQ|UH}SUB8 PKHBT | PKHTB {S|U}SAT {S|U}SAT16 SEL REV | REV16 | | REVSH {S|U}XTAB16 {S|U}XTB16 {S|U}XTAB {S|U}XTB {S|U}XTAH {S|U}XTH STR | LDR | STRB | LDRB pre SMLAD | SMLSD SMUAD | SMUSD SMLALD | SMLSLD cond immed Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm Rm immed 1 0 1 1 [3:0] 1 1 op 1 Rm immed 1 1 op 1 [3:0] 1 0 1 1 immed immed immed immed immed12 immed12 shift_size shift 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 shift_size op 0 1 shift_size sh 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 op 0 1 1 rot 0 0 0 1 1 1 rot 0 0 0 1 1 1 rot 0 0 0 1 1 1 rot 0 0 0 1 1 1 rot 0 0 0 1 1 1 rot 0 0 0 1 1 1 shift_size shift 0 Rs 0 op X 1 Rs 0 op X 1 Rs 0 op X 1 rotate rotate rotate rotate rotate 0 0 0 0 immed [7:4] 0 0 0 0 immed [7:4] 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 STRH | LDRH pre Instruction classes (indexed by op) TABLE B2.1 ARM instruction decode table. (Continued.) B2.1 ARM Instruction Set Encodings B2-5 SMMLA | | | SMMLS SMMUL USADA8 USAD8 Undefined and expected to stay so STMDA | LDMDA | STMIA | LDMIA STMDB | LDMDB | STMIB | LDMIB B to instruction_address+8+4*offset BL to instruction_address+8+4*offset MCRR | MRRC STC{L} | LDC{L} unindexed STC{L} | LDC{L} post STC{L} | LDC{L} pre CDP MCR | MRC SWI CPS | | CPSIE | CPSID SETEND LE | SETEND BE PLD pre PLD pre RFEDA | RFEIA | RFEDB | RFEIB SRSDA | SRSIA | SRSDB | SRSIB BLX instruction+8+4*offset+2*a MCRR2 | MRRC2 STC2{L} | LDC2{L} unindexed STC2{L} | LDC2{L} post STC2{L} | LDC2{L} pre CDP2 MCR2 | MRC2 Instruction classes (indexed by op) 1 1 1 1 1 1 1 1 1 1 1 1 1 cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond cond 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 0 1 0 1 0 0 0 1 0 0 1 1 1 1 1 op op a 0 0 0 1 0 0 1 0 L 0 L 1 L W op1 op1 0 1 U U 1 0 L 0 L 1 L W op1 op1 0 1 U U op op op op op 0 0 0 0 0 0 1 0 1 1 0 1 0 W 1 1 W 0 0 0 U U op op op op op op op 1 0 1 1 0 1 0 0 0 0 0 0 1 1 1 ^ W op ^ W op 0 0 1 1 1 op op Rn ! =1111 1 1 1 1 Rn ! = 1111 1 1 1 1 x Rs Rs Rs Rs op R 1 Rm 0 0 R 1 Rm 0 0 0 1 Rm 0 0 0 1 Rm 1 1 1 1 x Rn register_list Rn register_list signed 24-bit branch offset signed 24-bit branch offset Rn Rd copro op1 Cm Rn Cd copro option Rn Cd copro immed8 Rn Cd copro immed8 Cn Cd copro op2 0 Cm Cn Rd copro op2 1 Cm immed24 op M 0 0 0 0 0 0 0 0 a i f 0 mode 0 0 0 1 0 0 0 0 0 0 op 0 0 0 0 0 0 0 0 0 Rn 1 1 1 1 immed12 Rn 1 1 1 1 shift_size shift 0 Rm Rn 0 0 0 0 1 0 1 0 0 0 0 0 0 0 00 1 1 0 1 0 0 0 0 0 1 0 1 0 0 0 mode signed 24-bit branch offset Rn Rd copro op1 Cm Rn Cd copro option Rn Cd copro immed8 Rn Cd copro immed8 Cn Cd copro op2 0 Cm Cn Cd copro op2 1 Cm Rd Rd Rd Rd 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 TABLE B2.1 ARM instruction decode table. (Continued.) B2-6 Appendix B2 ARM and Thumb Instruction Encodings B2.1 ARM Instruction Set Encodings TABLE B2.2 Decoding table for cond. Binary Hex cond Binary Hex cond 0000 0 EQ 1000 8 HI 0001 1 NE 1001 9 LS 0010 2 CS/HS 1010 A GE 0011 3 CC/LO 1011 B LT 0100 4 MI 1100 C GT 0101 5 PL 1101 D LE 0110 6 VS 1110 E {AL} 0111 7 VC ■ Index through Table B2.1 using the second hex digit, bits 24 to 27 (shaded). ■ Index using bit 4, then bit 7 or bit 23 of the instruction where these bits are shaded. ■ Once you have located the correct table entry, look at the bits named op. Concatenate these to form a binary number that indexes the | separated instruction list on the left. For example if there are two op bits value 1 and 0, then the binary value 10 indicates instruction number 2 in the list (the third instruction). ■ The instruction operands have the same name as in the instruction description of Appendix B1. The table uses the following abbreviations: ■ L is 1 if the L suffix applies for LDC and STC operations. ■ M is 1 if CPS changes processor mode. mode is defined in Table B2.3. TABLE B2.3 Decoding table for mode. Binary Hex mode 10000 0x10 user mode ( _usr) 10001 0x11 FIQ mode ( _fiq) 10010 0x12 IRQ mode ( _irq) 10011 0x13 supervisor mode ( _svc) 10111 0x17 abort mode ( _abt) 11011 0x1B undefined mode ( _und) 11111 0x1F system mode B2-7 B2-8 Appendix B2 ARM and Thumb Instruction Encodings TABLE B2.4 Decoding table for shift, shift_size, and Rs. shift shift_size Rs Shift action 00 0 to 31 N/A LSL #shift_size 00 N/A Rs LSL Rs 01 0 N/A LSR #32 01 1 to 31 N/A LSR #shift_size 01 N/A Rs LSR Rs 10 0 N/A ASR #32 10 1 to 31 N/A ASR #shift_size 10 N/A Rs ASR Rs 11 0 N/A RRX 11 1 to 31 N/A ROR #shift_size 11 N/A Rs ROR Rs N/A 0 to 31 N/A The shift value is implicit: For PKHBT it is 00. For PKHTB it is 10. For SAT it is 2* sh. ■ op1 and op2 are the opcode extension fields in coprocessor instructions. ■ post indicates a postindexed addressing mode such as [Rn], Rm or [Rn], #immed. ■ pre indicates a preindexed addressing mode such as [Rn, Rm] or [Rn, #immed]. ■ register_list is a bit field with bit k set if register Rk appears in the register list. ■ rot is a byte rotate. The second operand is Rm ROR (8*rot). ■ rotate is a bit rotate. The second operand is #immed ROR (2*rotate). ■ shift and sh encode a shift type and direction. See Table B2.4. ■ U is the up/down select for addressing modes. If `U ⫽ 1, then we add the offset to the base address, as in [Rn], #4 or [Rn,Rm]. If U ⫽ 0 then we subtract the offset from the base address, as in [Rn,#-4] or [Rn],-Rm. ■ unindexed indicates an addressing mode of the form [Rn],{option}. ■ R is 1 if the R (round) instruction suffix is present. ■ T is 1 if the T suffix is present on load and store instructions. ■ W is 1 if ! (writeback) is specified in the instruction mnemonic. ■ X is 1 if the X (exchange) instruction suffix is present. ■ x and y are 0 for the B suffix, 1 for the T suffix. ■ ∧ is 1 if the ∧ suffix is applied in LDM or STM instructions. B2.2 B2.2 B2-9 Thumb Instruction Set Encodings Thumb Instruction Set Encodings Table B2.5 summarizes the bit encodings for the 16-bit Thumb instruction set. This table is useful if you need to decode a Thumb instruction by hand. We’ve expanded the table to aid quick manual decode. The table contains instruction definitions up to architecture THUMBv3. Any bitmaps not listed are either unpredictable or undefined for THUMBv3. To use the table efficiently, follow this decoding procedure: ■ Index through the table using the first hex digit of the instruction, bits 12 to 15 (shaded). ■ Index on any shaded bits from bits 0 to 11. ■ Once you have located the correct table entry, look at the bits named op. Concatenate these to form a binary number that indexes the | separated instruction list on the left. For example, if there are two op bits value 1 and 0, then the binary value 10 indicates instruction number 2 in the list (the third instruction). TABLE B2.5 Thumb instruction decode table. Instruction classes (indexed by op) 15 14 13 12 11 10 LSL | LSR ASR ADD | SUB ADD | SUB MOV | CMP ADD | SUB AND | EOR | LSL | LSR ASR | ADC | SBC | ROR TST | NEG | CMP | CMN ORR | MUL | BIC | MVN CPY Ld, Lm ADD | MOV Ld, Hm ADD | MOV Hd, Lm ADD | MOV Hd, Hm CMP CMP CMP BX | BLX LDR Ld, [pc, #immed*4] STR | STRH | STRB | LDRSB pre LDR | LDRH | LDRB | LDRSH pre STR | LDR Ld, [Ln, #immed*4] STRB | LDRB Ld, [Ln, #immed] STRH | LDRH Ld, [Ln, #immed*2] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 op 0 1 1 op op 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 op op op 9 8 7 6 immed5 immed5 0 op Lm 1 op immed3 Ld/Ln Ld 0 0 0 op 0 0 1 op 0 1 0 op 0 1 1 op 1 1 0 0 0 1 op 0 0 1 1 op 0 1 0 1 op 0 1 1 1 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 1 op Ld op Lm op Lm immed5 immed5 immed5 5 4 3 Lm Lm Ln Ln immed8 immed8 Lm/Ls Lm/Ls Lm Lm Lm Hm & 7 Lm Hm & 7 Hm & 7 Lm Hm & 7 Rm immed8 Ln Ln Ln Ln Ln 2 1 0 Ld Ld Ld Ld Ld Ld Ld/Ln Ld Ld Ld Hd & 7 Hd & 7 Ln Hn & 7 Hn & 7 0 0 0 Ld Ld Ld Ld Ld B2-10 TABLE B2.5 Appendix B2 ARM and Thumb Instruction Encodings Thumb instruction decode table. (Continued.) Instruction classes (indexed by op) 15 14 13 12 11 10 9 STR | LDR Ld, [sp, #immed*4] ADD Ld, pc, #immed*4 | ADD Ld, sp, #immed*4 ADD sp, #immed*4 | SUB sp, #immed*4 SXTH | SXTB | UXTH | UXTB REV | REV16 | | REVSH PUSH | POP SETEND LE | SETEND BE CPSIE | CPSID BKPT immed8 STMIA | LDMIA Ln!, {register-list} B<cond> instruction_address+ 4+offset*2 Undefined and expected to remain so SWI immed8 B instruction_address+4+offset*2 BLX ((instruction+4+ (poff<<12)+offset*4) &~ 3) This must be preceded by a branch prefix instruction. This is the branch prefix instruction. It must be followed by a relative BL or BLX instruction. BL instruction+4+ (poff<<12)+ offset*2 This must be preceded by a branch prefix instruction. 1 0 0 1 op Ld immed8 1 0 1 0 op Ld immed8 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 op 0 0 1 op 0 0 1 1 1 1 1 1 0 1 1 1 Ln 0 0 R 0 0 0 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 0 1 unsigned 10-bit offset 1 1 1 1 0 signed 11-bit prefix offset poff 1 1 1 1 1 unsigned 11-bit offset ■ 8 7 op 0 0 1 1 5 4 3 2 1 0 immed7 op op cond < 1110 1 1 6 1 1 Lm Lm register_list 0 1 op 0 1 op 0 a immed8 register_list Ld Ld 0 i 0 f signed 8-bit offset 0 1 x immed8 signed 11-bit offset 0 The instruction operands have the same name as in the instruction description of Appendix B1. The table uses the following abbreviations: ■ register_list is a bit field with bit k set if register Rk appears in the register list. ■ R is 1 if lr is in the register list of PUSH or pc is in the register list of POP. B2.3 B2.3 Program Status Registers Program Status Registers Table B2.6 shows how to decode the 32-bit program status registers for ARMv6. TABLE B2.6 cpsr and spsr decode table. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 N Z C V Q Res J Res Res GE[3:0] Field EA IFT mode Use N Negative flag, records bit 31 of the result of flag-setting operations. Z Zero flag, records if the result of a flag-setting operation is zero. C Carry flag, records unsigned overflow for addition, not-borrow for subtraction, and is also used by the shifting circuit. See Table B1.3. V Overflow flag, records signed overflows for flag-setting operations. Q Saturation flag. Certain operations set this flag on saturation. See for example QADD in Appendix B1 (ARMv5E and above). J J ⫽ 1 indicates Java execution (must have T ⫽ 0). Use the BXJ instruction to change this bit (ARMv5J and above). Res These bits are reserved for future expansion. Software should preserve the values in these bits. GE[3:0] The SIMD greater-or-equal flags. See SADD in Appendix B1 (ARMv6). E Controls the data endianness. See SETEND in Appendix B1 (ARMv6). A A ⫽ 1 disables imprecise data aborts (ARMv6). I I ⫽ 1 disables IRQ interrupts. F F ⫽ 1 disables FIQ interrupts. T T ⫽ 1 indicates Thumb state. T ⫽ 0 indicates ARM state. Use the BX or BLX instructions to change this bit (ARMv4T and above). mode The current processor mode. See Table B2.4. B2-11