Uploaded by Dundee Wang

COD4e-ARMv6-parts

advertisement
2
I speak Spanish
to God, Italian to
women, French to
men, and German to
my horse.
Instructions:
Language of
the Computer
2.1
Introduction 76
2.2
Operations of the Computer Hardware 77
2.3
Operands of the Computer Hardware 80
2.4
Signed and Unsigned Numbers 86
2.5
Representing Instructions in the
Charles V, King of France
1337–1380
03-Ch02-P374750.indd 74
Computer 93
2.6
Logical Operations 100
2.7
Instructions for Making Decisions 104
7/3/09 12:18:21 PM
2.8
Supporting Procedures in Computer Hardware 113
2.9
Communicating with People 122
2.10
ARM Addressing for 32-Bit Immediates and More Complex
Addressing Modes 127
2.11
Parallelism and Instructions: Synchronization 133
2.12
Translating and Starting a Program 135
2.13
A C Sort Example to Put It All Together 143
2.14
Arrays versus Pointers 152
2.15
Advanced Material: Compiling C and Interpreting Java 156
2.16
Real Stuff: MIPS Instructions 156
2.17
Real Stuff: x86 Instructions 161
2.18
Fallacies and Pitfalls 170
2.19
Concluding Remarks 171
2.20
Historical Perspective and Further Reading 174
2.21
Exercises 174
The Five Classic Components of a Computer
03-Ch02-P374750.indd 75
7/3/09 12:18:22 PM
76
Chapter 2
2.1
instruction set The
vocabulary of commands
understood by a given
architecture.
Instructions: Language of the Computer
Introduction
To command a computer’s hardware, you must speak its language. The words
of a computer’s language are called instructions, and its vocabulary is called an
instruction set. In this chapter, you will see the instruction set of a real computer,
both in the form written by people and in the form read by the computer. We
introduce instructions in a top-down fashion. Starting from a notation that looks
like a restricted programming language, we refine it step-by-step until you see
the real language of a real computer. Chapter 3 continues our downward descent,
unveiling the hardware for arithmetic and the representation of floating-point
numbers.
You might think that the languages of computers would be as diverse as those
of people, but in reality computer languages are quite similar, more like regional
dialects than like independent languages. Hence, once you learn one, it is easy to
pick up others. This similarity occurs because all computers are constructed from
hardware technologies based on similar underlying principles and because there
are a few basic operations that all computers must provide. Moreover, computer
designers have a common goal: to find a language that makes it easy to build the
hardware and the compiler while maximizing performance and minimizing cost
and power. This goal is time honored; the following quote was written before you
could buy a computer, and it is as true today as it was in 1947:
It is easy to see by formal-logical methods that there exist certain [instruction
sets] that are in abstract adequate to control and cause the execution of any
sequence of operations . . . . The really decisive considerations from the present
point of view, in selecting an [instruction set], are more of a practical nature:
simplicity of the equipment demanded by the [instruction set], and the clarity of
its application to the actually important problems together with the speed of its
handling of those problems.
Burks, Goldstine, and von Neumann, 1947
The “simplicity of the equipment” is as valuable a consideration for today’s
computers as it was for those of the 1950s. The goal of this chapter is to teach
an instruction set that follows this advice, showing both how it is represented
in hardware and the relationship between high-level programming languages
and this more primitive one. Our examples are in the C programming language;
Section 2.15 on the CD shows how these would change for an object-oriented
language like Java.
03-Ch02-P374750.indd 76
7/3/09 12:18:22 PM
2.2
By learning how to represent instructions, you will also discover the secret
of computing: the stored-program concept. Moreover, you will exercise your
“foreign language” skills by writing programs in the language of the computer and
running them on the simulator that comes with this book. You will also see the
impact of programming languages and compiler optimization on performance.
We conclude with a look at the historical evolution of instruction sets and an overview of other computer dialects.
The chosen instruction set comes is ARM, which is the most popular 32-bit
instruction set in the world, as 4 billion were shipped in 2008. Later, we will take
a quick look at two other popular instruction sets. MIPS is quite similar to ARM,
with making up in elegance what it lacks in popularity. The other example, the Intel
x86, is inside almost all of the 300 million PCs made in 2008.
We reveal the ARM instruction set a piece at a time, giving the rationale along
with the computer structures. This top-down, step-by-step tutorial weaves the components with their explanations, making the computer’s language more palatable.
Figure 2.1 gives a sneak preview of the instruction set covered in this chapter.
2.2
77
Operations of the Computer Hardware
Operations of the Computer Hardware
Every computer must be able to perform arithmetic. The ARM assembly language
notation
stored-program
concept The idea that
instructions and data
of many types can be
stored in memory as
numbers, leading to
the stored-program
computer.
There must certainly
be instructions
for performing
the fundamental
arithmetic operations.
Burks, Goldstine, and
von Neumann, 1947
ADD a, b, c
instructs a computer to add the two variables b and c and to put their sum in a.
This notation is rigid in that each ARM arithmetic instruction performs only
one operation and must always have exactly three variables. For example, suppose
we want to place the sum of four variables b, c, d, and e into variable a. (In this
section we are being deliberately vague about what a “variable” is; in the next section we’ll explain in detail.)
The following sequence of instructions adds the four variables:
ADD a, b, c
ADD a, a, d
ADD a, a, e
; The sum of b and c is placed in a.
; The sum of b, c, and d is now in a.
; The sum of b, c, d, and e is now in a.
Thus, it takes three instructions to sum the four variables.
The words to the right of the sharp symbol (;) on each line above are comments
for the human reader; the computer ignores them. Note that unlike other programming languages, each line of this language can contain at most one instruction. Another difference from C is that comments always terminate at the end of
a line.
03-Ch02-P374750.indd 77
7/3/09 12:18:22 PM
78
Chapter 2
Instructions: Language of the Computer
ARM operands
Name
Example
Comments
16 registers
r0, r1, r2, . . . , r11, r12, sp,
lr, pc
Fast locations for data. In ARM, data must be in registers to perform
arithmetic, register
230 memory
words
Memory[0], Memory[4], . . . ,
Memory[4294967292]
Accessed only by data transfer instructions. ARM uses byte addresses,
so sequential word addresses differ by 4. Memory holds data structures,
arrays, and spilled registers.
ARM assembly language
Category
Arithmetic
Data
transfer
Logical
Conditional
Branch
Unconditional
Branch
Instruction
Example
add
subtract
load register
store register
load register halfword
load register halfword signed
ADD r1,r2,r3
SUB r1,r2,r3
LDR r1, [r2,#20]
STR r1, [r2,#20]
LDRH r1, [r2,#20]
LDRHS r1, [r2,#20]
r1 = r2 – r3
r1 = r2 + r3
r1 = Memory[r2 + 20]
Memory[r2 + 20] = r1
r1 = Memory[r2 + 20]
r1 = Memory[r2 + 20]
store register halfword
load register byte
load register byte signed
store register byte
swap
STRH r1, [r2,#20]
LDRB r1, [r2,#20]
LDRBS r1, [r2,#20]
STRB r1, [r2,#20]
SWP r1, [r2,#20]
Memory[r2 + 20] = r1
r1 = Memory[r2 + 20]
r1 = Memory[r2 + 20]
Memory[r2 + 20] = r1
r1 = Memory[r2 + 20],
Memory[r2 + 20] = r1
Halfword register to memory
Byte from memory to register
Byte from memory to register
Byte from register to memory
Atomic swap register and memory
mov
and
MOV
AND
r1, r2
r1, r2, r3
r1 = r2
r1 = r2 & r3
Copy value into register
Three reg. operands; bit-by-bit AND
or
ORR
r1, r2, r3
r1 = r2 | r3
Three reg. operands; bit-by-bit OR
not
logical shift left (optional
operation)
logical shift right (optional
operation)
compare
MVN
LSL
r1, r2
r1, r2, #10
r1 = ~ r2
r1 = r2 << 10
Two reg. operands; bit-by-bit NOT
Shift left by constant
LSR
r1, r2, #10
r1 = r2 >> 10
Shift right by constant
cond. flag = r1 − r2
Compare for conditional branch
if (r1 == r2) go to
PC + 8 + 100
Conditional Test; PC-relative
go to PC + 8 + 10000
Branch
For procedure call
CMP r1, r2
branch on EQ, NE, LT, LE, GT, BEQ 25
GE, LO, LS, HI, HS, VS, VC,
MI, PL
branch (always)
B 2500
branch and link
BL 2500
Meaning
r14 = PC + 4; go to PC
+ 8 + 10000
Comments
3 register operands
3 register operands
Word from memory to register
Word from memory to register
Halfword memory to register
Halfword memory to register
FIGURE 2.1 ARM assembly language revealed in this chapter. This information is also found in Column 1 of the ARM Reference
Data Card at the front of this book.
The natural number of operands for an operation like addition is three: the
two numbers being added together and a place to put the sum. Requiring every
instruction to have exactly three operands, no more and no less, conforms to the
philosophy of keeping the hardware simple: hardware for a variable number of
operands is more complicated than hardware for a fixed number. This situation
illustrates the first of four underlying principles of hardware design:
03-Ch02-P374750.indd 78
7/3/09 12:18:22 PM
2.2
79
Operations of the Computer Hardware
Design Principle 1: Simplicity favors regularity.
We can now show, in the two examples that follow, the relationship of programs
written in higher-level programming languages to programs in this more primitive
notation.
Compiling Two C Assignment Statements into ARM
This segment of a C program contains the five variables a, b, c, d, and e. Since
Java evolved from C, this example and the next few work for either high-level
programming language:
EXAMPLE
a = b + c;
d = a – e;
The translation from C to ARM assembly language instructions is performed
by the compiler. Show the ARM code produced by a compiler.
An ARM instruction operates on two source operands and places the result
in one destination operand. Hence, the two simple statements above compile
directly into these two ARM assembly language instructions:
ANSWER
ADD a, b, c
SUB d, a, e
Compiling a Complex C Assignment into ARM
A somewhat complex statement contains the five variables f, g, h, i, and j:
f = (g + h) – (i + j);
EXAMPLE
What might a C compiler produce?
The compiler must break this statement into several assembly instructions,
since only one operation is performed per ARM instruction. The first ARM
instruction calculates the sum of g and h. We must place the result somewhere,
so the compiler creates a temporary variable, called t0:
ANSWER
ADD t0,g,h ; temporary variable t0 contains g + h
03-Ch02-P374750.indd 79
7/3/09 12:18:22 PM
80
Chapter 2
Instructions: Language of the Computer
Although the next operation is subtract, we need to calculate the sum of i and
j before we can subtract. Thus, the second instruction places the sum of i and
j in another temporary variable created by the compiler, called t1:
ADD t1,i,j ; temporary variable t1 contains i + j
Finally, the subtract instruction subtracts the second sum from the first and
places the difference in the variable f, completing the compiled code:
SUB f,t0,t1 ; f gets t0 – t1, which is (g + h) – (i + j)
Check
Yourself
For a given function, which programming language likely takes the most lines of
code? Put the three representations below in order.
1. Java
2. C
3. ARM assembly language
Elaboration: To increase portability, Java was originally envisioned as relying on a
software interpreter. The instruction set of this interpreter is called Java bytecodes (see
Section 2.15 on the CD), which is quite different from the ARM instruction set. To
get performance close to the equivalent C program, Java systems today typically compile
Java bytecodes into the native instruction sets like ARM. Because this compilation is
normally done much later than for C programs, such Java compilers are often called Just
In Time (JIT) compilers. Section 2.12 shows how JITs are used later than C compilers in
the start-up process, and Section 2.13 shows the performance consequences of compiling versus interpreting Java programs.
2.3
word The natural unit
of access in a computer,
usually a group of 32 bits;
corresponds to the size
of a register in the ARM
architecture.
03-Ch02-P374750.indd 80
Operands of the Computer Hardware
Unlike programs in high-level languages, the operands of arithmetic instructions
are restricted; they must be from a limited number of special locations built directly
in hardware called registers. Registers are primitives used in hardware design that
are also visible to the programmer when the computer is completed, so you can
think of registers as the bricks of computer construction. The size of a register in
the ARM architecture is 32 bits; groups of 32 bits occur so frequently that they are
given the name word in the ARM architecture.
One major difference between the variables of a programming language and
registers is the limited number of registers, typically 16 to 32 on current computers.
Section 2.20 on the CD for the history of the number of registers.) Thus,
(See
continuing in our top-down, stepwise evolution of the symbolic representation
of the ARM language, in this section we have added the restriction that the three
7/3/09 12:18:23 PM
2.3
81
Operands of the Computer Hardware
operands of ARM arithmetic instructions must each be chosen from one of the 16
32-bit registers.
The reason for the limit of 16 registers may be found in the second of our four
underlying design principles of hardware technology:
Design Principle 2: Smaller is faster.
A very large number of registers may increase the clock cycle time simply because
it takes electronic signals longer when they must travel farther.
Guidelines such as “smaller is faster” are not absolutes; 15 registers may not be
faster than 16. Yet, the truth behind such observations causes computer designers to
take them seriously. In this case, the designer must balance the craving of programs for
more registers with the designer’s desire to keep the clock cycle fast. Another reason for
not using more than 16 is the number of bits it would take in the instruction format, as
Section 2.5 demonstrates “Energy is a major concern today, so a third reason for using
fewer registers is to conserve energy.”
Chapter 4 shows the central role that registers play in hardware construction; as
we shall see in this chapter, effective use of registers is critical to program performance. We use the convention r0, r1, . . . , r15 to refer to registers 0, 1, . . . , r15.
Compiling a C Assignment Using Registers
It is the compiler’s job to associate program variables with registers. Take, for
instance, the assignment statement from our earlier example:
EXAMPLE
f = (g + h) – (i + j);
The variables f, g, h, i, and j are assigned to the registers r0, r1, r2, r3, and
r4, respectively. What is the compiled ARM code?
The compiled program is very similar to the prior example, except we replace
the variables with the register names mentioned above plus two temporary
registers, r5 and r6, which correspond to the temporary variables above:
ANSWER
ADD r5,r0,r1 ; register r5 contains g + h
ADD r6,r2,r3 ; register r6 contains i + j
SUB r4,r5,r6 ; r4 gets r5 – r6, which is (g + h)–(i + j)
Memory Operands
Programming languages have simple variables that contain single data elements, as
in these examples, but they also have more complex data structures—arrays and
structures. These complex data structures can contain many more data elements
than there are registers in a computer. How can a computer represent and access
such large structures?
03-Ch02-P374750.indd 81
7/3/09 12:18:23 PM
82
Chapter 2
Instructions: Language of the Computer
Processor
3
100
2
10
1
101
0
1
Address
Data
Memory
FIGURE 2.2 Memory addresses and contents of memory at those locations. If these elements
were words, these addresses would be incorrect, since ARM actually uses byte addressing, with each word
representing four bytes. Figure 2.3 shows the memory addressing for sequential word addresses.
data transfer instruction
A command that moves
data between memory
and registers.
address A value used to
delineate the location of
a specific data element
within a memory array.
Recall the five components of a computer introduced in Chapter 1 and repeated
on page 75. The processor can keep only a small amount of data in registers, but
computer memory contains billions of data elements. Hence, data structures
(arrays and structures) are kept in memory.
As explained above, arithmetic operations occur only on registers in ARM
instructions; thus, ARM must include instructions that transfer data between
memory and registers. Such instructions are called data transfer instructions. To
access a word in memory, the instruction must supply the memory address. Memory is just a large, single-dimensional array, with the address acting as the index
to that array, starting at 0. For example, in Figure 2.2, the address of the third data
element is 2, and the value of Memory[2] is 10.
The data transfer instruction that copies data from memory to a register is traditionally called load. The format of the load instruction is the name of the operation followed by the register to be loaded, then a constant and register used to
access memory. The sum of the constant portion of the instruction and the contents of the second register forms the memory address. The actual ARM name for
this instruction is LDR, standing for load word into register.
Compiling an Assignment When an Operand Is in Memory
EXAMPLE
Let’s assume that A is an array of 100 words and that the compiler has associated the variables g and h with the registers r1 and r2 and uses r5 as a temporary register as before. Let’s also assume that the starting address, or base
address, of the array is in r3. Compile this C assignment statement:
g = h + A[8];
03-Ch02-P374750.indd 82
7/3/09 12:18:23 PM
2.3
Although there is a single operation in this assignment statement, one of the
operands is in memory, so we must first transfer A[8] to a register. The address
of this array element is the sum of the base of the array A, found in register r3,
plus the number to select element 8. The data should be placed in a temporary
register for use in the next instruction. Based on Figure 2.2, the first compiled
instruction is
LDR
83
Operands of the Computer Hardware
ANSWER
r5,[r3,#8] ; Temporary reg r5 gets A[8]
(On the next page we’ll make a slight adjustment to this instruction, but we’ll
use this simplified version for now.) The following instruction can operate on
the value in r5 (which equals A[8]) since it is in a register. The instruction
must add h (contained in r2) to A[8] (r5) and put the sum in the register
corresponding to g (associated with r1):
ADD
r1,r2,r5 ; g = h + A[8]
The constant in a data transfer instruction (8) is called the offset, and the register added to form the address (r3) is called the base register.
In addition to associating variables with registers, the compiler allocates data structures like arrays and structures to locations in memory. The compiler can then
place the proper starting address into the data transfer instructions.
Since 8-bit bytes are useful in many programs, most architectures address individual bytes. Therefore, the address of a word matches the address of one of the
4 bytes within the word, and addresses of sequential words differ by 4. For example,
Figure 2.3 shows the actual ARM addresses for the words in Figure 2.2; the byte
address of the third word is 8.
Processor
12
100
8
10
4
101
0
1
Byte Address
Data
Hardware/
Software
Interface
Memory
FIGURE 2.3 Actual ARM memory addresses and contents of memory for those words.
The changed addresses are highlighted to contrast with Figure 2.2. Since ARM addresses each byte, word
addresses are multiples of 4: there are 4 bytes in a word.
03-Ch02-P374750.indd 83
7/3/09 12:18:23 PM
84
Chapter 2
alignment restriction
A requirement that data
be aligned in memory on
natural boundaries.
Instructions: Language of the Computer
In ARM, words must start at addresses that are multiples of 4. This requirement
is called an alignment restriction, and many architectures have it. (Chapter 4 suggests why alignment leads to faster data transfers.)
Computers divide into those that use the address of the leftmost or “big end”
byte as the word address versus those that use the rightmost or “little end” byte.
ARM is in the little-endian camp.
Byte addressing also affects the array index. To get the proper byte address in
the code above, the offset to be added to the base register r3 must be 4 × 8, or 32, so
that the load address will select A[8] and not A[8/4]. (See the related pitfall on
page 171 of Section 2.18.)
The instruction complementary to load is traditionally called store; it copies
data from a register to memory. The format of a store is similar to that of a load:
the name of the operation, followed by the register to be stored, then offset to select
the array element, and finally the base register. Once again, the ARM address is
specified in part by a constant and in part by the contents of a register. The actual
ARM name is STR, standing for store word from a register.
Compiling Using Load and Store
EXAMPLE
Assume variable h is associated with register r2 and the base address of the
array A is in r3. What is the ARM assembly code for the C assignment statement below?
A[12] = h + A[8];
ANSWER
Although there is a single operation in the C statement, now two of the operands are in memory, so we need even more ARM instructions. The first two
instructions are the same as the prior example, except this time we use the
proper offset for byte addressing in the load word instruction to select A[8],
and the ADD instruction places the sum in r5:
LDR
ADD
r5,[r3,#32]
r5,r2,r5
; Temporary reg r5 gets A[8]
; Temporary reg r5 gets h + A[8]
The final instruction stores the sum into A[12], using 48 (4 × 12) as the offset
and register r3 as the base register.
STR
03-Ch02-P374750.indd 84
r5,[r3,#48]
; Stores h + A[8] back into A[12]
7/3/09 12:18:23 PM
2.3
85
Operands of the Computer Hardware
Load word and store word are the instructions that copy words between memory and registers in the ARM architecture. Other brands of computers use other
instructions along with load and store to transfer data. An architecture with such
alternatives is the Intel x86, described in Section 2.17.
Many programs have more variables than computers have registers. Consequently,
the compiler tries to keep the most frequently used variables in registers and places
the rest in memory, using loads and stores to move variables between registers and
memory. The process of putting less commonly used variables (or those needed
later) into memory is called spilling registers.
The hardware principle relating size and speed suggests that memory must be
slower than registers, since there are fewer registers. This is indeed the case; data
accesses are faster if data is in registers instead of memory.
Moreover, data is more useful when in a register. An ARM arithmetic instruction can read two registers, operate on them, and write the result. An ARM data
transfer instruction only reads one operand or writes one operand, without operating on it.
Thus, registers take less time to access and have higher throughput than memory, making data in registers both faster to access and simpler to use. Accessing registers also uses less energy than accessing memory. To achieve highest performance
and conserve energy, compilers must use registers efficiently.
Hardware/
Software
Interface
Constant or Immediate Operands
Many times a program will use a constant in an operation—for example, incrementing an index to point to the next element of an array. In fact, more than half
of the ARM arithmetic instructions have a constant as an operand when running
the SPEC2006 benchmarks.
Using only the instructions we have seen so far, we would have to load a constant from memory to use one. (The constants would have been placed in memory
when the program was loaded.) For example, to add the constant 4 to register r3,
we could use the code
LDR r5, [r1,#AddrConstant4]
; r5 = constant 4
ADD r3,r3,r5
; r3 = r3 + r5 (r5 == 4)
assuming that r1 + AddrConstant4 is the memory address of the constant 4.
03-Ch02-P374750.indd 85
7/3/09 12:18:23 PM
86
Chapter 2
Instructions: Language of the Computer
An alternative that avoids the load instruction is to offer the option having one
operand of the arithmetic instructions be a constant, called an immediate operand.
To add 4 to register r3, we just write
ADD
r3,r3,#4
; r3 = r3 + 4
The sharp or hash symbol (#) means the following number is a constant.
Immediate operands illustrate the third hardware design principle, first mentioned in the Fallacies and Pitfalls of Chapter 1:
Design Principle 3: Make the common case fast.
Constant operands occur frequently, and by including constants inside arithmetic
instructions, operations are much faster and use less energy than if constants were
loaded from memory.
Check
Yourself
Given the importance of registers, what is the rate of increase in the number of
registers in a chip over time?
1. Very fast: They increase as fast as Moore’s law, which predicts doubling the
number of transistors on a chip every 18 months.
2. Very slow: Since programs are usually distributed in the language of the
computer, there is inertia in instruction set architecture, and so the number
of registers increases only as fast as new instruction sets become viable.
Elaboration: The ARM offset plus base register addressing is an excellent match
to structures as well as arrays, since the register can point to the beginning of the
structure and the offset can select the desired element. We’ll see such an example in
Section 2.13.
The register in the data transfer instructions was originally invented to hold an index
of an array with the offset used for the starting address of an array. Thus, the base
register is also called the index register. Today’s memories are much larger and the software model of data allocation is more sophisticated, so the base address of the array is
normally passed in a register since it won’t fit in the offset, as we shall see.
2.4
Signed and Unsigned Numbers
First, let’s quickly review how a computer represents numbers. Humans are taught
to think in base 10, but numbers may be represented in any base. For example, 123
base 10 = 1111011 base 2.
Numbers are kept in computer hardware as a series of high and low electronic
signals, and so they are considered base 2 numbers. (Just as base 10 numbers are
called decimal numbers, base 2 numbers are called binary numbers.)
03-Ch02-P374750.indd 86
7/3/09 12:18:23 PM
2.4
87
Signed and Unsigned Numbers
A single digit of a binary number is thus the “atom” of computing, since all
information is composed of binary digits or bits. This fundamental building block
can be one of two values, which can be thought of as several alternatives: high or
low, on or off, true or false, or 1 or 0.
Generalizing the point, in any number base, the value of ith digit d is
d × Basei
binary digit Also called
binary bit. One of
the two numbers in
base 2, 0 or 1, that
are the components
of information.
where i starts at 0 and increases from right to left. This leads to an obvious way to
number the bits in the word: simply use the power of the base for that bit. We subscript decimal numbers with ten and binary numbers with two. For example,
1011two
represents
(1 × 23)
+ (0 × 22) + (1 × 21) + (1 × 20)ten
= (1 × 8) + (0 × 4) + (1 × 2) + (1 × 1)ten
=
8
+
0
+
2
+
1ten
= 11ten
We number the bits 0, 1, 2, 3, . . . from right to left in a word. The drawing below
shows the numbering of bits within a ARM word and the placement of the number
1011two:
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
8
7
6
5
4
3
2
1
0
0
0
0
0
0
1
0
1
1
(32 bits wide)
Since words are drawn vertically as well as horizontally, leftmost and rightmost
may be unclear. Hence, the phrase least significant bit is used to refer to the rightmost bit (bit 0 above) and most significant bit to the leftmost bit (bit 31).
The ARM word is 32 bits long, so we can represent 232 different 32-bit patterns.
It is natural to let these combinations represent the numbers from 0 to 232 − 1
(4,294,967,295ten):
0000
0000
0000
...
1111
1111
1111
0000 0000 0000 0000 0000
0000 0000 0000 0000 0000
0000 0000 0000 0000 0000
...
1111 1111 1111 1111 1111
1111 1111 1111 1111 1111
1111 1111 1111 1111 1111
03-Ch02-P374750.indd 87
least significant bit
The rightmost bit in an
ARM word.
most significant bit
The leftmost bit in an
ARM word.
0000 0000two = 0ten
0000 0001two = 1ten
0000 0010two = 2ten
1111 1101two = 4,294,967,293ten
1111 1110two = 4,294,967,294ten
1111 1111two = 4,294,967,295ten
7/3/09 12:18:23 PM
88
Chapter 2
Instructions: Language of the Computer
That is, 32-bit binary numbers can be represented in terms of the bit value times a
power of 2 (here xi means the ith bit of x):
(x31 × 231) + (x30 × 230) + (x29 × 229) + . . . + (x1 × 21) + (x0 × 20)
Keep in mind that the binary bit patterns above are simply representatives of
numbers. Numbers really have an infinite number of digits, with almost all being
0 except for a few of the rightmost digits. We just don’t normally show leading 0s.
Hardware can be designed to add, subtract, multiply, and divide these binary
bit patterns. If the number that is the proper result of such operations cannot be
represented by these rightmost hardware bits, overflow is said to have occurred. It’s
up to the programming language, the operating system, and the program to determine what to do if overflow occurs.
Computer programs calculate both positive and negative numbers, so we need a
representation that distinguishes the positive from the negative. The most obvious
solution is to add a separate sign, which conveniently can be represented in a single
bit; the name for this representation is sign and magnitude.
Alas, sign and magnitude representation has several shortcomings. First, it’s not
obvious where to put the sign bit. To the right? To the left? Early computers tried
both. Second, adders for sign and magnitude may need an extra step to set the sign
because we can’t know in advance what the proper sign will be. Finally, a separate
sign bit means that sign and magnitude has both a positive and a negative zero,
which can lead to problems for inattentive programmers. As a result of these shortcomings, sign and magnitude representation was soon abandoned.
In the search for a more attractive alternative, the question arose as to what
would be the result for unsigned numbers if we tried to subtract a large number
from a small one. The answer is that it would try to borrow from a string of leading
0s, so the result would have a string of leading 1s.
Given that there was no obvious better alternative, the final solution was to pick
the representation that made the hardware simple: leading 0s mean positive, and
leading 1s mean negative. This convention for representing signed binary numbers
is called two’s complement representation:
0000 0000 0000 0000 0000 0000 0000 0000two = 0ten
0000 0000 0000 0000 0000 0000 0000 0001two = 1ten
0000 0000 0000 0000 0000 0000 0000 0010two = 2ten
...
...
0111 1111 1111 1111 1111 1111 1111 1101two
0111 1111 1111 1111 1111 1111 1111 1110two
0111 1111 1111 1111 1111 1111 1111 1111two
1000 0000 0000 0000 0000 0000 0000 0000two
1000 0000 0000 0000 0000 0000 0000 0001two
1000 0000 0000 0000 0000 0000 0000 0010two
...
...
03-Ch02-P374750.indd 88
=
=
=
=
=
=
2,147,483,645ten
2,147,483,646ten
2,147,483,647ten
–2,147,483,648ten
–2,147,483,647ten
–2,147,483,646ten
7/3/09 12:18:23 PM
2.4
89
Signed and Unsigned Numbers
1111 1111 1111 1111 1111 1111 1111 1101two = –3ten
1111 1111 1111 1111 1111 1111 1111 1110two = –2ten
1111 1111 1111 1111 1111 1111 1111 1111two = –1ten
The positive half of the numbers, from 0 to 2,147,483,647ten (231 − 1), use the
same representation as before. The following bit pattern (1000 . . . 0000two) represents the most negative number −2,147,483,648ten (−231). It is followed by a
declining set of negative numbers: −2,147,483,647ten (1000 . . . 0001two) down to
−1ten (1111 . . . 1111two).
Two’s complement does have one negative number, −2,147,483,648ten, that has
no corresponding positive number. Such imbalance was also a worry to the inattentive programmer, but sign and magnitude had problems for both the programmer and the hardware designer. Consequently, every computer today uses two’s
complement binary representations for signed numbers.
Two’s complement representation has the advantage that all negative numbers
have a 1 in the most significant bit. Consequently, hardware needs to test only this
bit to see if a number is positive or negative (with the number 0 considered positive). This bit is often called the sign bit. By recognizing the role of the sign bit, we
can represent positive and negative 32-bit numbers in terms of the bit value times
a power of 2:
(x31 × −231) + (x30 × 230) + (x29 × 229) + . . . + (x1 × 21) + (x 0 × 20)
The sign bit is multiplied by −231, and the rest of the bits are then multiplied by
positive versions of their respective base values.
Binary to Decimal Conversion
What is the decimal value of this 32-bit two’s complement number?
EXAMPLE
1111 1111 1111 1111 1111 1111 1111 1100two
Substituting the number’s bit values into the formula above:
(1 × −231) + (1 × 230) + (1 × 229) + . . . + (1 × 22) + (0 × 21) + (0 × 20)
+ 230 + 229 + . . . + 22 + 0
+ 0
= −231
= −2,147,483,648ten + 2,147,483,644ten
= − 4ten
ANSWER
We’ll see a shortcut to simplify conversion from negative to positive soon.
Just as an operation on unsigned numbers can overflow the capacity of hardware to represent the result, so can an operation on two’s complement numbers.
03-Ch02-P374750.indd 89
7/3/09 12:18:23 PM
90
Chapter 2
Instructions: Language of the Computer
Overflow occurs when the leftmost retained bit of the binary bit pattern is not the
same as the infinite number of digits to the left (the sign bit is incorrect): a 0 on
the left of the bit pattern when the number is negative or a 1 when the number is
positive.
Hardware/
Software
Interface
Unlike the numbers discussed above, memory addresses naturally start at 0 and
continue to the largest address. Put another way, negative addresses make no sense.
Thus, programs want to deal sometimes with numbers that can be positive or negative
and sometimes with numbers that can be only positive. Some programming languages
reflect this distinction. C, for example, names the former integers (declared as int in
the program) and the latter unsigned integers (unsigned int). Some C style guides
even recommend declaring the former as signed int to keep the distinction clear.
Let’s examine two useful shortcuts when working with two’s complement
numbers. The first shortcut is a quick way to negate a two’s complement binary
number. Simply invert every 0 to 1 and every 1 to 0, then add one to the result. This
shortcut is based on the observation that the sum of a number and its inverted representation must be 111 . . . 111two, which represents −1. Since x + x– = −1, therefore
x + x– + 1 = 0 or x– + 1 = −x.
Negation Shortcut
EXAMPLE
ANSWER
Negate 2ten, and then check the result by negating −2ten.
2ten = 0000 0000 0000 0000 0000 0000 0000 0010two
Negating this number by inverting the bits and adding one,
03-Ch02-P374750.indd 90
+
1111 1111 1111 1111 1111 1111 1111 1101two
1two
=
=
1111 1111 1111 1111 1111 1111 1111 1110two
–2ten
7/3/09 12:18:23 PM
2.4
91
Signed and Unsigned Numbers
Going the other direction,
1111 1111 1111 1111 1111 1111 1111 1110two
is first inverted and then incremented:
+
0000 0000 0000 0000 0000 0000 0000 0001two
1two
=
=
0000 0000 0000 0000 0000 0000 0000 0010two
2ten
Our next shortcut tells us how to convert a binary number represented in n bits
to a number represented with more than n bits. For example, the immediate field
in the load, store, branch, ADD, and set on less than instructions contains a two’s
complement 16-bit number, representing −32,768ten (−215) to 32,767ten (215 − 1).
To add the immediate field to a 32-bit register, the computer must convert that
16-bit number to its 32-bit equivalent. The shortcut is to take the most significant
bit from the smaller quantity—the sign bit—and replicate it to fill the new bits of
the larger quantity. The old bits are simply copied into the right portion of the new
word. This shortcut is commonly called sign extension.
Sign Extension Shortcut
Convert 16-bit binary versions of 2ten and −2ten to 32-bit binary numbers.
The 16-bit binary version of the number 2 is
EXAMPLE
ANSWER
0000 0000 0000 0010two = 2ten
It is converted to a 32-bit number by making 16 copies of the value in the most
significant bit (0) and placing that in the left-hand half of the word. The right
half gets the old value:
0000 0000 0000 0000 0000 0000 0000 0010two = 2ten
Let’s negate the 16-bit version of 2 using the earlier shortcut. Thus,
0000 0000 0000 0010two
03-Ch02-P374750.indd 91
7/3/09 12:18:23 PM
92
Chapter 2
Instructions: Language of the Computer
becomes
1111 1111 1111 1101two
+
1two
= 1111 1111 1111 1110two
Creating a 32-bit version of the negative number means copying the sign bit
16 times and placing it on the left:
1111 1111 1111 1111 1111 1111 1111 1110two = –2ten
This trick works because positive two’s complement numbers really have an
infinite number of 0s on the left and negative two’s complement numbers have an
infinite number of 1s. The binary bit pattern representing a number hides leading
bits to fit the width of the hardware; sign extension simply restores some of them.
Summary
The main point of this section is that we need to represent both positive and negative integers within a computer word, and although there are pros and cons to any
option, the overwhelming choice since 1965 has been two’s complement.
Check What is the decimal value of this 64-bit two’s complement number?
Yourself 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1000two
1) –4ten
2) –8ten
3) –16ten
4) 18,446,744,073,709,551,609ten
Elaboration: Two’s complement gets its name from the rule that the unsigned sum
of an n-bit number and its negative is 2n; hence, the complement or negation of a two’s
complement number x is 2n – x.
03-Ch02-P374750.indd 92
7/3/09 12:18:23 PM
2.5
A third alternative representation to two’s complement and sign and magnitude is
called one’s complement. The negative of a one’s complement is found by inverting
each bit, from 0 to 1 and from 1 to 0, which helps explain its name since the complement of x is 2n – x – 1. It was also an attempt to be a better solution than sign and
magnitude, and several early scientific computers did use the notation. This representation is similar to two’s complement except that it also has two 0s: 00 . . . 00two is
positive 0 and 11 . . . 11two is negative 0. The most negative number, 10 . . . 000two,
represents –2,147,483,647ten, and so the positives and negatives are balanced. One’s
complement adders did need an extra step to subtract a number, and hence two’s complement dominates today.
A final notation, which we will look at when we discuss floating point in Chapter 3 is
to represent the most negative value by 00 . . . 000two and the most positive value by
11 . . . 11two, with 0 typically having the value 10 . . . 00two. This is called a biased notation, since it biases the number such that the number plus the bias has a nonnegative
representation.
Elaboration: For signed decimal numbers, we used “–” to represent negative because
there are no limits to the size of a decimal number. Given a fixed word size, binary and
hexadecimal (see Figure 2.4) bit strings can encode the sign; hence we do not normally
use “+” or “–” with binary or hexadecimal notation.
2.5
93
Representing Instructions in the Computer
Representing Instructions in the Computer
We are now ready to explain the difference between the way humans instruct computers and the way computers see instructions.
Instructions are kept in the computer as a series of high and low electronic signals and may be represented as numbers. In fact, each piece of an instruction can
be considered as an individual number, and placing these numbers side by side
forms the instruction.
one’s complement
A notation that represents
the most negative value
by 10 . . . 000two and the
most positive value by
01 . . . 11two, leaving
an equal number of
negatives and positives
but ending up with
two zeros, one positive
(00 . . . 00two) and one
negative (11 . . . 11two).
The term is also used to
mean the inversion of
every bit in a pattern: 0 to
1 and 1 to 0.
biased notation
A notation that represents
the most negative value
by 00 . . . 000two and the
most positive value by
11 . . . 11two, with 0
typically having the value
10 . . . 00two, thereby
biasing the number such
that the number plus the
bias has a nonnegative
representation.
Translating ARM Assembly Instructions into Machine Instructions
Let’s do the next step in the refinement of the ARM language as an example.
We’ll show the real ARM language version of the instruction represented symbolically as
EXAMPLE
ADD r5,r1,r2
first as a combination of decimal numbers and then of binary numbers.
03-Ch02-P374750.indd 93
7/3/09 12:18:23 PM
94
Chapter 2
Instructions: Language of the Computer
The decimal representation is
ANSWER
14
0
0
4
0
1
5
2
Each of these segments of an instruction is called a field. The fourth field
(containing 4 in this case) tells the ARM computer that this instruction
performs addition. The sixth field gives the number of the register that is
the first source operand of the addition operation (1=r1), and the last field
gives the other source operand for the addition (2 = r2). The seventh field
contains the number of the register that is to receive the sum (5 = r5). Thus,
this instruction adds (field 4 = 4) register r1 (field 6 = 1) to register r2 (field
8 = 2), and places the sum in register r5 (field 7 = 5); we’ll reveal the
purpose of the remaining four fields later.
This instruction can also be represented as fields of binary numbers as
opposed to decimal:
instruction format
A form of representation
of an instruction
composed of fields of
binary numbers.
machine language
Binary representation
used for communication
within a computer system.
hexadecimal
Numbers in base 16.
1110
00
0
0100
0
0001
0101
000000000010
4 bits
2 bits
1 bit
4 bits
1 bit
4 bits
4 bits
12 bits
This layout of the instruction is called the instruction format. As you can see
from counting the number of bits, this ARM instruction takes exactly 32 bits—the
same size as a data word. In keeping with our design principle that simplicity favors
regularity, all ARM instructions are 32 bits long.
To distinguish it from assembly language, we call the numeric version of instructions machine language and a sequence of such instructions machine code.
It would appear that you would now be reading and writing long, tedious strings
of binary numbers. We avoid that tedium by using a higher base than binary that
converts easily into binary. Since almost all computer data sizes are multiples of 4,
hexadecimal (base 16) numbers are popular. Since base 16 is a power of 2, we can
trivially convert by replacing each group of four binary digits by a single hexadecimal digit, and vice versa. Figure 2.4 converts between hexadecimal and binary.
Hexadecimal
Binary
Hexadecimal
Binary
Hexadecimal
Binary
Hexadecimal
Binary
0hex
0000two
4hex
0100two
8hex
1000two
chex
1100two
1hex
0001two
5hex
0101two
9hex
1001two
dhex
1101two
2hex
0010two
6hex
0110two
ahex
1010two
ehex
1110two
3hex
0011two
7hex
0111two
bhex
1011two
fhex
1111two
FIGURE 2.4 The hexadecimal-binary conversion table. Just replace one hexadecimal digit by the corresponding four binary
digits, and vice versa. If the length of the binary number is not a multiple of 4, go from right to left.
03-Ch02-P374750.indd 94
7/3/09 12:18:23 PM
2.5
95
Representing Instructions in the Computer
Because we frequently deal with different number bases, to avoid confusion we
will subscript decimal numbers with ten, binary numbers with two, and hexadecimal numbers with hex. (If there is no subscript, the default is base 10.) By the way,
C and Java use the notation 0xnnnn for hexadecimal numbers.
Binary to Hexadecimal and Back
Convert the following hexadecimal and binary numbers into the other base:
EXAMPLE
eca8 6420hex
0001 0011 0101 0111 1001 1011 1101 1111 two
Using Figure 2.4, the answer is just a table lookup one way:
ANSWER
eca8 6420hex
1110 1100 1010 1000 0110 0100 0010 0000two
And then the other direction:
0001 0011 0101 0111 1001 1011 1101 1111two
1357 9bdfhex
ARM Fields
ARM fields are given names to make them easier to discuss:
Cond
F
I
Opcode
S
Rn
Rd
Operand2
4 bits
2 bits
1 bit
4 bits
1 bit
4 bits
4 bits
12 bits
Here is the meaning of each name of the fields in ARM instructions:
■
Opcode: Basic operation of the instruction, traditionally called the opcode.
opcode The field that
■
Rd: The register destination operand. It gets the result of the operation.
denotes the operation and
format of an instruction.
■
Rn: The first register source operand.
03-Ch02-P374750.indd 95
7/3/09 12:18:23 PM
96
Chapter 2
Instructions: Language of the Computer
■
Operand2: The second source operand.
■
I: Immediate. If I is 0, the second source operand is a register. If I is 1, the
second source operand is a 12-bit immediate. (Section 2.10 goes into details
on ARM immediates, but for now we’ll just assume its just a plain constant.)
■
S: Set Condition Code. Described in Section 2.7, this field is related to
conditional branch instructions.
■
Cond: Condition. Described in Section 2.7, this field is related to conditional
branch instructions.
■
F: Instruction Format. This field allows ARM to different instruction formats
when needed.
Let’s look at the add word instruction from page 83 that has a constant
operand:
ADD
r3,r3,#4
; r3 = r3 + 4
As you might expect, the constant 4 is placed in the Operand2 field and the I field
is set to 1. We make a small change the binary version from before.
14
0
1
4
0
3
3
4
The Opcode field is still 4 so that this instruction performs addition. The Rn and
Rd fields gives the numbers of the register that is the first source operand (3) and
the register to receive the sum (3). Thus, this instruction adds 4 to r3 and place the
sum in r3.
Now let’s try load word instruction from page 82:
LDR
r5,[r3, #32]
; Temporary reg r5 gets A[8]
Loads and stores use a different instruction format from above, with just 6 fields:
Cond
F
Opcode
Rn
Rd
Offset12
4 bits
2 bits
6 bits
4 bits
4 bits
12 bits
To tell ARM that the format is different, the F field now as 1, meaning that this
is a data transfer instruction format. The opcode field has 24, showing that this
instruction does load word. The rest of the fields are straightword: Rn field has 3
for the base register, the Offset12 field has 32 as the offset to add to the base register, and the Rd field has 5 for the destination register, which receives the result of
the load:
03-Ch02-P374750.indd 96
14
1
24
3
5
32
4 bits
2 bits
6 bits
4 bits
4 bits
12 bits
7/3/09 12:18:23 PM
2.5
97
Representing Instructions in the Computer
Let’s call the first option the data processing (DP) instruction format and the
second the data transfer (DT) instruction format.
Although multiple formats complicate the hardware, we can reduce the complexity by keeping the formats similar. For example, the first two fields and the last
three fields of the two formats are the same size and four of them have the same
names; the length of the Opcode field in DT format is equal to the sum of the
lengths of three fields of the DP format.
Figure 2.5 shows the numbers used in each field for the ARM instructions
covered here.
Instruction
Format
Cond
F
I
op
S
Rn
Rd
Operand2
reg
ADD
DP
14
0
0
4ten
0
reg
reg
SUB (subtract)
DP
14
0
0
2sten
0
reg
reg
reg
ADD (immediate)
DP
14
0
1
4ten
0
reg
reg
constant
LDR (load word)
DT
14
1
n.a.
24ten
n.a.
reg
reg
address
STR (store word)
DT
14
1
n.a.
25ten
n.a.
reg
reg
address
FIGURE 2.5 ARM instruction encoding. In the table above, “reg” means a register number between
0 and 15, “constant” means a 12-bit constant, “address” means a 12-bit address. “n.a.” (not applicable) means
this field does not appear in this format, and Op stands for opcode.
Translating ARM Assembly Language into Machine Language
We can now take an example all the way from what the programmer writes to
what the computer executes. If r3 has the base of the array A and r2 corresponds to h, the assignment statement
EXAMPLE
A[30] = h + A[30];
is compiled into
LDR
ADD
STR
r5,[r3,#120]
r5,r2,r5
r5,[r3,#120]
; Temporary reg r5 gets A[30]
; Temporary reg r5 gets h + A[30]
; Stores h + A[30] back into A[30]
What is the ARM machine language code for these three instructions?
03-Ch02-P374750.indd 97
7/3/09 12:18:23 PM
98
Chapter 2
ANSWER
Instructions: Language of the Computer
For convenience, let’s first represent the machine language instructions using
decimal numbers. From Figure 2.5, we can determine the three machine language instructions:
opcode
Cond
F
Rn
I
14
1
14
0
14
1
opcode
S
24
0
4
Rd
0
25
Offset12
Operand2
3
5
120
2
5
5
3
5
120
The LDR instruction is identified by 24 (see Figure 2.5) in the third field
(opcode). The base register 3 is specified in the fourth field (Rn), and the
destination register 5 is specified in the sixth field (Rd). The offset to select
A[30] (120 = 30 × 4) is found in the final field (offset12).
The ADD instruction that follows is specified with 4 in the fourth field
(opcode). The three register operands (2, 5, and 5) are found in the sixth,
seventh, and eighth fields.
The STR instruction is identified with 25 in the third field. The rest of this
final instruction is identical to the LDR instruction.
Since 120ten = 0000 1111 0000two, the binary equivalent to the decimal
form is:
opcode
Cond
F
Rn
I
1110
1
1110
0
1110
1
opcode
S
11000
0
100
11001
Rd
0
Off
Operand2
0011
0101
0000 1111 0000
0010
0101
0000 0000 0101
0011
0101
0000 1111 0000
Note the similarity of the binary representations of the first and last instructions. The only difference is in the last bit of the opcode.
Figure 2.6 summarizes the portions of ARM machine language described in this
section. As we shall see in Chapter 4, the similarity of the binary representations
of related instructions simplifies hardware design. These similarities are another
example of regularity in the ARM architecture.
03-Ch02-P374750.indd 98
7/3/09 12:18:23 PM
2.5
99
Representing Instructions in the Computer
ARM machine language
Name
Format
Example
Comments
ADD
DP
14
0
0
4
0
2
1
3
SUB
DP
14
0
0
2
0
2
1
3
SUB r1,r2,r3
LDR
DT
14
1
24
2
1
100
LDR r1,100(r2)
STR
DT
14
1
25
2
1
100
Field size
4 bits
2 bits
1 bit
4 bits
1 bit
4 bits
4 bits
12 bits
I
Opcode
S
Rn
Rd
Operand2
Rn
Rd
Offset12
DP format
DP
Cond
F
DT format
DT
Cond
F
Opcode
ADD r1,r2,r3
STR r1,100(r2)
All ARM instructions
are 32 bits long
Arithmetic
instruction format
Data transfer format
FIGURE 2.6 ARM architecture revealed through Section 2.5. The two ARM instruction formats so far are DP and DT. The last
16 bits have the same sized fields: both contain an Rn field, giving one of the sources; and an Rd field, specifying the destination register.
BIG
Today’s computers are built on two key principles:
The
Picture
1. Instructions are represented as numbers.
2.
Programs are stored in memory to be read or written, just like numbers.
These principles lead to the stored-program concept; its invention let the
computing genie out of its bottle. Figure 2.7 shows the power of the concept;
specifically, memory can contain the source code for an editor program, the
corresponding compiled machine code, the text that the compiled program is
using, and even the compiler that generated the machine code.
One consequence of instructions as numbers is that programs are often
shipped as files of binary numbers. The commercial implication is that
computers can inherit ready-made software provided they are compatible
with an existing instruction set. Such “binary compatibility” often leads
industry to align around a small number of instruction set architectures.
What ARM instruction does this represent?
Cond
F
I
opcode
S
Rn
Rd
Operand2
14
0
0
4
0
0
1
2
Check
Yourself
Choose from one of the four options below.
1. ADD r0, r1, r2
2. ADD r1, r0, r2
3. ADD r2, r1, r0
4. SUB r2, r0, r1
03-Ch02-P374750.indd 99
7/3/09 12:18:23 PM
100
Chapter 2
Instructions: Language of the Computer
Memory
Accounting program
(machine code)
Editor program
(machine code)
Processor
C compiler
(machine code)
Payroll data
Book text
Source code in C
for editor program
FIGURE 2.7 The stored-program concept. Stored programs allow a computer that performs accounting to become, in the blink of an eye, a computer that helps an author write a book. The switch happens
simply by loading memory with programs and data and then telling the computer to begin executing at a
given location in memory. Treating instructions in the same way as data greatly simplifies both the memory
hardware and the software of computer systems. Specifically, the memory technology needed for data can
also be used for programs, and programs like compilers, for instance, can translate code written in a notation
far more convenient for humans into code that the computer can understand.
“Contrariwise,”
continued Tweedledee,
“if it was so, it might
be; and if it were so,
it would be; but as it
isn’t, it ain’t. That’s
logic.”
Lewis Carroll, Alice’s
Adventures in
Wonderland, 1865
03-Ch02-P374750.indd 100
2.6
Logical Operations
Although the first computers operated on full words, it soon became clear that
it was useful to operate on fields of bits within a word or even on individual bits.
Examining characters within a word, each of which is stored as 8 bits, is one
example of such an operation (see Section 2.9). It follows that operations were
added to programming languages and instruction set architectures to simplify,
among other things, the packing and unpacking of bits into words. These instructions are called logical operations. Figure 2.8 shows logical operations in C, Java,
and ARM.
7/3/09 12:18:24 PM
2.6 Logical Operations
Logical operations
C operators
Java operators
Bit-by-bit AND
&
&
Bit-by-bit OR
|
|
ORR
Bit-by-bit NOT
~
~
MVN
Shift left
<<
<<
LSL
Shift right
>>
>>>
LSR
101
ARM instructions
AND
FIGURE 2.8 C and Java logical operators and their corresponding ARM instructions. ARM
implements NOT using a NOR with one operand being zero.
The first class of such operations Another useful operation that isolates fields is
AND. (We capitalize the word to avoid confusion between the operation and the
English conjunction.) AND is a bit-by-bit operation that leaves a 1 in the result
only if both bits of the operands are 1. For example, if register r2 contains
0000 0000 0000 0000 0000 1101 1100 0000two
AND A logical bit-by-bit
operation with two
operands that calculates
a 1 only if there is a 1 in
both operands.
and register r1 contains
0000 0000 0000 0000 0011 1100 0000 0000two
then, after executing the ARM instruction
AND r5,r1,r2
; reg r5 = reg r1 & reg r2
the value of register r5 would be
0000 0000 0000 0000 0000 1100 0000 0000two
As you can see, AND can apply a bit pattern to a set of bits to force 0s where there
is a 0 in the bit pattern. Such a bit pattern in conjunction with AND is traditionally
called a mask, since the mask “conceals” some bits.
To place a value into one of these seas of 0s, there is the dual to AND, called OR.
It is a bit-by-bit operation that places a 1 in the result if either operand bit is a 1. To
elaborate, if the registers r1 and r2 are unchanged from the preceding example,
the result of the ARM instruction
ORR r5,r1,r2
OR A logical bit-by-bit
operation with two
operands that calculates
a 1 if there is a 1 in either
operand.
; reg r5 = reg r1 | reg r2
is this value in register r5:
0000 0000 0000 0000 0011 1101 1100 0000two
03-Ch02-P374750.indd 101
7/3/09 12:18:24 PM
102
Chapter 2
NOT A logical bit-by-bit
The third logical operation is a contrarian. NOT takes one operand and
places a 1 in the result if one operand bit is a 0, and vice versa. If the register r1 is
unchanged from the preceding example, the result of the ARM instruction Move
Not (MVN):
operation with one
operand that inverts the
bits; that is, it replaces
every 1 with a 0, and
every 0 with a 1.
Instructions: Language of the Computer
MVN r5,r1
; reg r5 = ~reg r1
is this value in register r5:
1111 1111 1111 1111 1100 0011 1111 1111two
Although not a logical instruction, a useful instruction related to MVN that
we’ve not listed so far simply copies one register to another without changing it.
The Move instruction (MOV) does just what you’d think:
MOV r6,r5
; reg r6 = reg r5
Register r6 now has the value of the contents of r5.
Another class of such operations is called shifts. They move all the bits in a
word to the left or right, filling the emptied bits with 0s. For example, if register r0
contained
0000 0000 0000 0000 0000 0000 0000 1001two = 9ten
and the instruction to shift left by 4 was executed, the new value would be:
0000 0000 0000 0000 0000 0000 1001 0000two= 144ten
Shift left logical provides a bonus benefit. Shifting left by i bits gives the same
result as multiplying by 2i, just as shifting a decimal number by i digits is equivalent to multiplying by 10i. For example, the above LSL shifts by 4, which gives the
same result as multiplying by 24 or 16. The first bit pattern above represents 9, and
9 × 16 = 144, the value of the second bit pattern.
The dual of a shift left is a shift right. The actual name of the two ARM shift
operations are called logical shift left (LSL) and logical shift right (LSR).
Although ARM has shift operations, they are not separate instructions. How can
this be? Unlike any other microprocessor instruction set, ARM offers the ability to
shift the second operand as part of any data processing instruction!
For example, this variation of the add instruction adds register r1 to register r2
shifted left by 2 bits and puts the result in register r5:
ADD r5,r1,r2, LSL #2
; r5 = r1 + (r2 << 2)
In case you were wondering, ARM hardware is designed so that the adds with
shifts are no slower than regular adds.
03-Ch02-P374750.indd 102
7/3/09 12:18:24 PM
2.6 Logical Operations
103
If you just wanted to shift register r5 right by 4 bits and place the result in r6,
you could do that with a move instruction:
MOV r6,r5, LSR #4
; r6 = r5 >> 4
Although usually programmers want to shift by a constant, ARM allows shifting
by the value found in a register. The following instruction shifts register r5 right by
the amount in register r3 and places the result in r6.
MOV r6,r5, LSR r3
; r6 = r5 >> r3
Figure 2.8.5 shows that these shift operations are encoded in the 12 bit Operand2
field of the Data Processing instruction format. If the shift field is 0, the operation
is logical shift left, and if it is 1 then the operation is logical shift right. Here are the
machine language versions of the three instructions above with shifts operations:
Shift_imm
Cond
F
I
opcode
S
Rn
Rs
14
0
14
0
14
0
0
Shift
Rm
Shift
Rm
Rd
4
0
2
5
0
13
0
0
6
0
13
0
0
6
0
2
4
3
0
0
0
5
1
0
5
1
1
5
Note that if these new fields are 0, then there operand is not shifted, so the
encodings shown in the examples in prior sections work properly.
Figure 2.8 above shows the relationship between the C and Java operators and
the ARM instructions.
11
8
7
shift_imm
Rs
0
6
5
4
3
0
Shift
0
Rm
Shift
1
Rm
FIGURE 2.8.5 Encoding of shift operations inside Operand2
field of Data Processing instruction format.
Elaboration: The full ARM instruction set also includes exclusive or (EOR), which sets
the bit to 1 when two corresponding bits differ, and to 0 when they are the same. It also
has Bit Clear (BIC), which sets to 0 any bit that is a 1 in the second operand. C allows
bit fields or fields to be defined within words, both allowing objects to be packed within
a word and to match an externally enforced interface such as an I/O device. All fields
must fit within a single word. Fields are unsigned integers that can be as short as
1 bit. C compilers insert and extract fields using logical operations in ARM: AND,ORR,
LSL, and LSR.
03-Ch02-P374750.indd 103
7/3/09 12:18:24 PM
104
Chapter 2
Instructions: Language of the Computer
Elaboration: ARM has more shift operations than LSL and LSR. Arithmetic shift right
(ASR) replicates the sign bit during a shift; we’ll see what they were hoping to do with
ASR (but failed) in the fallacy in Section 3.8 of the next chapter. Instead of discarding
the bits on a right shift, Rotate Right (ROR) brings them back into the vacated upper
bits. As the name suggests, the bit pattern can be thought of as a ring that rotates in a
register but are never lost.
Check Which operations can isolate a field in a word?
Yourself
1. AND
2. A shift left followed by a shift right
The utility of an automatic computer lies in
the possibility of using
a given sequence of
instructions repeatedly,
the number of times it
is iterated being dependent upon the results of
the computation. ...This
choice can be made to
depend upon the sign of
a number (zero being
reckoned as plus for
machine purposes).
Consequently,
we introduce an
[instruction] (the
conditional transfer
[instruction]) which will,
depending on the sign of
a given number, cause
the proper one of two
routines to be executed.
2.7
Instructions for Making Decisions
What distinguishes a computer from a simple calculator is its ability to make
decisions. Based on the input data and the values created during computation, different instructions execute. Decision making is commonly represented in programming languages using the if statement, sometimes combined with go to statements
and labels. ARM assembly language includes many decision-making instructions,
similar to an if statement with a go to. Let’s start with an instruction that compares
two values followed by an instruction that branches to L1 if the registers are equal
CMP register1, register2
BEQ L1
This pair of instructions means go to the statement labelled L1 if the value in
register1 equals the value in register2. The mnemonic CMP stands for compare and BEQ stands for branch if equal. Another example is the instruction pair
CMP register1, register2
BEQ L1
Burks, Goldstine, and
von Neumann, 1947
03-Ch02-P374750.indd 104
7/3/09 12:18:24 PM
2.7
i=j
105
Instructions for Making Decisions
i≠ j
i = = j?
Else:
f=g+h
f=g–h
Exit:
FIGURE 2.9 Illustration of the options in the if statement above. The left box corresponds to
the then part of the if statement, and the right box corresponds to the else part.
conditional branch
This pair goes to the statement labelled L1 if the value in register1 does not
equal the value in register2. The mnemonic BNE stands for branch if not equal.
BEQ and BNE are traditionally called conditional branches.
An instruction that
requires the comparison
of two values and that
allows for a subsequent
transfer of control to
a new address in the
program based on the
outcome of the
comparison.
Compiling if-then-else into Conditional Branches
In the following code segment, f, g, h, i, and j are variables. If the five variables f through j correspond to the five registers r0 through r4, what is the
compiled ARM code for this C if statement?
EXAMPLE
if (i == j) f = g + h; else f = g – h;
Figure 2.9 is a flowchart of what the ARM code should do. The first expression compares for equality, so it would seem that we would want the branch if
registers are equal instruction (BEQ). In general, the code will be more efficient
if we test for the opposite condition to branch over the code that performs the
subsequent then part of the if (the label Else is defined below) and so we use
the branch if registers are not equal instruction (BNE):
CMP r3,r4
BNE, Else
ANSWER
; go to Else if i ≠ j
The next assignment statement performs a single operation, and if all the
operands are allocated to registers, it is just one instruction:
ADD r0,r1,r2
03-Ch02-P374750.indd 105
; f = g + h (skipped if i ≠ j)
7/3/09 12:18:24 PM
106
Chapter 2
Instructions: Language of the Computer
We now need to go to the end of the if statement. This example introduces
another kind of branch, often called an unconditional branch. This instruction
says that the processor always follows the branch (the label Exit is defined
below).
B Exit
; go to Exit
The assignment statement in the else portion of the if statement can again
be compiled into a single instruction. We just need to append the label Else
to this instruction. We also show the label Exit that is after this instruction,
showing the end of the if-then-else compiled code:
Else:SUB r0,r1,r2
Exit:
; f = g – h (skipped if i = j)
Notice that the assembler relieves the compiler and the assembly language programmer from the tedium of calculating addresses for branches, just as it does for
calculating data addresses for loads and stores (see Section 2.12).
Hardware/
Software
Interface
Compilers frequently create branches and labels where they do not appear in
the programming language. Avoiding the burden of writing explicit labels and
branches is one benefit of writing in high-level programming languages and is a
reason coding is faster at that level.
Loops
Decisions are important both for choosing between two alternatives—found in if
statements—and for iterating a computation—found in loops. The same assembly
instructions are the building blocks for both cases.
Compiling a while Loop in C
EXAMPLE
Here is a traditional loop in C:
while (save[i] == k)
i += 1;
Assume that i and k correspond to registers r3 and r5 and the base of the
array save is in r6. What is the ARM assembly code corresponding to this
C segment?
03-Ch02-P374750.indd 106
7/3/09 12:18:24 PM
2.7
The first step is to load save[i] into a temporary register. Before we can
load save[i] into a temporary register, we need to have its address. Before
we can add i to the base of array save to form the address, we must multiply
the index i by 4 due to the byte addressing problem. Fortunately, we can use
the logical shift left operation, since shifting left by 2 bits multiplies by 22 or 4
(see page 101 in the prior section). We need to add the label Loop to it so that
we can branch back to that instruction at the end of the loop:
Loop: ADD r12,r6, r3, LSL # 2
107
Instructions for Making Decisions
ANSWER
; r12 = address of save[i]
Now we can use that address to load save[i] into a temporary register:
LDR
r0,[r12,#0]
; Temp reg r0 = save[i]
The next instruction pair performs the loop test, exiting if save[i] ≠ k:
CMP
BNE
r0,r5
Exit
; go to Exit if save[i] ≠ k
The next instruction adds 1 to i:
ADD
r3,r3,#1
; i = i + 1
The end of the loop branches back to the while test at the top of the loop. We
just add the Exit label after it, and we’re done:
B
Loop
; go to Loop
Exit:
(See the exercises for an optimization of this sequence.)
Such sequences of instructions that end in a branch are so fundamental to compiling
that they are given their own buzzword: a basic block is a sequence of instructions
without branches, except possibly at the end, and without branch targets or branch
labels, except possibly at the beginning. One of the first early phases of compilation is
breaking the program into basic blocks.
Hardware/
Software
Interface
basic block A sequence
The test for equality or inequality is probably the most popular test, but sometimes it is useful to see if a variable is less than another variable. For example, a for
loop may want to test to see if the index variable is less than 0. Thus, ARM has a
large set of conditional branches. For example, LT, LE, GT, and GE branch if the
03-Ch02-P374750.indd 107
of instructions without
branches (except possibly
at the end) and without
branch targets or branch
labels (except possibly at
the beginning).
7/3/09 12:18:24 PM
108
Chapter 2
Instructions: Language of the Computer
result of the compare is less than, less than or equal, greater than, or greater than
or equal, respectively.
Conditional branches are actually based on condition flags or condition codes
that can be set by the CMP instruction. Branches then test the condition flags.
Condition flags form a special purpose register who values can be tested by a conditional branch instruction at any time after the compare instruction, not just by
the following instruction.
Moreover, the condition flags can be set by many other instructions than
compare; in such a case, the result of the operation is compared to 0. The S bit in
Figure 2.5 is used to set the condition codes as part of data processing instructions.
The programmer specifies setting the condition flags by appending an S to the
instruction name. Thus, SUBS sets the condition flags with the result of the subtraction, while SUB leaves the condition codes unchanged.
Hardware/
Software
Interface
Comparison instructions must deal with the dichotomy between signed and
unsigned numbers. Sometimes a bit pattern with a 1 in the most significant bit
represents a negative number and, of course, is less than any positive number,
which must have a 0 in the most significant bit. With unsigned integers, on the
other hand, a 1 in the most significant bit represents a number that is larger than
any that begins with a 0. (We’ll soon take advantage of this dual meaning of the
most significant bit to reduce the cost of the array bounds checking.)
ARM offers more versions conditional branch to handle these alternatives:
Less than unsigned is call LO for lower; less than or equal unsigned is called LS
for lower or same; Great than unsigned is called HI for higher; Greater or equal
unsigned is called HS for higher or same.
Signed versus Unsigned Comparison
EXAMPLE
Suppose register r0 has the binary number
1111 1111 1111 1111 1111 1111 1111 1111two
and that register r1 has the binary number
0000 0000 0000 0000 0000 0000 0000 0001two
03-Ch02-P374750.indd 108
7/3/09 12:18:24 PM
2.7
109
Instructions for Making Decisions
and the following instruction is executed.
CMP
r0, r1
Which conditional branch is taken?
BLO
BLT
L1 ; unsigned branch
L2 ; signed branch
The value in register r0 represents −1ten if it is an integer and 4,294,967,295ten
if it is an unsigned integer. The value in register r1 represents 1ten in either
case. The branch on lower unsigned instruction (BLO) is not taken to L1, since
4,294,967,295ten > 1ten. However, the branch on less than instruction (BLT) is
taken to L2, since −1ten < 1ten.
ANSWER
Treating signed numbers as if they were unsigned gives us a low cost way of
checking if 0 ≤ x < y, which matches the index out-of-bounds check for arrays.
The key is that negative integers in two’s complement notation look like large
numbers in unsigned notation; that is, the most significant bit is a sign bit in the
former notation but a large part of the number in the latter. Thus, an unsigned
comparison of x < y also checks if x is negative as well as if x is less than y.
Bounds Check Shortcut
Use this shortcut to reduce an index-out-of-bounds check: branch to
IndexOutOfBounds if r1 ≥ r2 or if r1 is negative.
The checking code just uses BHS to do both checks:
EXAMPLE
ANSWER
CMP r1,r2
BHS IndexOutOfBounds ;if r1>=r2 or r1<0, go to Error
Case/Switch Statement
Most programming languages have a case or switch statement that allows the programmer to select one of many alternatives depending on a single value. The simplest way to implement switch is via a sequence of conditional tests, turning the
switch statement into a chain of if-then-else statements.
03-Ch02-P374750.indd 109
7/3/09 12:18:24 PM
110
Chapter 2
jump address table
Also called jump table.
A table of addresses of
alternative instruction
sequences.
program counter
(PC) The register
containing the address
of the instruction in the
program being executed.
Hardware/
Software
Interface
Instructions: Language of the Computer
Sometimes the alternatives may be more efficiently encoded as a table of
addresses of alternative instruction sequences, called a jump address table or
jump table, and the program needs only to index into the table and then jump to
the appropriate sequence. The jump table is then just an array of words containing
addresses that correspond to labels in the code. The program needs to jump using
the address in the appropriate entry from the jump table.
ARM has a surprisingly easy way to handle such situations. Implicit in the
stored-program idea is the need to have a register to hold the address of the current instruction being executed. For historical reasons, this register is almost always
called the program counter, abbreviated pc in the ARM architecture, although a
more sensible name would have been instruction address register. Register 15 is
actually the program counter in ARM, so a LDR instruction with the destination
register 15 means an unconditional branch to the address specified in memory. In
fact, any instruction with register 15 as the destination register is an unconditional
branch to the address at that value. In Section 2.8, we’ll see how this trick is useful
when returning from a procedure.
Although there are many statements for decisions and loops in programming
languages like C and Java, the bedrock statement that implements them at the
instruction set level is the conditional branch.
Encoding Branch Instructions in ARM
A problem occurs when an instruction needs longer fields than those shown above
in Section 2.6. For example, the longest field in the instruction formats above is
just 12 bits, suggesting that branch field might be just 12 bits. That might limit
programs to just 212 or 4096 bytes!
Hence, we have a conflict between the desire to keep all instructions the same
length and the desire to have a single instruction format. This leads us to the final
hardware design principle:
Design Principle 4: Good design demands good compromises.
The compromise chosen by the ARM designers is to keep all instructions the
same length, thereby requiring different kinds of instruction formats for different
kinds of instructions. We saw the DP and DT formats above, which had many similarities. Branch has a third instruction type:
03-Ch02-P374750.indd 110
Cond
12
address
4 bits
4 bits
24 bits
7/3/09 12:18:24 PM
2.7
Value
Meaning
Instructions for Making Decisions
Value
Meaning
0
EQ (EQual)
8
HI (unsigned HIgher)
1
NE (Not Equal)
9
LS (unsigned Lower or Same)
2
HS (unsigned Higher or Same)
10
GE (signed Greater than or Equal)
3
LO (unsigned LOwer)
11
LT (signed Less Than)
4
MI (MInus, <0)
12
GT (signed Greater Than)
5
PL - (PLus, >=0)
13
LE (signed Less Than or Equal)
6
VS (oVerflow Set, overflow)
14
AL (Always)
7
VC (oVerflow Clear, no overflow)
15
NV (reserved)
FIGURE 2.9.5
111
Encodings of Options for Cond field.
The Cond field encodes the many versions of the conditional branch mentioned
above. Figure 2.9.5 shows those encodings.
You might think that the 24-bit address would extend the program limit to 224
or 16 MB, which would be fine for many programs but constrain some large ones.
An alternative would be to specify a register that would always be added to the
branch address, so that a branch instruction would calculate the following:
Program counter = Register + Branch address
This sum allows the program to be as large as 232, solving the branch address size
problem. Then the question is, which register?
The answer comes from seeing how conditional branches are used. Conditional
branches are found in loops and in if statements, so they tend to branch to a
nearby instruction. For example, about half of all conditional branches in SPEC
benchmarks go to locations less than 16 instructions away. Since the program
counter (PC) contains the address of the current instruction, we can branch within
± 224 words of the current instruction if we use the PC as the register to be added
to the address. All loops and if statements are much smaller than 224 words, so the
PC is the ideal choice.
This form of branch addressing is called PC-relative addressing. As we shall
see in Chapter 4, it is convenient for the hardware to increment the PC early.
Hence, the ARM address is actually relative to the address of the instruction two
after the branch (PC + 8) as opposed to the current instruction (PC).
Since all ARM instructions are 4 bytes long, ARM stretches the distance of the
branch by having PC-relative addressing refer to the number of words to the next
instruction instead of the number of bytes. Thus, the 24-bit field can branch four
times as far by interpreting the field as a relative word address rather than as a relative byte address.
PC-relative addressing
An addressing regime
in which the address is
the sum of the program
counter (PC) and a constant in the instruction.
Conditional Execution
Another unusual feature of ARM is that most instructions can be conditionally
executed, not just branches. That is the purpose of the 4-bit Cond field found
03-Ch02-P374750.indd 111
7/3/09 12:18:24 PM
112
Chapter 2
Instructions: Language of the Computer
in most ARM instruction formats. The assembly language programmer simply
appends the desired condition to the instruction name, telling the computer to
perform the operation only if the condition is true based on the last time the condition flags were set. That is, ADDEQ performs the addition only if the condition
flags suggest the operands were equal for the operation that set the flags.
For example, the ARM code to perform the if statement in Figure 2.9 is reproduced below:
CMP r3,r4
BNE Else
ADD r0,r1,r2
B Exit
Else: SUB r0,r1,r2
Exit:
;
;
;
;
go to
f = g
go to
f = g
Else if i ≠ j
+ h (skipped if i ≠ j)
Exit
– h (skipped if i = j)
The desired result can be achieved without any branches using conditional
execution:
CMP r3,r4
ADDEQ r0,r1,r2 ; f = g + h (skipped if i ≠ j)
SUBNE r0,r1,r2 ; f = g – h (skipped if i = j)
Figure 2.9.5 shows the encodings for conditional execution of all instructions,
not just branches. Note that 14 means always execute. Thus, Figures 2.5 and 2.6 in
Section 2.5 show the value 14 in the Cond field of the machine language instructions to indicate the version of instructions that are always executed.
Conditional execution provides a technique to execute instructions depending on a test without using conditional branch instructions. Chapter 4 shows that
branches can lower the performance of pipelined computers, so removing branches
can help performance even more than the reduction in instructions suggests.
Check
Yourself
I. C has many statements for decisions and loops, while ARM has few. Which of
the following do or do not explain this imbalance? Why?
1. More decision statements make code easier to read and understand.
2. Fewer decision statements simplify the task of the underlying layer that is
responsible for execution.
3. More decision statements mean fewer lines of code, which generally reduces
coding time.
4. More decision statements mean fewer lines of code, which generally results
in the execution of fewer operations.
03-Ch02-P374750.indd 112
7/3/09 12:18:24 PM
2.8
Supporting Procedures in Computer Hardware
113
II. Why does C provide two sets of operators for AND (& and &&) and two sets of
operators for OR (| and ||), while ARM doesn’t?
1. Logical operations AND and OR implement & and |, while conditional
branches implement && and ||.
2. The previous statement has it backwards: && and || correspond to logical
operations, while & and | map to conditional branches.
3. They are redundant and mean the same thing: && and || are simply inherited
from the programming language B, the predecessor of C.
2.8
Supporting Procedures in Computer
Hardware
A procedure or function is one tool programmers use to structure programs, both
to make them easier to understand and to allow code to be reused. Procedures
allow the programmer to concentrate on just one portion of the task at a time;
parameters act as an interface between the procedure and the rest of the program
and data, since they can pass values and return results. We describe the equivalent
to procedures in Java in Section 2.15 on the CD, but Java needs everything from a
computer that C needs.
You can think of a procedure like a spy who leaves with a secret plan,
acquires resources, performs the task, covers his or her tracks, and then returns
to the point of origin with the desired result. Nothing else should be perturbed
once the mission is complete. Moreover, a spy operates on only a “need to know”
basis, so the spy can’t make assumptions about his employer.
Similarly, in the execution of a procedure, the program must follow these six
steps:
procedure A stored
subroutine that performs
a specific task based on
the parameters with
which it is provided.
1. Put parameters in a place where the procedure can access them.
2. Transfer control to the procedure.
3. Acquire the storage resources needed for the procedure.
4. Perform the desired task.
5. Put the result value in a place where the calling program can access it.
6. Return control to the point of origin, since a procedure can be called from
several points in a program.
03-Ch02-P374750.indd 113
7/3/09 12:18:24 PM
114
Chapter 2
Instructions: Language of the Computer
As mentioned above, registers are the fastest place to hold data in a computer,
so we want to use them as much as possible. ARM software follows the following
convention for procedure calling in allocating its 16 registers:
■ r0−r3:
four argument registers in which to pass parameters
■ lr:
one link register containing the return address register to return to the
point of origin
Branch-and-link
instruction An instruction that jumps to an
address and simultaneously saves the
address of the following
instruction in a register (lr or register 14 in
ARM).
return address A link to
the calling site that allows
a procedure to return to
the proper address; in
ARM it is stored in register lr (register 14).
caller The program that
instigates a procedure and
provides the necessary
parameter values.
callee A procedure that
executes a series of stored
instructions based on
parameters provided by
the caller and then returns
control to the caller.
stack A data structure for
spilling registers organized as a last-in-first-out
queue.
03-Ch02-P374750.indd 114
In addition to allocating these registers, ARM assembly language includes an
instruction just for the procedures: it jumps to an address and simultaneously
saves the address of the following instruction in register lr. The Branch-and-Link
instruction (BL) is simply written
BL
ProcedureAddress
The link portion of the name means that an address or link is formed that points to
the calling site to allow the procedure to return to the proper address. This “link,”
stored in register lr (register 14), is called the return address. The return address
is needed because the same procedure could be called from several parts of the
program.
To return, ARM just uses the move instruction to copy the link register into the
PC, which causes an unconditional branch to the address specified in a register:
MOV pc, lr
This instruction branches to the address stored in register lr—which is just what
we want. Thus, the calling program, or caller, puts the parameter values in r0−r3
and uses BL X to jump to procedure X (sometimes named the callee). The callee then performs the calculations, places the results (if any) into r0 and r1, and
returns control to the caller using MOV pc, lr. The BL instruction actually saves
PC + 4 in register lr to link to the following instruction to set up the procedure
return.
Using More Registers
Suppose a compiler needs more registers for a procedure than the four argument
and two return value registers. Since we must cover our tracks after our mission
is complete, any registers needed by the caller must be restored to the values that
they contained before the procedure was invoked. This situation is an example
in which we need to spill registers to memory, as mentioned in the Hardware/
Software Interface section.
The ideal data structure for spilling registers is a stack—a last-in-first-out
queue. A stack needs a pointer to the most recently allocated address in the stack
to show where the next procedure should place the registers to be spilled or where
7/3/09 12:18:24 PM
Supporting Procedures in Computer Hardware
115
old register values are found. The stack pointer is adjusted by one word for each
register that is saved or restored. ARM software reserves register 13 for the stack
pointer, giving it the obvious name sp. Stacks are so popular that they have their
own buzzwords for transferring data to and from the stack: placing data onto the
stack is called a push, and removing data from the stack is called a pop.
By historical precedent, stacks “grow” from higher addresses to lower addresses.
This convention means that you push values onto the stack by subtracting from
the stack pointer. Adding to the stack pointer shrinks the stack, thereby popping
values off the stack.
stack pointer A value
denoting the most
recently allocated address
in a stack that shows
where registers should be
spilled or where old register values can be found.
In ARM, it is register
2.8
sp (register 13).
push Add element to
stack.
pop Remove element
from stack.
Compiling a C Procedure That Doesn’t Call Another Procedure
Let’s turn the example on page 79 from Section 2.2 into a C procedure:
EXAMPLE
int leaf_example (int g, int h, int i, int j)
{
int f;
f = (g + h) – (i + j);
return f;
}
What is the compiled ARM assembly code?
The parameter variables g, h, i, and j correspond to the argument registers
r0, r1, r2, and r3, and f corresponds to r4. The compiled program starts
with the label of the procedure:
ANSWER
leaf_example:
The next step is to save the registers used by the procedure. The C assignment
statement in the procedure body is identical to the example on page 79, which
uses two temporary registers. Thus, we need to save three registers: r4, r5, and
r6. We “push” the old values onto the stack by creating space for three words
(12 bytes) on the stack and then store them:
SUB
STR
STR
STR
03-Ch02-P374750.indd 115
sp,
r6,
r5,
r4,
sp, #12
[sp,#8]
[sp,#4]
[sp,#0]
;
;
;
;
adjust stack to make room
save register r6 for use
save register r5 for use
save register r4 for use
for 3 items
afterwards
afterwards
afterwards
7/3/09 12:18:24 PM
116
Chapter 2
Instructions: Language of the Computer
High address
sp
sp
Contents of register r6
Contents of register r5
sp
Contents of register r4
Low address
a.
b.
c.
FIGURE 2.10 The values of the stack pointer and the stack (a) before, (b) during, and (c)
after the procedure call. The stack pointer always points to the “top” of the stack, or the last word in
the stack in this drawing.
Figure 2.10 shows the stack before, during, and after the procedure call.
The next three statements correspond to the body of the procedure, which
follows the example on page 79:
ADD r5,r0,r1 ; register r5 contains g + h
ADD r6,r2,r3 ; register r6 contains i + j
SUB r4,r5,r6 ; f gets r5 – r6, which is (g + h) – (i + j)
To return the value of f, we copy it into a return value register r0:
MOV r0,r4 ; returns f (r0 = r4)
Before returning, we restore the three old values of the registers we saved by
“popping” them from the stack:
LDR r4, [sp,#0]
LDR r5, [sp,#4]
LDR r6, [sp,#8]
ADD sp,sp,#12
;
;
;
;
restore register r4 for caller
restore register r5 for caller
restore register r6 for caller
adjust stack to delete 3 items
The procedure ends with a jump register using the return address:
MOV pc, lr ; jump back to calling routine
In the previous example, we used temporary registers and assumed their old
values must be saved and restored. To avoid saving and restoring a register whose
value is never used, which might happen with a temporary register, ARM software
separates 12 of the registers into two groups:
03-Ch02-P374750.indd 116
7/3/09 12:18:24 PM
2.8
Supporting Procedures in Computer Hardware
117
■ r0−r3, r12: argument or
scratch registers that are not preserved by the
callee (called procedure) on a procedure call
■ r4−r11:
eight variable registers that must be preserved on a procedure call
(if used, the callee saves and restores them)
This simple convention reduces register spilling. In the example above, if we could
rewrite the code to use r12 and reuse one of the r0 to r3, we can drop two stores
and two loads from the code. We still must save and restore r4, since the callee
must assume that the caller needs its value.
Nested Procedures
Procedures that do not call others are called leaf procedures. Life would be simple
if all procedures were leaf procedures, but they aren’t. Just as a spy might employ
other spies as part of a mission, who in turn might use even more spies, so do
procedures invoke other procedures. Moreover, recursive procedures even invoke
“clones” of themselves. Just as we need to be careful when using registers in procedures, more care must also be taken when invoking nonleaf procedures.
For example, suppose that the main program calls procedure A with an argument of 3, by placing the value 3 into register r0 and then using BL A. Then suppose that procedure A calls procedure B via BL B with an argument of 7, also placed
in r0. Since A hasn’t finished its task yet, there is a conflict over the use of register
r0. Similarly, there is a conflict over the return address in register lr, since it now
has the return address for B. Unless we take steps to prevent the problem, this conflict will eliminate procedure A’s ability to return to its caller.
One solution is to push all the other registers that must be preserved onto the
stack, just as we did with the extra registers. The caller pushes any argument registers (r0−r3) that are needed after the call. The callee pushes the return address
register lr and any variable registers (r4−r11) used by the callee. The stack pointer
sp is adjusted to account for the number of registers placed on the stack. Upon the
return, the registers are restored from memory and the stack pointer is readjusted.
Compiling a Recursive C Procedure, Showing Nested Procedure
Linking
Let’s tackle a recursive procedure that calculates factorial:
int fact (int n)
{
if (n < 1) return (1);
EXAMPLE
else return (n * fact(n – 1));
}
What is the ARM assembly code?
03-Ch02-P374750.indd 117
7/3/09 12:18:24 PM
118
Chapter 2
ANSWER
Instructions: Language of the Computer
The parameter variable n corresponds to the argument register r0. The compiled program starts with the label of the procedure and then saves two registers
on the stack, the return address and r0:
fact:
SUB
STR
STR
sp, sp, #8
lr, [sp,#8]
r0, [sp,#0]
; adjust stack for 2 items
; save the return address
; save the argument n
The first time fact is called, STR saves an address in the program that called
fact. The next two instructions test whether n is less than 1, going to L1 if
n ≥ 1.
CMP
BGE
r0,#1
L1
; compare n to 1
; if n >= 1, go to L1
If n is less than 1, fact returns 1 by putting 1 into a value register: it moves
1 to r0. It then pops the two saved values off the stack and jumps to the return
address:
MOV
ADD
MOV
r0,#1
sp,sp,#8
pc,lr
; return 1
; pop 2 items off stack
; return to the caller
Before popping two items off the stack, we could have loaded r0 and lr. Since
r0 and lr don’t change when n is less than 1, we skip those instructions.
If n is not less than 1, the argument n is decremented and then fact is
called again with the decremented value:
L1:
SUB r0,r0,#1
BL fact
; n >= 1: argument gets (n – 1)
; call fact with (n – 1)
The next instruction is where fact returns. First we save the returned value
and restore the old return address and old argument, along with the stack
pointer:
MOV r12,r0
LDR r0, [sp,#0]
LDR lr, [sp,#0]
ADD sp, sp, #8
items
;
;
;
;
save the return value
return from BL: restore argument n
restore the return address
adjust stack pointer to pop 2
Next, the return value register r0 gets the product of old argument r0 and the
current value of the value register. We assume a multiply instruction is available, even though it is not covered until Chapter 3:
MUL r0,r0,r12
; return n * fact (n – 1)
Finally, fact jumps again to the return address:
MOV
03-Ch02-P374750.indd 118
pc,lr
; return to the caller
7/3/09 12:18:24 PM
2.8
Supporting Procedures in Computer Hardware
A C variable is generally a location in storage, and its interpretation depends
both on its type and storage class. Examples include integers and characters (see
Section 2.9). C has two storage classes: automatic and static. Automatic variables
are local to a procedure and are discarded when the procedure exits. Static variables
exist across exits from and entries to procedures. C variables declared outside all
procedures are considered static, as are any variables declared using the keyword
static. The rest are automatic.
119
Hardware/
Software
Interface
Figure 2.11 summarizes what is preserved across a procedure call. Note that
several schemes preserve the stack, guaranteeing that the caller will get the same
data back on a load from the stack as it stored onto the stack. The stack above sp is
preserved simply by making sure the callee does not write above sp; sp is itself preserved by the callee adding exactly the same amount that was subtracted from it;
and the other registers are preserved by saving them on the stack (if they are used)
and restoring them from there.
Preserved
Not preserved
Variable registers: r4–r11
Argument registers: r0–r3
Stack pointer register: sp
Intra-procedure-call scatch register: r12
Link register: lr
Stack below the stack pointer
Stack above the stack pointer
FIGURE 2.11
What is and what is not preserved across a procedure call.
Allocating Space for New Data on the Stack
The final complexity is that the stack is also used to store variables that are local
to the procedure but do not fit in registers, such as local arrays or structures. The
segment of the stack containing a procedure’s saved registers and local variables is
called a procedure frame or activation record. Figure 2.12 shows the state of the
stack before, during, and after the procedure call.
Allocating Space for New Data on the Heap
In addition to automatic variables that are local to procedures, C programmers
need space in memory for static variables and for dynamic data structures.
Figure 2.13 shows the ARM convention for allocation of memory. The stack starts
in the high end of memory and grows down. The first part of the low end of
memory is reserved, followed by the home of the ARM machine code, traditionally
called the text segment. Above the code is the static data segment, which is the place
for constants and other static variables. Although arrays tend to be a fixed length
03-Ch02-P374750.indd 119
procedure frame Also
called activation record.
The segment of the stack
containing a procedure’s
saved registers and local
variables.
text segment The
segment of a UNIX
object file that contains
the machine language
code for routines in the
source file.
7/3/09 12:18:25 PM
120
Chapter 2
Instructions: Language of the Computer
High address
sp
sp
Saved argument
registers (if any)
Saved return address
Saved saved
registers (if any)
Local arrays and
structures (if any)
sp
Low address
a.
b.
c.
FIGURE 2.12 Illustration of the stack allocation (a) before, (b) during, and (c) after the
procedure call. The stack pointer (sp) points to the top of the stack. The stack is adjusted to make room
for all the saved registers and any memory-resident local variables. If there are no local variables on the stack
within a procedure, the compiler will save time by not setting and restoring the stack pointer.
sp
7fff fffchex
Stack
Dynamic data
1000 8000hex
1000 0000hex
pc
0040 0000hex
0
Static data
Text
Reserved
FIGURE 2.13 Typical ARM memory allocation for program and data. These addresses
are only a software convention, and not part of the ARM architecture. The stack pointer is initialized to
7fff fffchex and grows down toward the data segment. At the other end, the program code (“text”) starts
at 0040 0000hex. The static data starts at 1000 0000hex. Dynamic data, allocated by malloc in C and by
new in Java, is next. It grows up toward the stack in an area called the heap.
and thus are a good match to the static data segment, data structures like linked
lists tend to grow and shrink during their lifetimes. The segment for such data
structures is traditionally called the heap, and it is placed next in memory. Note
that this allocation allows the stack and heap to grow toward each other, thereby
allowing the efficient use of memory as the two segments wax and wane.
C allocates and frees space on the heap with explicit functions. malloc()
allocates space on the heap and returns a pointer to it, and free() releases space
03-Ch02-P374750.indd 120
7/3/09 12:18:25 PM
2.8
Supporting Procedures in Computer Hardware
121
on the heap to which the pointer points. Memory allocation is controlled by
programs in C, and it is the source of many common and difficult bugs. Forgetting
to free space leads to a “memory leak,” which eventually uses up so much memory
that the operating system may crash. Freeing space too early leads to “dangling
pointers,” which can cause pointers to point to things that the program never
intended. Java uses automatic memory allocation and garbage collection just to
avoid such bugs.
Figure 2.14 summarizes the register conventions for the ARM assembly
language.
Name
Register number
Usage
Preserved on
call?
a1-a2
a3-a4
0–1
Argument / return result / scratch register
2–3
Argument / scratch register
no
no
v1-v8
4–11
Variables for local routine
yes
ip
12
Intra-procedure-call scratch register
no
sp
13
Stack pointer
yes
lr
14
Link Register (Return address)
yes
pc
15
Program Counter
n.a.
FIGURE 2.14
ARM register conventions.
Elaboration: What if there are more than four parameters? The ARM convention is
to place the extra parameters on the stack. The procedure then expects the first four
parameters to be in registers r0 through r3 and the rest in memory, addressable via
the stack pointer.
Elaboration: Some recursive procedures can be implemented iteratively without
using recursion. Iteration can significantly improve performance by removing the overhead associated with procedure calls. For example, consider a procedure used to accumulate a sum:
int sum (int n, int acc) {
if (n > 0)
return sum(n – 1, acc + n);
else
return acc;
}
Consider the procedure call sum(3,0). This will result in recursive calls to sum
(2,3), sum(1,5), and sum(0,6), and then the result 6 will be returned four times.
This recursive call of sum is referred to as a tail call, and this example use of tail recursion can be implemented very efficiently (assume r0 = n and r1 = acc):
sum: CMP r0, #0
BLE sum_exit
03-Ch02-P374750.indd 121
; test if n <= 0
; go to sum_exit if n <= 0
7/3/09 12:18:25 PM
122
Chapter 2
Instructions: Language of the Computer
ADD r1, r1, r0
SUB r0, r0, #1
B sum
sum_exit:
MOV
r0, r1
MOV
pc, lr
; add n to acc
; subtract 1 from n
; go to sum
; return value acc
; return to caller
Check Which of the following statements about C and Java are generally true?
Yourself
1. C programmers manage data explicitly, while it’s automatic in Java.
2. C leads to more pointer bugs and memory leak bugs than does Java.
!(@ | = > (wow open
tab at bar is great)
Fourth line of the
keyboard poem “Hatless
Atlas,” 1991 (some
give names to ASCII
characters: “!” is “wow,”
“(” is open, “|” is bar,
and so on).
ASCII
value
Character
32
33
2.9
Communicating with People
Computers were invented to crunch numbers, but as soon as they became commercially viable they were used to process text. Most computers today offer 8-bit
bytes to represent characters, with the American Standard Code for Information Interchange (ASCII) being the representation that nearly everyone follows.
Figure 2.15 summarizes ASCII.
ASCII
value
Character
ASCII
value
Character
ASCII
value
Character
ASCII
value
Character
ASCII
value
Character
space
48
0
64
@
80
P
096
`
112
p
!
49
1
65
A
81
Q
097
a
113
q
34
"
50
2
66
B
82
R
098
b
114
r
35
;
51
3
67
C
83
S
099
c
115
s
36
$
52
4
68
D
84
T
100
d
116
t
37
%
53
5
69
E
85
U
101
e
117
u
38
&
54
6
70
F
86
V
102
f
118
v
39
'
55
7
71
G
87
W
103
g
119
w
40
(
56
8
72
H
88
X
104
h
120
x
41
)
57
9
73
I
89
Y
105
i
121
y
42
*
58
:
74
J
90
Z
106
j
122
z
43
+
59
;
75
K
91
[
107
k
123
{
44
,
60
<
76
L
92
\
108
l
124
|
45
-
61
=
77
M
93
]
109
m
125
}
46
.
62
>
78
N
94
^
110
n
126
~
47
/
63
?
79
O
95
_
111
o
127
DEL
FIGURE 2.15 ASCII representation of characters. Note that upper- and lowercase letters differ by exactly 32; this observation can
lead to shortcuts in checking or changing upper- and lowercase. Values not shown include formatting characters. For example, 8 represents a
backspace, 9 represents a tab character, and 13 a carriage return. Another useful value is 0 for null, the value the programming language C uses
to mark the end of a string. This information is also found in Column 3 of the ARM Reference Data Card at the front of this book.
03-Ch02-P374750.indd 122
7/3/09 12:18:25 PM
2.9
123
Communicating with People
Base 2 is not natural to human beings; we have 10 fingers and so find base
10 natural. Why didn’t computers use decimal? In fact, the first commercial computer did offer decimal arithmetic. The problem was that the computer still used
on and off signals, so a decimal digit was simply represented by several binary digits. Decimal proved so inefficient that subsequent computers reverted to all binary,
converting to base 10 only for the relatively infrequent input/output events.
Hardware/
Software
Interface
ASCII versus Binary Numbers
We could represent numbers as strings of ASCII digits instead of as integers.
How much does storage increase if the number 1 billion is represented in
ASCII versus a 32-bit integer?
One billion is 1,000,000,000, so it would take 10 ASCII digits, each 8 bits long.
Thus the storage expansion would be (10 × 8)/32 or 2.5. In addition to the
expansion in storage, the hardware to ADD, subtract, multiply, and divide
such decimal numbers is difficult. Such difficulties explain why computing
professionals are raised to believe that binary is natural and that the occasional
decimal computer is bizarre.
EXAMPLE
ANSWER
A series of instructions can extract a byte from a word, so load word and store
word are sufficient for transferring bytes as well as words. Because of the popularity
of text in some programs, however, ARM provides instructions to move bytes. Load
register byte (LDRB) loads a byte from memory, placing it in the rightmost 8 bits
of a register. Store register byte (STRB) takes a byte from the rightmost 8 bits of a
register and writes it to memory. Thus, we copy a byte with the sequence
LDRB r0,[sp,#0]
STRB r0,[r10,#0]
; Read byte from source
; Write byte to destination
Signed versus unsigned applies to loads as well as to arithmetic. The function of
a signed load is to copy the sign repeatedly to fill the rest of the register—called
sign extension—but its purpose is to place a correct representation of the number
within that register. Unsigned loads simply fill with 0s to the left of the data, since
the number represented by the bit pattern is unsigned.
03-Ch02-P374750.indd 123
Hardware/
Software
Interface
7/3/09 12:18:25 PM
124
Chapter 2
Instructions: Language of the Computer
When loading a 32-bit word into a 32-bit register, the point is moot; signed and
unsigned loads are identical. ARM does offer two flavors of byte loads: load register
signed byte (LDRSB) treats the byte as a signed number and thus sign-extends to
fill the 24 leftmost bits of the register, while load register byte (LDRB) works with
unsigned integers. Since C programs almost always use bytes to represent characters rather than consider bytes as very short signed integers, LDRB is used practically exclusively for byte loads.
Characters are normally combined into strings, which have a variable number
of characters. There are three choices for representing a string: (1) the first position
of the string is reserved to give the length of a string, (2) an accompanying variable
has the length of the string (as in a structure), or (3) the last position of a string is
indicated by a character used to mark the end of a string. C uses the third choice,
terminating a string with a byte whose value is 0 (named null in ASCII). Thus,
the string “Cal” is represented in C by the following 4 bytes, shown as decimal
numbers: 67, 97, 108, 0. (As we shall see, Java uses the first option.)
Compiling a String Copy Procedure, Showing How to Use C Strings
EXAMPLE
The procedure strcpy copies string y to string x using the null byte
termination convention of C:
void strcpy (char x[], char y[])
{
int i;
i = 0;
while ((x[i] = y[i]) != ‘\0’) /* copy & test byte */
i += 1;
}
What is the ARM assembly code?
ANSWER
Below is the basic ARM assembly code segment. Assume that base addresses
for arrays x and y are found in r0 and r1, while i is in r4. strcpy adjusts the
stack pointer and then saves the saved register r4 on the stack:
strcpy:
SUB
STR
03-Ch02-P374750.indd 124
sp, sp, #4
r4, [sp,#0]
; adjust stack for 1 more item
; save r4
7/3/09 12:18:25 PM
2.9
Communicating with People
125
To initialize i to 0, the next instruction sets r4 to 0 by adding 0 to 0 and
placing that sum in r4:
MOV
r4, #0 ; i = 0 + 0
This is the beginning of the loop. The address of y[i] is first formed by adding i to y[]:
L1:
ADD
r2,r4,r1 ; address of y[i] in r2
Note that we don’t have to multiply i by 4 since y is an array of bytes and not
of words, as in prior examples.
To load the character in y[i], we use load byte unsigned, which puts the
character into r3:
LDRBS
r3, [r2,#0] ; r3 = y[i] and set condition flags
A similar address calculation puts the address of x[i] in r12, and then the
character in r3 is stored at that address.
ADD
STRB
r12,r4,r0
; address of x[i] in r12
r3, [r12,#0] ; x[i] = y[i]
Next, we exit the loop if the character was 0. That is, we exit if it is the last
character of the string:
BEQ
L2
; if y[i] == 0, go to L2
If not, we increment i and loop back:
ADD
B
r4, r4, #1 ; i = i + 1
L1
; go to L1
If we don’t loop back, it was the last character of the string; we restore r4 and
the stack pointer, and then return.
L2: LDR
ADD
MOV
r4, [sp,#0]
sp, sp, #4
pc, lr
; y[i] == 0: end of string.
Restore old r4
; pop 1 word off stack
; return
String copies usually use pointers instead of arrays in C to avoid the operations on i in the code above. Section 2.10 shows more sophisticated addressing
options for loads and stores that would reduce the number of instructions in
this loop. Also, see Section 2.14 for an explanation of arrays versus pointers.
Since the procedure strcpy above is a leaf procedure, the compiler could allocate i to a temporary register and avoid saving and restoring it. Hence, instead of
thinking of the registers r0 to r3 as being just for arguments, we can think of
them as registers that the callee should use whenever convenient. When a compiler
finds a leaf procedure, it exhausts such registers before using registers it must save.
03-Ch02-P374750.indd 125
7/3/09 12:18:25 PM
126
Chapter 2
Instructions: Language of the Computer
Characters and Strings in Java
Unicode is a universal encoding of the alphabets of most human languages.
Figure 2.16 is a list of Unicode alphabets; there are almost as many alphabets in
Unicode as there are useful symbols in ASCII. To be more inclusive, Java uses Unicode for characters. By default, it uses 16 bits to represent a character.
The ARM instruction set has explicit instructions to load and store such
16-bit quantities, called halfwords. Load register half (LDRH) loads a halfword from
memory, placing it in the rightmost 16 bits of a register. Like load register byte,
load half (LDRH) treats the halfword as a unsigned number and thus zero-extends
to fill the 16 leftmost bits of the register, while load register signed halfword (LDRSH)
works with signed integers. Thus, LDRH is the more popular of the two. Store half
(STRH) takes a halfword from the rightmost 16 bits of a register and writes it to
memory. We copy a halfword with the sequence
LDRH r0,[sp,#0]
STRH r0,[r12,#0]
; Read halfword (16 bits) from source
; Write halfword (16 bits) to destination
Strings are a standard Java class with special built-in support and predefined
methods for concatenation, comparison, and conversion. Unlike C, Java includes a
word that gives the length of the string, similar to Java arrays.
Latin
Malayalam
Tagbanwa
General Punctuation
Greek
Sinhala
Khmer
Spacing Modifier Letters
Cyrillic
Thai
Mongolian
Currency Symbols
Armenian
Lao
Limbu
Combining Diacritical Marks
Hebrew
Tibetan
Tai Le
Combining Marks for Symbols
Arabic
Myanmar
Kangxi Radicals
Superscripts and Subscripts
Syriac
Georgian
Hiragana
Number Forms
Thaana
Hangul Jamo
Katakana
Mathematical Operators
Devanagari
Ethiopic
Bopomofo
Mathematical Alphanumeric Symbols
Bengali
Cherokee
Kanbun
Braille Patterns
Gurmukhi
Unified Canadian
Aboriginal Syllabic
Shavian
Optical Character Recognition
Gujarati
Ogham
Osmanya
Byzantine Musical Symbols
Oriya
Runic
Cypriot Syllabary
Musical Symbols
Tamil
Tagalog
Tai Xuan Jing Symbols
Arrows
Telugu
Hanunoo
Yijing Hexagram Symbols Box Drawing
Kannada
Buhid
Aegean Numbers
Geometric Shapes
FIGURE 2.16 Example alphabets in Unicode. Unicode version 4.0 has more than 160 “blocks,”
which is their name for a collection of symbols. Each block is a multiple of 16. For example, Greek starts at
0370hex, and Cyrillic at 0400hex. The first three columns show 48 blocks that correspond to human languages
in roughly Unicode numerical order. The last column has 16 blocks that are multilingual and are not in order.
A 16-bit encoding, called UTF-16, is the default. A variable-length encoding, called UTF-8, keeps the ASCII
subset as eight bits and uses 16−32 bits for the other characters. UTF-32 uses 32 bits per character. To learn
more, see www.unicode.org.
03-Ch02-P374750.indd 126
7/3/09 12:18:25 PM
2.10
127
ARM Addressing for 32-Bit Immediates
Elaboration: ARM software tries to keep the stack aligned to word addresses,
allowing the program to always use LDR and STR (which must be aligned) to access
the stack. This convention means that a char variable allocated on the stack occupies
4 bytes, even though it needs less. However, a C string variable or an array of bytes will
pack 4 bytes per word, and a Java string variable or array of shorts packs 2 halfwords
per word.
I. Which of the following statements about characters and strings in C and Java
are true?
Check
Yourself
1. A string in C takes about half the memory as the same string in Java.
2. Strings are just an informal name for single-dimension arrays of characters
in C and Java.
3. Strings in C and Java use null (0) to mark the end of a string.
4. Operations on strings, like length, are faster in C than in Java.
II. Which type of variable that can contain 1,000,000,000ten takes the most
memory space?
1. int in C
2. string in C
3. string in Java
2.10
ARM Addressing for 32-Bit Immediates
and More Complex Addressing Modes
Although keeping all ARM instructions 32 bits long simplifies the hardware, there
are times where it would be convenient to have a 32-bit constant or 32-bit address.
This section starts with the how ARM builds ins support for certain 32-bit patterns from just 12 bits, and then shows the optimizations for addresses used in data
transfers.
32-Bit Immediate Operands
Although constants are frequently short and fit into a narrow field, sometimes
they are bigger. The ARM architects believed that some 32-bit constants were more
popular than others, and so included a trick to allow ARM to specify some that
they thought would be important. The 12-bit Operand2 field in the DP format
actually is subdivided into 2 fields: a 8-bit constant field on the right and a 4-bit
rotate right field. This latter field rotates the 8-bit constant to the right by twice the
value in the rotate field.
Using unsigned numbers, this trick can represent any number such that
X * 22i
03-Ch02-P374750.indd 127
7/3/09 12:18:25 PM
128
Chapter 2
Instructions: Language of the Computer
where X is between 0 and 255 and i is between 0 and 15. It can also represent a few
other patterns when the 8-bit rotated constant straddles both the most significant
and least significant bits:
T * 230 + W, U * 228 + V, and W * 226 + T
where T is between 0 and 3, U is 0 to 15, V is 0 to 15, and W is 0 and 63. Since ARM
has the MVN instruction, it is also quick to produce the one’s complement of any
of the patterns above.
Loading a 32-Bit Constant
EXAMPLE
ANSWER
Hardware/
Software
Interface
What is the ARM machine code to load this 32-bit constant into register r0?
0000 0000 1101 1001 0000 0000 0000 0000
First, we would find the pattern for the 8 non-zerobits (1101 1001) which
is 217 in decimal. To move that pattern from the 8 least significant bits to bits
23 to 16, we need to rotate the 8-bit pattern to the right by 16 bits. Since ARM
doubles the amount in the rotation field, we need to enter 8 in that field. The
machine language MOV instruction (opcode=13) to put this 32-bit value in
r4 is:
Cond
F
I
Opcode
S
Rn
Rd
rotate-imm
mm_8
14
0
1
13
0
0
4
8
217
4 bits
2 bits
1 bit
4 bits
1 bit
4 bits
4 bits
4 bits
8 bits
The assembler should map large constants into the right instruction format so that
the programmer need not worry about it. If it doesn’t match one of these patterns,
the assembler should place the 32-bit constant in memory and then load it into the
register. This useful feature can be generalized so that the symbolic representation of
the ARM machine language is no longer limited by the hardware, but by whatever the
creator of an assembler chooses to include (see Section 2.12).
Elaboration: ARMv6T2 added instructions that allow creation of any 32-bit
constant in just two instructions. MOVT writes a 16-bit immediate into the top
half of the destination register without affecting the bottom half. MOVW does the
opposite: Writes the bottom half without affecting the top half. The ARM assembler
picks the most efficient instruction sequence depending on the constant.
03-Ch02-P374750.indd 128
7/3/09 12:18:25 PM
2.10
ARM Addressing for 32-Bit Immediates
129
More Complex Addressing in Data Transfer Instructions
Multiple forms of addressing are generically called addressing modes. Consistent
with the ARM philosophy of trying to reduce the number of instructions to execute a program—as exemplified by the shifting of arithmetic operands, conditional
instruction execution, and rotating immediates—are including many addressing
modes for data processing instructions.
The addressing mode in Section 2.6 is called immediate offset, where a constant
address is added to a base register. There are seven more:
addressing mode One of
several addressing regimes
delimited by their varied
use of operands and/or
addresses.
1. Register Offset. Instead of adding a constant to the base register, another register is added to the base register. This mode can help with an index into an
array, where the array index is in one register and the base of the array is in
another. Example: LDR r2, [r0,r1].
2. Scaled Register Offset. Just as the second operand can be optionally shifted left
or right in data processing instructions, this addressing mode allows the register to be shifted before it is added to the base register. This mode can be useful to turn an array index into a byte address by shifting it left by 2 bits. We’ll
see this mode used in Section 2.14. Example: LDR r2, [r0,r1, LSL #2]
3. Immediate Pre-Indexed. This and the following modes update the base register with the new address as part of the addressing mode. That is, on a load
instruction, the content of the destination register changes based on the
value fetched from memory and the base register changes to the address that
was used to access memory. This mode can be useful when going sequentially through an array. There are two versions—one where the address is
added to the base and one where the address is subtracted from the base—to
allow the programmer to go through the array forwards or backwards. Note
that in this mode the addition or subtraction occurs before the address is
sent to memory. Example: LDR r2, [r0, #4]!
4. Immediate Post-Indexed. Just to cover all eventualities, ARM also has a mode
just like immediate pre-indexed except the address in the base register is
used to access memory first and then the constant is added or subtracted
later. Depending on how you set up your program to access memory, either
pre-indexed or post-indexed is desired, but not likely both. We’ll see this
mode used in Section 2.14. Example: LDR r2, [r0], #4
5. Register Pre-Indexed. The same Immediate Pre-Indexed, except you add or
subtract a register instead of a constant. Example: LDR r2, [r0,r1]!
6. Scaled Register Pre-Indexed. The same Register Pre-Indexed, except you
shift the register before adding or subtracting it. Example: LDR r2,
[r0,r1, LSL #2]!
7. Register Post-Indexed. The same Immediate Post-Indexed, except you add or
subtract a register instead of a constant. Example: LDR r2, [r0],r1
03-Ch02-P374750.indd 129
7/3/09 12:18:25 PM
130
Chapter 2
Instructions: Language of the Computer
Some consider the operands in the data processing instructions to be addressing
modes as well. Hence, we can add the following modes to the list:
1. Immediate addressing, where the operand is a constant within the instruction itself. Example: ADD r2, r0, #5
2. Register addressing, where the operand is a register. Example: ADD r2, r0, r1
3. Scaled register addressing, where the register operand is shifted first.
Example: ADD r2, r0, r1, LSL #2
4. PC-relative addressing, where the branch address is the sum of the PC and a
constant in the instruction. Example: BEQ 1000
Since one of the 16 ARM registers is the program counter, the data transfer addressing modes can also offer PC-relative addressing, just as we saw for branches. One
reason for using PC-relative addressing with loads would be to fetch 32-bit constants that could be placed in memory along with the program.
Figure 2.17 shows how operands are identified for each addressing mode. Note
that a single operation can use more than one addressing mode. Add, for example,
uses both immediate and register addressing.
Figure 2.18 shows the ARM instruction formats covered so far. Figure 2.1 on
page 78 shows the ARM assembly language revealed in this chapter. The remaining portion of ARM instructions deals mainly with arithmetic and real numbers,
which are covered in the next chapter.
1. Immediate: ADD r2, r0, #5
cond
f
opcode
rn
rd Immediate
2. Register: ADD r2, r0, r1
cond
f
opcode
rn
rd
...
rm
Register
Register
3. Scaled register: ADD r2, r0, r1, LSL #2
cond
f
opcode
rn
rd
...
rm
Register
Register
Shifter
FIGURE 2.17 Illustration of the twelve ARM addressing modes. The operands are shaded in
color. The operand of mode 3 is in memory, whereas the operand for mode 2 is a register. Note that versions
of load and store access bytes, halfwords, or words. For mode 1, the operand is in the instruction itself.
03-Ch02-P374750.indd 130
7/3/09 12:18:25 PM
2.10
ARM Addressing for 32-Bit Immediates
131
4. PC-relative: BEQ 1000
cond
opcode
offset
Memory
PC
+
Byte
Half
Word
5. Immediate offset: LDR r2, [r0, #8]
cond
f
opcode
rn
rd
address
Memory
+
Byte
Half
Word
register
6. Register offset: LDR r2, [r0, r1]
cond
f
opcode
rn
rd
...
rm
Memory
register
Byte
Half
Word
+
register
7. Scaled register offset: LDR r2, [r0, r1, LSL #2]
cond
f
opcode
rn
rd
...
rm
Memory
register
Byte
Half
Word
shifter
+
register
8. Immediate offset pre-indexed: LDR r2, [r0, #4]!
cond
f
opcode
rn
rd
address
Memory
+
Byte
Half
Word
register
FIGURE 2.17
03-Ch02-P374750.indd 131
Illustration of the twelve ARM addressing modes. (continued)
7/3/09 12:18:26 PM
132
Chapter 2
Instructions: Language of the Computer
9. Immediate offset post-indexed: LDR r2, [r0], #4
cond
f
opcode
rn
rd
address
Memory
+
Byte
Half
Word
register
10. Register offset pre-indexed: LDR r2, [r0, r1]!
cond
f
opcode
rn
rd
rm
...
Memory
register
Byte
Half
Word
+
register
11. Scaled register offset pre-indexed: LDR r2, [r0, r1, LSL #2]!
cond
f
opcode
rn
rd
...
rm
Memory
register
Byte
Half
Word
shifter
+
register
12. Register offset post-indexed: LDR r2, [r0], r1
cond
f
opcode
rn
rd
...
rm
Memory
register
Byte
Half
Word
+
register
FIGURE 2.17
03-Ch02-P374750.indd 132
Illustration of the twelve ARM addressing modes. (continued)
7/3/09 12:18:27 PM
2.11
Name
Format
Field size
DP format
DP
DT format
DT
Field size
BR format
FIGURE 2.18
BR
Example
Comments
4 bits
2 bits
1 bit
4 bits
1 bit
4 bits
4 bits
Cond
F
I
Opcode
S
Rn
Rd
Operand2 Arithmetic instruction format
Rn
Rd
Offset12
Cond
F
4 bits
2 bits
2 bits
Opcode
24 bits
Cond
F
Opcode
signed_immed_24
12 bits
All ARM instructions are 32 bits long
Data transfer format
B and BL instructions
ARM instruction formats.
Although ARM has 32-bit addresses, many microprocessors also have 64-bit address
extensions in addition to 32-bit addresses. These extensions were in response to the
needs of software for larger programs. The process of instruction set extension
allows architectures to expand in such a way that is able to move software compatibly upward to the next generation of architecture.
2.11
Hardware/
Software
Interface
Parallelism and Instructions:
Synchronization
Parallel execution is easier when tasks are independent, but often they need to
cooperate. Cooperation usually means some tasks are writing new values that others must read. To know when a task is finished writing so that it is safe for another
to read, the tasks need to synchronize. If they don’t synchronize, there is a danger of
a data race, where the results of the program can change depending on how events
happen to occur.
For example, recall the analogy of the eight reporters writing a story on page
43 of Chapter 1. Suppose one reporter needs to read all the prior sections before
writing a conclusion. Hence, he must know when the other reporters have finished
their sections, so that he or she need not worry about them being changed afterwards. That is, they had better synchronize the writing and reading of each section
so that the conclusion will be consistent with what is printed in the prior sections.
In computing, synchronization mechanisms are typically built with user-level
software routines that rely on hardware-supplied synchronization instructions. In
this section, we focus on the implementation of lock and unlock synchronization
operations. Lock and unlock can be used straightforwardly to create regions where
only a single processor can operate, called mutual exclusion, as well as to implement
more complex synchronization mechanisms.
The critical ability we require to implement synchronization in a multiprocessor is
a set of hardware primitives with the ability to atomically read and modify a memory
03-Ch02-P374750.indd 133
133
Parallelism and Instructions: Synchronization
data race Two memory
accesses form a data race
if they are from different
threads to same location,
at least one is a write,
and they occur one after
another.
7/3/09 12:18:34 PM
134
Chapter 2
Instructions: Language of the Computer
location. That is, nothing else can interpose itself between the read and the write of
the memory location. Without such a capability, the cost of building basic synchronization primitives will be too high and will increase as the processor count increases.
There are a number of alternative formulations of the basic hardware primitives, all of which provide the ability to atomically read and modify a location,
together with some way to tell if the read and write were performed atomically. In
general, architects do not expect users to employ the basic hardware primitives, but
instead expect that the primitives will be used by system programmers to build a
synchronization library, a process that is often complex and tricky.
Let’s start with one such hardware primitive and show how it can be used to
build a basic synchronization primitive. One typical operation for building synchronization operations is the atomic exchange or atomic swap, which interchanges
a value in a register for a value in memory. ARM has such an instruction: SWP.
To see how to use this to build a basic synchronization primitive, assume that
we want to build a simple lock where the value 0 is used to indicate that the lock
is free and 1 is used to indicate that the lock is unavailable. A processor tries to set
the lock by doing an exchange of 1, which is in a register, with the memory address
corresponding to the lock. The value returned from the exchange instruction is 1 if
some other processor had already claimed access and 0 otherwise. In the latter case,
the value is also changed to 1, preventing any competing exchange in another processor from also retrieving a 0.
For example, consider two processors that each try to do the exchange simultaneously: this race is broken, since exactly one of the processors will perform the
exchange first, returning 0, and the second processor will return 1 when it does the
exchange. The key to using the exchange primitive to implement synchronization
is that the operation is atomic: the exchange is indivisible, and two simultaneous
exchanges will be ordered by the hardware. It is impossible for two processors
trying to set the synchronization variable in this manner to both think they have
simultaneously set the variable.
Elaboration: Although it was presented for multiprocessor synchronization, atomic
exchange is also useful for the operating system in dealing with multiple processes in
a single processor. To make sure nothing interferes in a single processor, the store conditional also fails if the processor does a context switch between the two instructions
(see Chapter 5).
Check When do you use primitives like SWP?
Yourself
1. When cooperating threads of a parallel program need to synchronize to get
proper behaviour for reading and writing shared data
2. When cooperating processes on a uniprocessor need to synchronize for
reading and writing shared data
03-Ch02-P374750.indd 134
7/3/09 12:18:34 PM
2.12
2.12
Translating and Starting a Program
135
Translating and Starting a Program
This section describes the four steps in transforming a C program in a file on disk
into a program running on a computer. Figure 2.19 shows the translation hierarchy. Some systems combine these steps to reduce translation time, but these are the
logical four phases that programs go through. This section follows this translation
hierarchy.
Compiler
The compiler transforms the C program into an assembly language program, a symbolic form of what the machine understands. High-level language programs take
C program
Compiler
Assembly language program
Assembler
Object: Machine language module
Object: Library routine (machine language)
Linker
Executable: Machine language program
Loader
Memory
FIGURE 2.19 A translation hierarchy for C. A high-level language program is first compiled into
an assembly language program and then assembled into an object module in machine language. The linker
combines multiple modules with library routines to resolve all references. The loader then places the machine
code into the proper memory locations for execution by the processor. To speed up the translation process,
some steps are skipped or combined. Some compilers produce object modules directly, and some systems use
linking loaders that perform the last two steps. To identify the type of file, UNIX follows a suffix convention
for files: C source files are named x.c, assembly files are x.s, object files are named x.o, statically linked
library routines are x.a, dynamically linked library routes are x.so, and executable files by default are called
a.out. MS-DOS uses the suffixes .C, .ASM, .OBJ, .LIB, .DLL, and .EXE to the same effect.
03-Ch02-P374750.indd 135
7/3/09 12:18:34 PM
136
assembly language
A symbolic language that
can be translated into
binary machine language.
Chapter 2
Instructions: Language of the Computer
many fewer lines of code than assembly language, so programmer productivity is
much higher.
In 1975, many operating systems and assemblers were written in assembly language because memories were small and compilers were inefficient. The 500,000fold increase in memory capacity per single DRAM chip has reduced program size
concerns, and optimizing compilers today can produce assembly language programs nearly as good as an assembly language expert, and sometimes even better
for large programs.
Assembler
pseudoinstruction
A common variation
of assembly language
instructions often treated
as if it were an instruction
in its own right.
symbol table A table
that matches names of
labels to the addresses of
the memory words that
instructions occupy.
03-Ch02-P374750.indd 136
Since assembly language is an interface to higher-level software, the assembler can
also treat common variations of machine language instructions as if they were
instructions in their own right. The hardware need not implement these instructions; however, their appearance in assembly language simplifies translation and
programming. Such instructions are called pseudoinstructions.
For example, the ARM assembler accepts this instruction:
LDR r0, #constant
and the assembler determines which instructions to use to create the constant in the
most efficient way possible. The default is loading a 32-bit constant from memory.
Assemblers will also accept numbers in a variety of bases. In addition to binary
and decimal, they usually accept a base that is more succinct than binary yet converts easily to a bit pattern. ARM assemblers use hexadecimal.
Such features are convenient, but the primary task of an assembler is assembly into machine code. The assembler turns the assembly language program into
an object file, which is a combination of machine language instructions, data, and
information needed to place instructions properly in memory.
To produce the binary version of each instruction in the assembly language
program, the assembler must determine the addresses corresponding to all labels.
Assemblers keep track of labels used in branches and data transfer instructions
in a symbol table. As you might expect, the table contains pairs of symbols and
addresses.
The object file for UNIX systems typically contains six distinct pieces:
■
The object file header describes the size and position of the other pieces of the
object file.
■
The text segment contains the machine language code.
■
The static data segment contains data allocated for the life of the program.
(UNIX allows programs to use both static data, which is allocated throughout
the program, and dynamic data, which can grow or shrink as needed by the
program. See Figure 2.13.)
■
The relocation information identifies instructions and data words that depend
on absolute addresses when the program is loaded into memory.
7/3/09 12:18:34 PM
2.12
Translating and Starting a Program
■
The symbol table contains the remaining labels that are not defined, such as
external references.
■
The debugging information contains a concise description of how the modules were compiled so that a debugger can associate machine instructions
with C source files and make data structures readable.
137
The next subsection shows how to attach such routines that have already been
assembled, such as library routines.
Linker
What we have presented so far suggests that a single change to one line of one procedure requires compiling and assembling the whole program. Complete retranslation is a terrible waste of computing resources. This repetition is particularly
wasteful for standard library routines, because programmers would be compiling
and assembling routines that by definition almost never change. An alternative is
to compile and assemble each procedure independently, so that a change to one
line would require compiling and assembling only one procedure. This alternative requires a new systems program, called a link editor or linker, which takes
all the independently assembled machine language programs and “stitches” them
together.
There are three steps for the linker:
1. Place code and data modules symbolically in memory.
2. Determine the addresses of data and instruction labels.
linker Also called link
editor. A systems
program that combines
independently assembled
machine language
programs and resolves
all undefined labels
into an executable file.
3. Patch both the internal and external references.
The linker uses the relocation information and symbol table in each object
module to resolve all undefined labels. Such references occur in branch instructions, jump instructions, and data addresses, so the job of this program is much
like that of an editor: it finds the old addresses and replaces them with the new
addresses. Editing is the origin of the name “link editor,” or linker for short. The
reason a linker is useful is that it is much faster to patch code than it is to recompile
and reassemble.
If all external references are resolved, the linker next determines the memory
locations each module will occupy. Recall that Figure 2.13 on page 120 shows
the ARM convention for allocation of program and data to memory. Since the
files were assembled in isolation, the assembler could not know where a module’s
instructions and data would be placed relative to other modules. When the linker
places a module in memory, all absolute references, that is, memory addresses that
are not relative to a register, must be relocated to reflect its true location.
The linker produces an executable file that can be run on a computer. Typically,
this file has the same format as an object file, except that it contains no unresolved
references. It is possible to have partially linked files, such as library routines, that
still have unresolved addresses and hence result in object files.
03-Ch02-P374750.indd 137
executable file A
functional program in
the format of an object
file that contains no
unresolved references.
It can contain symbol
tables and debugging
information. A “stripped
executable” does not
contain that information.
Relocation information
may be included for the
loader.
7/3/09 12:18:34 PM
138
Chapter 2
Instructions: Language of the Computer
Linking Object Files
EXAMPLE
Link the two object files below. Show updated addresses of the first few
instructions of the completed executable file. We show the instructions in
assembly language just to make the example understandable; in reality, the
instructions would be numbers.
Note that in the object files we have highlighted the addresses and symbols
that must be updated in the link process: the instructions that refer to the
addresses of procedures A and B and the instructions that refer to the addresses
of data words X and Y.
Object file header
Name
Text size
Procedure A
100hex
Data size
20hex
Address
Instruction
0
LDR r0, 0(r3)
4
…
BL 0
…
Data segment
0
…
(X)
…
Relocation information
Address
Instruction type
0
LDR
X
4
B
Label
BL
Address
X
–
B
–
Name
Text size
Procedure B
200hex
Text segment
Symbol table
Dependency
Object file header
Text segment
Data size
30hex
Address
Instruction
0
STR r1, 0(r3)
4
BL 0
…
…
Data segment
Relocation information
Symbol table
03-Ch02-P374750.indd 138
0
…
(Y)
…
Address
Instruction type
Dependency
0
STR
Y
4
A
Label
BL
Address
Y
–
A
–
7/3/09 12:18:34 PM
2.12
139
Translating and Starting a Program
Procedure A needs to find the address for the variable labelled X to put in the
load instruction and to find the address of procedure B to place in the BL
instruction. Procedure B needs the address of the variable labelled Y for the
store instruction and the address of procedure A for its BL instruction.
From Figure 2.13 on page 120, we know that the text segment starts at
address 40 0000hex and the data segment at 1000 0000hex. The text of procedure A is placed at the first address and its data at the second. The object file
header for procedure A says that its text is 100hex bytes and its data is 20hex
bytes, so the starting address for procedure B text is 40 0100hex, and its data
starts at 1000 0020hex.
ANSWER
Executable file header
Text size
Text segment
Data segment
300hex
Data size
50hex
Address
Instruction
0040 0000hex
LDR r0, 8000hex(r3)
0040 0004hex
BL 00 00EChex
…
…
0040 0100hex
0040 0104hex
STR r1, 8020hex(r3)
…
…
BL FF FDDDhex
Address
1000 0000hex
…
(X)
…
1000 0020hex
…
(Y)
…
Now the linker updates the address fields of the instructions. It uses the
instruction type field to know the format of the address to be edited. We have
two types here:
1. The BLs use PC-relative addressing. The BL at address 40 0004hex
gets 40 0100hex (the address of procedure B) - (40 0004hex + 8) =
00 00EChex in its address field. The BL at 40 0104hex needs 40 0000hex
(the address of procedure A) – (40 0104hex + 8) = -112hex. The two’s
complement representation is FF FDDDhex.
2. The load and store addresses are harder because they are relative to a
base register. This example uses the global pointer as the base register. Assume that register r4 is initialized to 1000 8000hex. To get the
address 1000 0000hex (the address of word X), we place 8000hex in the
address field of LDR at address 40 0000hex. Similarly, we place 8020hex
in the address field of STR at address 40 0100hex to get the address
1000 0020hex (the address of word Y).
03-Ch02-P374750.indd 139
7/3/09 12:18:34 PM
140
Chapter 2
Instructions: Language of the Computer
Elaboration: Recall that ARM instructions are word aligned, so BL drops the
right two bits to increase the instruction’s address range. Thus, it uses 24 bits to
create a 26-bit byte address. Hence, the actual address in the lower 24 bits of the BL
instruction in this example is 00 002Bhex, rather than 00 00EChex.
Loader
loader A systems
program that places an
object program in main
memory so that it is ready
to execute.
Now that the executable file is on disk, the operating system reads it to memory and
starts it. The loader follows these steps in UNIX systems:
1. Reads the executable file header to determine size of the text and data
segments.
2. Creates an address space large enough for the text and data.
3. Copies the instructions and data from the executable file into memory.
4. Copies the parameters (if any) to the main program onto the stack.
5. Initializes the machine registers and sets the stack pointer to the first free
location.
6. Jumps to a start-up routine that copies the parameters into the argument registers and calls the main routine of the program. When the main
routine returns, the start-up routine terminates the program with an exit
system call.
Dynamically Linked Libraries
The first part of this section describes the traditional approach to linking libraries
before the program is run. Although this static approach is the fastest way to call
library routines, it has a few disadvantages:
dynamically linked
libraries (DLLs) Library
routines that are linked
to a program during
execution.
03-Ch02-P374750.indd 140
■
The library routines become part of the executable code. If a new version of
the library is released that fixes bugs or supports new hardware devices, the
statically linked program keeps using the old version.
■
It loads all routines in the library that are called anywhere in the executable,
even if those calls are not executed. The library can be large relative to the
program; for example, the standard C library is 2.5 MB.
These disadvantages lead to dynamically linked libraries (DLLs), where the
library routines are not linked and loaded until the program is run. Both the program and library routines keep extra information on the location of nonlocal procedures and their names. In the initial version of DLLs, the loader ran a dynamic
linker, using the extra information in the file to find the appropriate libraries and
to update all external references.
The downside of the initial version of DLLs was that it still linked all routines of the library that might be called, versus only those that are called during
7/3/09 12:18:34 PM
2.12
Translating and Starting a Program
141
the running of the program. This observation led to the lazy procedure linkage
version of DLLs, where each routine is linked only after it is called.
Like many innovations in our field, this trick relies on a level of indirection.
Figure 2.20 shows the technique. It starts with the nonlocal routines calling a set of
dummy routines at the end of the program, with one entry per nonlocal routine.
These dummy entries each contain an indirect jump.
The first time the library routine is called, the program calls the dummy entry
and follows the indirect jump. It points to code that puts a number in a register to
identify the desired library routine and then jumps to the dynamic linker/loader.
The linker/loader finds the desired routine, remaps it, and changes the address in
the indirect jump location to point to that routine. It then jumps to it. When the
Text
Text
BL
...
BL
...
lw
mov pc,Lr
...
lw
mov pc,Lr
...
Data
Data
Text
...
li
B
...
ID
Text
Dynamic linker/loader
Remap DLL routine
B
...
Data/Text
DLL routine
...
mov pc,Lr
a. First call to DLL routine
Text
DLL routine
...
mov pc,Lr
b. Subsequent calls to DLL routine
FIGURE 2.20 Dynamically linked library via lazy procedure linkage. (a) Steps for the first
time a call is made to the DLL routine. (b) The steps to find the routine, remap it, and link it are skipped on
subsequent calls. As we will see in Chapter 5, the operating system may avoid copying the desired routine by
remapping it using virtual memory management.
03-Ch02-P374750.indd 141
7/3/09 12:18:34 PM
142
Chapter 2
Instructions: Language of the Computer
routine completes, it returns to the original calling site. Thereafter, the call to the
library routine jumps indirectly to the routine without the extra hops.
In summary, DLLs require extra space for the information needed for dynamic
linking, but do not require that whole libraries be copied or linked. They pay a good
deal of overhead the first time a routine is called, but only a single indirect jump
thereafter. Note that the return from the library pays no extra overhead. Microsoft’s
Windows relies extensively on dynamically linked libraries, and it is also the default
when executing programs on UNIX systems today.
Starting a Java Program
Java bytecode
Instruction from an
instruction set designed to
interpret Java programs.
The discussion above captures the traditional model of executing a program, where
the emphasis is on fast execution time for a program targeted to a specific instruction set architecture, or even a specific implementation of that architecture. Indeed,
it is possible to execute Java programs just like C. Java was invented with a different
set of goals, however. One was to run safely on any computer, even if it might slow
execution time.
Figure 2.21 shows the typical translation and execution steps for Java. Rather
than compile to the assembly language of a target computer, Java is compiled first
to instructions that are easy to interpret: the Java bytecode instruction set (see
Section 2.15 on the CD). This instruction set is designed to be close to the
Java language so that this compilation step is trivial. Virtually no optimizations
are performed. Like the C compiler, the Java compiler checks the types of data and
produces the proper operation for each type. Java programs are distributed in the
binary version of these bytecodes.
Java program
Compiler
Class files (Java bytecodes)
Just In Time
compiler
Java library routines (machine language)
Java Virtual Machine
Compiled Java methods (machine language)
FIGURE 2.21 A translation hierarchy for Java. A Java program is first compiled into a binary version of Java bytecodes, with all addresses defined by the compiler. The Java program is now ready to run on
the interpreter, called the Java Virtual Machine (JVM). The JVM links to desired methods in the Java library
while the program is running. To achieve greater performance, the JVM can invoke the JIT compiler, which
selectively compiles methods into the native machine language of the machine on which it is running.
03-Ch02-P374750.indd 142
7/3/09 12:18:34 PM
A C Sort Example to Put It All Together
143
A software interpreter, called a Java Virtual Machine (JVM), can execute Java
bytecodes. An interpreter is a program that simulates an instruction set architecture. For example, the ARM simulator used with this book is an interpreter. There
is no need for a separate assembly step since either the translation is so simple that
the compiler fills in the addresses or JVM finds them at runtime.
The upside of interpretation is portability. The availability of software Java virtual machines meant that most people could write and run Java programs shortly
after Java was announced. Today, Java virtual machines are found in hundreds of
millions of devices, in everything from cell phones to Internet browsers.
The downside of interpretation is lower performance. The incredible advances
in performance of the 1980s and 1990s made interpretation viable for many important applications, but the factor of 10 slowdown when compared to traditionally
compiled C programs made Java unattractive for some applications.
To preserve portability and improve execution speed, the next phase of Java
development was compilers that translated while the program was running. Such
Just In Time compilers (JIT) typically profile the running program to find where
the “hot” methods are and then compile them into the native instruction set on
which the virtual machine is running. The compiled portion is saved for the next
time the program is run, so that it can run faster each time it is run. This balance of
interpretation and compilation evolves over time, so that frequently run Java programs suffer little of the overhead of interpretation.
As computers get faster so that compilers can do more, and as researchers invent
betters ways to compile Java on the fly, the performance gap between Java and C or
Section 2.15 on the CD goes into much greater depth on the
C++ is closing.
implementation of Java, Java bytecodes, JVM, and JIT compilers.
Java Virtual Machine
(JVM) The program that
2.13
Which of the advantages of an interpreter over a translator do you think was most
important for the designers of Java?
interprets Java bytecodes.
Just In Time compiler
(JIT) The name
commonly given to a
compiler that operates at
runtime, translating the
interpreted code segments
into the native code of the
computer.
Check
Yourself
1. Ease of writing an interpreter
2. Better error messages
3. Smaller object code
4. Machine independence
2.13
A C Sort Example to Put It All Together
One danger of showing assembly language code in snippets is that you will have
no idea what a full assembly language program looks like. In this section, we derive
the ARM code from two procedures written in C: one to swap array elements and
one to sort them.
03-Ch02-P374750.indd 143
7/3/09 12:18:34 PM
144
Chapter 2
Instructions: Language of the Computer
void swap(int v[], int k)
{
int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
}
FIGURE 2.22 A C procedure that swaps two locations in memory. This subsection uses this
procedure in a sorting example.
The Procedure swap
Let’s start with the code for the procedure swap in Figure 2.22. This procedure
simply swaps two locations in memory. When translating from C to assembly
language by hand, we follow these general steps:
1. Allocate registers to program variables.
2. Produce code for the body of the procedure.
3. Preserve registers across the procedure invocation.
This section describes the swap procedure in these three pieces, concluding by
putting all the pieces together.
Register Allocation for swap
As mentioned on pages 113–114, the ARM convention on parameter passing is to
use registers r0, r1, r2, and r3. Since swap has just two parameters, v and k,
they will be found in registers r0 and r1. The only other variable is temp, which
we associate with register r2 since swap is a leaf procedure (see page 116–117).
This register allocation corresponds to the variable declarations in the first part of
the swap procedure in Figure 2.22. We’ll also need two registers to hold temporary
calculations. We’ll use r3 and r12, since the caller does not expect them to be preserved across a procedure call. To make to assembly language easier to read, we’ll
use the assembler directive that let’s us associate names with registers.
v
k
temp
temp2
vkAddr
RN
RN
RN
RN
RN
0
1
2
3
12
;
;
;
;
;
1st argument address of v
2nd argument index k
local variable
temporary for v[k+1]
to hold address of v[k]
Code for the Body of the Procedure swap
The remaining lines of C code in swap are
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
03-Ch02-P374750.indd 144
7/3/09 12:18:35 PM
2.13
A C Sort Example to Put It All Together
145
Recall that the memory address for ARM refers to the byte address, and so words
are really 4 bytes apart. Hence we need to multiply the index k by 4 before adding
it to the address. Forgetting that sequential word addresses differ by 4 instead of by 1
is a common mistake in assembly language programming. Hence the first step is to
get the address of v[k] by multiplying k by 4 via a shift left by 2:
ADD
vkAddr, v, k, LSL #2; reg vkAddr = v + (k * 4)
; reg vkAddr has the address of v[k]
Now we load v[k] using vkAddr, and then v[k+1] by adding 4 to vkAddr:
LDR
LDR
temp, [vkAddr, #0] ; temp = v[k]
temp2, [vkAddr, #4] ; temp2 = v[k + 1]
; refers to next element of v
Next we store temp and temp2 to the swapped addresses:
STR
STR
temp2, [vkAddr, #0] ; v[k] = temp2
temp, [vkAddr, #4] ; v[k+1] = temp
Now we have allocated registers and written the code to perform the operations
of the procedure. What is missing is the code for preserving the saved registers used
within swap. Since we are not using saved registers in this leaf procedure, there is
nothing to preserve.
The Full swap Procedure
We are now ready for the whole routine, which includes the procedure label and
the return jump. To make it easier to follow, we identify in Figure 2.23 each block
of code with its purpose in the procedure.
Procedure body
swap: ADD
vkAddr, v, k, LSL #2
LDR
LDR
temp, [vkAddr, #0]
temp2, [vkAddr, #4]
STR
STR
temp2, [vkAddr, #0]
temp, [vkAddr, #4]
MOV
pc, lr
; reg vkAddr = v + (k * 4)
; reg vkAddr has the address of v[k]
; temp (temp) = v[k]
; temp2 = v[k + 1]
; refers to next element of v
; v[k] = temp2
; v[k+1] = temp
Procedure return
FIGURE 2.23
; return to calling routine
ARM assembly code of the procedure swap in Figure 2.22.
The Procedure sort
To ensure that you appreciate the rigor of programming in assembly language,
we’ll try a second, longer example. In this case, we’ll build a routine that calls the
03-Ch02-P374750.indd 145
7/3/09 12:18:35 PM
146
Chapter 2
Instructions: Language of the Computer
swap procedure. This program sorts an array of integers, using bubble or exchange
sort, which is one of the simplest if not the fastest sorts. Figure 2.24 shows the C
version of the program. Once again, we present this procedure in several steps,
concluding with the full procedure.
void sort
{
int i,
for (i
for
(int v[], int n)
j;
= 0; i < n; i += 1) {
(j = i – 1; j >= 0 && v[j] > v[j + 1]; j -= 1) {
swap(v,j);
}
}
}
FIGURE 2.24
A C procedure that performs a sort on the array v.
Register Allocation for sort
The two parameters of the procedure sort, v and n, are in the parameter registers r0 and r1, and we assign register r2 to local variable i and register r3 to j.
We’ll also need five more registers to hold values, which we’ll assign to r12 and r4
to r7.
To enhance code legibility, we’ll tell the assembler to rename the registers:
v
n
i
j
vjAddr
vj
vj1
vcopy
ncopy
RN
RN
RN
RN
RN
RN
RN
RN
Rn
0
1
2
3
12
4
5
6
7
;
;
;
;
;
;
;
;
;
1st argument address of v
2nd argument index n
local variable i
local variable j
to hold address of v[j]
to hold a copy of v[j]
to hold a copy of v[j+1]
to hold a copy of v
to hold a copy of n
Code for the Body of the Procedure sort
The procedure body consists of two nested for loops and a call to swap that
includes parameters. Let’s unwrap the code from the outside to the middle.
The first translation step is the first for loop:
for (i = 0; i < n; i += 1) {
Recall that the C for statement has three parts: initialization, loop test, and iteration increment. It takes just one instruction to initialize i to 0, the first part of the
for statement:
MOV
03-Ch02-P374750.indd 146
i, #0
; i = 0
7/3/09 12:18:35 PM
2.13
A C Sort Example to Put It All Together
147
(Remember that move is a pseudoinstruction provided by the assembler for the
convenience of the assembly language programmer; see page 136.) It also takes just
one instruction to increment i, the last part of the for statement:
ADD
i, i, #1
; i += 1
The loop should be exited if i < n is not true or, said another way, should be exited
if i ≥ n. This test takes two instructions:
for1tst:CMP i, n
BGE
exit1
; if i ≥ n
; go to exit1 if i ≥ n
The bottom of the loop just jumps back to the loop test:
B
for1tst
; branch to test of outer loop
exit1:
The skeleton code of the first for loop is then
MOV i, #0
for1tst:CMP i, n
BGE exit1
...
(body of first
...
ADD i, i, #1
B
for1tst
exit1:
; i = 0
; if i ≥ n
; go to exit1 if i ≥ n
for loop)
; i += 1
; branch to test of outer loop
Voila! (The exercises explore writing faster code for similar loops.)
The second for loop looks like this in C:
for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j –= 1) {
The initialization portion of this loop is again one instruction:
SUB
j, i, #1 ; j = i – 1
The decrement of j at the end of the loop is also one instruction:
SUB
j, j, #1 ; j –= 1
The loop test has two parts. We exit the loop if either condition fails, so the first test
must exit the loop if it fails (j < 0):
for2tst: CMP j, #0
BLT exit2
; if j < 0
; go to exit2 if j < 0
This branch will skip over the second condition test. If it doesn’t skip, j ≥ 0.
03-Ch02-P374750.indd 147
7/3/09 12:18:35 PM
148
Chapter 2
Instructions: Language of the Computer
The second test exits if v[j] > v[j + 1] is not true, or exits if v[j] ≤
v[j + 1]. First we create the address by multiplying j by 4 (since we need a byte
address) and add it to the base address of v:
ADD
vjAddr, v, j, LSL #2 ; reg vjAddr = v + (j * 4)
Now we load v[j]:
LDR
vj, [vjAddr,s#0]
; reg vj
= v[j]
Since we know that the second element is just the following word, we add 4 to the
address in register vjAddr to get v[j + 1]:
LDR
vj1, [vjAddr,#4]
; reg vj1
= v[j + 1]
The test of v[j] ≤ v[j + 1] is next, so the two instructions of the exit test are
CMP
BLE
vj, vj1
exit2
; if vj ≤ vj1
; go to exit2 if vj ≤ vj1
The bottom of the loop jumps back to the inner loop test:
B
for2tst ; branch to test of inner loop
Combining the pieces, the skeleton of the second for loop looks like this:
SUB
for2tst:CMP
BLT
ADD
LDR
LDR
CMP
BLE
j, i,#1
; j = i – 1
j, #0
; if j < 0
exit2
; go to exit2 if j < 0
vjAddr, v, j, LSL #2 ; reg vjAddr = v + (j * 4)
vj, [vjAddr,#0]
; reg vj
= v[j]
vj1, [vjAddr,#4]
; reg vj1 = v[j + 1]
vj, vj1
; if vj ≤ vj1
exit2
; go to exit2 if vj ≤ vj1
...
(body of second for loop)
...
SUB j, j, #1
; j –= 1
B
for2tst
; branch to test of inner loop
exit2:
The Procedure Call in sort
The next step is the body of the second for loop:
swap(v,j);
Calling swap is easy enough:
BL
03-Ch02-P374750.indd 148
swap
7/3/09 12:18:35 PM
2.13
A C Sort Example to Put It All Together
149
Passing Parameters in sort
The problem comes when we want to pass parameters because the sort procedure needs the values in registers v and n, yet the swap procedure needs to have its
parameters placed in those same registers. One solution is to copy the parameters
for sort into other registers earlier in the procedure, making registers v and n
available for the call of swap. We first copy v and n into vcopy and ncopy during
the procedure:
MOV
MOV
vcopy, v
ncopy, n
; copy parameter v into vcopy (save r0)
; copy parameter n into ncopy (save r1)
Then we pass the parameters to swap with these two instructions:
MOV
MOV
r0, vcopy ; first swap parameter is v
r1, j
; second swap parameter is j
Preserving Registers in sort
The only remaining code is the saving and restoring of registers. Clearly, we must
save the return address in register lr, since sort is a procedure and is called itself.
The sort procedure also uses the saved registers i, j, vcopy, and ncopy, so they
must be saved. The prologue of the sort procedure is then
SUB
STR
STR
STR
STR
STR
sp,sp,#20
lr, [sp, #16]
ncopy, [sp, #12]
vcopy, [sp, #8]
j, [sp, #4]
i, [sp, #0]
;
;
;
;
;
;
make room on stack for 5 registers
save lr on stack
save ncopy on stack
save vcopy on stack
save j on stack
save i on stack
The tail of the procedure simply reverses all these instructions, then adds a MOV
pc,lr to return.
The Full Procedure sort
Now we put all the pieces together in Figure 2.25. Once again, to make the code
easier to follow, we identify each block of code with its purpose in the procedure.
In this example, nine lines of the sort procedure in C became 32 lines in the ARM
assembly language.
Elaboration: One optimization that works with this example is procedure inlining.
Instead of passing arguments in parameters and invoking the code with a BL instruction, the compiler would copy the code from the body of the swap procedure where the
call to swap appears in the code. Inlining would avoid four instructions in this example.
The downside of the inlining optimization is that the compiled code would be bigger if
the inlined procedure is called from several locations. Such a code expansion might turn
into lower performance if it increased the cache miss rate; see Chapter 5.
03-Ch02-P374750.indd 149
7/3/09 12:18:35 PM
150
Chapter 2
Instructions: Language of the Computer
Saving registers
sort:
SUB
STR
STR
STR
STR
STR
sp,sp,#20
lr, [sp, #16]
ncopy, [sp, #12]
vcopy, [sp, #8]
j, [sp, #4]
i, [sp, #0]
;
;
;
;
;
;
make
save
save
save
save
save
room on stack for 5 registers
lr on stack
ncopy on stack
vcopy on stack
j on stack
i on stack
Procedure body
Move parameters
Outer loop
vcopy, v
ncopy, n
; copy parameter v into vcopy (save r0)
; copy parameter n into ncopy (save r1)
; i = 0
MOV
i, #0
for1tst: CMP
i, n
BGE
exit1
SUB
for2tst: CMP
BLT
ADD
LDR
LDR
Inner loop
Pass parameters
and call
Inner loop
Outer loop
MOV
MOV
exit2:
; if i ≥ n
; go to exit1 if
j, i, #1
; j = i – 1
j, #0
; if j < 0
exit2 ; go to exit2 if j < 0
vjAddr, v, j, LSL #2
vj, [vjAddr,#0]
vj1, [vjAddr,#4]
i ≥ n
; reg vjAddr = v + (j * 4)
; reg vj
= v[j]
; reg vj1
= v[j + 1]
CMP
vj, vj1
; if vj ≤ vj1
BLE
exit2
; go to exit2 if vj ≤ vj1
MOV
MOV
BL
r0, vcopy
r1, j
swap
; first swap parameter is v
; second swap parameter is j
; swap code shown in Figure 2.23
SUB
B
j, j, #1
for2tst
; j –= 1
; branch to test of inner loop
ADD
B
i, i, #1
for1tst
; i += 1
; branch to test of outer loop
Restoring registers
exit1:
LDR
LDR
LDR
LDR
LDR
ADD
i, [sp, #0]
j, [sp, #4]
vcopy, [sp, #8]
ncopy, [sp, #12]
lr, [sp, #16]
sp,sp,#20
;
;
;
;
;
;
restore
restore
restore
restore
restore
restore
i from stack
j from stack
vcopy from stack
ncopy from stack
lr from stack
stack pointer
Procedure return
MOV
FIGURE 2.25
pc, lr
; return to calling routine
ARM assembly version of procedure sort in Figure 2.24.
Elaboration: ARM includes instructions to save and restore registers at procedure
call boundaries. Store Multiple (STM) and Load Multiple (LDM) can store and load up to
16 registers, which are specified by a bit mask in the instruction. Thus, the 5 stores and
03-Ch02-P374750.indd 150
7/3/09 12:18:35 PM
2.13
A C Sort Example to Put It All Together
151
the SUB could be replaced by the instruction STM sp!, {i, j, ncopy, vcopy, lr} since the
stack pointer would be updated by the pre-index addressing mode.
Figure 2.26 shows the impact of compiler optimization on sort program performance, compile time, clock cycles, instruction count, and CPI. Note that unoptimized code has the best CPI, and O1 optimization has the lowest instruction count,
but O3 is the fastest, reminding us that time is the only accurate measure of program performance.
Figure 2.27 compares the impact of programming languages, compilation versus interpretation, and algorithms on performance of sorts. The fourth column
shows that the unoptimized C program is 8.3 times faster than the interpreted
Java code for Bubble Sort. Using the JIT compiler makes Java 2.1 times faster than
the unoptimized C and within a factor of 1.13 of the highest optimized C code.
Section 2.15 on the CD gives more details on interpretation versus compi(
lation of Java and the Java and ARM code for Bubble Sort.) The ratios aren’t as
close for Quicksort in Column 5, presumably because it is harder to amortize the
cost of runtime compilation over the shorter execution time. The last column
demonstrates the impact of a better algorithm, offering three orders of magnitude a performance increases by when sorting 100,000 items. Even comparing
interpreted Java in Column 5 to the C compiler at highest optimization in Column 4, Quicksort beats Bubble Sort by a factor of 50 (0.05 × 2468, or 123 times
faster than the unoptimized C code versus 2.41 times faster).
Understanding
Program
Performance
Elaboration: The ARM compilers always save room on the stack for the arguments
in case they need to be stored, so in reality they always decrement sp by 16 to make
room for all four argument registers (16 bytes). One reason is that C provides a vararg
option that allows a pointer to pick, say, the third argument to a procedure. When the
compiler encounters the rare vararg, it copies the four argument registers onto the
stack into the four reserved locations.
gcc optimization
Relative
performance
Clock cycles
(millions)
Instruction count
(millions)
CPI
None
1.00
158,615
114,938
1.38
O1 (medium)
2.37
66,990
37,470
1.79
O2 (full)
2.38
66,521
39,993
1.66
O3 (procedure integration)
2.41
65,747
44,993
1.46
FIGURE 2.26 Comparing performance, instruction count, and CPI using compiler optimization for Bubble Sort. The programs sorted 100,000 words with the array initialized to random values.
These programs were run on a Pentium 4 with a clock rate of 3.06 GHz and a 533 MHz system bus with 2 GB
of PC2100 DDR SDRAM. It used Linux version 2.4.20.
03-Ch02-P374750.indd 151
7/3/09 12:18:35 PM
152
Chapter 2
Instructions: Language of the Computer
Language
Execution method
Optimization
Bubble Sort relative
performance
Quicksort relative
performance
Speedup Quicksort
vs. Bubble Sort
C
Compiler
None
1.00
1.00
2468
Compiler
O1
2.37
1.50
1562
Compiler
O2
2.38
1.50
1555
Compiler
O3
2.41
1.91
1955
Interpreter
–
0.12
0.05
1050
JIT compiler
–
2.13
0.29
338
Java
FIGURE 2.27 Performance of two sort algorithms in C and Java using interpretation and optimizing compilers relative
to unoptimized C version. The last column shows the advantage in performance of Quicksort over Bubble Sort for each language and
execution option. These programs were run on the same system as Figure 2.26. The JVM is Sun version 1.3.1, and the JIT is Sun Hotspot
version 1.3.1.
2.14
Arrays versus Pointers
A challenge for any new C programmer is understanding pointers. Comparing
assembly code that uses arrays and array indices to the assembly code that uses
pointers offers insights about pointers. This section shows C and ARM assembly
versions of two procedures to clear a sequence of words in memory: one using
array indices and one using pointers. Figure 2.28 shows the two C procedures.
The purpose of this section is to show how pointers map into ARM instructions,
and not to endorse a dated programming style. We’ll see the impact of modern
compiler optimization on these two procedures at the end of the section.
Array Version of Clear
Let’s start with the array version, clear1, focusing on the body of the loop and
ignoring the procedure linkage code. We assume that the two parameters array and
size are found in the registers r0 and r1, and that i is allocated to register r2. We
start by renaming registers.
array
n
i
zero
RN
RN
RN
RN
0
1
2
3
;
;
;
;
1st argument address of array
2nd argument size (of array)
local variable i
temporary to hold constant 0
The initialization of i, the first part of the for loop, is straightforward:
MOV
i,0
; i = 0
We also need to put 0 into the temporary register so that we can write it to
memory:
MOV
03-Ch02-P374750.indd 152
zero,0
; zero = 0
7/3/09 12:18:35 PM
2.14
Arrays versus Pointers
153
clear1(int array[], int size)
{
int i;
for (i = 0; i < size; i += 1)
array[i] = 0;
}
clear2(int *array, int size)
{
int *p;
for (p = &array[0]; p < array[size]; p = p + 1)
*p = 0;
}
FIGURE 2.28 Two C procedures for setting an array to all zeros. Clear1 uses indices, while
clear2 uses pointers. The second procedure needs some explanation for those unfamiliar with C. The
address of a variable is indicated by &, and the object pointed to by a pointer is indicated by *. The declarations declare that array and p are pointers to integers. The first part of the for loop in clear2 assigns
the address of the first element of array to the pointer p. The second part of the for loop tests to see if the
pointer is pointing beyond the last element of array. Incrementing a pointer by one, in the last part of the
for loop, means moving the pointer to the next sequential object of its declared size. Since p is a pointer to
integers, the compiler will generate ARM instructions to increment p by four, the number of bytes in a ARM
integer. The assignment in the loop places 0 in the object pointed to by p.
To set array[i] to 0 we must first get its address. We can use scaled register offset
addressing mode to multiply i by 4 to get the byte address and then add it to the
index to get the address of array[i]:
loop1:STR
zero, [array,i, LSL #2] ; array[i] = 0
This instruction is the end of the body of the loop, so the next step is to increment i:
ADD
i,i,#1
; i = i + 1
The loop test checks if i is less than size:
CMP
BLT
i,size
loop1
; i < size
; if (i < size) go to loop1
We have now seen all the pieces of the procedure. Here is the ARM code for
clearing an array using indices:
MOV
MOV
loop1:STR
ADD
CMP
BLT
03-Ch02-P374750.indd 153
i,0
zero,0
zero, [array,i,
i,i,#1
i,size
loop1
; i = 0
; zero = 0
LSL #2] ; array[i] = 0
; i = i + 1
; i < size
; if (i < size) go to loop1
7/3/09 12:18:35 PM
154
Chapter 2
Instructions: Language of the Computer
(This code works as long as size is greater than 0; ANSI C requires a test of size
before the loop, but we’ll skip that legality here.)
Pointer Version of Clear
The second procedure that uses pointers allocates the two parameters array and
size to the registers r0 and r1 and allocates p to register r2, renaming the registers to do this:
array
n
p
zero
arraySize
RN
RN
RN
RN
RN
0
1
2
3
12
;
;
;
;
;
1st argument address of array
2nd argument size (of array)
local variable i
temporary to hold constant 0
address of array[size]
The code for the second procedure starts with assigning the pointer p to the address
of the first element of the array and setting zero:
MOV
MOV
p,array
zero,#0
; p = address of array[0]
; zero = 0
The next code is the body of the for loop, which simply stores 0 into memory
pointed to by p. We’ll use the immediate post-indexed addressing mode to increment p. Incrementing a pointer by 1 means moving the pointer to the next sequential object in C. Since p is a pointer to integers, each of which uses 4 bytes, the
compiler increments p by 4.
loop2: STR
zero,[p],#4
; Memory[p] = 0; p = p + 4
The loop test is next. The first step is calculating the address of the last element
of array. Start with multiplying size by 4 to get its byte address and then we add
the product to the starting address of the array to get the address of the first word
after the array:
ADD
arraySize,array,size,LSL #2 ; arraySize = address
; of array[size]
The loop test is simply to see if p is less than the last element of array:
CMP
BLT
p,arraySize
loop2
; p < &array[size]
; if (p<&array[size]) go to loop2
With all the pieces completed, we can show a pointer version of the code to zero
an array:
MOV
MOV
03-Ch02-P374750.indd 154
p,array
zero,#0
; p = address of array[0]
; zero = 0
7/3/09 12:18:35 PM
2.14
Arrays versus Pointers
155
loop2: STR zero,[p],#4
; Memory[p] = 0; p = p + 4
ADD arraySize,array,size,LSL #2 ; arraySize = address
; of array[size]
CMP p,arraySize
; p < &array[size]
BLT loop2
; if (p<&array[size]) go to loop2
As in the first example, this code assumes size is greater than 0.
Note that this program calculates the address of the end of the array in every
iteration of the loop, even though it does not change. A faster version of the code
moves this calculation outside the loop:
MOV
MOV
ADD
p,array
; p = address of array[0]
zero,#0
; zero = 0
arraySize,array,size,LSL #2 ; arraySize =
; address of array[size]
loop2:STR zero,[p],#4 ; Memory[p] = 0; p = p + 4
CMP
p,arraySize
; p < &array[size]
BLT
loop2
; if (p<&array[size]) go to loop2
Comparing the Two Versions of Clear
Comparing the two code sequences side by side illustrates the difference between
array indices and pointers (the changes introduced by the pointer version are
highlighted):
MOV
i,#0
;i=0
MOV
p,array
; p = & array[0]
MOV
zero,#0
;zero = 0
MOV
zero,#0
;zero = 0
ADD
arraySize,array,size,LSL #2
STR
zero,[p],#4 ; Memory[p] = 0,p = p + 4
p,arraySize; p<&array[size]
loop1: STR zero, [array,i, LSL #2] ; array[i] = 0
; arraySize = &array[size]
ADD
i,i,#1
;i=i+1
loop2:
CMP
i,size
; i < size
CMP
BLT
loop1
; if (i < size) go to loop1
BLT loop2
; if () go to loop2
The version on the left must have the “multiply” and add inside the loop because
i is incremented and each address must be recalculated from the new index. How-
ever, the scaled register offset addressing mode hides that extra work. The memory
pointer version on the right increments the pointer p directly via the post-indexed
immediate addressing mode. The pointer version moves the end of array calculation outside the loop, thereby reducing the instructions executed per iteration
from 4 to 3. This manual optimization corresponds to the compiler optimization
of strength reduction (shift instead of multiply) and induction variable eliminaSection 2.15 on the
tion (eliminating array address calculations within loops).
CD describes these two and many other optimizations.
03-Ch02-P374750.indd 155
7/3/09 12:18:35 PM
156
Chapter 2
Instructions: Language of the Computer
Elaboration: As mentioned earlier, a C compiler would add a test to be sure that
size is greater than 0. One way would be to add a jump just before the first instruction
of the loop to the CMP instruction.
Understanding
Program
Performance
People used to be taught to use pointers in C to get greater efficiency than that
available with arrays: “Use pointers, even if you can’t understand the code.” Modern optimizing compilers can produce code for the array version that is just as
good. Most programmers today prefer that the compiler do the heavy lifting.
2.15
object oriented
language A programming language that is
oriented around objects
rather than actions, or
data versus logic.
Advanced Material: Compiling C and
Interpreting Java
This section gives a brief overview of how the C compiler works and how Java is
executed. Because the compiler will significantly affect the performance of a computer, understanding compiler technology today is critical to understanding performance. Keep in mind that the subject of compiler construction is usually taught
in a one- or two-semester course, so our introduction will necessarily only touch
on the basics.
The second part of this section is for readers interested in seeing how an
objected oriented language like Java executes on a ARM architecture. It shows
the Java bytecodes used for interpretation and the ARM code for the Java version
of some of the C segments in prior sections, including Bubble Sort. It covers both
the Java Virtual Machine and JIT compilers.
The rest of this section is on the CD.
2.16
Real Stuff: MIPS Instructions
While not as popular as ARM, MIPS is an elegant instruction set that came out the
same year as ARM and was inspired by similar philosophies. Figure 2.29 lists the
similarities. The principle differences are that MIPS has more registers and ARM
has more addressing modes.
There is a similar core of instruction sets for arithmetic-logical and data transfer
instructions for MIPS and ARM, as Figure 2.30 shows.
03-Ch02-P374750.indd 156
7/3/09 12:18:35 PM
2.16
Real Stuff: MIPS Instructions
ARM
157
MIPS
Date announced
1985
Instruction size (bits)
26 (32 starting with ARMv3)
1985
32
Address space (size, model)
32 bits, flat
32 bits, flat
Data alignment
Aligned
Aligned
Data addressing modes
9
3
Integer registers (number, model, size)
I/O
15 GPR × 32 bits
Memory mapped
31 GPR × 32 bits
Memory mapped
FIGURE 2.29 Similarities in ARM and MIPS instruction sets.
Instruction name
Register-register
Data transfer
ARM
MIPS
Add
ADD
Add (trap if overflow)
ADDS; SWIVS
addu, addiu
add
Subtract
SUB
subu
Subtract (trap if overflow)
SUBS; SWIVS
sub
Multiply
MUL
mult, multu
Divide
—
div, divu
And
AND
and
Or
ORR
or
Xor
EOR
xor
Load high part register
MOVT
lui
Shift left logical
LSL1
sllv, sll
Shift right logical
LSR1
srlv, srl
Shift right arithmetic
ASR1
srav, sra
Compare
CMP, CMN, TST, TEQ
slt/i, slt/iu
Load byte signed
LDRSB
lb
Load byte unsigned
LDRB
lbu
Load halfword signed
LDRSH
lh
Load halfword unsigned
LDRH
lhu
Load word
LDR
lw
Store byte
STRB
sb
Store halfword
STRH
sh
Store word
STR
sw
Read, write special registers
MRS, MSR
move
Atomic Exchange
SWP, SWPB
ll;sc
FIGURE 2.30 ARM register-register and data transfer instructions equivalent to MIPS
core. Dashes mean the operation is not available in that architecture or not synthesized in a few instructions. If there are several choices of instructions equivalent to the ARM core, they are separated by commas.
ARM includes shifts as part of every data operation instruction, so the shifts with superscript 1 are just a
variation of a move instruction, such as LSR1. “Note that ARMv7M has divide instructions.”
03-Ch02-P374750.indd 157
7/3/09 12:18:35 PM
158
Chapter 2
Instructions: Language of the Computer
Data addressing mode
ARM
MIPS
Register operand
X
X
Immediate operand
X
X
Register + offset (displacement or based)
X
X
Register + register (indexed)
X
—
Register + scaled register (scaled)
X
—
Register + offset and update register
X
—
Register + register and update register
X
—
Autoincrement, autodecrement
X
—
PC-relative data
X
—
FIGURE 2.31 Summary of data addressing modes in ARM vs. MIPS. MIPS has three basic
addressing modes. The remaining six ARM addressing modes would require another instruction to calculate
the address in MIPS.
Addressing Modes
Figure 2.31 shows the data addressing modes supported by ARM. MIPS has just
three simple data addressing modes while ARM has nine, including fairly complex calculations. MIPS would require extra instructions to perform those address
calculations.
Compare and Conditional Branch
Instead of using condition codes that are set as a side effect of an arithmetic or
logical instruction code as in ARM, MIPS uses the contents of registers to evaluate conditional branches. Comparisons are done using the instruction Set on Less
Than (slt) sets the contents of a register to 1 if the first operand is less than the
second operand, and to 0 otherwise. The MIPS instructions Branch Equal (beq)
and Branch Not Equal (bne) instructions compare the values of two registers
before deciding to branch. The comparison to zero comes for free since on of
MIPS 32 registers always has the value 0, called $zero.
Every ARM instruction has the option of executing conditionally, depending on
the condition codes. The MIPS instructions do not have such a field.
Figure 2.32 shows the instruction formats for ARM and MIPS. The principal
differences are the 4-bit conditional execution field in every instruction and the
smaller register field, because ARM has half the number of registers.
The MIPS 16-bit immediate field is either zero extended to 32 bits or sign
extended to 32 bits depending on the instruction, and doesn’t have the unusual bit
twidling features ARM’s 12-bit immediate field.
Figure 2.33 shows a few of the ARM arithmetic-logical instructions not found in
MIPS. Since ARM does not have a dedicated register for 0, it has separate opcodes
to perform some operations that MIPS can do with $zero. MIPS instructions
are generally a subset of the ARM instruction set, although MIPS does provide a
Divide instruction (div). Figure 2.33.5 shows the MIPS instruction set used in
Chapters 4 to 6.
03-Ch02-P374750.indd 158
7/3/09 12:18:35 PM
2.16
31
Register-register
8
26 25
6
MIPS
Op
31
31
Branch
Op
Jump/Call
31
MIPS
0
16 15
Const16
0
21 20
Rs15
0
16 15
Opx5/Rs25
Const16
24 23
28 27
Opx4
0
Const24
26 25
31
Opx
Const12
Rd5
4
Op6
ARM
0
6
Const
Rd4
Op
0
6 5
5
12 11
16 15
21 20
5
159
Rs24
24 23
28 27
31
Rd
Rs14
4 3
Opx8
11 10
5
20 19
Rs15
Opx4
MIPS
Rs2
8
26 25
12 11
Rd4
16 15
Op
Op6
ARM
5
28 27
31
MIPS
21 20
Rs1
Opx4
ARM
16 15
Rs14
Op
31
Data transfer
20 19
28 27
Opx4
ARM
Real Stuff: MIPS Instructions
0
4
Const24
0
26 25
Op6
Const
Opcode
Register
26
Constant
FIGURE 2.32 Instruction formats for ARM and MIPS. The differences result from whether the
architecture has 16 or 32 registers. and whether it has the 4-bit conditional execution field.
Name
Definition
ARM v.6
MIPS
Load immediate
Rd = Imm
MOV
addi, $0,
Not
Rd = ~(Rs1)
MVN
nor, $0,
Move
Rd = Rs1
MOV
or, $0,
Rotate right
Rd = Rs i >> i
Rd0. . . i–1 = Rs31–i. . . 31
ROR
And not
Rd = Rs1 & ~(Rs2)
BIC
Reverse subtract
Rd = Rs2 - Rs1
RSB, RSC
Support for multiword
integer add
CarryOut, Rd = Rd + Rs1 +
OldCarryOut
ADCS
—
Support for multiword
integer sub
CarryOut, Rd = Rd – Rs1 +
OldCarryOut
SBCS
—
FIGURE 2.33 ARM arithmetic/logical instructions not found in MIPS.
03-Ch02-P374750.indd 159
7/3/09 12:18:36 PM
160
Chapter 2
Instructions: Language of the Computer
MIPS operands
Name
Example
Comments
$s0–$s7, $t0–$t9, $zero,
32 registers $a0–$a3, $v0–$v1, $gp, $fp,
$sp, $ra, $at
230 memory Memory[0], Memory[4], . . . ,
words
Memory[4294967292]
Fast locations for data. In MIPS, data must be in registers to perform arithmetic,
register $zero always equals 0, and register $at is reserved by the assembler to
handle large constants.
Accessed only by data transfer instructions. MIPS uses byte addresses, so
sequential word addresses differ by 4. Memory holds data structures, arrays, and
spilled registers.
MIPS assembly language
Category
Arithmetic
Data
transfer
Logical
Instruction
Example
add
subtract
add immediate
load word
store word
load half
load half unsigned
store half
load byte
load byte unsigned
store byte
load linked word
store condition. word
load upper immed.
and
or
nor
and immediate
or immediate
shift left logical
shift right logical
branch on equal
add $s1,$s2,$s3
sub $s1,$s2,$s3
addi $s1,$s2,20
lw $s1,20($s2)
sw $s1,20($s2)
lh $s1,20($s2)
lhu $s1,20($s2)
sh $s1,20($s2)
lb $s1,20($s2)
lbu $s1,20($s2)
sb $s1,20($s2)
ll $s1,20($s2)
sc $s1,20($s2)
lui $s1,20
branch on not equal
set on less than
Conditional
branch
set on less than
unsigned
set less than
immediate
set less than
immediate unsigned
jump
Unconditional
jump register
jump
jump and link
Meaning
Comments
Three register operands
$s1 = $s2 + $s3
Three register operands
$s1 = $s2 – $s3
Used to add constants
$s1 = $s2 + 20
Word from memory to register
$s1 = Memory[$s2 + 20]
Word from register to memory
Memory[$s2 + 20] = $s1
Halfword memory to register
$s1 = Memory[$s2 + 20]
Halfword memory to register
$s1 = Memory[$s2 + 20]
Halfword register to memory
Memory[$s2 + 20] = $s1
Byte from memory to register
$s1 = Memory[$s2 + 20]
Byte from memory to register
$s1 = Memory[$s2 + 20]
Byte from register to memory
Memory[$s2 + 20] = $s1
Load word as 1st half of atomic swap
$s1 = Memory[$s2 + 20]
Memory[$s2+20]=$s1;$s1=0 or 1 Store word as 2nd half of atomic swap
Loads constant in upper 16 bits
$s1 = 20 * 216
Three reg. operands; bit-by-bit AND
and
$s1,$s2,$s3 $s1 = $s2 & $s3
Three reg. operands; bit-by-bit OR
or
$s1,$s2,$s3 $s1 = $s2 | $s3
Three reg. operands; bit-by-bit NOR
nor
$s1,$s2,$s3 $s1 = ~ ($s2 | $s3)
Bit-by-bit AND reg with constant
andi $s1,$s2,20 $s1 = $s2 & 20
Bit-by-bit OR reg with constant
ori
$s1,$s2,20 $s1 = $s2 | 20
Shift left by constant
sll
$s1,$s2,10 $s1 = $s2 << 10
Shift right by constant
srl
$s1,$s2,10 $s1 = $s2 >> 10
Equal test; PC-relative branch
beq $s1,$s2,25
if ($s1 == $s2) go to
PC + 4 + 100
Not equal test; PC-relative
bne $s1,$s2,25
if ($s1!= $s2) go to
PC + 4 + 100
slt $s1,$s2,$s3 if ($s2 < $s3) $s1 = 1;
Compare less than; for beq, bne
else $s1 = 0
Compare less than unsigned
sltu $s1,$s2,$s3 if ($s2 < $s3) $s1 = 1;
else $s1 = 0
Compare less than constant
slti $s1,$s2,20
if ($s2 < 20) $s1 = 1;
else $s1 = 0
Compare less than constant
sltiu $s1,$s2,20 if ($s2 < 20) $s1 = 1;
unsigned
else $s1 = 0
go to 10000
Jump to target address
j
2500
jr
jal
$ra
2500
go to $ra
$ra = PC + 4; go to 10000
For switch, procedure return
For procedure call
MIPS assembly language.
03-Ch02-P374750.indd 160
7/3/09 12:18:36 PM
2.17
2.17
Real Stuff: x86 Instructions
Real Stuff: x86 Instructions
Designers of instruction sets sometimes provide more powerful operations than
those found in ARM and MIPS. The goal is generally to reduce the number of
instructions executed by a program. The danger is that this reduction can occur at
the cost of simplicity, increasing the time a program takes to execute because the
instructions are slower. This slowness may be the result of a slower clock cycle time
or of requiring more clock cycles than a simpler sequence.
The path toward operation complexity is thus fraught with peril. To avoid these
problems, designers have moved toward simpler instructions. Section 2.18 demonstrates the pitfalls of complexity.
161
Beauty is altogether in
the eye of the beholder.
Margaret Wolfe
Hungerford, Molly Bawn,
1877
Evolution of the Intel x86
ARM and MIPS were the vision of single small groups in 1985; the pieces of these
architectures fit nicely together, and the whole architecture can be described succinctly. Such is not the case for the x86; it is the product of several independent
groups who evolved the architecture over 30 years, adding new features to the
original instruction set as someone might add clothing to a packed bag. Here are
important x86 milestones.
■
1978: The Intel 8086 architecture was announced as an assembly language–
compatible extension of the then successful Intel 8080, an 8-bit microprocessor. The 8086 is a 16-bit architecture, with all internal registers 16 bits wide.
Unlike ARM and MIPS, the registers have dedicated uses, and hence the 8086
is not considered a general-purpose register architecture.
■
1980: The Intel 8087 floating-point coprocessor is announced. This architecture extends the 8086 with about 60 floating-point instructions. Instead of
Section 2.20 and Section 3.7).
using registers, it relies on a stack (see
■
1982: The 80286 extended the 8086 architecture by increasing the address
space to 24 bits, by creating an elaborate memory-mapping and protection
model (see Chapter 5), and by adding a few instructions to round out the
instruction set and to manipulate the protection model.
■
1985: The 80386 extended the 80286 architecture to 32 bits. In addition to
a 32-bit architecture with 32-bit registers and a 32-bit address space, the
80386 added new addressing modes and additional operations. The added
instructions make the 80386 nearly a general-purpose register machine. The
80386 also added paging support in addition to segmented addressing (see
Chapter 5). Like the 80286, the 80386 has a mode to execute 8086 programs
without change.
03-Ch02-P374750.indd 161
general-purpose register
(GPR) A register that can
be used for addresses or
for data with virtually any
instruction.
7/3/09 12:18:36 PM
162
03-Ch02-P374750.indd 162
Chapter 2
Instructions: Language of the Computer
■
1989–95: The subsequent 80486 in 1989, Pentium in 1992, and Pentium
Pro in 1995 were aimed at higher performance, with only four instructions
added to the user-visible instruction set: three to help with multiprocessing
(Chapter 7) and a conditional move instruction.
■
1997: After the Pentium and Pentium Pro were shipping, Intel announced
that it would expand the Pentium and the Pentium Pro architectures with
MMX (Multi Media Extensions). This new set of 57 instructions uses the
floating-point stack to accelerate multimedia and communication applications. MMX instructions typically operate on multiple short data elements
at a time, in the tradition of single instruction, multiple data (SIMD) architectures (see Chapter 7). Pentium II did not introduce any new instructions.
■
1999: Intel added another 70 instructions, labelled SSE (Streaming SIMD
Extensions) as part of Pentium III. The primary changes were to add eight
separate registers, double their width to 128 bits, and add a single precision
floating-point data type. Hence, four 32-bit floating-point operations can be
performed in parallel. To improve memory performance, SSE includes cache
prefetch instructions plus streaming store instructions that bypass the caches
and write directly to memory.
■
2001: Intel added yet another 144 instructions, this time labelled SSE2. The
new data type is double precision arithmetic, which allows pairs of 64-bit
floating-point operations in parallel. Almost all of these 144 instructions are
versions of existing MMX and SSE instructions that operate on 64 bits of
data in parallel. Not only does this change enable more multimedia operations, it gives the compiler a different target for floating-point operations
than the unique stack architecture. Compilers can choose to use the eight SSE
registers as floating-point registers like those found in other computers. This
change boosted the floating-point performance of the Pentium 4, the first
microprocessor to include SSE2 instructions.
■
2003: A company other than Intel enhanced the x86 architecture this time.
AMD announced a set of architectural extensions to increase the address
space from 32 to 64 bits. Similar to the transition from a 16- to 32-bit address
space in 1985 with the 80386, AMD64 widens all registers to 64 bits. It also
increases the number of registers to 16 and increases the number of 128bit SSE registers to 16. The primary ISA change comes from adding a new
mode called long mode that redefines the execution of all x86 instructions
with 64-bit addresses and data. To address the larger number of registers, it
adds a new prefix to instructions. Depending how you count, long mode also
adds four to ten new instructions and drops 27 old ones. PC-relative data
addressing is another extension. AMD64 still has a mode that is identical
to x86 (legacy mode) plus a mode that restricts user programs to x86 but
allows operating systems to use AMD64 (compatibility mode). These modes
allow a more graceful transition to 64-bit addressing than the HP/Intel IA-64
architecture.
7/3/09 12:18:36 PM
2.17
Real Stuff: x86 Instructions
■
2004: Intel capitulates and embraces AMD64, relabeling it Extended Memory
64 Technology (EM64T). The major difference is that Intel added a 128-bit
atomic compare and swap instruction, which probably should have been
included in AMD64. At the same time, Intel announced another generation
of media extensions. SSE3 adds 13 instructions to support complex arithmetic,
graphics operations on arrays of structures, video encoding, floating-point
conversion, and thread synchronization (see Section 2.11). AMD will offer
SSE3 in subsequent chips and it will almost certainly add the missing atomic
swap instruction to AMD64 to maintain binary compatibility with Intel.
■
2006: Intel announces 54 new instructions as part of the SSE4 instruction set
extensions. These extensions perform tweaks like sum of absolute differences,
dot products for arrays of structures, sign or zero extension of narrow data to
wider sizes, population count, and so on. They also added support for virtual
machines (see Chapter 5).
■
2007: AMD announces 170 instructions as part of SSE5, including 46 instructions of the base instruction set that adds three operand instructions like
ARM and MIPS.
■
2008: Intel announces the Advanced Vector Extension that expands the SSE
register width from 128 to 256 bits, thereby redefining about 250 instructions
and adding 128 new instructions.
163
This history illustrates the impact of the “golden handcuffs” of compatibility on
the x86, as the existing software base at each step was too important to jeopardize
with significant architectural changes. If you looked over the life of the x86, on
average the architecture has been extended by one instruction per month!
Whatever the artistic failures of the x86, keep in mind that there are more instances
of this architectural family on desktop computers than of any other architecture,
increasing by more than 250 million per year. Nevertheless, this checkered ancestry
has led to an architecture that is difficult to explain and impossible to love.
Brace yourself for what you are about to see! Do not try to read this section
with the care you would need to write x86 programs; the goal instead is to give you
familiarity with the strengths and weaknesses of the world’s most popular desktop
architecture.
Rather than show the entire 16-bit and 32-bit instruction set, in this section we
concentrate on the 32-bit subset that originated with the 80386, as this portion of
the architecture is what is used today. We start our explanation with the registers
and addressing modes, move on to the integer operations, and conclude with an
examination of instruction encoding.
x86 Registers and Data Addressing Modes
The registers of the 80386 show the evolution of the instruction set (Figure 2.34). The
80386 extended all 16-bit registers (except the segment registers) to 32 bits, prefixing
an E to their name to indicate the 32-bit version. We’ll refer to them generically as
03-Ch02-P374750.indd 163
7/3/09 12:18:36 PM
164
Chapter 2
Instructions: Language of the Computer
Name
Use
31
0
EAX
GPR 0
ECX
GPR 1
EDX
GPR 2
EBX
GPR 3
ESP
GPR 4
EBP
GPR 5
ESI
GPR 6
EDI
GPR 7
EIP
EFLAGS
CS
Code segment pointer
SS
Stack segment pointer (top of stack)
DS
Data segment pointer 0
ES
Data segment pointer 1
FS
Data segment pointer 2
GS
Data segment pointer 3
Instruction pointer (PC)
Condition codes
FIGURE 2.34 The 80386 register set. Starting with the 80386, the top eight registers were extended
to 32 bits and could also be used as general-purpose registers.
GPRs (general-purpose registers). The 80386 contains only eight GPRs. This means
MIPS programs can use four times as many and ARM twice as many.
Figure 2.35 shows the arithmetic, logical, and data transfer instructions are
two-operand instructions. There are two important differences here. The x86
arithmetic and logical instructions must have one operand act as both a source
and a destination; ARM and MIPS allow separate registers for source and destination. This restriction puts more pressure on the limited registers, since one source
register must be modified. The second important difference is that one of the operands can be in memory. Thus, virtually any instruction may have one operand in
memory, unlike ARM and MIPS.
03-Ch02-P374750.indd 164
7/3/09 12:18:37 PM
2.17
Real Stuff: x86 Instructions
Source/destination operand type
Second source operand
Register
Register
Register
Immediate
Register
Memory
Memory
Register
Memory
Immediate
165
FIGURE 2.35 Instruction types for the arithmetic, logical, and data transfer instructions.
The x86 allows the combinations shown. The only restriction is the absence of a memory-memory mode.
Immediates may be 8, 16, or 32 bits in length; a register is any one of the 14 major registers in Figure 2.34
(not EIP or EFLAGS).
Mode
Description
Register restrictions
Address is in a register.
Not ESP or EBP
Address is contents of base register
plus displacement.
Not ESP
Base plus scaled index
The address is
Base + (2Scale x Index)
where Scale has the value 0, 1, 2, or 3.
Base: any GPR
Index: not ESP
Base plus scaled index with
8- or 32-bit displacement
The address is
Base + (2Scale x Index) + displacement
where Scale has the value 0, 1, 2, or 3.
Base: any GPR
Index: not ESP
Register indirect
Based mode with 8- or 32-bit
displacement
FIGURE 2.36 x86 32-bit addressing modes with register restrictions. The Base plus Scaled
Index addressing mode, found in ARM or but not MIPS, is included to avoid the multiplies by 4 (scale factor of 2) to turn an index in a register into a byte address (see Figures 2.23 and 2.25). A scale factor of 1 is
used for 16-bit data, and a scale factor of 3 for 64-bit data. A scale factor of 0 means the address is not scaled.
(Intel gives two different names to what is called Based addressing mode—Based and Indexed—but they are
essentially identical and we combine them here.)
Data memory-addressing modes, described in detail below, offer two sizes of
addresses within the instruction. These so-called displacements can be 8 bits or 32 bits.
Although a memory operand can use any addressing mode, there are restrictions on which registers can be used in a mode. Figure 2.36 shows the x86 addressing modes and which GPRs cannot be used with each mode.
x86 Integer Operations
The 8086 provides support for both 8-bit (byte) and 16-bit (word) data types. The
80386 adds 32-bit addresses and data (double words) in the x86. (AMD64 adds
64-bit addresses and data, called quad words; we’ll stick to the 80386 in this section.) The data type distinctions apply to register operations as well as memory
accesses.
03-Ch02-P374750.indd 165
7/3/09 12:18:37 PM
166
Chapter 2
Instructions: Language of the Computer
Almost every operation works on both 8-bit data and on one longer data size.
That size is determined by the mode and is either 16 bits or 32 bits.
Clearly, some programs want to operate on data of all three sizes, so the 80386
architects provided a convenient way to specify each version without expanding
code size significantly. They decided that either 16-bit or 32-bit data dominates
most programs, and so it made sense to be able to set a default large size. This
default data size is set by a bit in the code segment register. To override the default
data size, an 8-bit prefix is attached to the instruction to tell the machine to use the
other large size for this instruction.
The prefix solution was borrowed from the 8086, which allows multiple prefixes
to modify instruction behaviour. The three original prefixes override the default segment register, lock the bus to support synchronization (see Section 2.11), or repeat
the following instruction until the register ECX counts down to 0. This last prefix
was intended to be paired with a byte move instruction to move a variable number of
bytes. The 80386 also added a prefix to override the default address size.
The x86 integer operations can be divided into four major classes:
1. Data movement instructions, including move, push, and pop
2. Arithmetic and logic instructions, including test, integer, and decimal arithmetic operations
3. Control flow, including conditional branches, unconditional jumps, calls,
and returns
4. String instructions, including string move and string compare
The first two categories are unremarkable, except that the arithmetic and logic
instruction operations allow the destination to be either a register or a memory
location. Figure 2.37 shows some typical x86 instructions and their functions.
Instruction
Function
je name
if equal(condition code) {EIP=name};
EIP–128 <= name < EIP+128
jmp name
EIP=name
call name
SP=SP–4; M[SP]=EIP+5; EIP=name;
movw EBX,[EDI+45]
EBX=M[EDI+45]
push ESI
SP=SP–4; M[SP]=ESI
pop EDI
EDI=M[SP]; SP=SP+4
add EAX,;6765
EAX= EAX+6765
Set condition code (flags) with EDX and 42
test EDX,;42
movsl
M[EDI]=M[ESI];
EDI=EDI+4; ESI=ESI+4
FIGURE 2.37 Some typical x86 instructions and their functions. A list of frequent operations
appears in Figure 2.38. The CALL saves the EIP of the next instruction on the stack. (EIP is the Intel PC.)
03-Ch02-P374750.indd 166
7/3/09 12:18:37 PM
2.17
Real Stuff: x86 Instructions
167
Conditional branches on the x86 are based on condition codes or flags, like ARM.
Unlike ARM, where condition codes are set or not depending on the S bit, condition codes are set as an implicit side effect of most x86 instructions. Branches
then test the condition codes. PC-relative branch addresses must be specified in
the number of bytes, since unlike ARM and MIPS, x86 instructions are not all 4
bytes in length.
String instructions are part of the 8080 ancestry of the x86 and are not commonly executed in most programs. They are often slower than equivalent software
routines (see the fallacy on page 170).
Figure 2.38 lists some of the integer x86 instructions. Many of the instructions
are available in both byte and word formats.
Instruction
Meaning
Control
Conditional and unconditional branches
jnz, jz
Jump if condition to EIP + 8-bit offset; JNE (for JNZ), JE (for JZ) are alternative
names
jmp
Unconditional jump—8-bit or 16-bit offset
call
Subroutine call—16-bit offset; return address pushed onto stack
ret
Pops return address from stack and jumps to it
loop
Loop branch—decrement ECX; jump to EIP + 8-bit displacement if ECX ≠ 0
Data transfer
Move data between registers or between register and memory
move
Move between two registers or between register and memory
push, pop
Push source operand on stack; pop operand from stack top to a register
les
Load ES and one of the GPRs from memory
Arithmetic, logical
Arithmetic and logical operations using the data registers and memory
add, sub
cmp
Add source to destination; subtract source from destination; register-memory
format
Compare source and destination; register-memory format
shl, shr, rcr
Shift left; shift logical right; rotate right with carry condition code as fill
cbw
Convert byte in eight rightmost bits of EAX to 16-bit word in right of EAX
test
Logical AND of source and destination sets condition codes
inc, dec
Increment destination, decrement destination
or, xor
Logical OR; exclusive OR; register-memory format
String
Move between string operands; length given by a repeat prefix
movs
Copies from string source to destination by incrementing ESI and EDI; may be
repeated
lods
Loads a byte, word, or doubleword of a string into the EAX register
FIGURE 2.38 Some typical operations on the x86. Many operations use register-memory format,
where either the source or the destination may be memory and the other may be a register or immediate
operand.
03-Ch02-P374750.indd 167
7/3/09 12:18:37 PM
168
Chapter 2
Instructions: Language of the Computer
x86 Instruction Encoding
Saving the worst for last, the encoding of instructions in the 80386 is complex,
with many different instruction formats. Instructions for the 80386 may vary from
1 byte, when there are no operands, up to 15 bytes.
Figure 2.39 shows the instruction format for several of the example instructions
in Figure 2.37. The opcode byte usually contains a bit saying whether the operand
is 8 bits or 32 bits. For some instructions, the opcode may include the addressing
a. JE EIP + displacement
4
4
8
CondiJE
Displacement
tion
b. CALL
8
32
CALL
c. MOV
6
Offset
EBX, [EDI + 45]
1 1
8
r/m
d w
Postbyte
MOV
8
Displacement
d. PUSH ESI
5
3
PUSH
Reg
e. ADD EAX, #6765
4
3 1
ADD
32
Reg w
f. TEST EDX, #42
7
1
TEST
w
Immediate
8
32
Postbyte
Immediate
FIGURE 2.39 Typical x86 instruction formats. Figure 2.40 shows the encoding of the postbyte. Many
instructions contain the 1-bit field w, which says whether the operation is a byte or a double word. The d field
in MOV is used in instructions that may move to or from memory and shows the direction of the move. The
ADD instruction requires 32 bits for the immediate field, because in 32-bit mode, the immediates are either
8 bits or 32 bits. The immediate field in the TEST is 32 bits long because there is no 8-bit immediate for test
in 32-bit mode. Overall, instructions may vary from 1 to 17 bytes in length. The long length comes from extra
1-byte prefixes, having both a 4-byte immediate and a 4-byte displacement address, using an opcode of 2 bytes,
and using the scaled index mode specifier, which adds another byte.
03-Ch02-P374750.indd 168
7/3/09 12:18:37 PM
2.17
169
Real Stuff: x86 Instructions
mode and the register; this is true in many instructions that have the form “register =
register op immediate.” Other instructions use a “postbyte” or extra opcode byte,
labeled “mod, reg, r/m,” which contains the addressing mode information. This
postbyte is used for many of the instructions that address memory. The base plus
scaled index mode uses a second postbyte, labeled “sc, index, base.”
Figure 2.40 shows the encoding of the two postbyte address specifiers for both
16-bit and 32-bit mode. Unfortunately, to understand fully which registers and
which addressing modes are available, you need to see the encoding of all addressing modes and sometimes even the encoding of the instructions.
reg
w=0
w=1
16b
r/m
32b
mod = 0
16b
mod = 1
mod = 2
32b
16b
32b
16b
32b
mod = 3
0
AL
AX
EAX
0
addr=BX+SI
=EAX
same
same
same
same
1
CL
CX
ECX
1
addr=BX+DI
=ECX
addr as
addr as
addr as
addr as
same
as
2
DL
DX
EDX
2
addr=BP+SI
=EDX
mod=0
mod=0
mod=0
mod=0
reg
field
3
BL
BX
EBX
3
addr=BP+SI
=EBX
+ disp8
+ disp8
+ disp16
+ disp32
4
AH
SP
ESP
4
addr=SI
=(sib)
SI+disp8
(sib)+disp8
SI+disp8
(sib)+disp32
“
5
CH
BP
EBP
5
addr=DI
=disp32
DI+disp8
EBP+disp8
DI+disp16
EBP+disp32
“
6
DH
SI
ESI
6
addr=disp16
=ESI
BP+disp8
ESI+disp8
BP+disp16
ESI+disp32
“
7
BH
DI
EDI
7
addr=BX
=EDI
BX+disp8
EDI+disp8
BX+disp16
EDI+disp32
“
FIGURE 2.40 The encoding of the first address specifier of the x86: mod, reg, r/m. The first four columns show the encoding
of the 3-bit reg field, which depends on the w bit from the opcode and whether the machine is in 16-bit mode (8086) or 32-bit mode (80386).
The remaining columns explain the mod and r/m fields. The meaning of the 3-bit r/m field depends on the value in the 2-bit mod field and the
address size. Basically, the registers used in the address calculation are listed in the sixth and seventh columns, under mod = 0, with mod = 1 adding an 8-bit displacement and mod = 2 adding a 16-bit or 32-bit displacement, depending on the address mode. The exceptions are 1) r/m = 6
when mod = 1 or mod = 2 in 16-bit mode selects BP plus the displacement; 2) r/m = 5 when mod = 1 or mod = 2 in 32-bit mode selects
EBP plus displacement; and 3) r/m = 4 in 32-bit mode when mod does not equal 3, where (sib) means use the scaled index mode shown in
Figure 2.36. When mod = 3, the r/m field indicates a register, using the same encoding as the reg field combined with the w bit.
x86 Conclusion
Intel had a 16-bit microprocessor two years before its competitors’ more elegant
architectures, such as the Motorola 68000, and this head start led to the selection
of the 8086 as the CPU for the IBM PC. Intel engineers generally acknowledge that
the x86 is more difficult to build than computers like ARM and MIPS, but the large
market means AMD and Intel can afford more resources to help overcome the
added complexity. What the x86 lacks in style, it makes up for in quantity, making
it beautiful from the right perspective.
Its saving grace is that the most frequently used x86 architectural components are not too difficult to implement, as AMD and Intel have demonstrated
by rapidly improving performance of integer programs since 1978. To get that
performance, compilers must avoid the portions of the architecture that are hard
to implement fast.
03-Ch02-P374750.indd 169
7/3/09 12:18:37 PM
170
Chapter 2
2.18
Instructions: Language of the Computer
Fallacies and Pitfalls
Fallacy: More powerful instructions mean higher performance.
Part of the power of the Intel x86 is the prefixes that can modify the execution of
the following instruction. One prefix can repeat the following instruction until
a counter counts down to 0. Thus, to move data in memory, it would seem that
the natural instruction sequence is to use move with the repeat prefix to perform
32-bit memory-to-memory moves.
An alternative method, which uses the standard instructions found in all computers, is to load the data into the registers and then store the registers back to
memory. This second version of this program, with the code replicated to reduce
loop overhead, copies at about 1.5 times faster. A third version, which uses the
larger floating-point registers instead of the integer registers of the x86, copies at
about 2.0 times faster than the complex move instruction.
Fallacy: Write in assembly language to obtain the highest performance.
At one time compilers for programming languages produced naïve instruction
sequences; the increasing sophistication of compilers means the gap between compiled code and code produced by hand is closing fast. In fact, to compete with current
compilers, the assembly language programmer needs to understand the concepts in
Chapters 4 and 5 thoroughly (processor pipelining and memory hierarchy).
This battle between compilers and assembly language coders is one situation in
which humans are losing ground. For example, C offers the programmer a chance
to give a hint to the compiler about which variables to keep in registers versus spilled
to memory. When compilers were poor at register allocation, such hints were vital
to performance. In fact, some old C textbooks spent a fair amount of time giving
examples that effectively use register hints. Today’s C compilers generally ignore such
hints, because the compiler does a better job at allocation than the programmer does.
Even if writing by hand resulted in faster code, the dangers of writing in assembly language are the longer time spent coding and debugging, the loss in portability,
and the difficulty of maintaining such code. One of the few widely accepted axioms
of software engineering is that coding takes longer if you write more lines, and it
clearly takes many more lines to write a program in assembly language than in C
or Java. Moreover, once it is coded, the next danger is that it will become a popular
program. Such programs always live longer than expected, meaning that someone
will have to update the code over several years and make it work with new releases
of operating systems and new models of machines. Writing in higher-level language instead of assembly language not only allows future compilers to tailor the
code to future machines, it also makes the software easier to maintain and allows
the program to run on more brands of computers.
Fallacy: The importance of commercial binary compatibility means successful
instruction sets don’t change.
03-Ch02-P374750.indd 170
7/3/09 12:18:37 PM
171
2.19 Concluding Remarks
1000
Number of Instructions
900
800
700
600
500
400
300
200
100
19
7
19 8
8
19 0
8
19 2
8
19 4
8
19 6
8
19 8
9
19 0
9
19 2
9
19 4
9
19 6
9
20 8
0
20 0
0
20 2
0
20 4
0
20 6
08
0
Year
FIGURE 2.41 Growth of x86 instruction set over time. While there is clear technical value to
some of these extensions, this rapid change also increases the difficulty for other companies to try to build
compatible processors.
While backwards binary compatibility is sacrosanct, Figure 2.41 shows that the
x86 architecture has grown dramatically. The average is more than one instruction
per month over its 30-year lifetime!
Pitfall: Forgetting that sequential word addresses in machines with byte addressing
do not differ by one.
Many an assembly language programmer has toiled over errors made by assuming
that the address of the next word can be found by incrementing the address in a
register by one instead of by the word size in bytes. Forewarned is forearmed!
Pitfall: Using a pointer to an automatic variable outside its defining procedure.
A common mistake in dealing with pointers is to pass a result from a procedure
that includes a pointer to an array that is local to that procedure. Following the
stack discipline in Figure 2.12, the memory that contains the local array will be
reused as soon as the procedure returns. Pointers to automatic variables can lead
to chaos.
Less is more.
2.19
Concluding Remarks
Robert Browning,
Andrea del Sarto, 1855
The two principles of the stored-program computer are the use of instructions that
are indistinguishable from numbers and the use of alterable memory for programs.
These principles allow a single machine to aid environmental scientists, financial
03-Ch02-P374750.indd 171
7/3/09 12:18:37 PM
172
Chapter 2
Instructions: Language of the Computer
advisers, and novelists in their specialties. The selection of a set of instructions that
the machine can understand demands a delicate balance among the number of
instructions needed to execute a program, the number of clock cycles needed by
an instruction, and the speed of the clock. As illustrated in this chapter, four design
principles guide the authors of instruction sets in making that delicate balance:
1. Simplicity favors regularity. Regularity motivates many features of the ARM
instruction set: keeping all instructions a single size, always requiring three
register operands in arithmetic instructions, and keeping the register fields
in the same place in each instruction format.
2. Smaller is faster. The desire for speed is the reason that ARM has 16 registers
rather than many more.
3. Make the common case fast. Examples of making the common ARM case
fast include PC-relative addressing for conditional branches and immediate
addressing for larger constant operands.
4. Good design demands good compromises. One ARM example was the compromise between providing for larger addresses and constants in instructions and keeping all instructions the same length.
Above this machine level is assembly language, a language that humans can read.
The assembler translates it into the binary numbers that machines can understand,
and it even “extends” the instruction set by creating symbolic instructions that
aren’t in the hardware. For instance, constants or addresses that are too big are broken into properly sized pieces, common variations of instructions are given their
own name, and so on. Figure 2.42 lists the ARM instructions we have covered so
far, both real and pseudoinstructions.
Each category of ARM instructions is associated with constructs that appear in
programming languages:
■
The arithmetic instructions correspond to the operations found in assignment statements.
■
Data transfer instructions are most likely to occur when dealing with data
structures like arrays or structures.
■
The conditional branches are used in if statements and in loops.
■
The unconditional jumps are used in procedure calls and returns and for
case/switch statements.
These instructions are not born equal; the popularity of the few dominates the
many. For example, Figure 2.43 shows the popularity of each class of instructions
for SPEC2006. The varying popularity of instructions plays an important role in
the chapters about, datapath, control, and pipelining.
After we explain computer arithmetic in Chapter 3, we reveal the rest of the
ARM instruction set architecture.
03-Ch02-P374750.indd 172
7/3/09 12:18:38 PM
173
2.19 Concluding Remarks
ARM instructions
Name
Format
add
ADD
DP
subtract
SUB
DP
load register
LDR
DT
store register
STR
DT
load register halfword
LDRH
DT
LDRHS
DT
store register halfword
STRH
DT
load register byte
LDRB
DT
LDRBS
DT
store register byte
STRB
DT
swap
SWP
DT
mov
MOV
DP
and
AND
ORR
DP
DP
logical shift left (optional operation)
MVN
LSL
DP
logical shift right (optional operation)
LSR
DP
compare
CMP
DP
branch on X: EQ, NE, LT, LE, GT, GE
LO, LS, HI, HS, VS, VC, MI, PL
Bx
BR
branch (always)
B
BR
branch and link
BL
BR
load register halfword signed
load register byte signed
or
not
DP
FIGURE 2.42 The ARM instruction set covered so far.
Appendixes B1, B2 and B3
describe the full ARM architecture. Figure 2.1 shows more details of the ARM architecture revealed
in this chapter.
Frequency
Instruction class
ARM examples
HLL correspondence
Integer
Ft. pt.
Arithmetic
ADD, SUB, MOV
Operations in assignment statements
16%
48%
Data transfer
LDR, STR, LDRB, LDRSB,
LDRH, LDRSH, STRB, STRH
References to data structures, such as arrays
35%
36%
Logical
AND, ORR, MNV, LSL, LSR
0perations in assignment statements
12%
4%
Conditional branch
B_, CMP
If statements and loops
34%
8%
Jump
B, BL
Procedure calls, returns, and case/switch statements
2%
0%
FIGURE 2.43 ARM instruction classes, examples, correspondence to high-level program language constructs, and
percentage of ARM instructions executed by category for the average SPEC2006 benchmarks. Figure 3.24 in Chapter 3
shows average percentage of the individual ARM instructions executed. Extrapolated from measurements of MIPS programs.
03-Ch02-P374750.indd 173
7/3/09 12:18:38 PM
174
Chapter 2
\
2.20
Instructions: Language of the Computer
Historical Perspective and
Further Reading
This section surveys the history of instruction set architectures (ISAs) over time,
and we give a short history of programming languages and compilers. ISAs include
accumulator architectures, general-purpose register architectures, stack architectures, and a brief history of ARM, MIPS, and the x86. We also review the controversial subjects of high-level-language computer architectures and reduced instruction
set computer architectures. The history of programming languages includes Fortran, Lisp, Algol, C, Cobol, Pascal, Simula, Smalltalk, C++, and Java, and the history
of compilers includes the key milestones and the pioneers who achieved them. The
rest of this section is on the CD.
2.21
Exercises
Contributed by John Oliver of Cal Poly, San Luis Obispo, with contributions from Nicole
Kaiyan (University of Adelaide) and Milos Prvulovic (Georgia Tech)
A link for an ARM simulator is provided on the CD and is helpful for these exercises.
Although the simulator accepts pseudoinstructions, try not to use pseudoinstructions
for any exercises that ask you to produce ARM code. Your goal should be to learn the
real ARM instruction set, and if you are asked to count instructions, your count should
reflect the actual instructions that will be executed and not the pseudoinstructions.
There are some cases where pseudoinstructions must be used (for example, the
la instruction when an actual value is not known at assembly time). In many cases,
they are quite convenient and result in more readable code (for example, the li
and move instructions). If you choose to use pseudoinstructions for these reasons,
please add a sentence or two to your solution stating which pseudoinstructions
you have used and why.
Exercise 2.1
The following problems deal with translating from C to ARM. Assume that the
variables g, h, i, and j are given and could be considered 32-bit integers as declared
in a C program.
a.
f = g + h + i + j;
b.
f = g + (h + 5);
2.1.1 [5] <2.2> For the C statements above, what is the corresponding ARM
assembly code? Use a minimal number of ARM assembly instructions.
03-Ch02-P374750.indd 174
7/3/09 12:18:38 PM
2.21
Exercises
175
2.1.2 [5] <2.2> For the C statements above, how many ARM assembly instructions are needed to perform the C statement?
2.1.3 [5] <2.2> If the variables f, g, h, i, and j have values 1, 2, 3, 4, and 5, respectively, what is the end value of f?
The following problems deal with translating from ARM to C. Assume that the
variables g, h, i, and j are given and could be considered 32-bit integers as declared
in a C program.
a.
ADD
f, g, h
b.
ADD
ADD
f, f, #1
f, g, h
2.1.4 [5] <2.2> For the ARM statements above, what is a corresponding
C statement?
2.1.5 [5] <2.2> If the variables f, g, h, and i have values 1, 2, 3, and 4, respectively,
what is the end value of f?
Exercise 2.2
The following problems deal with translating from C to ARM. Assume that the
variables g, h, i, and j are given and could be considered 32-bit integers as declared
in a C program.
a.
f = f + f + i;
b.
f = g + (j + 2);
2.2.1 [5] <2.2> For the C statements above, what is the corresponding ARM
assembly code? Use a minimal number of ARM assembly instructions.
2.2.2 [5] <2.2> For the C statements above, how many ARM assembly instructions are needed to perform the C statement?
2.2.3 [5] <2.2> If the variables f, g, h, and i have values 1, 2, 3, and 4, respectively,
what is the end value of f?
The following problems deal with translating from ARM to C. For the following
exercise, assume that the variables g, h, i, and j are given and could be considered
32-bit integers as declared in a C program.
a.
ADD
f, f, h
b.
RSB
ADD
f, f, #0
f, f, 1
03-Ch02-P374750.indd 175
7/3/09 12:18:38 PM
176
Chapter 2
Instructions: Language of the Computer
2.2.4 [5] <2.2> For the ARM statements above, what is a corresponding C
statement?
2.2.5 [5] <2.2> If the variables f, g, h, and i have values 1, 2, 3, and 4, respectively, what is the end value of f?
Exercise 2.3
The following problems deal with translating from C to ARM. Assume that the
variables g, h, i, and j are given and could be considered 32-bit integers as declared
in a C program.
a.
f = f + g + h + i + j + 2;
b.
f = g – (f + 5);
2.3.1 [5] <2.2> For the C statements above, what is the corresponding ARM
assembly code? Use a minimal number of ARM assembly instructions.
2.3.2 [5] <2.2> For the C statements above, how many ARM assembly instructions are needed to perform the C statement?
2.3.3 [5] <2.2> If the variables f, g, h, i, and j have values 1, 2, 3, 4, and 5, respectively, what is the end value of f?
The following problems deal with translating from ARM to C. Assume that the
variables g, h, i, and j are given and could be considered 32-bit integers as declared
in a C program.
a.
ADD
f, –g, h
b.
ADD
SUB
h, f, #1
f, g, h
2.3.4 [5] <2.2> For the ARM statements above, what is a corresponding C
statement?
2.3.5 [5] <2.2> If the variables f, g, h, and i have values 1, 2, 3, and 4, respectively, what is the end value of f?
Exercise 2.4
The following problems deal with translating from C to ARM. Assume that the variables f, g, h, i, and j are assigned to registers r0, r1, r2, r3, and r4, respectively.
03-Ch02-P374750.indd 176
7/3/09 12:18:38 PM
2.21
Exercises
177
Assume that the base address of the arrays A and B are in registers r6 and r7,
respectively.
a.
f = g + h + B[4];
b.
f = g – A[B[4]];
2.4.1 [10] <2.2, 2.3> For the C statements above, what is the corresponding ARM
assembly code?
2.4.2 [5] <2.2, 2.3> For the C statements above, how many ARM assembly
instructions are needed to perform the C statement?
2.4.3 [5] <2.2, 2.3> For the C statements above, how many different registers are
needed to carry out the C statement?
The following problems deal with translating from ARM to C. Assume that the
variables f, g, h, i, and j are assigned to registers r0, r1, r2, r3, and r4, respectively. Assume that the base address of the arrays A and B are in registers r6 and
r7, respectively.
a.
ADD
add
add
add
r0,
r0,
r0,
r0,
r0,
r0,
r0,
r0,
r1
r2
r3
r4
b.
LDR r0, 4[r6, #0x4]
2.4.4 [10] <2.2, 2.3> For the ARM assembly instructions above, what is the
corresponding C statement?
2.4.5 [5] <2.2, 2.3> For the ARM assembly instructions above, rewrite the assembly code to minimize the number of ARM instructions (if possible) needed to carry
out the same function.
2.4.6 [5] <2.2, 2.3> How many registers are needed to carry out the ARM assembly as written above? If you could rewrite the code above, what is the minimal
number of registers needed?
Exercise 2.5
In the following problems, we will be investigating memory operations in the context of a ARM processor. The table below shows the values of an array stored in
memory.
03-Ch02-P374750.indd 177
7/3/09 12:18:38 PM
178
Chapter 2
Instructions: Language of the Computer
a.
Address
12
8
4
0
Data
1
6
4
2
b.
Address
16
12
8
4
0
Data
1
2
3
4
5
2.5.1 [10] <2.2, 2.3> For the memory locations in the table above, write C code
to sort the data from lowest-to-highest, placing the lowest value in the smallest
memory location shown in the figure. Assume that the data shown represents the
C variable called Array, which is an array of type int. Assume that this particular
machine is a byte-addressable machine and a word consists of 4 bytes.
2.5.2 [10] <2.2, 2.3> For the memory locations in the table above, write ARM
code to sort the data from lowest-to-highest, placing the lowest value in the smallest memory location. Use a minimum number of ARM instructions. Assume the
base address of Array is stored in register r6.
2.5.3 [5] <2.2, 2.3> To sort the array above, how many instructions are required
for the ARM code? If you are not allowed to use the immediate field in lw and sw
instructions, how many ARM instructions do you need?
The following problems explore the translation of hexadecimal numbers to other
number formats.
a.
0x12345678
b.
0xbeadf00d
2.5.4 [5] <2.3> Translate the hexadecimal numbers above into decimal.
2.5.5 [5] <2.3> Show how the data in the table would be arranged in memory
of a little-endian and a big-endian machine. Assume the data is stored starting at
address 0.
Exercise 2.6
The following problems deal with translating from C to ARM. Assume that the
variables f, g, h, i, and j are assigned to registers r0, r1, r2, r3, and r4, respectively. Assume that the base address of the arrays A and B are in registers r6 and
r7, respectively.
03-Ch02-P374750.indd 178
7/3/09 12:18:38 PM
2.21
a.
f = –g + h + B[1];
b.
f = A[B[g]+1];
Exercises
179
2.6.1 [10] <2.2, 2.3> For the C statements above, what is the corresponding ARM
assembly code?
2.6.2 [5] <2.2, 2.3> For the C statements above, how many ARM assembly
instructions are needed to perform the C statement?
2.6.3 [5] <2.2, 2.3> For the C statements above, how many registers are needed
to carry out the C statement using ARM assembly code?
The following problems deal with translating from ARM to C. Assume that the
variables f, g, h, i, and j are assigned to registers r0, r1, r2, r3, and r4, respectively. Assume that the base address of the arrays A and B are in registers r6 and
r7, respectively.
a.
ADD r0, r0, r1
ADD r0, r3, r2
ADD r0, r0, r3
b.
ADD
ADD
LDR
r6, r6, #–20 ;(SUB r6, r6, #20)
r6, r6, r1
r0, [r6, #8]
2.6.4 [5] <2.2, 2.3> For the ARM assembly instructions above, what is the corresponding C statement?
2.6.5 [5] <2.2, 2.3> For the ARM assembly above, assume that the registers
r0, r1, r2, r3, contain the values 10, 20, 30, and 40, respectively. Also, assume
that register r6 contains the value 256, and that memory contains the following
values:
Address
Value
256
100
260
200
264
300
Find the value of r0 at the end of the assembly code.
2.6.6 [10] <2.3, 2.5> For each ARM instruction, show the value of the opcode, Rd,
Rn, operand2 and I fields. Remember to distinguish between DP-type and DT-type
instructions.
03-Ch02-P374750.indd 179
7/3/09 12:18:38 PM
180
Chapter 2
Instructions: Language of the Computer
Exercise 2.7
The following problems explore number conversions from signed and unsigned
binary number to decimal numbers.
a.
1010 1101 0001 0000 0000 0000 0000 0010two
b.
1111 1111 1111 1111 1011 0011 0101 0011two
2.7.1 [5] <2.4> For the patterns above, what base 10 number does it represent,
assuming that it is a two’s complement integer?
2.7.2 [5] <2.4> For the patterns above, what base 10 number does it represent,
assuming that it is an unsigned integer?
2.7.3 [5] <2.4> For the patterns above, what hexadecimal number does it represent?
The following problems explore number conversions from decimal to signed and
unsigned binary numbers.
a.
2147483647ten
b.
1000ten
2.7.4 [5] <2.4> For the base ten numbers above, convert to 2’s complement
binary.
2.7.5 [5] <2.4> For the base ten numbers above, convert to 2’s complement hexadecimal.
2.7.6 [5] <2.4> For the base ten numbers above, convert the negated values from
the table to 2’s complement hexadecimal.
Exercise 2.8
The following problems deal with sign extension and overflow. Registers r0 and r1
hold the values as shown in the table below. You will be asked to perform a ARM
operation on these registers and show the result.
a.
r0 = 70000000sixteen, r1 = 0x0FFFFFFFsixteen
b.
r0 = 0x40000000sixteen, r1 = 0x40000000sixteen
2.8.1 [5] <2.4> For the contents of registers r0 and r1 as specified above, what is
the value of r4 for the following assembly code:
ADD r4, r0, r1
Is the result in r4 the desired result, or has there been overflow?
03-Ch02-P374750.indd 180
7/3/09 12:18:38 PM
2.21
Exercises
181
2.8.2 [5] <2.4> For the contents of registers r0 and r1 as specified above, what is
the value of r0 for the following assembly code:
SUB r4, r0, r1
Is the result in r4 the desired result, or has there been overflow?
2.8.3 [5] <2.4> For the contents of registers r0 and r1 as specified above, what is
the value of r4 for the following assembly code:
ADD r4, r0, r1
ADD r4, r0, r0
Is the result in r4 the desired result, or has there been overflow?
In the following problems, you will perform various ARM operations on a pair of
registers, r0 and r1. Given the values of r0 and r1 in each of the questions below,
state if there will be overflow.
a.
ADD r0, r0, r1
b.
SUB r0, r0, r1
sub r0, r0, r1
2.8.4 [5] <2.4> Assume that register r0 = 0x70000000 and r1 = 0x10000000. For
the table above, will there be overflow?
2.8.5 [5] <2.4> Assume that register r0 = 0x40000000 and r1 = 0x20000000. For
the table above, will there be overflow?
2.8.6 [5] <2.4> Assume that register r0 = 0x8FFFFFFF and r1 = 0xD0000000.
For the table above, will there be overflow?
Exercise 2.9
The table below contains various values for register r1. You will be asked to evaluate
if there would be overflow for a given operation.
a.
2147483647ten
b.
0xD0000000sixteen
2.9.1 [5] <2.4> Assume that register r0 = 0x70000000 and r1 has the value as
given in the table. If the instruction: ADD r0, r0, r1 is executed, will there be
overflow?
03-Ch02-P374750.indd 181
7/3/09 12:18:38 PM
182
Chapter 2
Instructions: Language of the Computer
2.9.2 [5] <2.4> Assume that register r0 = 0x80000000 and r1 has the value as
given in the table. If the instruction: SUB r0, r0, r1 is executed, will there be
overflow?
2.9.3 [5] <2.4> Assume that register r0 = 0x7FFFFFFF and r1 has the value as
given in the table. If the instruction: SUB r0, r0, r1 is executed, will there be
overflow?
The table below contains various values for register r1. You will be asked to evaluate
if there would be overflow for a given operation.
a.
1010 1101 0001 0000 0000 0000 0000 0010two
b.
1111 1111 1111 1111 1011 0011 0101 0011two
2.9.4 [5] <2.4> Assume that register r0 = 0x70000000 and r1 has the value as
given in the table. If the instruction: ADD r0, r0, r1 is executed, will there be
overflow?
2.9.5 [5] <2.4> Assume that register r0 = 0x70000000 and r1 has the value as
given in the table. If the instruction: ADD r0, r0, r1 is executed, what is the result
in hex?
2.9.6 [5] <2.4> Assume that register r0 = 0x70000000 and r1 has the value as given
in the table. If the instruction: ADD r0, r0, r1 is executed, what is the result in base
ten?
Exercise 2.10
In the following problems, the data table contains bits that represent the opcode
of an instruction. You will be asked to translate the entries into assembly code and
determine what format of ARM instruction the bits represent.
a.
1010 1110 0000 1011 0000 0000 0000 0100two
b.
1000 1101 0000 1000 0000 0000 0100 0000two
2.10.1 [5] <2.5> For the binary entries above, what instruction do they
represent?
2.10.2 [5] <2.5> What type (DP-type, DT-type) instruction do the binary entries
above represent?
2.10.3 [5] <2.4, 2.5> If the binary entries above were data bits, what number
would they represent in hexadecimal?
03-Ch02-P374750.indd 182
7/3/09 12:18:38 PM
2.21
Exercises
183
In the following problems, the data table contains ARM instructions. You will be
asked to translate the entries into the bits of the opcode and determine what is the
ARM instruction format.
a.
ADD r0, r0, r5
b.
LDR r1, [r3, #4]
2.10.4 [5] <2.4, 2.5> For the instructions above, show the hexadecimal representation of these instructions.
2.10.5 [5] <2.5> What type (DP-type, DT-type) instruction do the instructions
above represent?
2.10.6 [5] <2.5> What is the hexadecimal representation of the opcode, Rd, and
Rn fields in this instruction? For DP-type instruction, what is the hexadecimal
representation of the Rd, I and operand2 fields?
Exercise 2.11
In the following problems, the data table contains bits that represent the opcode
of an instruction. You will be asked to translate the entries into assembly code and
determine what format of ARM instruction the bits represent.
a.
0xE0842005
b.
0xE0423001
2.11.1 [5] <2.4, 2.5> What binary number does the above hexadecimal number
represent?
2.11.2 [5] <2.4, 2.5> What decimal number does the above hexadecimal number
represent?
2.11.3 [5] <2.5> What instruction does the above hexadecimal number
represent?
In the following problems, the data table contains the values of various fields of
ARM instructions. You will be asked to determine what the instruction is, and
find the ARM format for the instruction.
a.
Cond=14, F=1, opcode=25, Rn=6, Rd=5, operand2/offset=0
b.
Cond=14, F=0, opcode=2, Rn=6, Rd=5, operand2/offset=0
03-Ch02-P374750.indd 183
7/3/09 12:18:38 PM
184
Chapter 2
Instructions: Language of the Computer
2.11.4 [5] <2.5> What type (DP-type, DT-type) instruction do the instructions
above represent?
2.11.5 [5] <2.5> What is the ARM assembly instruction described above?
2.11.6 [5] <2.4, 2.5> What is the binary representation of the instructions above?
Exercise 2.12
In the following problems, the data table contains various modifications that could
be made to the ARM instruction set architecture. You will investigate the impact of
these changes on the instruction format of the ARM architecture.
a.
8 registers & 10-bit immediate constant
b.
10 bit offset
2.12.1 [5] <2.5> If the instruction set of the ARM processor is modified, the
instruction format must also be changed. For each of the suggested changes above,
show the size of the bit fields of an DP-type format instruction. What is the total
number of bits needed for each instruction?
2.12.2 [5] <2.5> If the instruction set of the ARM processor is modified, the
instruction format must also be changed. For each of the suggested changes above,
show the size of the bit fields of an DT-type format instruction. What is the total
number of bits needed for each instruction?
2.12.3 [5] <2.5, 2.10> Why could the suggested change in the table above
decrease the size of a ARM assembly program? Why could the suggested change
in the table above increase the size of a ARM assembly program?
In the following problems, the data table contains hexadecimal values. You will be
asked to determine what ARM instruction the value represents, and find the ARM
instruction format.
a.
0xE2801000
b.
0xE1801020
2.12.4 [5] <2.5> For the entries above, what is the value of the number in decimal?
2.12.5 [5] <2.5> For the hexadecimal entries above, what instruction do they
represent?
2.12.6 [5] <2.4, 2.5> What type (DP-type, DT-type) instruction do the binary
entries above represent? What is the value of the opcode field and the Rd field?
03-Ch02-P374750.indd 184
7/3/09 12:18:38 PM
2.21
Exercises
185
Exercise 2.13
In the following problems, the data table contains the values for registers r3
and r4. You will be asked to perform several ARM logical operations on these
registers.
a.
r3 = 0x55555555, r4 = 0x12345678
b.
r3 = 0xBEADFEED, r4 = 0xDEADFADE
2.13.1 [5] <2.6> For the lines above, what is the value of r5 for the following
sequence of instructions:
OR r5, r4, r3, LSL #4
2.13.2 [5] <2.6> For the values in the table above, what is the value of r2 for the
following sequence of instructions:
MVN r3, #1
AND r5, r3, r4, LSL #4
2.13.3 [5] <2.6> For the lines above, what is the value of r5 for the following
sequence of instructions:
MOV r5, 0xFFEF
AND r5, r5, r3, LSR #3
In the following exercise, the data table contains various ARM logical operations.
You will be asked to find the result of these operations given values for registers r0
and r1.
a.
ORR r2, r1, r0, LSL #1
b.
AND r2, r1, r0, LSR #1
2.13.4 [5] <2.6> Assume that r0 = 0x0000A5A5 and r1 = 00005A5A. What is the
value of r2 after the two instructions in the table?
2.13.5 [5] <2.6> Assume that r0 = 0xA5A50000 and r1 = A5A50000. What is the
value of r2 after the two instructions in the table?
2.13.6 [5] <2.6> Assume that r0 = 0xA5A5FFFF and r1 = A5A5FFFF. What is
the value of r2 after the two instructions in the table?
03-Ch02-P374750.indd 185
7/3/09 12:18:38 PM
186
Chapter 2
Instructions: Language of the Computer
Exercise 2.14
The following figure shows the placement of a bit field in register r0.
31
i
j
0
Field
31 – i bits
i – j bits
j bits
In the following problems, you will be asked to write ARM instructions to extract
the bits “Field” from register r0 and place them into register r1 at the location
indicated in the following table.
a.
i–j
31
000…000
b.
31
Field
14 + i – j bits
000…000
14
Field
0
000…000
2.14.1 [20] <2.6> Find the shortest sequence of ARM instructions that extracts a
field from r0 for the constant values i = 22 and j = 5 and places the field into r1 in
the format shown in the data table.
2.14.2 [5] <2.6> Find the shortest sequence of ARM instructions that extracts a
field from r0 for the constant values i = 4 and j = 0 and places the field into r1 in
the format shown in the data table.
2.14.3 [5] <2.6> Find the shortest sequence of ARM instructions that extracts a
field from r0 for the constant values i = 31 and j = 28 and places the field into r1
in the format shown in the data table.
In the following problems, you will be asked to write ARM instructions to extract
the bits “Field” from register r0 shown in the figure and place them into register
$t1 at the location indicated in the following table. The bits shown as “XXX” are to
remain unchanged.
a.
i–j
31
XXX…XXX
b.
31
14 + i – j bits
XXX…XXX
03-Ch02-P374750.indd 186
Field
14
Field
0
XXX…XXX
7/3/09 12:18:38 PM
2.21
Exercises
187
2.14.4 [20] <2.6> Find the shortest sequence of ARM instructions that extracts
a field from r0 for the constant values i = 17 and j = 11 and places the field into r1
in the format shown in the data table.
2.14.5 [5] <2.6> Find the shortest sequence of ARM instructions that extracts a
field from r0 for the constant values i = 5 and j = 0 and places the field into r1 in
the format shown in the data table.
2.14.6 [5] <2.6> Find the shortest sequence of ARM instructions that extracts a
field from r0 for the constant values i = 31 and j = 29 and places the field into r1
in the format shown in the data table.
Exercise 2.15
For these problems, the table holds some logical operations that are not included in
the ARM instruction set. How can these instructions be implemented?
a.
ANDN r1, r2, r3
// bit-wise AND of r2, !r3
b.
XNOR r1, r2, r3
// bit-wise exclusive-NOR
2.15.1 [5] <2.6> The logical instructions above are not included in the ARM
instruction set, but are described above. If the value of r2 = 0x00FFA5A5 and the
value of r3 = 0xFFFF003C, what is the result in r1?
2.15.2 [10] <2.6> The logical instructions above are not included in the ARM
instruction set, but can be synthesized using one or more ARM assembly instructions. Provide a minimal set of ARM instructions that may be used in place of the
instructions in the table above.
2.15.3 [5] <2.6> For your sequence of instructions in 2.15.2, show the bit-level
representation of each instruction.
Various C-level logical statements are shown in the table below. In this exercise, you
will be asked to evaluate the statements and implement these C statements using
ARM assembly instructions.
a.
A = B & C[0];
b.
A = A ? B : C[0]
2.15.4 [5] <2.6> The table above shows different C statements that use logical
operators. If the memory location at C[0] contains the integer value 0x00001234,
and the initial integer value of A and B are 0x00000000 and 0x00002222, what is
the result value of A?
03-Ch02-P374750.indd 187
7/3/09 12:18:39 PM
188
Chapter 2
Instructions: Language of the Computer
2.15.5 [5] <2.6> For the C statements in the table above, write a minimal sequence
of ARM assembly instructions that does the identical operation.
2.15.6 [5] <2.6> For your sequence of instructions in 2.15.5, show the bit-level
representation of each instruction.
Exercise 2.16
For these problems, the table holds various binary values for register r0. Given the
value of r0, you will be asked to evaluate the outcome of different branches.
a.
1010 1101 0001 0000 0000 0000 0000 0010two
b.
1111 1111 1111 1111 1111 1111 1111 1111two
2.16.1 [5] <2.7> Suppose that register r0 contains a value from above and r1
has the value
0011 1111 1111 1000 0000 0000 0000 0000two
What is the value of r2 after the following instructions?
MOV
CMP
BGE
B
ELSE: MOV
DONE:
r2,#0
r0,r1
ELSE
DONE
r2, #2
2.16.2 [5] <2.7> Suppose that register r0 contains a value from above and r1
has the value
0011 1111 1111 1000 0000 0000 0000 0000two
What is the value of r2 after the following instructions?
MOV
CMP
BLO
B
ELSE: MOV
DONE:
r2,#0
r0,r1
ELSE
DONE
r2, #2
2.16.3 [5] <2.7> Rewrite the above code using conditional instructions of ARM.
For these problems, the table holds various binary values for register r0. Given the
value of r0, you will be asked to evaluate the outcome of different branches.
03-Ch02-P374750.indd 188
7/3/09 12:18:39 PM
2.21
a.
0x00001000
b.
0x20001400
Exercises
189
2.16.4 [5] <2.7> Suppose that register r0 contains a value from above. What is
the value of r2 after the following instructions?
MOV r2, #0
CMP r0, r0
BLT
ELSE
B
DONE
ELSE: ADD r2, r2, #2
DONE:
2.16.5 [5] <2.6, 2.7> Suppose that register r0 contains a value from above. What
is the value of r2 after the following instructions?
MOV r2, #0
CMP r0, r0
BHI ELSE
B
DONE
ELSE: ADD r2, r2, #2
DONE:
Exercise 2.17
For these problems, several instructions that are not included in the ARM instruction set are shown.
a.
ABS r2, r3
# r2 = |r3|
b.
SGT r1, r2, r3
# R[r1] = (R[r2] > R[r3]) ? 1:0
2.17.1 [5] <2.7> The table above contains some instructions not included in
the ARM instruction set and the description of each instruction. Why are these
instructions not included in the ARM instruction set.
2.17.2 [5] <2.7> The table above contains some instructions not included in the
ARM instruction set and the description of each instruction. If these instructions
were to be implemented in the ARM instruction set, what is the most appropriate
instruction format?
2.17.3 [5] <2.7> For each instruction in the table above, find the shortest
sequence of ARM instructions that performs the same operation.
For these problems, the table holds ARM assembly code fragments. You will be
asked to evaluate each of the code fragments, familiarizing you with the different
ARM branch instructions.
03-Ch02-P374750.indd 189
7/3/09 12:18:39 PM
190
Chapter 2
a.
Instructions: Language of the Computer
LOOP:
ELSE:
CMP
BLT
B
ADD
SUB
B
r1, #0
ELSE
DONE
r3, r3, #2
r1, r1, #1
LOOP
DONE:
b.
LOOP: MOV
LOOP2: ADD
SUB
CMP
BNE
SUB
BNE
DONE:
r2, 0xA
r4, r4, #2
r2, r2, #1
r1, #0
LOOP2
r1, r1, #1
LOOP
2.17.4 [5] <2.7> For the loops written in ARM assembly above, assume that the
register r1 is initialized to the value 10. What is the value in register r2 assuming
the r2 is initially zero?
2.17.5 [5] <2.7> For each of the loops above, write the equivalent C code
routine. Assume that the registers r3, r4, r1, and r2 are integers A, B, i, and temp,
respectively.
2.17.6 [5] <2.7> For the loops written in ARM assembly above, assume that the
register r1 is initialized to the value N. How many ARM instructions are executed?
Exercise 2.18
For these problems, the table holds some C code. You will be asked to evaluate these
C code statements in ARM assembly code.
a.
for(i=0; i<10; i++)
a += b;
b.
while (a < 10){
D[a] = b + a;
a += 1;
}
2.18.1 [5] <2.7> For the table above, draw a control-flow graph of the C code.
2.18.2 [5] <2.7> For the table above, translate the C code to ARM assembly code.
Use a minimum number of instructions. Assume that the value a, b, i, j are in
registers r0, r1, r3, r3, respectively. Also, assume that register r1 holds the base
address of the array D.
2.18.3 [5] <2.7> How many ARM instructions does it take to implement the C code?
If the variables a and b are initialized to 10 and 1 and all elements of D are initially 0,
what is the total number of ARM instructions that is executed to complete the loop?
03-Ch02-P374750.indd 190
7/3/09 12:18:39 PM
2.21
Exercises
191
For these problems, the table holds ARM assembly code fragments. You will be
asked to evaluate each of the code fragments, familiarizing you with the different
ARM branch instructions.
a.
MOV
LOOP: LDR
ADD
ADD
SUB
CMP
BNE
r1, #100
r3, [r2, #0]
r4, r4, r3
r2, r2, #4
r1, r1, #1
r1, #0
LOOP
b.
ADD
LOOP: LDR
ADD
LDR
ADD
ADD
CMP
BNE
r1, r2, #400
r3, [r2, #0]
r4, r4, r3
r3, [r2, #0]
r4, r4, r3
r2, r2, #8
r1, r2
LOOP
2.18.4 [5] <2.7> What is the total number of ARM instructions executed?
2.18.5 [5] <2.7> Translate the loops above into C. Assume that the C-level integer i is held in register r1, r4 holds the C-level integer called result, and r2
holds the base address of the integer MemArray.
2.18.6 [5] <2.7> Rewrite the loop in ARM assembly to reduce the number of
ARM instructions executed.
Exercise 2.19
For the following problems, the table holds C code functions. Assume that the first
function listed in the table is called first. You will be asked to translate these C code
routines into ARM Asembly.
a.
int compare(int a, int b) {
if (sub(a, b) >= 0)
return 1;
else
return 0;
}
int sub (int a, int b) {
return a–b;
}
b.
int fib_iter(int a, int b, int n){
if(n == 0)
return b;
else
return fib_iter(a+b, a, n–1);
}
03-Ch02-P374750.indd 191
7/3/09 12:18:39 PM
192
Chapter 2
Instructions: Language of the Computer
2.19.1 [15] <2.8> Implement the C code in the table in ARM assembly. What is
the total number of ARM instructions needed to execute the function?
2.19.2 [5] <2.8> Functions can often be implemented by compilers “in-line”.
An in-line function is when the body of the function is copied into the program
space, allowing the overhead of the function call to be eliminated. Implement an
“in-line” version of the C code in the table in ARM assembly. What is the reduction
in the total number of ARM assembly instructions needed to complete the function? Assume that the C variable n is initialized to 5.
2.19.3 [5] <2.8> For each function call, show the contents of the stack after the
function call is made. Assume the stack pointer is originally at addresss 0x7ffffffc,
and follow the register conventions as specified in Figure 2.11.
The following three problems in this exercise refer to a function f that calls another
function func. The code for C function func is already compiled in another module
using the ARM calling convention from Figure 2.14. The function declaration for func
is “int func(int a, int b);”. The code for function f is as follows:
a.
int f(int a, int b, int c){
return func(func(a,b),c);
}
b.
int f(int a, int b, int c){
return func(a,b)+func(b,c);
}
2.19.4 [10] <2.8> Translate function f into ARM assembler, also using the ARM
calling convention from Figure 2.14. If you need to use registers r4 through r11,
use the lower-numbered registers first.
2.19.5 [5] <2.8> Can we use the tail-call optimization in this function? If no,
explain why not. If yes, what is the difference in the number of executed instructions in f with and without the optimization?
2.19.6 [5] <2.8> Right before your function f from Problem 2.19.4 returns,
what do we know about contents of registers and sp? Keep in mind that we know
what the entire function f looks like, but for function func we only know its
declaration.
Exercise 2.20
This exercise deals with recursive procedure calls. For the following problems, the
table has an assembly code fragment that computes the factorial of a number. However, the entries in the table have errors, and you will be asked to fix these errors.
03-Ch02-P374750.indd 192
7/3/09 12:18:39 PM
2.21
a.
b.
FACT:
SUB
STR
STR
CMP
BGE
MOV
ADD
MOV
L1:
SUB R0,R0,#1
BL FACT
LDR R0,[SP,#4]
LDR LR,[SP,0]
ADD SP,SP,#4
MUL R1,R0,R1
MOV PC,LR
FACT:
SUB
STR
STR
CMP
BGE
MOV
ADD
MOV
L1:
SUB R0,R0,#1
BL FACT
LDR R0,[SP,#4]
LDR LR,[SP,0]
ADD SP,SP,#4
MUL R1,R0,R1
MOV PC,LR
Exercises
193
SP,SP,#8
LR,[SP,#4]
R0,[SP,#8]
R0,#1
L1
R1,#1
SP,SP,#8
PC,LR
SP,SP,#8
LR,[SP,#4]
R0,[SP,#8]
R0,#1
L1
R1,#1
SP,SP,#8
PC,LR
2.20.1 [5] <2.8> The ARM assembly program above computes the factorial of
a given input. The integer input is passed through register r0, and the result is
returned in register r1. In the assembly code, there are a few errors. Correct the
ARM errors.
2.20.2 [10] <2.8> For the recursive factorial ARM program above, assume that
the input is 4. Rewrite the factorial program to operate in a nonrecursive manner.
What is the total number of instructions used to execute your solution from 2.20.2
versus the recursive version of the factorial program?
2.20.3 [5] <2.8> Show the contents of the stack after each function call, assuming that the input is 4.
For the following problems, the table has an assembly code fragment that computes a Fibonacci number. However, the entries in the table have errors, and you
will be asked to fix these errors.
03-Ch02-P374750.indd 193
7/3/09 12:18:39 PM
194
Chapter 2
a.
b.
Instructions: Language of the Computer
FIB:
SUB
STR
STR
STR
CMP
BGE
MOV
B
sp,sp,#12
lr,[sp,#0]
r2,[sp,#4]
r1,[sp,#8]
r1,#1
L1
r0,r1
EXIT
L1:
SUB
BL
MOV
SUB
BL
ADD
r1,r1,#1
FIB
r2,r0
r1,r1,#1
FIB
r0,r0,r2
EXIT:
LDR
LDR
LDR
ADD
MOV
lr,[sp,#0]
r1,[sp,#8]
r2,[sp,#4]
sp,sp,#12
pc,lr
FIB:
SUB
STR
STR
STR
CMP
BGE
MOV
B
sp,sp,#12
lr,[sp,#0]
r2,[sp,#4]
r1,[sp,#8]
r1,#1
L1
r0,r1
EXIT
L1:
SUB
BL
MOV
SUB
BL
ADD
r1,r1,#1
FIB
r2,r0
r1,r1,#1
FIB
r0,r0,r2
EXIT:
LDR
LDR
LDR
ADD
MOV
lr,[sp,#0]
r1,[sp,#8]
r2,[sp,#4]
sp,sp,#12
pc,lr
2.20.4 [5] <2.8> The ARM assembly program above computes the Fibonacci of
a given input. The integer input is passed through register r1, and the result is
returned in register r0. In the assembly code, there are a few errors. Correct the
ARM errors.
2.20.5 [10] <2.8> For the recursive Fibonacci ARM program above, assume that
the input is 4. Rewrite the Fibonacci program to operate in a nonrecursive manner.
What is the total number of instructions used to execute your solution from 2.20.2
versus the recursive version of the factorial program?
2.20.6 [5] <2.8> Show the contents of the stack after each function call, assuming that the input is 4.
03-Ch02-P374750.indd 194
7/3/09 12:18:39 PM
2.21
Exercises
195
Exercise 2.21
Assume that the stack and the static data segments are empty and that the stack and
global pointers start at address 0x7fff fffc and 0x1000 8000, respectively. Assume
the calling conventions as specified in Figure 2.11 and that function inputs are
passed using registers r0 and returned in register r1.
a.
main()
{
leaf_function(1);
}
int leaf_function (int f)
{
int result;
result = f + 1;
if (f > 5)
return result;
leaf_function(result);
}
b.
int my_global = 100;
main()
{
int x = 10;
int y = 20;
int z;
z = my_function(x, my_global)
}
int my_function(int x, int y)
{
return x – y;
}
2.21.1 [5] <2.8> Show the contents of the stack and the static data segments after
each function call.
2.21.2 [5] <2.8> Write ARM code for the code in the table above.
2.21.3 [5] <2.8> If the leaf function could use temporary registers, write the
ARM code for the code in the table above.
The following three problems in this exercise refer to this function, written in ARM
assembler following the calling conventions from Figure 2.14:
a.
f:
03-Ch02-P374750.indd 195
SUB
ADD
SUB
MOV
r4,r0,r3
r4,r2,r4
r4,r4,r1
pc,lr
LSL #1
7/3/09 12:18:39 PM
196
Chapter 2
b.
Instructions: Language of the Computer
f: ADD sp,sp,#8
STR lr,[sp,#4]
STR r3,[sp,#0]
MOV r3,r2
BL g
ADD r0,r0,r3
LDR lr,[sp,#4]
LDR r3,[sp,#0]
SUB sp,sp,#8
MOV pc,lr
2.21.4 [10] <2.8> This code contains a mistake that violates the ARM calling
convention. What is this mistake and how should it be fixed?
2.21.5 [10] <2.8> What is the C equivalent of this code? Assume that the function’s arguments are named a, b, c, etc. in the C version of the function.
2.21.6 [10] <2.8> At the point where this function is called register r0, r1, r2,
and r3 have values 1, 100, 1000, and 30, respectively. What is the value returned by
this function? If another function g is called from f, assume that the value returned
from g is always 500.
Exercise 2.22
This exercise explores ASCII and Unicode conversion. The following table shows
strings of characters.
a.
A byte
b.
computer
2.22.1 [5] <2.9> Translate the strings into decimal ASCII byte values.
2.22.2 [5] <2.9> Translate the strings into 16-bit Unicode (using hex notation
and the Basic Latin character set).
The following table shows hexadecimal ASCII character values.
a.
61 64 64
b.
73 68 69 66 74
2.22.3 [5] <2.5, 2.9> Translate the hexadecimal ASCII values to text.
03-Ch02-P374750.indd 196
7/3/09 12:18:39 PM
2.21
Exercises
197
Exercise 2.23
In this exercise, you will be asked to write a ARM assembly program that converts
strings into the number format as specified in the table.
a.
positive integer decimal strings
b.
2’s complement hexadecimal integers
2.23.1 [10] <2.9> Write a program in ARM assembly language to convert an
ASCII number string with the conditions listed in the table above, to an integer.
Your program should expect register r0 to hold the address of a null-terminated
string containing some combination of the digits 0 through 9. Your program should
compute the integer value equivalent to this string of digits, then place the number
in register r1. If a non-digit character appears anywhere in the string, your program should stop with the value –1 in register r1. For example, if register r0 points
to a sequence of three bytes 50ten, 52ten, 0ten (the null-terminated string “24”), then
when the program stops, register r1 should contain the value 24ten.
Exercise 2.24
Assume that the register r1 contains the address 0x1000 0000 and the register r2
contains the address 0x1000 0010.
a.
LDRB r0, [r1,#0]
STRH r0, [r2,#0]
b.
LDRB r0, [r1,#0]
STRB r0, [r2,#0]
2.24.1 [5] <2.9> Assume that the data (in hexadecimal) at address 0x1000 0000 is:
1000 0000
12
34
56
78
What value is stored at the address pointed to by register r2? Assume that the
memory location pointed to r2 is initialized to 0xFFFF FFFF.
2.24.2 [5] <2.9> Assume that the data (in hexadecimal) at address 0x1000 0000 is:
1000 0000
80
80
80
80
What value is stored at the address pointed to by register r2? Assume that the
memory location pointed to r2 is initialized to 0x0000 0000.
2.24.3 [5] <2.9> Assume that the data (in hexadecimal) at address 0x1000 0000 is:
1000 0000
03-Ch02-P374750.indd 197
11
00
00
FF
7/3/09 12:18:39 PM
198
Chapter 2
Instructions: Language of the Computer
What value is stored at the address pointed to by register r2? Assume that the
memory location pointed to r2 is initialized to 0x5555 5555.
Exercise 2.25
In this exercise, you will explore 32-bit constants in ARM. For the following problems, you will be using the binary data in the table below.
a.
1010 1101 0001 0000 0000 0000 0000 0010two
b.
1111 1111 1111 1111 1111 1111 1111 1111two
2.25.1 [10] <2.10> Write the ARM code that creates the 32-bit constants listed
above and stores that value to register r1
2.25.2 [5] <2.6, 2.10> If the current value of the PC is 0x00000000, write the
instruction(s), to get to the PC address shown in the table above.
2.25.3 [10] <2.10> If the immediate field of an ARM instruction was only 8 bits
wide, would it be possible to create 32-bit constants ? If so, write the ARM code to
do that.
2.25.4 [5] <2.10> How would you create the one’s complement of these numbers?
2.25.5 [5] <2.10> Write the ARM machine code for the instructions used in
2.25.1.
Exercise 2.26
For this exercise, you will explore the addressing modes in ARM. Consider the four
addressing modes given in the table below.
a.
Register offset
b.
Scaled register offset
c
Pre-indexed
d
Post-indexed
2.26.1 [5] <2.10> Give an example ARM instruction for each of the above ARM
addressing modes.
2.26.2 [5] <2.10> For the instructions in 2.26.1, what is the instruction format
type used for the given instruction?
03-Ch02-P374750.indd 198
7/3/09 12:18:39 PM
2.21
Exercises
199
2.26.3 [5] <2.10> List benefits and drawbacks of each ARM addressing mode.
Write ARM code that shows these benefits and drawbacks.
For the following problems, you will use the instructions and information given
below to determine the effect of the different addressing modes.
LDR
LDR
LDR
STR
r0,[r1,#4]
r2,[r0,r3, LSL#2]
r3,[r1,#4]!
r0,[r2],#4
r0=0x0000000 r1=0x00009000 r2=0x00009004 r3=0x00000002
mem[0x00000000]= 0x01010101
mem[0x00000004]= 0x04040404
mem[0x00000008]= 0x05050505
mem[0x00009000]= 0x02020202
mem[0x00009004]= 0x00000000
mem[0x00009008]= 0x06060606
2.26.4 [10] <2.10> What will be the values in the registers r0, r1, r2 and r3
when the above instructions are executed?
2.26.5 [10] <2.10> What will be the values in the memory locations given above?
Exercise 2.27
For this exercise, you will explore the addressing modes in ARM. Consider the four
addressing modes given in the table below.
a.
Immediate
b.
Scaled register
c
Scaled register Pre-indexed
d
PC-relative
2.27.1 [5] <2.10> Give an example ARM instruction for each of the above ARM
addressing modes.
2.27.2 [5] <2.10> For the instructions in 2.27.1, what is the instruction format
type used for the given instruction?
2.27.3 [5] <2.10> List benefits and drawbacks of each ARM addressing mode.
Write ARM code that shows these benefits and drawbacks.
03-Ch02-P374750.indd 199
7/3/09 12:18:39 PM
200
Chapter 2
Instructions: Language of the Computer
Exercise 2.28
The following table contains ARM assembly code for a lock.
try:
MOV
SWP
CMP
BEQ
LDR
ADD
STR
SWP
R3,#1
R2,R3,[R1,#0]
R2,#1
try
R4,[R2,#0]
R3,R4,#1
R3,[R2,#0]
R2,R3,[R1,#0]
2.28.1 [5] <2.11> For each test and fail of the “swp”, how many instructions need
to be executed?
2.28.2 [5] <2.11> For the swp-based lock-unlock code above, explain why this
code may fail.
2.28.3 [15] <2.11> Re-write the code above so that the code may operate correct.
Be sure to avoid any race conditions.
Each entry in the following table has code and also shows the contents of various
registers. The notation, “(r1)” shows the contents of a memory location pointed to
by register r1. The assembly code in each table is executed in the cycle shown on
parallel processors with a shared memory space.
a.
Processor 1
Processor 1
Processor 2
MEM
Processor 2
Cycle
r2
r3
(r1)
r2
r3
0
1
2
99
30
40
Processor 1
MEM
SWP R2,R3,
[R1,#0]
1
SWP
R2,R3,[R1,#0]
2
b.
Processor 1
Processor 2
try:
try:
MOV R3,#1
MOV R3,#1
SWP R2,R3,[R1,#0]
SWP R2,R3,[R1,#0]
03-Ch02-P374750.indd 200
r2
r3
(r1)
r2
r3
0
2
3
1
10
20
1
2
3
CMP R2,#1
BEQ try
Processor 2
Cycle
4
CMP R2,#1
5
BEQ try
6
7/3/09 12:18:39 PM
2.21
Exercises
201
2.28.4 [5] <2.11> Fill out the table with the value of the registers for each given
cycle.
Exercise 2.29
The first three problems in this exercise refer to a critical section of the form
lock(lk);
operation
unlock(lk);
where the “operation” updates the shared variable shvar using the local (nonshared)
variable x as follows:
Operation
a.
shvar=shvar+x;
b.
shvar=min(shvar,x);
2.29.1 [10] <2.11> Write the ARM assembler code for this critical section, assuming that the address of the lk variable is in r1, the address of the shvar variable
is in r4, and the value of variable x is in r5. Your critical section should not contain any function calls, i.e., you should include the ARM instructions for lock(),
unlock(), max(), and min() operations. Use swp instructions to implement
the lock() operation, and the unlock() operation is simply an ordinary store
instruction.
2.29.2 [10] <2.11> Repeat problem 2.29.1, but this time use swp to perform an
atomic update of the shvar variable directly, without using lock() and unlock().
Note that in this problem there is no variable lk.
2.29.3 [10] <2.11> Compare the best-case performance of your code from
2.29.1 and 2.29.2, assuming that each instruction takes one cycle to execute. Note:
best-case means that swp always succeeds, the lock is always free when we want to
lock(), and if there is a branch we take the path that completes the operation with
fewer executed instructions.
2.29.4 [10] <2.11> Using your code from 2.29.2 as an example, explain what
happens when two processors begin to execute this critical section at the same
time, assuming that each processor executes exactly one instruction per cycle.
2.29.5 [10] <2.11> Explain why in your code from 2.29.2 register r4 contains the
address of variable shvar and not the value of that variable, and why register r5
contains the value of variable x and not its address.
03-Ch02-P374750.indd 201
7/3/09 12:18:39 PM
202
Chapter 2
Instructions: Language of the Computer
2.29.6 [10] <2.11> If we want to atomically perform the same operation on two
shared variables (e.g., shvar1 and shvar2) in the same critical section, we can do
this easily using the approach from 2.29.1 (simply put both updates between the
lock operation and the corresponding unlock operation). Explain why we cannot
do this using the approach from 2.29.2., i.e., why we cannot use swp to access both
shared variables in a way that guarantees that both updates are executed together
as a single atomic operation.
Exercise 2.30
Assembler pseudoinstructions are not a part of the ARM instruction set, but often
appear in ARM programs. The table below contains some ARM pseudoinstructions that, when assembled, are translated to other ARM assembly instructions.
a.
LDR r0,#constant
2.30.1 [5] <2.12> For each pseudo instruction in the table above, give atleast two
different sequence of actual ARM instructions to accomplish the same thing.
2.30.2 [5] <2.12> If the constant value is FFF0, which of these would you choose?
Exercise 2.31
The table below contains the link-level details of two different procedures. In this
exercise, you will be taking the place of the linker.
Procedure A
a.
Procedure B
Address
Instruction
0
LDR r0, [r3, #0]
0
STR r1, [r3, #0]
4
BL 0
4
BL 0
…
…
…
…
Data
Segment
0
(X)
0
(Y)
…
…
Data
Segment
…
…
Relocation
Info
Address
Instruction Type
Dependency
Address
Instruction Type
Dependency
0
LDR
X
Relocation
Info
0
STR
Y
4
BL
B
4
BL
A
Text
Segment
Symbol
Table
03-Ch02-P374750.indd 202
Address
Symbol
—
X
—
B
Text
Segment
Symbol
Table
Address
Instruction
Address
Symbol
—
Y
—
A
7/3/09 12:18:39 PM
2.21
b.
Procedure A
Text
Segment
Address
Instruction
0
LDR r0, [r3,#0]
4
Procedure B
Text
Segment
Address
Instruction
0
STR r0, [r3,#0]
ORR r1, r0, #0
4
B 0
8
BL 0
…
…
…
…
0x180
MOV pc, lr
…
…
0
(Y)
…
…
Address
Instruction Type
Dependency
0
STR
Y
4
B
FOO
Data
Segment
0
(X)
…
…
Relocation
Info
Address
Instruction Type
Dependency
0
LDR
X
4
ORR
X
8
BL
B
Address
Symbol
—
X
—
B
Symbol
Table
203
Exercises
Data
Segment
Relocation
Info
Symbol
Table
Address
Symbol
—
Y
0x180
FOO
2.31.1 [5] <2.12> Link the object files above to form the executable file header.
Assume that Procedure A has a text size of 0x140, data size of 0x40 and Procedure B
has a text size of 0x300 and data size of 0x50. Also assume the memory allocation
strategy as shown in Figure 2.13.
2.31.2 [5] <2.12> What limitations, if any, are there on the size of an executable?
Exercise 2.32
The first three problems in this exercise assume that function swap, instead of the
code in Figure 2.22, is defined in C as follows:
a.
void swap(int v[], int k, int j){
int temp;
temp=v[k];
v[k]=v[j];
v[j]=temp;
}
b.
void swap(int *p){
int temp;
temp=*p;
*p=*(p+1);
*(p+1)=*p;
}
03-Ch02-P374750.indd 203
7/3/09 12:18:39 PM
204
Chapter 2
Instructions: Language of the Computer
2.32.1 [10] <2.13> Translate this function into ARM assembler code.
2.32.2 [5] <2.13> What needs to change in the sort function?
2.32.3 [5] <2.13> If we were sorting 8-bit bytes, not 32-bit words, how would
your ARM code for swap in 2.32.1 change?
Exercise 2.33
The problems in this exercise refer to the following function, given as array code:
a.
int find(int a[], int n, int x){
int i;
for(i=0;i!=n;i++)
if(a[i]==x)
return i;
return –1;
}
b.
int count(int a[], int n, int x){
int res=0;
int i;
for(i=0;i!=n;i++)
if(a[i]==x)
res=res+1;
return res;
}
2.33.1 [10] <2.14> Translate this function into ARM assembly.
2.33.2 [10] <2.14> Convert this function into pointer-based code (in C).
2.33.3 [10] <2.14> Translate your pointer-based C code from 2.33.2 into ARM
assembly.
2.33.4 [5] <2.14> Compare the worst-case number of executed instructions per
nonlast loop iteration in your array-based code from 2.33.1 and your pointer-based
code from 2.33.3. Note: the worst-case occurs when branch conditions are such
that the longest path through the code is taken, i.e., if there is an if statement, the
result of the condition check is such that the path with more instructions is taken.
However, if the result of the condition check would cause the loop to exit, then we
assume that the path that keeps us in the loop is taken.
2.33.5 [5] <2.14> Compare the number of registers needed for your array-based
code from 2.33.1 and for your pointer-based code from 2.33.3.
03-Ch02-P374750.indd 204
7/3/09 12:18:39 PM
2.21
Exercises
205
Exercise 2.34
The table below contains ARM assembly code. In the following problems, you will
translate ARM assembly code to MIPS.
a.
LOOP:
b.
MOV
ADD
SUBS
BNE
r0, ;10
r0, r1
r0, 1
LOOP
;init loop counter to 10
;add r1 to r0
;decrement counter
;if Z=0 repeat loop
ROR
r1, r2, #4
;r1 = r23:0 concatenated with r231:4
2.34.1 [5] <2.16> For the table above, translate this ARM assembly code to MIPS
assembly code. Assume that ARM registers r0, r1, and r2 hold the same values as
MIPS registers $s0, $s1, and $s2, respectively. Use MIPS temporary registers
($t0, etc.) where necessary.
2.34.2 [5] <2.16> For the MIPS assembly instructions in 2.34.1, indicate the
instruction types.
The table below contains MIPS assembly code. In the following problems, you will
translate MIPS assembly code to ARM.
a.
slt $t0, $s0, $s1
blt $t0, $0, FARAWAY
b.
add $s0, $s1, $s2
2.34.3 [5] <2.16> For the table above, find the ARM assembly code that corresponds to the sequence of MIPS assembly code.
2.34.4 [5] <2.16> Show the bit fields that represent the ARM assembly code.
Exercise 2.35
The ARM processor has a few different addressing modes that are not supported in
MIPS. The following problems explore how these addressing modes can be realized on MIPS.
a.
LDR
r0, [r1]
; r0 = memory[r1]
b.
LDMIA
r0, {r1, r2, r4}
; r1 = memory[r0], r2 = memory[r0+4]
; r4 = memory[r0+8]
2.35.1 [5] <2.16> Identify the type of addressing mode of the ARM assembly
instructions in the table above.
03-Ch02-P374750.indd 205
7/3/09 12:18:39 PM
206
Chapter 2
Instructions: Language of the Computer
2.35.2 [5] <2.16> For the ARM assembly instructions above, write a sequence of
ARM assembly instructions to accomplish the same data transfer. In the following
problems, you will compare code written using the ARM and ARM instruction sets.
The following table shows code written in the ARM instruction set.
a.
ADDLP:
b.
LDR
LDR
EOR
LDR
ADD
ADD
SUBS
BNE
r0, =Table1
r1, #100
r2, r2, r2
r4, [r0]
r2, r2, r4
r0, r0, #4
r1, r1, #1
ADDLP
;load base address of table
;initialize loop counter
;clear r2
;get first addition operand
;add to r2
;increment to next table element
;decrement loop counter
;if loop counter != 0, go to ADDLP
ROR
r1, r2, #4
;r1 = r23:0 concatenated with r231:4
2.35.3 [10] <2.16> For the ARM assembly code above, write an equivalent MIPS
assembly code routine.
2.35.4 [5] <2.16> What is the total number of ARM assembly instructions
required to execute the code? What is the total number of MIPS assembly instructions required to execute the code?
2.35.5 [5] <2.16> Assuming that the average CPI of the MIPS assembly routine
is the same as the average CPI of the ARM assembly routine, and the MIPS processor has an operation frequency that is 1.5 times of the ARM processor, how much
faster is the ARM processor than the MIPS processor?
Exercise 2.36
The ARM processor has an interesting way of supporting immediate constants.
This exercise investigates those differences. The following table contains ARM
instructions.
a.
ADD, r3, r2, r1, LSL #3
;r3 = r2 + (r1 << 3)
b.
ADD, r3, r2, r1, ROR #3
;r3 = r2 + (r1, rotated_right 3 bits)
2.36.1 [5] <2.16> Write the equivalent MIPS code for the ARM assembly code
above.
2.36.2 [5] <2.16> If the register R1 had the constant value of 8, re-write your
MIPS code to minimize the number of MIPS assembly instructions needed.
2.36.3 [5] <2.16> If the register R1 had the constant value of 0x06000000, rewrite
your MIPS code to minimize the number of MIPS assembly instructions needed.
03-Ch02-P374750.indd 206
7/3/09 12:18:39 PM
2.21
Exercises
207
The following table contains MIPS instructions.
a.
addi r3, r2, 0x1
b.
addi r3, r2, 0x8000
2.36.4 [5] <2.16> For the MIPS assembly code above, write the equivalent ARM
assembly code.
Exercise 2.37
This exercise explores the differences between the ARM and x86 instruction sets.
The following table contains x86 assembly code.
mov edx, [esi+4*ebx]
a.
b.
START:
mov
mov
mov
and
or
ax,
cx,
bx,
ax,
ax,
00101100b
00000011b
11110000b
bx
cx
2.37.1 [10] <2.17> Write pseudo code for the given routine.
2.37.2 [10] <2.17> What is the equivalent ARM for the given routine?
The following table contains x86 assembly instructions.
a.
mov edx, [esi+4*ebx]
b.
add
eax, 0x12345678
2.37.3 [5] <2.17> For each assembly instruction, show the size of each of the
bit fields that represent the instruction. Treat the label MY_FUNCTION as a 32-bit
constant.
2.37.4 [10] <2.17> Write equivalent ARM assembly statements.
Exercise 2.38
The x86 instruction set includes the REP prefix that causes the instruction to be
repeated a given number of times or until a condition is satisfied. The first three
problems in this exercise refer to the following x86 instruction:
03-Ch02-P374750.indd 207
7/3/09 12:18:40 PM
208
Chapter 2
Instructions: Language of the Computer
Instruction
Interpretation
a.
REP MOVSB
Repeat until ECX is zero:
Mem8[EDI]=Mem8[ESI], EDI=EDI+1, ESI=ESI+1, ECX=ECX–1
b.
REP MOVSD
Repeat until ECX is zero:
Mem32[EDI]=Mem32[ESI], EDI=EDI+4, ESI=ESI+4, ECX=ECX–1
2.38.1 [5] <2.17> What would be a typical use for this instruction?
2.38.2 [5] <2.17> Write ARM code that performs the same operation, assuming
that r0 corresponds to ECX, r1 to EDI, r2 to ESI, and r3 to EAX.
2.38.3 [5] <2.17> If the x86 instruction takes one cycle to read memory, one
cycle to write memory, and one cycle for each register update, and if ARM takes one
cycle per instruction, what is the speed-up of using this x86 instruction instead of
the equivalent ARM code when ECX is very large? Assume that the clock cycle time
for x86 and ARM is the same.
The remaining three problems in this exercise refer to the following function, given
in both C and x86 assembly. For each x86 instruction, we also show its length in the
x86 variable-length instruction format and the interpretation (what the instruction
does). Note that the x86 architecture has very few registers compared to ARM, and
as a result the x86 calling convention is to push all arguments onto the stack. The
return value of an x86 function is passed back to the caller in the EAX register.
C code
x86 code
a.
int f(int a, int b){
return a+b;
}
f: push %ebp
mov %esp,%ebp
mov 0xc(%ebp),%eax
add 0x8(%ebp),%eax
pop %ebp
ret
;
;
;
;
;
;
1B,
2B,
3B,
3B,
1B,
1B,
push %ebp to stack
move %esp to %ebp
load 2nd arg to %eax
add 1st arg to %eax
restore %ebp
return
b.
void f(int *a, int *b){
*a=*a+*b;
*b=*a;
}
f: push %ebp
mov %esp,%ebp
mov 8(%ebp),%eax
mov 12(%ebp),%ecx
mov (%eax),%edx
add (%ecx),%edx
mov %edx,(%eax)
mov %edx,(%ecx)
pop %ebp
ret
;
;
;
;
;
;
;
;
;
;
1B,
2B,
3B,
3B,
2B,
2B,
2B,
2B,
1B,
1B,
push %ebp to stack
move %esp to %ebp
load 1st arg into %eax
load 2nd arg into %ecx
load *a into %edx
add *b to %edx
store %edx to *a
store %edx to *b
restore %ebp
return
2.38.4 [5] <2.17> Translate this function into ARM assembly. Compare the size
(how many bytes of instruction memory are needed) for this x86 code and for your
ARM code.
03-Ch02-P374750.indd 208
7/3/09 12:18:40 PM
2.21
Exercises
209
2.38.5 [5] <2.17> If the processor can execute two instructions per cycle, it must
at least be able to read two consecutive instructions in each cycle. Explain how it
would be done in ARM and how it would be done in x86.
2.38.6 [5] <2.17> If each ARM instruction takes one cycle, and if each x86
instruction takes one cycle plus a cycle for each memory read or write it has to
perform, what is the speed-up of using x86 instead of ARM? Assume that the clock
cycle time is the same in both x86 and ARM, and that the execution takes the shortest possible path through the function (i.e., every loop is exited immediately and
every if statement takes the direction that leads toward the return from the function). Note that x86 ret instruction reads the return address from the stack.
Exercise 2.39
The CPI of the different instruction types is given in the following table.
Arithmetic
Load/Store
Branch
a.
2
10
3
b.
1
10
4
2.39.1 [5] <2.18> Assume the following instruction breakdown given for executing a given program:
Instructions (in millions)
Arithmetic
500
Load/Store
300
Branch
100
What is the execution time for the processor if the operation frequency is 5 GHz?
2.39.2 [5] <2.18> Suppose that new, more powerful arithmetic instructions are
added to the instruction set. On average, through the use of these more powerful arithmetic instructions, we can reduce the number of arithmetic instructions
needed to execute a program by 25%, and the cost of increasing the clock cycle time
by only 10%. Is this a good design choice? Why?
2.39.3 [5] <2.18> Suppose that we find a way to double the performance of
arithmetic instructions? What is the overall speed-up of our machine? What if we
find a way to improve the performance of arithmetic instructions by 10 times!?
The following table shows the proportions of instruction execution for the different instruction types.
03-Ch02-P374750.indd 209
7/3/09 12:18:40 PM
210
Chapter 2
Instructions: Language of the Computer
Arithmetic
Load/Store
Branch
a.
60%
20%
20%
b.
80%
15%
5%
2.39.4 [5] <2.18> Given the instruction mix above and the assumption that an
arithmetic instruction requires 2 cycles, a load/store instruction takes 6 cycles, and
a branch instruction takes 3 cycles, find the average CPI.
2.39.5 [5] <2.18> For a 25% improvement in performance, how many cycles, on
average, may an arithmetic instruction take if load/store and branch instructions
are not improved at all?
2.39.6 [5] <2.18> For a 50% improvement in performance, how many cycles, on
average, may an arithmetic instruction take if load/store and branch instructions
are not improved at all?
Exercise 2.40
The first three problems in this exercise refer to the following function, given in
ARM assembly. Unfortunately, the programmer of this function has fallen prey to
the pitfall of assuming that ARM is a word addressed machine, but in fact ARM is
byte-addressed.
a.
b.
03-Ch02-P374750.indd 210
; int f(int a[],int
f: MOV R4,#0
MOV R5,#0
L: ADD R6,R5,R0
LDR R6,[R6,#0]
CMP R6,R2
BNE S
ADD R4,R4,#1
S: ADD R5,R5,#1
CMP R5,R1
BNE L
MOV R0,R4
MOV PC,LR
; void
f: MOV
MOV
ADD
L: LDR
LDR
ADD
STR
ADD
ADD
CMP
BNE
MOV
n,int x);
; ret=0
; i=0
; &(a[i])
; read a[i]
; if( a[i]==x)
; ret++
; i++
; repeat if i!=n
; return ret
f(int *a,int *b,int n);
R4,R0
; p=a
R5,R1
; q=b
R6,R2,R0
; &(a[n])
R7,[R4,#0]
; read *p
R8,[R5,#0]
; read *q
R7,R7,R8
; *p + *q
R7,[R4,#0]
; *p = *p + *q
R4,R4,#1
; p=p+1
R5,R5,#1
; q=q+1
R4,R6,#1
; repeat if p!= &(a[n])
L
PC,LR
; return
7/3/09 12:18:40 PM
2.21
Exercises
211
Note that in ARM assembly the “;” character denotes that the remainder of the line
is a comment.
2.40.1 [5] <2.18> The ARM architecture requires word-sized accesses (LDR and
STR) to be word-aligned, i.e. the lowermost 2 bits of the address must both be zero.
Explain how this alignment requirement affects the execution of this function.
2.40.2 [5] <2.18> If “a” was a pointer to the beginning of an array of one-byte
elements, and if we replaced LDR and STR with LDRB (load byte) and STRB (store
byte), respectively, would this function be correct?
2.40.3 [5] <2.18> Change this code to make it correct for 32-bit integers.
The remaining three problems in this exercise refer to a program that allocates
memory for an array, fills the array with some numbers, calls the sort function
from Figure 2.25, and then prints out the array. The main function of the program
is as follows (given as both C and ARM code):
main code in C
main(){
int *v;
int n=5;
v=my_alloc(5);
my_init(v,n);
sort(v,n);
.
.
.
ARM version of the main code
main:
MOV R9,#5
MOV R0,R9
BL my_alloc
MOV R10,R0
MOV R0,R10
MOV R1,R9
BL my_init
MOV R0,R10
MOV R1,R9
BL sort
The my_alloc function is defined as follows (given as both C and ARM code).
Note that the programmer of this function has fallen prey to the pitfall of using a
pointer to an automatic variable arr outside the function in which it is defined.
my_alloc in C
int *my_alloc(int n){
int arr[n];
return arr;
}
03-Ch02-P374750.indd 211
ARM code for my_alloc
my_alloc:
SUB SP,SP,4
STR R11,[SP,#0]
MOV R11,SP
LSL R4,R0,R2
SUB SP,SP,R4
MOV R0,SP
MOV SP,R11
LDR R11,[SP,#0]
ADD SP,SP,#4
MOV PC,LR
;
;
;
;
;
;
;
;
;
push
r11 to stack
save sp in r11
We need 4*n bytes
Make room for arr
Return address of arr
Restore sp from r11
pop r11
from stack
7/3/09 12:18:40 PM
212
Chapter 2
Instructions: Language of the Computer
The my_init function is defined as follows (ARM code):
a.
b.
my_init:
MOV R4,#0
MOV R5,R0
L: MOV R6,#0
STR R6,[R5,#0]
ADD R5,R5,#4
ADD R4,R4,#1
CMP R4,R1
BNE L
MOV PC,LR
my_init:
MOV R4,#0
MOV R5,R0
L: SUB R6,R1,R4
STR R6,[R5,#0]
ADD R5,R5,#4
ADD R4,R4,#1
CMP R4,R1
BNE L
MOV PC,LR
; i=0
; v[i]=0
; i=i+1
; untill i==n
; i=0
; a[i]=n-i
; i=i+1
; until i==n
2.40.4 [5] <2.18> What are the contents (values of all five elements) of array v
right before the “BL sort” instruction in the main code is executed?
2.40.5 [15] <2.18, 2.13> What are the contents of array v right before the sort
function enters its outer loop for the first time? Assume that registers sp, r9, and
r10 have values of 0x1000, 20, and 40, respectively, at the beginning of the main
code.
2.40.6 [10] <2.18, 2.13> What are the contents of the 5-element array pointed by
v right after “BL sort” returns to the main code?
Answers to
Check Yourself
03-Ch02-P374750.indd 212
§2.2, page 80: ARM, C, Java
§2.3, page 86: 2) Very slow
§2.4, page 92: 3) –8ten
§2.5, page 99: 4) sub r2, r0, r1
§2.6, page 104: Both. AND with a mask pattern of 1s will leaves 0s everywhere but
the desired field. Shifting left by the right amount removes the bits from the left of
the field. Shifting right by the appropriate amount puts the field into the rightmost
bits of the word, with 0s in the rest of the word. Note that AND leaves the field
where it was originally, and the shift pair moves the field into the rightmost part
of the word.
7/3/09 12:18:40 PM
2.21
Exercises
213
§2.7, page 112: I. All are true. II. 1).
§2.8, page 122: Both are true.
§2.9, page 127: I. 2) II. 3)
§2.11, page 134: Both are true.
§2.12, page 143: 4) Machine independence.
03-Ch02-P374750.indd 213
7/3/09 12:18:40 PM
3
Arithmetic for
Computers
Numerical precision
is the very soul
of science.
Sir D’arcy Wentworth Thompson
On Growth and Form, 1917
3.1
Introduction 216
3.2
Addition and Subtraction 216
3.3
Multiplication 220
3.4
Division
3.5
Floating Point 232
3.6
Parallelism and Computer Arithmetic:
226
Associativity 258
04-Ch03-P374750.indd 214
3.7
Real Stuff: Floating Point in the x86 259
3.8
Fallacies and Pitfalls 262
7/3/09 9:00:45 AM
3.9
Concluding Remarks 265
3.10
Historical Perspective and Further Reading 268
3.11
Exercises 269
The Five Classic Components of a Computer
04-Ch03-P374750.indd 215
7/3/09 9:00:45 AM
216
Chapter 3
Arithmetic for Computers
3.1
Introduction
Computer words are composed of bits; thus, words can be represented as binary
numbers. Chapter 2 shows that integers can be represented either in decimal
or binary form, but what about the other numbers that commonly occur? For
example:
■
What about fractions and other real numbers?
■
What happens if an operation creates a number bigger than can be represented?
■
And underlying these questions is a mystery: How does hardware really multiply or divide numbers?
The goal of this chapter is to unravel these mysteries including representation of
real numbers, arithmetic algorithms, hardware that follows these algorithms, and
the implications of all this for instruction sets. These insights may explain quirks
that you have already encountered with computers.
Subtraction: Addition’s
Tricky Pal
No. 10, Top Ten Courses
for Athletes at a
Football Factory, David
Letterman et al., Book of
Top Ten Lists, 1990
3.2
Addition and Subtraction
Addition is just what you would expect in computers. Digits are added bit by bit
from right to left, with carries passed to the next digit to the left, just as you would
do by hand. Subtraction uses addition: the appropriate operand is simply negated
before being added.
Binary Addition and Subtraction
EXAMPLE
Let’s try adding 6ten to 7ten in binary and then subtracting 6ten from 7ten in
binary.
0000 0000 0000 0000 0000 0000 0000 0111two = 7ten
+
0000 0000 0000 0000 0000 0000 0000 0110two = 6ten
=
0000 0000 0000 0000 0000 0000 0000 1101two = 13ten
The 4 bits to the right have all the action; Figure 3.1 shows the sums and carries. The carries are shown in parentheses, with the arrows showing how they
are passed.
04-Ch03-P374750.indd 216
7/3/09 9:00:46 AM
3.2
217
Addition and Subtraction
Subtracting 6ten from 7ten can be done directly:
ANSWER
–
0000 0000 0000 0000 0000 0000 0000 0111two = 7ten
0000 0000 0000 0000 0000 0000 0000 0110two = 6ten
=
0000 0000 0000 0000 0000 0000 0000 0001two = 1ten
or via addition using the two’s complement representation of −6:
+
0000 0000 0000 0000 0000 0000 0000 0111two = 7ten
1111 1111 1111 1111 1111 1111 1111 1010two = –6ten
=
0000 0000 0000 0000 0000 0000 0000 0001two = 1ten
(0)
(0)
(1)
(1)
(0)
0
0
0
1
1
0
0
0
1
1
. . . (0) 0
(0) 0
(0) 1
(1) 1
(1) 0
...
...
(Carries)
1
0
(0)
1
FIGURE 3.1 Binary addition, showing carries from right to left. The rightmost bit adds
1 to 0, resulting in the sum of this bit being 1 and the carry out from this bit being 0. Hence, the operation for the second digit to the right is 0 + 1 + 1. This generates a 0 for this sum bit and a carry out of 1.
The third digit is the sum of 1 + 1 + 1, resulting in a carry out of 1 and a sum bit of 1. The fourth bit is
1 + 0 + 0, yielding a 1 sum and no carry.
Recall that overflow occurs when the result from an operation cannot be
represented with the available hardware, in this case a 32-bit word. When can
overflow occur in addition? When adding operands with different signs, overflow
cannot occur. The reason is the sum must be no larger than one of the operands.
For example, −10 + 4 = −6. Since the operands fit in 32 bits and the sum is no larger
than an operand, the sum must fit in 32 bits as well. Therefore, no overflow can
occur when adding positive and negative operands.
There are similar restrictions to the occurrence of overflow during subtract, but
it’s just the opposite principle: when the signs of the operands are the same, overflow cannot occur. To see this, remember that x − y = x + (−y) because we subtract
by negating the second operand and then add. Therefore, when we subtract operands of the same sign we end up by adding operands of different signs. From the
prior paragraph, we know that overflow cannot occur in this case either.
Knowing when overflow cannot occur in addition and subtraction is all well
and good, but how do we detect it when it does occur? Clearly, adding or subtracting two 32-bit numbers can yield a result that needs 33 bits to be fully expressed.
The lack of a 33rd bit means that when overflow occurs, the sign bit is set with the
value of the result instead of the proper sign of the result. Since we need just one
extra bit, only the sign bit can be wrong. Hence, overflow occurs when adding two
04-Ch03-P374750.indd 217
7/3/09 9:00:46 AM
218
Chapter 3
Arithmetic for Computers
positive numbers and the sum is negative, or vice versa. This means a carry out
occurred into the sign bit.
Overflow occurs in subtraction when we subtract a negative number from a
positive number and get a negative result, or when we subtract a positive number
from a negative number and get a positive result. This means a borrow occurred
from the sign bit. Figure 3.2 shows the combination of operations, operands, and
results that indicate an overflow.
We have just seen how to detect overflow for two’s complement numbers in
a computer. What about overflow with unsigned integers? Unsigned integers are
commonly used for memory addresses where overflows are ignored.
The computer designer must therefore provide a way to ignore overflow in some
cases and to recognize it in others. The ARM solution is to have two conditional
branches that test for overflow: BVS (branch if overflow set) and BVC (branch if
overflow clear. The arithmetic instruction just needs to append S to set the condition flags before the branch. Because C ignores overflows, the ARM C compilers would not set the condition flag and branch on overflow. The ARM Fortran
compilers, however, would add the test if the operands were signed integers.
Arithmetic Logic Unit
(ALU) Hardware that
performs addition,
subtraction, and usually
logical operations such as
AND and OR.
Result
indicating overflow
Operation
Operand A
Operand B
A+B
A+B
A–B
A–B
≥0
≥0
<0
<0
<0
≥0
≥0
≥0
FIGURE 3.2
≥0
<0
<0
<0
Overflow conditions for addition and subtraction.
Appendix C describes the hardware that performs addition and subtraction,
which is called an Arithmetic Logic Unit or ALU.
Arithmetic for Multimedia
Since every desktop microprocessor by definition has its own graphical displays,
as transistor budgets increased it was inevitable that support would be added for
graphics operations.
Many graphics systems originally used 8 bits to represent each of the three
primary colors plus 8 bits for a location of a pixel. The addition of speakers and
microphones for teleconferencing and video games suggested support of sound as
well. Audio samples need more than 8 bits of precision, but 16 bits are sufficient.
Every microprocessor has special support so that bytes and halfwords take up
less space when stored in memory (see Section 2.9), but due to the infrequency
of arithmetic operations on these data sizes in typical integer programs, there is
little support beyond data transfers. Architects recognized that many graphics
and audio applications would perform the same operation on vectors of this
data. By partitioning the carry chains within a 64-bit adder, a processor could
04-Ch03-P374750.indd 218
7/3/09 9:00:46 AM
3.2
219
Addition and Subtraction
perform simultaneous operations on short vectors of eight 8-bit operands, four
16-bit operands, or two 32-bit operands. The cost of such partitioned adders was
small. These extensions have been called vector or SIMD, for single instruction,
multiple data (see Section 2.17 and Chapter 7).
One feature not generally found in general-purpose microprocessors is saturating operations. Saturation means that when a calculation overflows, the result is
set to the largest positive number or most negative number, rather than a modulo
calculation as in two’s complement arithmetic. Saturation is likely what you want
for media operations. For example, the volume knob on a radio set would be frustrating if, as you turned, it would get continuously louder for a while and then
immediately very soft. A knob with saturation would stop at the highest volume no
matter how far you turned it. Figure 3.3 shows arithmetic and logical operations
found in many multimedia extensions to modern instruction sets.
Instruction category
Unsigned add/subtract
Operands
Eight 8-bit or Four 16-bit
Saturating add/subtract Eight 8-bit or Four 16-bit
Max/min/minimum
Eight 8-bit or Four 16-bit
Average
Eight 8-bit or Four 16-bit
Shift right/left
Eight 8-bit or Four 16-bit
FIGURE 3.3 Summary of multimedia support for
desktop computers.
Summary
A major point of this section is that, independent of the representation, the finite
word size of computers means that arithmetic operations can create results that are
too large to fit in this fixed word size. It’s easy to detect overflow in unsigned numbers, although these are almost always ignored because programs don’t want to
detect overflow for address arithmetic, the most common use of natural numbers.
Two’s complement presents a greater challenge, yet some software systems require
detection of overflow, so today all computers have a way to detect it.
The rising popularity of multimedia applications led to arithmetic instructions
that support narrower operations that can easily operate in parallel.
Some programming languages allow two’s complement integer arithmetic on variables declared byte and half. What ARM instructions would be used?
Check
Yourself
1. Load with LDRB, LDRH; arithmetic with ADD, SUB, MUL; then store using
STRB, STRH.
2. Load with LDRBS, LDRHS; arithmetic with ADD, SUB, MUL; then store using
STRB, STRH.
3. LDRBS, LDRHS; arithmetic with ADD, SUB, MUL; using AND to mask result to
8 or 16 bits after each operation; then store using STRB, STRH.
04-Ch03-P374750.indd 219
7/3/09 9:00:46 AM
220
Chapter 3
Arithmetic for Computers
Elaboration: The speed of addition is increased by determining the carry in to the
high-order bits sooner. There are a variety of schemes to anticipate the carry so that
the worst-case scenario is a function of the log 2 of the number of bits in the adder.
These anticipatory signals are faster because they go through fewer gates in sequence,
but it takes many more gates to anticipate the proper carry. The most popular is carry
lookahead, which Section C.6 in
Appendix C on the CD describes.
Multiplication is
vexation, Division is
as bad; The rule of
three doth puzzle me,
And practice drives
me mad.
Anonymous, Elizabethan
manuscript, 1570
3.3
Multiplication
Now that we have completed the explanation of addition and subtraction, we are
ready to build the more vexing operation of multiplication.
First, let’s review the multiplication of decimal numbers in longhand to remind
ourselves of the steps of multiplication and the names of the operands. For reasons
that will become clear shortly, we limit this decimal example to using only the digits 0 and 1. Multiplying 1000ten by 1001ten:
Multiplicand
Multiplier
x
Product
1000ten
1001ten
1000
0000
0000
1000
1001000ten
The first operand is called the multiplicand and the second the multiplier. The
final result is called the product. As you may recall, the algorithm learned in grammar school is to take the digits of the multiplier one at a time from right to left,
multiplying the multiplicand by the single digit of the multiplier, and shifting the
intermediate product one digit to the left of the earlier intermediate products.
The first observation is that the number of digits in the product is considerably
larger than the number in either the multiplicand or the multiplier. In fact, if we ignore
the sign bits, the length of the multiplication of an n-bit multiplicand and an m-bit
multiplier is a product that is n + m bits long. That is, n + m bits are required to represent all possible products. Hence, like add, multiply must cope with overflow because
we frequently want a 32-bit product as the result of multiplying two 32-bit numbers.
In this example, we restricted the decimal digits to 0 and 1. With only two
choices, each step of the multiplication is simple:
1. Just place a copy of the multiplicand (1 × multiplicand) in the proper place
if the multiplier digit is a 1, or
2. Place 0 (0 × multiplicand) in the proper place if the digit is 0.
04-Ch03-P374750.indd 220
7/3/09 9:00:46 AM
3.3
Multiplication
221
Although the decimal example above happens to use only 0 and 1, multiplication
of binary numbers must always use 0 and 1, and thus always offers only these two
choices.
Now that we have reviewed the basics of multiplication, the traditional next
step is to provide the highly optimized multiply hardware. We break with tradition
in the belief that you will gain a better understanding by seeing the evolution of
the multiply hardware and algorithm through multiple generations. For now, let’s
assume that we are multiplying only positive numbers.
Sequential Version of the Multiplication Algorithm
and Hardware
This design mimics the algorithm we learned in grammar school; Figure 3.4 shows
the hardware. We have drawn the hardware so that data flows from top to bottom
to resemble more closely the paper-and-pencil method.
Let’s assume that the multiplier is in the 32-bit Multiplier register and that the
64-bit Product register is initialized to 0. From the paper-and-pencil example
above, it’s clear that we will need to move the multiplicand left one digit each step,
as it may be added to the intermediate products. Over 32 steps, a 32-bit multiplicand would move 32 bits to the left. Hence, we need a 64-bit Multiplicand register,
initialized with the 32-bit multiplicand in the right half and zero in the left half.
This register is then shifted left 1 bit each step to align the multiplicand with the
sum being accumulated in the 64-bit Product register.
Multiplicand
Shift left
64 bits
Multiplier
Shift right
64-bit ALU
32 bits
Product
Write
Control test
64 bits
FIGURE 3.4 First version of the multiplication hardware. The Multiplicand register, ALU, and
Product register are all 64 bits wide, with only the Multiplier register containing 32 bits. ( Appendix C
describes ALUs.) The 32-bit multiplicand starts in the right half of the Multiplicand register and is shifted
left 1 bit on each step. The multiplier is shifted in the opposite direction at each step. The algorithm starts
with the product initialized to 0. Control decides when to shift the Multiplicand and Multiplier registers and
when to write new values into the Product register.
04-Ch03-P374750.indd 221
7/3/09 9:00:46 AM
222
Chapter 3
Arithmetic for Computers
Figure 3.5 shows the three basic steps needed for each bit. The least significant
bit of the multiplier (Multiplier0) determines whether the multiplicand is added to
the Product register. The left shift in step 2 has the effect of moving the intermediate operands to the left, just as when multiplying with paper and pencil. The shift
right in step 3 gives us the next bit of the multiplier to examine in the following
iteration. These three steps are repeated 32 times to obtain the product. If each step
took a clock cycle, this algorithm would require almost 100 clock cycles to multiply
two 32-bit numbers. The relative importance of arithmetic operations like multiply
Start
Multiplier0 = 1
1. Test
Multiplier0
Multiplier0 = 0
1a. Add multiplicand to product and
place the result in Product register
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
No: < 32 repetitions
32nd repetition?
Yes: 32 repetitions
Done
FIGURE 3.5 The first multiplication algorithm, using the hardware shown in Figure 3.4. If the
least significant bit of the multiplier is 1, add the multiplicand to the product. If not, go to the next step. Shift
the multiplicand left and the multiplier right in the next two steps. These three steps are repeated 32 times.
04-Ch03-P374750.indd 222
7/3/09 9:00:46 AM
3.3
223
Multiplication
varies with the program, but addition and subtraction may be anywhere from 5 to
100 times more popular than multiply. Accordingly, in many applications, multiply can take multiple clock cycles without significantly affecting performance. Yet
Amdahl’s law (see Section 1.8) reminds us that even a moderate frequency for a
slow operation can limit performance.
This algorithm and hardware are easily refined to take 1 clock cycle per step.
The speed-up comes from performing the operations in parallel: the multiplier
and multiplicand are shifted while the multiplicand is added to the product if the
multiplier bit is a 1. The hardware just has to ensure that it tests the right bit of
the multiplier and gets the preshifted version of the multiplicand. The hardware
is usually further optimized to halve the width of the adder and registers by noticing where there are unused portions of registers and adders. Figure 3.6 shows the
revised hardware.
Replacing arithmetic by shifts can also occur when multiplying by constants. Some
compilers replace multiplies by short constants with a series of shifts and adds.
Because one bit to the left represents a number twice as large in base 2, shifting
the bits left has the same effect as multiplying by a power of 2. As mentioned in
Chapter 2, almost every compiler will perform the strength reduction optimization
of substituting a left shift for a multiply by a power of 2.
Hardware/
Software
Interface
Multiplicand
32 bits
32-bit ALU
Product
Shift right
Write
Control
test
64 bits
FIGURE 3.6 Refined version of the multiplication hardware. Compare with the first version in
Figure 3.4. The Multiplicand register, ALU, and Multiplier register are all 32 bits wide, with only the Product
register left at 64 bits. Now the product is shifted right. The separate Multiplier register also disappeared. The
multiplier is placed instead in the right half of the Product register. These changes are highlighted in color.
(The Product register should really be 65 bits to hold the carry out of the adder, but it’s shown here as 64 bits
to highlight the evolution from Figure 3.4.)
04-Ch03-P374750.indd 223
7/3/09 9:00:47 AM
224
Chapter 3
Arithmetic for Computers
A Multiply Algorithm
EXAMPLE
ANSWER
Using 4-bit numbers to save space, multiply 2ten × 3ten, or 0010two × 0011two.
Figure 3.7 shows the value of each register for each of the steps labeled according to Figure 3.5, with the final value of 0000 0110two or 6ten. Color is used to
indicate the register values that change on that step, and the bit circled is the
one examined to determine the operation of the next step.
Signed Multiplication
So far, we have dealt with positive numbers. The easiest way to understand how
to deal with signed numbers is to first convert the multiplier and multiplicand to
positive numbers and then remember the original signs. The algorithms should
then be run for 31 iterations, leaving the signs out of the calculation. As we learned
in grammar school, we need negate the product only if the original signs disagree.
It turns out that the last algorithm will work for signed numbers, provided that
we remember that we are dealing with numbers that have infinite digits, and we are
only representing them with 32 bits. Hence, the shifting steps would need to extend
the sign of the product for signed numbers. When the algorithm completes, the
lower word would have the 32-bit product.
Iteration
0
1
2
3
4
Step
Initial values
1a: 1 ⇒ Prod = Prod + Mcand
2: Shift left Multiplicand
3: Shift right Multiplier
1a: 1 ⇒ Prod = Prod + Mcand
2: Shift left Multiplicand
3: Shift right Multiplier
1: 0 ⇒ No operation
2: Shift left Multiplicand
3: Shift right Multiplier
1: 0 ⇒ No operation
2: Shift left Multiplicand
3: Shift right Multiplier
Multiplier
Multiplicand
Product
0011
0011
0011
0001
0001
0001
0000
0000
0000
0000
0000
0000
0000
0000 0010
0000 0010
0000 0100
0000 0100
0000 0100
0000 1000
0000 1000
0000 1000
0001 0000
0001 0000
0001 0000
0010 0000
0010 0000
0000 0000
0000 0010
0000 0010
0000 0010
0000 0110
0000 0110
0000 0110
0000 0110
0000 0110
0000 0110
0000 0110
0000 0110
0000 0110
FIGURE 3.7 Multiply example using algorithm in Figure 3.5. The bit examined to determine the
next step is circled in color.
04-Ch03-P374750.indd 224
7/3/09 9:00:47 AM
3.3
225
Multiplication
Faster Multiplication
Moore’s law has provided so much more in resources that hardware designers can
now build much faster multiplication hardware. Whether the multiplicand is to be
added or not is known at the beginning of the multiplication by looking at each
of the 32 multiplier bits. Faster multiplications are possible by essentially providing one 32-bit adder for each bit of the multiplier: one input is the multiplicand
ANDed with a multiplier bit, and the other is the output of a prior adder.
A straightforward approach would be to connect the outputs of adders on the
right to the inputs of adders on the left, making a stack of adders 32 high. An alternative way to organize these 32 additions is in a parallel tree, as Figure 3.8 shows.
Instead of waiting for 32 add times, we wait just the log2 (32) or five 32-bit add
times. Figure 3.8 shows how this is a faster way to connect them.
In fact, multiply can go even faster than five add times because of the use of
Appendix C) and because it is easy to
carry save adders (see Section C.6 in
pipeline such a design to be able to support many multiplies simultaneously (see
Chapter 4).
Multiply in ARM
ARM provides a multiply instruction (MUL) that puts the lower 32 bits of the product into the destination register. Since it doesn’t offer the upper 32 bits, there is no
difference between signed and unsigned multiplication.
Mplier31 • Mcand Mplier30 • Mcand
Mplier29 • Mcand Mplier28 • Mcand
32 bits
32 bits
Mplier3 • Mcand Mplier2 • Mcand
Mplier1 • Mcand Mplier0 • Mcand
32 bits
32 bits
...
32 bits
1 bit
1 bit
32 bits
...
...
...
1 bit
1 bit
32 bits
Product63 Product62
...
Product47..16
...
Product1 Product0
FIGURE 3.8 Fast multiplication hardware. Rather than use a single 32-bit adder 31 times, this hardware “unrolls the loop” to use 31
adders and then organizes them to minimize delay.
04-Ch03-P374750.indd 225
7/3/09 9:00:47 AM
226
Chapter 3
Arithmetic for Computers
Summary
Multiplication hardware is simply shifts and add, derived from the paper-andpencil method learned in grammar school. Compilers even use shift instructions
for multiplications by powers of 2.
Elaboration: The ARM multiply instruction ignores overflow, so it is up to the software
to check to see if the product is too big to fit in 32 bits.
Divide et impera.
Latin for “Divide
and rule,” ancient
political maxim cited
by Machiavelli, 1532
dividend A number
being divided.
divisor A number that
the dividend is divided by.
3.4
Division
The reciprocal operation of multiply is divide, an operation that is even less frequent and even more quirky. It even offers the opportunity to perform a mathematically invalid operation: dividing by 0.
Let’s start with an example of long division using decimal numbers to recall the
names of the operands and the grammar school division algorithm. For reasons
similar to those in the previous section, we limit the decimal digits to just 0 or 1.
The example is dividing 1,001,010ten by 1000ten:
1001ten
Quotient
Divisor 1000ten 1001010ten
−1000
10
101
1010
−1000
10ten
Dividend
Remainder
quotient The primary
result of a division;
a number that when
multiplied by the
divisor and added to the
remainder produces the
dividend.
Divide’s two operands, called the dividend and divisor, and the result, called
the quotient, are accompanied by a second result, called the remainder. Here is
another way to express the relationship between the components:
remainder The secondary
result of a division; a
number that when added
to the product of the
quotient and the divisor
produces the dividend.
where the remainder is smaller than the divisor. Infrequently, programs use the
divide instruction just to get the remainder, ignoring the quotient.
The basic grammar school division algorithm tries to see how big a number
can be subtracted, creating a digit of the quotient on each attempt. Our carefully
selected decimal example uses only the numbers 0 and 1, so it’s easy to figure out
04-Ch03-P374750.indd 226
Dividend = Quotient × Divisor + Remainder
7/3/09 9:00:47 AM
3.4
Division
227
how many times the divisor goes into the portion of the dividend: it’s either 0 times
or 1 time. Binary numbers contain only 0 or 1, so binary division is restricted to
these two choices, thereby simplifying binary division.
Let’s assume that both the dividend and the divisor are positive and hence the
quotient and the remainder are nonnegative. The division operands and both
results are 32-bit values, and we will ignore the sign for now.
A Division Algorithm and Hardware
Figure 3.9 shows hardware to mimic our grammar school algorithm. We start with
the 32-bit Quotient register set to 0. Each iteration of the algorithm needs to move
the divisor to the right one digit, so we start with the divisor placed in the left half
of the 64-bit Divisor register and shift it right 1 bit each step to align it with the
dividend. The Remainder register is initialized with the dividend.
Divisor
Shift right
64 bits
Quotient
Shift left
64-bit ALU
32 bits
Remainder
Write
Control
test
64 bits
FIGURE 3.9 First version of the division hardware. The Divisor register, ALU, and Remainder
register are all 64 bits wide, with only the Quotient register being 32 bits. The 32-bit divisor starts in the left
half of the Divisor register and is shifted right 1 bit each iteration. The remainder is initialized with the dividend. Control decides when to shift the Divisor and Quotient registers and when to write the new value into
the Remainder register.
Figure 3.10 shows three steps of the first division algorithm. Unlike a human,
the computer isn’t smart enough to know in advance whether the divisor is smaller
than the dividend. It must first subtract the divisor in step 1; remember that this
is how we performed the comparison in the set on less than instruction. If the
result is positive, the divisor was smaller or equal to the dividend, so we generate a 1 in the quotient (step 2a). If the result is negative, the next step is to restore
the original value by adding the divisor back to the remainder and generate a 0 in
the quotient (step 2b). The divisor is shifted right and then we iterate again. The
remainder and quotient will be found in their namesake registers after the iterations are complete.
04-Ch03-P374750.indd 227
7/3/09 9:00:47 AM
228
Chapter 3
Arithmetic for Computers
Start
1. Subtract the Divisor register from the
Remainder register and place the
result in the Remainder register
Remainder ≥ 0
Remainder < 0
Test Remainder
2a. Shift the Quotient register to the left,
setting the new rightmost bit to 1
2b. Restore the original value by adding
the Divisor register to the Remainder
register and placing the sum in the
Remainder register. Also shift the
Quotient register to the left, setting the
new least significant bit to 0
3. Shift the Divisor register right 1 bit
No: < 33 repetitions
33rd repetition?
Yes: 33 repetitions
Done
FIGURE 3.10 A division algorithm, using the hardware in Figure 3.9. If the remainder is
positive, the divisor did go into the dividend, so step 2a generates a 1 in the quotient. A negative remainder
after step 1 means that the divisor did not go into the dividend, so step 2b generates a 0 in the quotient and
adds the divisor to the remainder, thereby reversing the subtraction of step 1. The final shift, in step 3, aligns
the divisor properly, relative to the dividend for the next iteration. These steps are repeated 33 times.
04-Ch03-P374750.indd 228
7/3/09 9:00:47 AM
3.4
229
Division
A Divide Algorithm
Using a 4-bit version of the algorithm to save pages, let’s try dividing 7ten by
2ten, or 0000 0111two by 0010two.
Figure 3.11 shows the value of each register for each of the steps, with the quotient being 3ten and the remainder 1ten. Notice that the test in step 2 of whether
the remainder is positive or negative simply tests whether the sign bit of the
Remainder register is a 0 or 1. The surprising requirement of this algorithm is
that it takes n + 1 steps to get the proper quotient and remainder.
EXAMPLE
ANSWER
This algorithm and hardware can be refined to be faster and cheaper. The
speed-up comes from shifting the operands and the quotient simultaneously with
the subtraction. This refinement halves the width of the adder and registers by
noticing where there are unused portions of registers and adders. Figure 3.12 shows
the revised hardware.
Signed Division
So far, we have ignored signed numbers in division. The simplest solution is to
remember the signs of the divisor and dividend and then negate the quotient if the
signs disagree.
Iteration
0
1
2
3
4
5
Step
Initial values
1: Rem = Rem – Div
2b: Rem < 0 ⇒ +Div, sll Q, Q0 = 0
3: Shift Div right
1: Rem = Rem – Div
2b: Rem < 0 ⇒ +Div, sll Q, Q0 = 0
3: Shift Div right
1: Rem = Rem – Div
2b: Rem < 0 ⇒ +Div, sll Q, Q0 = 0
3: Shift Div right
1: Rem = Rem – Div
2a: Rem ≥ 0 ⇒ sll Q, Q0 = 1
3: Shift Div right
1: Rem = Rem – Div
2a: Rem ≥ 0 ⇒ sll Q, Q0 = 1
3: Shift Div right
Quotient
Divisor
Remainder
0000
0000
0000
0010 0000
0010 0000
0010 0000
0000 0111
1110 0111
0000 0111
0000
0000
0000
0000
0000
0000
0000
0000
0001
0001
0001
0011
0011
0001 0000
0001 0000
0001 0000
0000 1000
0000 1000
0000 1000
0000 0100
0000 0100
0000 0100
0000 0010
0000 0010
0000 0010
0000 0001
0000 0111
1111 0111
0000 0111
0000 0111
1111 1111
0000 0111
0000 0111
0000 0011
0000 0011
0000 0011
0000 0001
0000 0001
0000 0001
FIGURE 3.11 Division example using the algorithm in Figure 3.10. The bit examined to
determine the next step is circled in color.
04-Ch03-P374750.indd 229
7/3/09 9:00:47 AM
230
Chapter 3
Arithmetic for Computers
Divisor
32 bits
32-bit ALU
Remainder
Shift right
Shift left
Write
Control
test
64 bits
FIGURE 3.12 An improved version of the division hardware. The Divisor register, ALU, and
Quotient register are all 32 bits wide, with only the Remainder register left at 64 bits. Compared to Figure 3.9,
the ALU and Divisor registers are halved and the remainder is shifted left. This version also combines the
Quotient register with the right half of the Remainder register. (As in Figure 3.6, the Remainder register
should really be 65 bits to make sure the carry out of the adder is not lost.)
Elaboration: The one complication of signed division is that we must also set the
sign of the remainder. Remember that the following equation must always hold:
Dividend = Quotient × Divisor + Remainder
To understand how to set the sign of the remainder, let’s look at the example of dividing all the combinations of ±7ten by ±2ten. The first case is easy:
+7 ÷ +2: Quotient = +3, Remainder = +1
Checking the results:
7 = 3 × 2 + (+1) = 6 + 1
If we change the sign of the dividend, the quotient must change as well:
–7 ÷ +2: Quotient = –3
Rewriting our basic formula to calculate the remainder:
Remainder = (Dividend – Quotient × Divisor) = –7 – (–3 × +2) = –7–(–6) = –1
So,
–7 ÷ +2: Quotient = –3, Remainder = –1
Checking the results again:
–7 = –3 × 2 + (–1) = – 6 – 1
The reason the answer isn’t a quotient of –4 and a remainder of +1, which would also
fit this formula, is that the absolute value of the quotient would then change depending
on the sign of the dividend and the divisor! Clearly, if
–(x ÷ y) ≠ (–x) ÷ y
04-Ch03-P374750.indd 230
7/3/09 9:00:47 AM
3.4
Division
231
programming would be an even greater challenge. This anomalous behavior is avoided
by following the rule that the dividend and remainder must have the same signs, no
matter what the signs of the divisor and quotient.
We calculate the other combinations by following the same rule:
+7 ÷ –2: Quotient = –3, Remainder = +1
–7 ÷ –2: Quotient = +3, Remainder = –1
Thus the correctly signed division algorithm negates the quotient if the signs of the
operands are opposite and makes the sign of the nonzero remainder match the dividend.
Faster Division
We used many adders to speed up multiply, but we cannot do the same trick for
divide. The reason is that we need to know the sign of the difference before we can
perform the next step of the algorithm, whereas with multiply we could calculate
the 32 partial products immediately.
There are techniques to produce more than one bit of the quotient per step.
The SRT division technique tries to guess several quotient bits per step, using a
table lookup based on the upper bits of the dividend and remainder. It relies on
subsequent steps to correct wrong guesses. A typical value today is 4 bits. The key
is guessing the value to subtract. With binary division, there is only a single choice.
These algorithms use 6 bits from the remainder and 4 bits from the divisor to index
a table that determines the guess for each step.
The accuracy of this fast method depends on having proper values in the lookup
table. The fallacy on page 263 in Section 3.8 shows what can happen if the table is
incorrect.
Divide in ARM (or lack there of)
While there are many versions of ARM, the classic ARM instruction set had no
divide instruction. That tradition continues with the ARMv7A, although the
ARMv7R and AMv7M include signed integer divide (SDIV) and unsigned integer
divide (UDIV) instructions.
Summary
The common hardware support for multiply and divide allows ARM to provide a
single pair of 32-bit registers that are used both for multiply and divide.
Elaboration: An even faster algorithm does not immediately add the divisor back if
the remainder is negative. It simply adds the dividend to the shifted remainder in the following step, since (r + d ) × 2 – d = r × 2 + d × 2 – d = r × 2 + d. This nonrestoring division
algorithm, which takes 1 clock cycle per step, is explored further in the exercises; the
algorithm here is called restoring division. A third algorithm that doesn’t save the result
of the subtract if its negative is called a nonperforming division algorithm. It averages
one-third fewer arithmetic operations.
04-Ch03-P374750.indd 231
7/3/09 9:00:48 AM
232
Speed gets you nowhere
if you’re headed the
wrong way.
Chapter 3
3.5
Arithmetic for Computers
Floating Point
American proverb
Going beyond signed and unsigned integers, programming languages support
numbers with fractions, which are called reals in mathematics. Here are some
examples of reals:
3.14159265 . . .ten (pi)
2.71828 . . .ten (e)
0.000000001ten or 1.0ten × 10−9 (seconds in a nanosecond)
3,155,760,000ten or 3.15576ten × 109 (seconds in a typical century)
scientific notation
A notation that renders
numbers with a single
digit to the left of the
decimal point.
normalized A number
in floating-point notation
that has no leading 0s.
floating point Computer
arithmetic that represents
numbers in which the
binary point is not fixed.
Notice that in the last case, the number didn’t represent a small fraction, but it
was bigger than we could represent with a 32-bit signed integer. The alternative
notation for the last two numbers is called scientific notation, which has a single
digit to the left of the decimal point. A number in scientific notation that has no
leading 0s is called a normalized number, which is the usual way to write it. For
example, 1.0ten × 10−9 is in normalized scientific notation, but 0.1ten × 10−8 and
10.0ten × 10 −10 are not.
Just as we can show decimal numbers in scientific notation, we can also show
binary numbers in scientific notation:
1.0two × 2−1
To keep a binary number in normalized form, we need a base that we can increase
or decrease by exactly the number of bits the number must be shifted to have one
nonzero digit to the left of the decimal point. Only a base of 2 fulfills our need.
Since the base is not 10, we also need a new name for decimal point; binary point
will do fine.
Computer arithmetic that supports such numbers is called floating point
because it represents numbers in which the binary point is not fixed, as it is for
integers. The programming language C uses the name float for such numbers. Just
as in scientific notation, numbers are represented as a single nonzero digit to the
left of the binary point. In binary, the form is
1.xxxxxxxxxtwo × 2yyyy
(Although the computer represents the exponent in base 2 as well as the rest of the
number, to simplify the notation we show the exponent in decimal.)
A standard scientific notation for reals in normalized form offers three advantages. It simplifies exchange of data that includes floating-point numbers; it simplifies the floating-point arithmetic algorithms to know that numbers will always
04-Ch03-P374750.indd 232
7/3/09 9:00:48 AM
3.5
233
Floating Point
be in this form; and it increases the accuracy of the numbers that can be stored in
a word, since the unnecessary leading 0s are replaced by real digits to the right of
the binary point.
Floating-Point Representation
A designer of a floating-point representation must find a compromise between the
size of the fraction and the size of the exponent, because a fixed word size means
you must take a bit from one to add a bit to the other. This trade-off is between
precision and range: increasing the size of the fraction enhances the precision of
the fraction, while increasing the size of the exponent increases the range of numbers that can be represented. As our design guideline from Chapter 2 reminds us,
good design demands good compromise.
Floating-point numbers are usually a multiple of the size of a word. The representation of a ARM floating-point number is shown below, where s is the sign
of the floating-point number (1 meaning negative), exponent is the value of the
8-bit exponent field (including the sign of the exponent), and fraction is the 23-bit
number. This representation is called sign and magnitude, since the sign is a separate bit from the rest of the number.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
s
exponent
fraction
1 bit
8 bits
23 bits
8
fraction The value,
generally between
0 and 1, placed in the
fraction field.
exponent In the
numerical representation
system of floating-point
arithmetic, the value that
is placed in the exponent
field.
7
6
5
4
3
2
1
0
In general, floating-point numbers are of the form
(−1)S × F × 2E
F involves the value in the fraction field and E involves the value in the exponent
field; the exact relationship to these fields will be spelled out soon. (We will shortly
see that ARM does something slightly more sophisticated.)
These chosen sizes of exponent and fraction give ARM computer arithmetic
an extraordinary range. Fractions almost as small as 2.0ten × 10−38 and numbers
almost as large as 2.0ten × 1038 can be represented in a computer. Alas, extraordinary differs from infinite, so it is still possible for numbers to be too large. Thus,
overflow interrupts can occur in floating-point arithmetic as well as in integer
arithmetic. Notice that overflow here means that the exponent is too large to be
represented in the exponent field.
Floating point offers a new kind of exceptional event as well. Just as programmers will want to know when they have calculated a number that is too large to be
represented, they will want to know if the nonzero fraction they are calculating has
become so small that it cannot be represented; either event could result in a program giving incorrect answers. To distinguish it from overflow, we call this event
underflow. This situation occurs when the negative exponent is too large to fit in
the exponent field.
04-Ch03-P374750.indd 233
overflow (floatingpoint) A situation in
which a positive exponent
becomes too large to fit in
the exponent field.
underflow (floatingpoint) A situation in
which a negative exponent
becomes too large to fit in
the exponent field.
7/3/09 9:00:48 AM
234
Chapter 3
Hardware/
Software
Interface
exception Also called
interrupt. An
unscheduled event
that disrupts program
execution
double precision
A floating-point value
represented in two 32-bit
words.
single precision
A floating-point value
represented in a single
32-bit word.
Arithmetic for Computers
One of the optional modes of the IEEE floating point standard is to cause an
exception when underflow or overflow occurs. The programmer or the programming environment must then decide what to do.
An exception, also called an interrupt on many computers, is essentially an
unscheduled procedure call. The address of the instruction that overflowed is saved
in a register, and the computer jumps to a predefined address to invoke the appropriate routine for that exception. The interrupted address is saved so that in some
situations the program can continue after corrective code is executed. (Section 4.9
covers exceptions in more detail; Chapters 5 and 6 describe other situations where
exceptions and interrupts occur.)
One way to reduce chances of underflow or overflow is to offer another format
that has a larger exponent. In C this number is called double, and operations on
doubles are called double precision floating-point arithmetic; single precision
floating point is the name of the earlier format.
The representation of a double precision floating-point number takes two ARM
words, as shown below, where s is still the sign of the number, exponent is the value
of the 11-bit exponent field, and fraction is the 52-bit number in the fraction field.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
s
exponent
fraction
1 bit
11 bits
20 bits
8
7
6
5
4
3
2
1
0
fraction (continued)
32 bits
ARM double precision allows numbers almost as small as 2.0ten × 10−308 and
almost as large as 2.0ten × 10308. Although double precision does increase the
exponent range, its primary advantage is its greater precision because of the much
larger significand.
These formats go beyond ARM. They are part of the IEEE 754 floating-point
standard, found in virtually every computer invented since 1980. This standard has
greatly improved both the ease of porting floating-point programs and the quality
of computer arithmetic.
To pack even more bits into the significand, IEEE 754 makes the leading 1-bit
of normalized binary numbers implicit. Hence, the number is actually 24 bits long
in single precision (implied 1 and a 23-bit fraction), and 53 bits long in double
precision (1 + 52). To be precise, we use the term significand to represent the 24- or
53-bit number that is 1 plus the fraction, and fraction when we mean the 23- or
52-bit number. Since 0 has no leading 1, it is given the reserved exponent value 0 so
that the hardware won’t attach a leading 1 to it.
04-Ch03-P374750.indd 234
7/3/09 9:00:48 AM
3.5
Floating Point
235
Thus 00 . . . 00two represents 0; the representation of the rest of the numbers uses
the form from before with the hidden 1 added:
(−1)S × (1 + Fraction) × 2E
where the bits of the fraction represent a number between 0 and 1 and E specifies
the value in the exponent field, to be given in detail shortly. If we number the bits
of the fraction from left to right s1, s2, s3, . . . , then the value is
(−1)S × (1 + (s1 × 2−1) + (s2 × 2−2) + (s3 × 2−3) + (s4 × 2−4) + …) × 2E
Figure 3.13 shows the encodings of IEEE 754 floating-point numbers. Other
features of IEEE 754 are special symbols to represent unusual events. For example,
instead of interrupting on a divide by 0, software can set the result to a bit pattern
representing +∞ or −∞; the largest exponent is reserved for these special symbols.
When the programmer prints the results, the program will print an infinity symbol. (For the mathematically trained, the purpose of infinity is to form topological
closure of the reals.)
IEEE 754 even has a symbol for the result of invalid operations, such as 0/0 or
subtracting infinity from infinity. This symbol is NaN, for Not a Number. The purpose of NaNs is to allow programmers to postpone some tests and decisions to a
later time in the program when they are convenient.
The designers of IEEE 754 also wanted a floating-point representation that could
be easily processed by integer comparisons, especially for sorting. This desire is
why the sign is in the most significant bit, allowing a quick test of less than, greater
than, or equal to 0. (It’s a little more complicated than a simple integer sort, since
this notation is essentially sign and magnitude rather than two’s complement.)
Placing the exponent before the significand also simplifies the sorting of
floating-point numbers using integer comparison instructions, since numbers
with bigger exponents look larger than numbers with smaller exponents, as long as
both exponents have the same sign.
Negative exponents pose a challenge to simplified sorting. If we use two’s complement or any other notation in which negative exponents have a 1 in the most
Single precision
Exponent
Fraction
Double precision
Exponent
Object represented
Fraction
0
0
0
0
0
0
Nonzero
0
Nonzero
± denormalized number
1–254
Anything
1–2046
Anything
± floating-point number
255
0
2047
0
± infinity
255
Nonzero
2047
Nonzero
NaN (Not a Number)
FIGURE 3.13 IEEE 754 encoding of floating-point numbers. A separate sign bit determines the
sign. Denormalized numbers are described in the Elaboration on page 257.
04-Ch03-P374750.indd 235
7/3/09 9:00:48 AM
236
Chapter 3
Arithmetic for Computers
significant bit of the exponent field, a negative exponent will look like a big number.
For example, 1.0two × 2−1 would be represented as
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
.
.
.
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
(Remember that the leading 1 is implicit in the significand.) The value 1.0two × 2+1
would look like the smaller binary number
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
.
.
.
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
The desirable notation must therefore represent the most negative exponent as
00 . . . 00two and the most positive as 11 . . . 11two. This convention is called biased
notation, with the bias being the number subtracted from the normal, unsigned
representation to determine the real value.
IEEE 754 uses a bias of 127 for single precision, so an exponent of −1 is represented by the bit pattern of the value −1 + 127ten, or 126ten = 0111 1110two, and +1
is represented by 1 + 127, or 128ten = 1000 0000two. The exponent bias for double
precision is 1023. Biased exponent means that the value represented by a floatingpoint number is really
(−1)S × (1 + Fraction) × 2(Exponent − Bias)
The range of single precision numbers is then from as small as
±1.0000 0000 0000 0000 0000 000two × 2−126
to as large as
±1.1111 1111 1111 1111 1111 111two × 2+127.
Let’s show the representation.
Floating-Point Representation
EXAMPLE
ANSWER
Show the IEEE 754 binary representation of the number −0.75ten in single and
double precision.
The number −0.75ten is also
−3/4ten or −3/22ten
It is also represented by the binary fraction
−11two/22ten or −0.11two
04-Ch03-P374750.indd 236
7/3/09 9:00:48 AM
3.5
237
Floating Point
In scientific notation, the value is
−0.11two × 20
and in normalized scientific notation, it is
−1.1two × 2−1
The general representation for a single precision number is
(−1)S × (1 + Fraction) × 2(Exponent − 127)
Subtracting the bias 127 from the exponent of −1.1two × 2−1 yields
(−1)1 × (1 + .1000 0000 0000 0000 0000 000two) × 2(126−127)
The single precision binary representation of −0.75ten is then
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
1
0
0
0
0
0
0
0
0
0
0
1
1
1 bit
1
1
1
1
0
1
0
0
0
0
0
0
0
0
0
8 bits
0
0
0
0
23 bits
The double precision representation is
(−1)1 × (1 + .1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000two) × 2(1022−1023)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1 bit
0
1
1
1
1
1
1
0
1
0
0
0
0
0
0
0
0
11 bits
0
0
0
0
0
0
0
0
0
20 bits
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
32 bits
Now let’s try going the other direction.
04-Ch03-P374750.indd 237
7/3/09 9:00:48 AM
238
Chapter 3
Arithmetic for Computers
Converting Binary to Decimal Floating Point
What decimal number is represented by this single precision float?
EXAMPLE
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
1
0
0
0
0
0
0
.
.
.
1
0
0
0
ANSWER
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
The sign bit is 1, the exponent field contains 129, and the fraction field contains 1 × 2−2 = 1/4, or 0.25. Using the basic equation,
(−1)S × (1 + Fraction) × 2(Exponent − Bias) = (−1)1 × (1 + 0.25) × 2(129−127)
= −1 × 1.25 × 22
= −1.25 × 4
= −5.0
In the next subsections, we will give the algorithms for floating-point addition
and multiplication. At their core, they use the corresponding integer operations
on the significands, but extra bookkeeping is necessary to handle the exponents
and normalize the result. We first give an intuitive derivation of the algorithms in
decimal and then give a more detailed, binary version in the figures.
Elaboration: In an attempt to increase range without removing bits from the significand, some computers before the IEEE 754 standard used a base other than 2. For
example, the IBM 360 and 370 mainframe computers use base 16. Since changing
the IBM exponent by one means shifting the significand by 4 bits, “normalized” base
16 numbers can have up to 3 leading bits of 0s! Hence, hexadecimal digits mean that
up to 3 bits must be dropped from the significand, which leads to surprising problems
in the accuracy of floating-point arithmetic. Recent IBM mainframes support IEEE 754
as well as the hex format.
Floating-Point Addition
Let’s add numbers in scientific notation by hand to illustrate the problems in
floating-point addition: 9.999ten × 101 + 1.610ten × 10−1. Assume that we can store
only four decimal digits of the significand and two decimal digits of the exponent.
Step 1. To be able to add these numbers properly, we must align the decimal point
of the number that has the smaller exponent. Hence, we need a form of
the smaller number, 1.610ten × 10−1, that matches the larger exponent.
04-Ch03-P374750.indd 238
7/3/09 9:00:48 AM
3.5
Floating Point
239
We obtain this by observing that there are multiple representations of an
unnormalized floating-point number in scientific notation:
1.610ten × 10−1 = 0.1610ten × 100 = 0.01610ten × 101
The number on the right is the version we desire, since its exponent
matches the exponent of the larger number, 9.999ten × 101. Thus, the first
step shifts the significand of the smaller number to the right until its corrected exponent matches that of the larger number. But we can represent
only four decimal digits so, after shifting, the number is really
0.016ten × 101
Step 2. Next comes the addition of the significands:
+
9.999ten
0.016ten
10.015ten
The sum is 10.015ten × 101.
Step 3. This sum is not in normalized scientific notation, so we need to adjust it:
10.015ten × 101 = 1.0015ten × 102
Thus, after the addition we may have to shift the sum to put it into normalized form, adjusting the exponent appropriately. This example shows
shifting to the right, but if one number were positive and the other were
negative, it would be possible for the sum to have many leading 0s, requiring left shifts. Whenever the exponent is increased or decreased, we must
check for overflow or underflow—that is, we must make sure that the
exponent still fits in its field.
Step 4. Since we assumed that the significand can be only four digits long (excluding the sign), we must round the number. In our grammar school algorithm, the rules truncate the number if the digit to the right of the desired
point is between 0 and 4 and add 1 to the digit if the number to the right
is between 5 and 9. The number
1.0015ten × 102
is rounded to four digits in the significand to
1.002ten × 102
since the fourth digit to the right of the decimal point was between
5 and 9. Notice that if we have bad luck on rounding, such as adding 1 to
a string of 9s, the sum may no longer be normalized and we would need
to perform step 3 again.
04-Ch03-P374750.indd 239
7/3/09 9:00:48 AM
240
Chapter 3
Arithmetic for Computers
Start
1. Compare the exponents of the two numbers;
shift the smaller number to the right until its
exponent would match the larger exponent
2. Add the significands
3. Normalize the sum, either shifting right and
incrementing the exponent or shifting left
and decrementing the exponent
Overflow or
underflow?
Yes
No
Exception
4. Round the significand to the appropriate
number of bits
No
Still normalized?
Yes
Done
FIGURE 3.14 Floating-point addition. The normal path is to execute steps 3 and 4 once, but if rounding causes the sum to be unnormalized, we must repeat step 3.
Figure 3.14 shows the algorithm for binary floating-point addition that follows
this decimal example. Steps 1 and 2 are similar to the example just discussed: adjust
the significand of the number with the smaller exponent and then add the two significands. Step 3 normalizes the results, forcing a check for overflow or underflow.
04-Ch03-P374750.indd 240
7/3/09 9:00:48 AM
3.5
241
Floating Point
The test for overflow and underflow in step 3 depends on the precision of the
operands. Recall that the pattern of all 0 bits in the exponent is reserved and used
for the floating-point representation of zero. Moreover, the pattern of all 1 bits in
the exponent is reserved for indicating values and situations outside the scope of
normal floating-point numbers (see the Elaboration on page 257). Thus, for single
precision, the maximum exponent is 127, and the minimum exponent is −126. The
limits for double precision are 1023 and −1022.
Binary Floating-Point Addition
Try adding the numbers 0.5ten and −0.4375ten in binary using the algorithm in
Figure 3.14.
EXAMPLE
Let’s first look at the binary version of the two numbers in normalized scientific notation, assuming that we keep 4 bits of precision:
ANSWER
= 1/21ten
0.5ten = 1/2ten
= 0.1two
= 0.1two × 20
= 1.000two × 2−1
4
−0.4375ten = −7/16ten = −7/2 ten
= −0.0111two = −0.0111two × 20 = −1.110two × 2−2
Now we follow the algorithm:
Step 1. The significand of the number with the lesser exponent (−1.11two ×
2−2) is shifted right until its exponent matches the larger number:
−1.110two × 2−2 = −0.111two × 2−1
Step 2. Add the significands:
1.000two × 2−1 + (−0.111two × 2−1) = 0.001two × 2−1
Step 3. Normalize the sum, checking for overflow or underflow:
0.001two × 2−1 = 0.010two × 2−2 = 0.100two × 2−3
= 1.000two × 2−4
Since 127 ≥ −4 ≥ −126, there is no overflow or underflow. (The biased
exponent would be −4 + 127, or 123, which is between 1 and 254, the
smallest and largest unreserved biased exponents.)
Step 4. Round the sum:
1.000two × 2−4
04-Ch03-P374750.indd 241
7/3/09 9:00:48 AM
242
Chapter 3
Arithmetic for Computers
The sum already fits exactly in 4 bits, so there is no change to the bits
due to rounding.
This sum is then
1.000two × 2−4 = 0.0001000two = 0.0001two
= 1/24ten
= 1/16ten
= 0.0625ten
This sum is what we would expect from adding 0.5ten to −0.4375ten.
Many computers dedicate hardware to run floating-point operations as fast as
possible. Figure 3.15 sketches the basic organization of hardware for floating-point
addition.
Floating-Point Multiplication
Now that we have explained floating-point addition, let’s try floating-point multiplication. We start by multiplying decimal numbers in scientific notation by
hand: 1.110ten × 1010 × 9.200ten × 10−5. Assume that we can store only four digits
of the significand and two digits of the exponent.
Step 1. Unlike addition, we calculate the exponent of the product by simply
adding the exponents of the operands together:
New exponent = 10 + (−5) = 5
Let’s do this with the biased exponents as well to make sure we obtain
the same result: 10 + 127 = 137, and −5 + 127 = 122, so
New exponent = 137 + 122 = 259
This result is too large for the 8-bit exponent field, so something is
amiss! The problem is with the bias because we are adding the biases
as well as the exponents:
New exponent = (10 + 127) + (−5 + 127) = (5 + 2 × 127) = 259
Accordingly, to get the correct biased sum when we add biased numbers,
we must subtract the bias from the sum:
New exponent = 137 + 122 − 127 = 259 − 127 = 132 = (5 + 127)
and 5 is indeed the exponent we calculated initially.
04-Ch03-P374750.indd 242
7/3/09 9:00:48 AM
3.5
Sign
Exponent
Fraction
Sign
243
Floating Point
Fraction
Exponent
Compare
Small ALU
exponents
Exponent
difference
0
1
0
1
0
1
Shift smaller
Shift right
Control
number right
Add
Big ALU
0
0
1
Increment or
decrement
Shift left or right
Rounding hardware
Sign
Exponent
1
Normalize
Round
Fraction
FIGURE 3.15 Block diagram of an arithmetic unit dedicated to floating-point addition. The steps of Figure 3.14 correspond
to each block, from top to bottom. First, the exponent of one operand is subtracted from the other using the small ALU to determine which is
larger and by how much. This difference controls the three multiplexors; from left to right, they select the larger exponent, the significand of the
smaller number, and the significand of the larger number. The smaller significand is shifted right, and then the significands are added together
using the big ALU. The normalization step then shifts the sum left or right and increments or decrements the exponent. Rounding then creates
the final result, which may require normalizing again to produce the final result.
04-Ch03-P374750.indd 243
7/3/09 9:00:48 AM
244
Chapter 3
Arithmetic for Computers
Step 2. Next comes the multiplication of the significands:
1.110ten
x 9.200ten
0000
0000
2220
9990
10212000ten
There are three digits to the right of the decimal point for each operand, so the decimal point is placed six digits from the right in the
product significand:
10.212000ten
Assuming that we can keep only three digits to the right of the decimal point, the product is 10.212 × 105.
Step 3. This product is unnormalized, so we need to normalize it:
10.212ten × 105 = 1.0212ten × 106
Thus, after the multiplication, the product can be shifted right one
digit to put it in normalized form, adding 1 to the exponent. At this
point, we can check for overflow and underflow. Underflow may
occur if both operands are small—that is, if both have large negative
exponents.
Step 4. We assumed that the significand is only four digits long (excluding the
sign), so we must round the number. The number
1.0212ten × 106
is rounded to four digits in the significand to
1.021ten × 106
Step 5. The sign of the product depends on the signs of the original operands.
If they are both the same, the sign is positive; otherwise, it’s negative.
Hence, the product is
+1.021ten × 106
The sign of the sum in the addition algorithm was determined by
addition of the significands, but in multiplication, the sign of the
product is determined by the signs of the operands.
Once again, as Figure 3.16 shows, multiplication of binary floating-point
numbers is quite similar to the steps we have just completed. We start with
04-Ch03-P374750.indd 244
7/3/09 9:00:48 AM
3.5
Floating Point
245
Start
1. Add the biased exponents of the two
numbers, subtracting the bias from the sum
to get the new biased exponent
2. Multiply the significands
3. Normalize the product if necessary, shifting
it right and incrementing the exponent
Overflow or
underflow?
Yes
No
Exception
4. Round the significand to the appropriate
number of bits
No
Still normalized?
Yes
5. Set the sign of the product to positive if the
signs of the original operands are the same;
if they differ make the sign negative
Done
FIGURE 3.16 Floating-point multiplication. The normal path is to execute steps 3 and 4 once, but if
rounding causes the sum to be unnormalized, we must repeat step 3.
04-Ch03-P374750.indd 245
7/3/09 9:00:48 AM
246
Chapter 3
Arithmetic for Computers
calculating the new exponent of the product by adding the biased exponents,
being sure to subtract one bias to get the proper result. Next is multiplication of significands, followed by an optional normalization step. The size of
the exponent is checked for overflow or underflow, and then the product is
rounded. If rounding leads to further normalization, we once again check for
exponent size. Finally, set the sign bit to 1 if the signs of the operands were
different (negative product) or to 0 if they were the same (positive product).
Binary Floating-Point Multiplication
EXAMPLE
ANSWER
Let’s try multiplying the numbers 0.5ten and −0.4375ten, using the steps in
Figure 3.16.
In binary, the task is multiplying 1.000two × 2−1 by − 1.110two × 2−2.
Step 1. Adding the exponents without bias:
−1 + (−2) = −3
or, using the biased representation:
(−1 + 127) + (−2 + 127) − 127 = (−1 − 2) + (127 + 127 − 127)
= −3 + 127 = 124
Step 2. Multiplying the significands:
1.000two
x 1.110two
0000
1000
1000
1000
1110000two
The product is 1.110000two × 2−3, but we need to keep it to 4 bits, so
it is 1.110two × 2−3.
Step 3. Now we check the product to make sure it is normalized, and then
check the exponent for overflow or underflow. The product is already
normalized and, since 127 ≥ −3 ≥ −126, there is no overflow or underflow. (Using the biased representation, 254 ≥ 124 ≥ 1, so the exponent
fits.)
04-Ch03-P374750.indd 246
7/3/09 9:00:49 AM
3.5
Floating Point
247
Step 4. Rounding the product makes no change:
1.110two × 2−3
Step 5. Since the signs of the original operands differ, make the sign of the
product negative. Hence, the product is
−1.110two × 2−3
Converting to decimal to check our results:
−1.110two × 2−3 = −0.001110two = −0.00111two
= −7/25ten = −7/32ten = −0.21875ten
The product of 0.5ten and −0.4375ten is indeed −0.21875ten.
Floating-Point Instructions in ARM
ARM supports the IEEE 754 single precision and double precision formats with
these instructions with the optional VFP set of instructions: These include
■
Floating-point addition, single (FADDS) and addition, double (FADDD)
■
Floating-point subtraction, single (FSUBS) and subtraction, double (FSUBD)
■
Floating-point multiplication, single (FMULS) and multiplication, double
(FMULD)
■
Floating-point division, single (FDIVS) and division, double (FDIVD)
■
Floating-point comparison, single (FCMPS) and comparison, double (FCMPD)
Floating-point comparison set floating point condition flags. To be able to branch
on them, the programmer must first transfer them to the integer condition flags,
which is accomplished with the FMSTAT instruction. As you might expect, BEQ,
BNE, BGT, and BGE test as their namesakes suggest for floating point comparisons.
However, because of the way the floating point conditions are mapped to integer
condition flags, you test for less than with BMI and less than or equal with BLS.
The ARM designers decided to add 32 separate floating-point registers—called
s0, s1, s2, . . . , s31—used for single precision. Hence, they included separate
loads and stores for floating-point registers: FLDS and FSTS. The base registers for
floating-point data transfers remain integer registers. The ARM code to load two
single precision numbers from memory, add them, and then store the sum might
look like this:
FLDS
FLDS
FADDS
FSTS
04-Ch03-P374750.indd 247
s4,[sp,#x]
s6,[sp,#y]
s2,s4,s6
s2,[sp,#z]
;
;
;
;
Load 32-bit F.P. number into s4
Load 32-bit F.P. number into s6
s2 = s4 + s6 single precision
Store 32-bit F.P. number from s2
7/3/09 9:00:49 AM
248
Chapter 3
Arithmetic for Computers
FP data transfers have a limited number of addressing modes. The most useful
are immediate offset, immediate offset pre-indexed, and immediate offset postindexed. Note that the offsets are multiplied by 4 to extend the range of the offset,
similar to what ARM does for branches. (see Chapter 2).
A double precision register is really just an even-odd pair of single precision
registers, using the names d0, d1, d2, . . . , d15. Thus, the pair of single precision
registers s16 and s17 also form the double precision register named d8.
Figure 3.17 summarizes the floating-point portion of the ARM architecture
revealed in this chapter, with the additions to support floating point shown in
color.
ARM floating-point operands
Example
Name
Comments
32 floating-point
registers
s0, s1, s2, . . . , s31 or
d0, d1, d2, . . ., d15
ARM single-precision floating-point registers are used in pairs for double
precision numbers.
230 memory words
Memory[0],
Memory[4], . . . ,
Memory[4294967292]
Accessed only by data transfer instructions. ARM uses byte addresses, so
sequential word addresses differ by 4. Memory holds data structures, such
as arrays, and spilled registers, such as those saved on procedure calls.
ARM floating-point assembly language
Category
Arithmetic
Data
transfer
Compare
FIGURE 3.17
Example
Instruction
Meaning
FP add single
FP subtract single
FP multiply single
FP divide single
FADDS
FSUBS
FMULS
FDIVS
s2,s4,s6
s2,s4,s6
s2,s4,s6
s2,s4,s6
s2 = s4 + s6
s2 = s4 – s6
FP add double
FP subtract double
FP multiply double
FADDD
FSUBD
FMULD
d2,d4,d6
d2,d4,d6
d2,d4,d6
d2 = d4 + d6
d2 = d4 – d6
FP divide double
FP load, single prec.
FP store, single prec.
FP load, double prec.
FP store, double prec.
FP compare single
FDIVD d2,d4,d6
FLDS
s1,[r1,#100]
FSTS
s1,[r1,#100]
FLDD
d1,[r1,#100]
FSTD
d1,[r1,#100]
FCMPS s2,s4
d2 = d4 / d6
s1 = Memory[r1 + 400]
FP compare double
FCMPD d2,d4
if (d2 - d4)
FP Move Status (for
conditional branch)
FMSTAT
cond. flags = FP cond. flags
s2 = s4 × s6
s2 = s4 / s6
d2 = d4 × d6
Memory[r1 + 400] = s1
d1 = Memory[r1 + 400]
Memory[r1 + 400] = d1
if (s2 - s4)
Comments
FP add (single precision)
FP sub (single precision)
FP multiply (single precision)
FP divide (single precision)
FP add (double precision)
FP sub (double precision)
FP multiply (double
precision)
FP divide (double precision)
32-bit data to FP register
32-bit data to memory
64-bit data to FP register
64--bit data to memory
FP compare less than
single precision
FP compare less than
double precision
Copy FP condition flags to
integer condition flags
ARM floating-point architecture revealed thus far.
Elaboration: VPF version 3 has 32 double-precision floating-point registers.
04-Ch03-P374750.indd 248
7/3/09 9:00:49 AM
3.5
249
Floating Point
One issue that architects face in supporting floating-point arithmetic is whether
to use the same registers used by the integer instructions or to add a special set
for floating point. Because programs normally perform integer operations and
floating-point operations on different data, separating the registers will only slightly
increase the number of instructions needed to execute a program. The major impact
is to create a separate set of data transfer instructions to move data between floatingpoint registers and memory.
The benefits of separate floating-point registers are having twice as many registers without using up more bits in the instruction format, having twice the register
bandwidth by having separate integer and floating-point register sets, and being
able to customize registers to floating point; for example, some computers convert
all sized operands in registers into a single internal format.
Hardware/
Software
Interface
Elaboration: The V in VFP stands for vector, as the floating point coprocessor actually
supports short vector instructions. Section 7.6 of Chapter 7 describes Vector instruction
set architecture. The maximum length of a vectors is either 8 single precision registers
or 4 double precision registers. Vector length is determined by a LEN field in the floating
point status and control register of ARM (FPSCR). If its set to 0, VFP acts like a scalar
floating point instruction. The vector loads and stores can do unit stride accesses. If the
LEN and STRIDE fields of the FPSCR are set properly, they can also support a stride of 2.
The other pieces commonly found in vector architectures are missing, such as scattergather data transfers and conditional execution of vector elements.
To be able to perform the common scalar -vector operations, where an operator using
a single scalar variable in combination with each element of the vector, VFP divides the
FP registers into those that can make up the vector registers and those that can only be
scalar registers. For single precision, registers s0 to s7 are scalar and three sets of registers can be used as vectors; 8 to 15, 16 to 23, and 24 to 31. We use the corresponding registers in double precision to the same end. Scalar are s0 to s3, and the three
sets of vectorizable registers are 4 to 7, 8 to 11, and 12 to 15. In both cases, if the
destination register is in the first bank, then the whole operation is considered scalar.
Compiling a Floating-Point C Program into ARM Assembly Code
Let’s convert a temperature in Fahrenheit to Celsius:
EXAMPLE
float f2c (float fahr)
{
return ((5.0/9.0) * (fahr – 32.0));
}
04-Ch03-P374750.indd 249
7/3/09 9:00:49 AM
250
Chapter 3
Arithmetic for Computers
Assume that the floating-point argument fahr is passed in s12 and the result
should go in s0. (Unlike integer registers, floating-point register 0 can contain
a number.) What is the ARM assembly code?
ANSWER
We assume that the compiler places the three floating-point constants in
memory within easy reach of the register r12. The first two instructions load
the constants 5.0 and 9.0 into floating-point registers:
f2c:
FLDS s16,[r12,
FLDS s18,[r12,
const5] ; s16 = 5.0 (5.0 in memory)
const9] ; s18 = 9.0 (9.0 in memory)
They are then divided to get the fraction 5.0/9.0:
FDIVS s16, s16, s18 ; s16 = 5.0 / 9.0
(Many compilers would divide 5.0 by 9.0 at compile time and save the single
constant 5.0/9.0 in memory, thereby avoiding the divide at runtime.) Next, we
load the constant 32.0 and then subtract it from fahr (s12):
FLDS s18,[r12,const32]; s18 = 32.0
FSUBS s18, s12, s18 ; s18 = fahr – 32.0
Finally, we multiply the two intermediate results, placing the product in s0
as the return result, and then return
FMULS s0,
MOV pc, lr
s16, s18
; s0 = (5/9)*(fahr – 32.0)
; return
Now let’s perform floating-point operations on matrices, code commonly found
in scientific programs.
Compiling Floating-Point C Procedure with Two-Dimensional
Matrices into ARM
EXAMPLE
Most floating-point calculations are performed in double precision. Let’s perform matrix multiply of X = X + Y * Z. Let’s assume X, Y, and Z are all square
matrices with 32 elements in each dimension.
void mm (double x[][], double y[][], double z[][])
{
int i, j, k;
for (i = 0; i < 32; i = i + 1)
for (j = 0; j < 32; j = j + 1)
04-Ch03-P374750.indd 250
7/3/09 9:00:49 AM
3.5
251
Floating Point
for (k = 0; k < 32; k = k + 1)
x[i][j] = x[i][j] + y[i][k] * z[k][j];
}
The array starting addresses are parameters, so they are in r0, r1, and r2.
Assume that the integer variables are in r3, r4, and r5, respectively. What is
the ARM assembly code for the body of the procedure?
Note that x[i][j] is used in the innermost loop above. Since the loop index
is k, the index does not affect x[i][j], so we can avoid loading and storing
x[i][j] each iteration. Instead, the compiler loads x[i][j] into a register
outside the loop, accumulates the sum of the products of y[i][k] and z[k]
[j] in that same register, and then stores the sum into x[i][j] upon termination of the innermost loop.
As we did in Chapter 2, let’s rename the registers to make it easier to read
and write the code:
x
y
z
i
j
k
xijAddr
tempAddr
RN
RN
RN
RN
RN
RN
RN
RN
0
1
2
3
4
5
6
12
;
;
;
;
;
;
;
;
ANSWER
1st argument address of x
2nd argument address of y
3rd argument address of z
local variable i
local variable j
local variable k
address of x[i][j]
address of y[i][j] or z[i][j]
Since r4, r5, and r6 are not preserved across the procedure call, we must
save them:
mm:
SUB
STR
STR
STR
sp,sp,#12
r4, [sp, #8]
r5, [sp, #4]
r6, [sp, #0]
;
;
;
;
make
save
save
save
room on stack for 3 registers
r4 on stack
r5 on stack
r6 on stack
The body of the procedure starts with initializing the three for loop
variables:
L1:
L2:
04-Ch03-P374750.indd 251
MOV
MOV
MOV
i,
j,
k,
0
0
0
;
;
;
i = 0; initialize 1st for loop
j = 0; restart 2nd for loop
k = 0; restart 3rd for loop
7/3/09 9:00:49 AM
252
Chapter 3
Arithmetic for Computers
To calculate the address of x[i][j], we need to know how a 32 × 32,
two-dimensional array is stored in memory. As you might expect, its layout is
the same as if there were 32 single-dimension arrays, each with 32 elements.
So the first step is to skip over the i “single-dimensional arrays,” or rows, to
get the one we want. Thus, we multiply the index in the first dimension by the
size of the row, 32. Since 32 is a power of 2, we can use a shift instead as part of
adding the second index to select the jth element of the desired row:
ADD
xijAddr, j, i, LSL #5 ; xijAddr = i*size(row) + j
To turn this sum into a byte index, we multiply it by the size of a matrix
element in bytes. Since each element is 8 bytes for double precision, we can
instead shift left by 3 as part of adding this sum to the base address of x, giving the address of x[i][j]:
ADD
;
xijAddr, x, xijAddr, LSL #3 ; xijAddr = byte
address of x[i][j]
We then load the double precision number x[i][j] into s4:
FLDD
s4, [xijAddr,#0]
; s4 = 8 bytes of x[i][j]
The following three instructions are virtually identical to the last three:
calculate the address and then load the double precision number z[k][j].
L3:
ADD tempAddr, j , k, LSL #5 ; tempAddr = k *
;
size(row) + j
ADD tempAddr, z, tempAddr, LSL #3
;
tempAddr=byte address of z[k][j]
; s16 = 8 bytes of
FLDD s16, [tempAddr,#0]
z[k][j]
Similarly, the next three instructions are like the last three: calculate the
address and then load the double precision number y[i][k].
ADD tempAddr, k, i, LSL #5
; tempAddr = i *
;
size(row) + k
ADD tempAddr, y, tempAddr, LSL #3
; tempAddr=byte
;
address of y[i][k]
FLDD s18, [tempAddr,#0]
; s18 = 8 bytes of y[i][k]
Now that we have loaded all the data, we are finally ready to do some
floating-point operations! We multiply elements of y and z located in registers s18 and s16, and then accumulate the sum in s4.
FMULD s16, s18, s16
FADDD s4, s4, s16
04-Ch03-P374750.indd 252
; s16 = y[i][k] * z[k][j]
; s4 = x[i][j]+ y[i][k] * z[k][j]
7/3/09 9:00:49 AM
3.5
Floating Point
253
The final block increments the index k and loops back if the index is not
32. If it is 32, and thus the end of the innermost loop, we need to store the sum
accumulated in s4 into x[i][j].
ADD
CMP
BLT
FSTD
k, k, #1
k, #32
L3
s4, [xijAddr,#0]
; k = k + 1
; if (k < 32) go to L3
; x[i][j] = s4
Similarly, these final four instructions increment the index variable of the
middle and outermost loops, looping back if the index is less than 32 and exiting if the index is 32.
ADD
CMP
BLT
ADD
CMP
BLT
j,
j,
L2
i,
i,
L1
j, #1
; j = j + 1
#32
; if (j < 32) go to L2
i, #1
; i = i + 1
#32
; if (i < 32) go to L1
The end of the procedure restores r4, r5, and r6 from the stack, and then returns
to the callee.
Elaboration: The array layout discussed in the example, called row-major order, is
used by C and many other programming languages. Fortran instead uses column-major
order, whereby the array is stored column by column.
Elaboration: Another reason for separate integers and floating-point registers is that
microprocessors in the 1980s didn’t have enough transistors to put the floating-point
unit on the same chip as the integer unit. Hence, the floating-point unit, including the
floating-point registers, was optionally available as a second chip. Such optional accelerator chips are called coprocessors, Today a coprocessor function may mean its optional
for some members of the family, in that lower cost chips might leave out coprocessors
so that the chip can be smaller.
Elaboration: As mentioned in Section 3.4, accelerating division is more challenging
than multiplication. In addition to SRT, another technique to leverage a fast multiplier
is Newton’s iteration, where division is recast as finding the zero of a function to find
the reciprocal 1/x, which is then multiplied by the other operand. Iteration techniques
cannot be rounded properly without calculating many extra bits. A TI chip solves this
problem by calculating an extra-precise reciprocal.
Elaboration: Java embraces IEEE 754 by name in its definition of Java floating-point
data types and operations. Thus, the code in the first example could have well been
generated for a class method that converted Fahrenheit to Celsius.
04-Ch03-P374750.indd 253
7/3/09 9:00:49 AM
254
Chapter 3
Arithmetic for Computers
The second example uses multiple dimensional arrays, which are not explicitly
supported in Java. Java allows arrays of arrays, but each array may have its own length,
unlike multiple dimensional arrays in C. Like the examples in Chapter 2, a Java version
of this second example would require a good deal of checking code for array bounds,
including a new length calculation at the end of row access. It would also need to check
that the object reference is not null.
Accurate Arithmetic
guard The first of two
extra bits kept on the
right during intermediate
calculations of floatingpoint numbers; used
to improve rounding
accuracy.
round Method to
make the intermediate
floating-point result
fit the floating-point
format; the goal is
typically to find the
nearest number that can
be represented in the
format.
Unlike integers, which can represent exactly every number between the smallest and largest number, floating-point numbers are normally approximations
for a number they can’t really represent. The reason is that an infinite variety
of real numbers exists between, say, 0 and 1, but no more than 253 can be represented exactly in double precision floating point. The best we can do is getting the floating-point representation close to the actual number. Thus, IEEE
754 offers several modes of rounding to let the programmer pick the desired
approximation.
Rounding sounds simple enough, but to round accurately requires the hardware
to include extra bits in the calculation. In the preceding examples, we were vague
on the number of bits that an intermediate representation can occupy, but clearly,
if every intermediate result had to be truncated to the exact number of digits, there
would be no opportunity to round. IEEE 754, therefore, always keeps two extra bits
on the right during intermediate additions, called guard and round, respectively.
Let’s do a decimal example to illustrate their value.
Rounding with Guard Digits
EXAMPLE
ANSWER
Add 2.56ten × 100 to 2.34ten × 102, assuming that we have three significant
decimal digits. Round to the nearest decimal number with three significant
decimal digits, first with guard and round digits, and then without them.
First we must shift the smaller number to the right to align the exponents, so
2.56ten × 100 becomes 0.0256ten × 102. Since we have guard and round digits,
we are able to represent the two least significant digits when we align exponents. The guard digit holds 5 and the round digit holds 6. The sum is
2.3400ten
+ 0.0256ten
2.3656ten
Thus the sum is 2.3656ten × 102. Since we have two digits to round, we want
values 0 to 49 to round down and 51 to 99 to round up, with 50 being the tiebreaker. Rounding the sum up with three significant digits yields 2.37ten × 102.
04-Ch03-P374750.indd 254
7/3/09 9:00:49 AM
3.5
Floating Point
255
Doing this without guard and round digits drops two digits from the calculation. The new sum is then
2.34ten
+ 0.02ten
2.36ten
The answer is 2.36ten × 102, off by 1 in the last digit from the sum above.
Since the worst case for rounding would be when the actual number is halfway
between two floating-point representations, accuracy in floating point is normally
measured in terms of the number of bits in error in the least significant bits of the
significand; the measure is called the number of units in the last place, or ulp. If
a number were off by 2 in the least significant bits, it would be called off by 2 ulps.
Provided there is no overflow, underflow, or invalid operation exceptions, IEEE 754
guarantees that the computer uses the number that is within one-half ulp.
Elaboration: Although the example above really needed just one extra digit, multiply
can need two. A binary product may have one leading 0 bit; hence, the normalizing step
must shift the product one bit left. This shifts the guard digit into the least significant bit
of the product, leaving the round bit to help accurately round the product.
IEEE 754 has four rounding modes: always round up (toward +∞), always round
down (toward −∞), truncate, and round to nearest even. The final mode determines what
to do if the number is exactly halfway in between. The U.S. Internal Revenue Service
(IRS) always rounds 0.50 dollars up, possibly to the benefit of the IRS. A more equitable
way would be to round up this case half the time and round down the other half. IEEE
754 says that if the least significant bit retained in a halfway case would be odd, add
one; if it’s even, truncate. This method always creates a 0 in the least significant bit in
the tie-breaking case, giving the rounding mode its name. This mode is the most commonly used, and the only one that Java supports.
The goal of the extra rounding bits is to allow the computer to get the same results
as if the intermediate results were calculated to infinite precision and then rounded. To
support this goal and round to the nearest even, the standard has a third bit in addition
to guard and round; it is set whenever there are nonzero bits to the right of the round
bit. This sticky bit allows the computer to see the difference between 0.50 . . . 00 ten
and 0.50 . . . 01ten when rounding.
The sticky bit may be set, for example, during addition, when the smaller number is
shifted to the right. Suppose we added 5.01ten × 10–1 to 2.34ten × 102 in the example
above. Even with guard and round, we would be adding 0.0050 to 2.34, with a sum of
2.3450. The sticky bit would be set, since there are nonzero bits to the right. Without
the sticky bit to remember whether any 1s were shifted off, we would assume the number
is equal to 2.345000 . . . 00 and round to the nearest even of 2.34. With the sticky bit to
remember that the number is larger than 2.345000 . . . 00, we round instead to 2.35.
units in the last place
(ulp) The number of
bits in error in the least
significant bits of the
significand between
the actual number and
the number that can be
represented.
sticky bit A bit used in
rounding in addition to
guard and round that is
set whenever there are
nonzero bits to the right
of the round bit.
Elaboration: PowerPC, SPARC64, and AMD SSE5 architectures provide a single
instruction that does a multiply and add on three registers: a = a + (b × c). Obviously, this
04-Ch03-P374750.indd 255
7/3/09 9:00:49 AM
256
Chapter 3
fused multiply add
A floating-point
instruction that performs
both a multiply and an
add, but rounds only once
after the add.
Arithmetic for Computers
instruction allows potentially higher floating-point performance for this common operation.
Equally important is that instead of performing two roundings—-after the multiply and then
after the add—which would happen with separate instructions, the multiply add instruction can perform a single rounding after the add. A single rounding step increases the
precision of multiply add. Such operations with a single rounding are called fused multiply
add. It was added to the revised IEEE 754 standard (see
Section 3.10 on the CD).
Summary
The Big Picture that follows reinforces the stored-program concept from Chapter 2;
the meaning of the information cannot be determined just by looking at the bits, for
the same bits can represent a variety of objects. This section shows that computer
arithmetic is finite and thus can disagree with natural arithmetic. For example, the
IEEE 754 standard floating-point representation
(−1)S × (1 + Fraction) × 2(Exponent − Bias)
is almost always an approximation of the real number. Computer systems must
take care to minimize this gap between computer arithmetic and arithmetic in the
real world, and programmers at times need to be aware of the implications of this
approximation.
BIG
Bit patterns have no inherent meaning. They may represent signed integers, unsigned integers, floating-point numbers, instructions, and so on.
What is represented depends on the instruction that operates on the bits
in the word.
The major difference between computer numbers and numbers in the
real world is that computer numbers have limited size and hence limited
precision; it’s possible to calculate a number too big or too small to be represented in a word. Programmers must remember these limits and write
programs accordingly.
The
Picture
C type
04-Ch03-P374750.indd 256
Java type Data transfers
Operations
int
int
LDR, STR
ADD, SUB, MUL,
AND, ORR, MVN, MOV, CMP
unsigned int
—
LDR, STR
ADD, SUB, MUL,
AND, ORR, MVN, MOV, CMP
char
—
LDRSB, STRB
ADD, SUB, MUL,
AND, ORR, MVN, MOV, CMP
—
char
LDRSH, STRB
ADD, SUB, MUL,
AND, ORR, MVN, MOV, CMP
float
float
FLDS, FSTS
FADDS, FSUBS, FMULS, FDIVS, FCMPS
double
double
FLDD, FSTD
FADDD, FSUBD, FMULD, FDIVD, FCMPD
7/3/09 9:00:49 AM
3.5
257
Floating Point
In the last chapter, we presented the storage classes of the programming language C
(see the Hardware/Software Interface section in Section 2.7). The table above shows
some of the C and Java data types, the ARM data transfer instructions, and instructions that operate on those types that appear in Chapter 2 and this chapter. Note
that Java omits unsigned integers.
Hardware/
Software
Interface
Suppose there was a 16-bit IEEE 754 floating-point format with five exponent bits.
What would be the likely range of numbers it could represent?
Check
Yourself
1.
1.0000 0000 00 × 20
to 1.1111 1111 11 × 231, 0
2. ±1.0000 0000 0 × 2−14 to ±1.1111 1111 1 × 215, ±0, ±∞, NaN
3. ±1.0000 0000 00 × 2−14 to ±1.1111 1111 11 × 215, ±0, ±∞, NaN
4. ±1.0000 0000 00 × 2−15 to ±1.1111 1111 11 × 214, ±0, ±∞, NaN
Elaboration: To accommodate comparisons that may include NaNs, the standard
includes ordered and unordered as options for compares. Hence, the full ARM instruction
set has many flavors of compares to support NaNs. (Java does not support unordered
compares.)
In an attempt to squeeze every last bit of precision from a floating-point operation,
the standard allows some numbers to be represented in unnormalized form. Rather than
having a gap between 0 and the smallest normalized number, IEEE allows denormalized
numbers (also known as denorms or subnormals). They have the same exponent as
zero but a nonzero significand. They allow a number to degrade in significance until it
becomes 0, called gradual underflow. For example, the smallest positive single precision
normalized number is
1.0000 0000 0000 0000 0000 000two × 2–126
but the smallest single precision denormalized number is
0.0000 0000 0000 0000 0000 001two × 2–126, or 1.0two × 2–149
For double precision, the denorm gap goes from 1.0 × 2–1022 to 1.0 × 2–1074.
The possibility of an occasional unnormalized operand has given headaches to
floating-point designers who are trying to build fast floating-point units. Hence, many
computers cause an exception if an operand is denormalized, letting software complete
the operation. Although software implementations are perfectly valid, their lower performance has lessened the popularity of denorms in portable floating-point software.
Moreover, if programmers do not expect denorms, their programs may surprise them.
04-Ch03-P374750.indd 257
7/3/09 9:00:49 AM
258
Chapter 3
3.6
Arithmetic for Computers
Parallelism and Computer Arithmetic:
Associativity
Programs have typically been written first to run sequentially before being rewritten to run concurrently, so a natural question is, “do the two versions get the same
answer?” If the answer is no, you presume there is a bug in the parallel version that
you need to track down.
This approach assumes that computer arithmetic does not affect the results
when going from sequential to parallel. That is, if you were to add a million numbers together, you would get the same results whether you used 1 processor or 1000
processors. This assumption holds for two’s complement integers, even if the computation overflows. Another way to say this is that integer addition is associative.
Alas, because floating-point numbers are approximations of real numbers and
because computer arithmetic has limited precision, it does not hold for floatingpoint numbers. That is, floating-point addition is not associative.
Testing Associativity of Floating-Point Addition
EXAMPLE
See if x + (y + z) = (x + y) + z. For example, suppose x = −1.5ten × 1038,
y = 1.5ten × 1038, and z = 1.0, and that these are all single precision numbers.
ANSWER
Given the great range of numbers that can be represented in floating point,
problems occur when adding two large numbers of opposite signs plus a small
number, as we shall see:
x + (y + z) = −1.5ten × 1038 + (1.5ten × 1038 + 1.0)
= −1.5ten × 1038 + (1.5ten × 1038)
= 0.0
(x + y) + z = (−1.5ten × 1038 + 1.5ten × 1038) + 1.0
= (0.0ten) + 1.0
= 1.0
Therefore x + (y + z) ≠ (x + y) + z, so floating-point addition is not associative.
Since floating-point numbers have limited precision and result in approximations of real results, 1.5ten × 1038 is so much larger than 1.0ten that 1.5ten
× 1038 + 1.0 is still 1.5ten × 1038. That is why the sum of x, y, and z is 0.0 or
1.0, depending on the order of the floating-point additions, and hence floatingpoint add is not associative.
A more vexing version of this pitfall occurs on a parallel computer where the
operating system scheduler may use a different number of processors depending
on what other programs are running on a parallel computer. The unaware parallel
04-Ch03-P374750.indd 258
7/3/09 9:00:49 AM
3.7
Real Stuff: Floating Point in the x86
259
programmer may be flummoxed by his or her program getting slightly different
answers each time it is run for the same identical code and the same identical input,
as the varying number of processors from each run would cause the floating-point
sums to be calculated in different orders.
Given this quandary, programmers who write parallel code with floating-point
numbers need to verify whether the results are credible even if they don’t give the
same exact answer as the sequential code. The field that deals with such issues is
called numerical analysis, which is the subject of textbooks in its own right. Such
concerns are one reason for the popularity of numerical libraries such as LAPACK
and SCALAPAK, which have been validated in both their sequential and parallel
forms.
Elaboration: A subtle version of the associativity issue occurs when two processors
perform a redundant computation that is executed in different order so they get slightly different answers, although both answers are considered accurate. The bug occurs if a conditional branch compares to a floating-point number and the two processors take different
branches when common sense reasoning suggests they should take the same branch.
3.7
Real Stuff: Floating Point in the x86
The main differences between ARM x 86 arithmetic instructions are found in
floating-point instructions. The x86 floating-point architecture is different between
ARM and from all other computers in the world.
The x86 Floating-Point Architecture
The Intel 8087 floating-point coprocessor was announced in 1980. This architecture extended the 8086 with about 60 floating-point instructions.
Intel provided a stack architecture with its floating-point instructions: loads
push numbers onto the stack, operations find operands in the two top elements of
the stacks, and stores can pop elements off the stack. Intel supplemented this stack
architecture with instructions and addressing modes that allow the architecture
to have some of the benefits of a register-memory model. In addition to finding
operands in the top two elements of the stack, one operand can be in memory or in
one of the seven registers on-chip below the top of the stack. Thus, a complete stack
instruction set is supplemented by a limited set of register-memory instructions.
This hybrid is still a restricted register-memory model, however, since loads always
move data to the top of the stack while incrementing the top-of-stack pointer, and
stores can only move the top of stack to memory. Intel uses the notation ST to indicate the top of stack, and ST(i) to represent the ith register below the top of stack.
Another novel feature of this architecture is that the operands are wider in the
register stack than they are stored in memory, and all operations are performed
04-Ch03-P374750.indd 259
7/3/09 9:00:49 AM
260
Chapter 3
Arithmetic for Computers
at this wide internal precision. Unlike the maximum of 64 bits on ARM, the x86
floating-point operands on the stack are 80 bits wide. Numbers are automatically
converted to the internal 80-bit format on a load and converted back to the appropriate size on a store. This double extended precision is not supported by programming
languages, although it has been useful to programmers of mathematical software.
Memory data can be 32-bit (single precision) or 64-bit (double precision)
floating-point numbers. The register-memory version of these instructions will
then convert the memory operand to this Intel 80-bit format before performing the operation. The data transfer instructions also will automatically convert
16- and 32-bit integers to floating point, and vice versa, for integer loads and stores.
The x86 floating-point operations can be divided into four major classes:
1. Data movement instructions, including load, load constant, and store
2. Arithmetic instructions, including add, subtract, multiply, divide, square
root, and absolute value
3. Comparison, including instructions to send the result to the integer processor
so that it can branch
4. Transcendental instructions, including sine, cosine, log, and exponentiation
Figure 3.18 shows some of the 60 floating-point operations. Note that we get even
more combinations when we include the operand modes for these operations.
Figure 3.19 shows the many options for floating-point add.
Data transfer
Arithmetic
Compare
Transcendental
F{I}LD mem/ST(i)
F{I}ST{P} mem/
ST(i)
FLDPI
FLD1
F{I}ADD{P} mem/ST(i)
F{I}SUB{R}{P} mem/ST(i)
F{I}COM{P}
F{I}UCOM{P}{P}
FPATAN
F2XM1
F{I}MUL{P} mem/ST(i)
F{I}DIV{R}{P} mem/ST(i)
FSTSW AX/mem
FCOS
FPTAN
FLDZ
FSQRT
FPREM
FABS
FSIN
FRNDINT
FYL2X
FIGURE 3.18 The floating-point instructions of the x86. We use the curly brackets {} to show
optional variations of the basic operations: {I} means there is an integer version of the instruction, {P}
means this variation will pop one operand off the stack after the operation, and {R} means reverse the
order of the operands in this operation. The first column shows the data transfer instructions, which move
data to memory or to one of the registers below the top of the stack. The last three operations in the first
column push constants on the stack: pi, 1.0, and 0.0. The second column contains the arithmetic operations
described above. Note that the last three operate only on the top of stack. The third column is the compare
instructions. Since there are no special floating-point branch instructions, the result of the compare must
be transferred to the integer CPU via the FSTSW instruction, either into the AX register or into memory, followed by an SAHF instruction to set the condition codes. The floating-point comparison can then be tested
using integer branch instructions. The final column gives the higher-level floating-point operations. Not all
combinations suggested by the notation are provided. Hence, F{I}SUB{R}{P} operations represent these
instructions found in the x86: FSUB, FISUB, FSUBR, FISUBR, FSUBP, FSUBRP. For the integer subtract
instructions, there is no pop (FISUBP) or reverse pop (FISUBRP).
04-Ch03-P374750.indd 260
7/3/09 9:00:49 AM
3.7
Instruction
Operands
FADD
Real Stuff: Floating Point in the x86
Comment
Both operands in stack; result replaces top of stack.
FADD
ST(i)
One source operand is ith register below the top of stack; result
replaces the top of stack.
FADD
ST(i), ST
One source operand is the top of stack; result replaces ith register
below the top of stack.
FADD
mem32
One source operand is a 32-bit location in memory; result replaces the
top of stack.
FADD
mem64
One source operand is a 64-bit location in memory; result replaces the
top of stack.
FIGURE 3.19
261
The variations of operands for floating-point add in the x86.
The floating-point instructions are encoded using the ESC opcode of the 8086
and the postbyte address specifier (see Figure 2.45). The memory operations reserve
2 bits to decide whether the operand is a 32- or 64-bit floating point or a 16- or
32-bit integer. Those same 2 bits are used in versions that do not access memory to
decide whether the stack should be popped after the operation and whether the top
of stack or a lower register should get the result.
In the past, floating-point performance of the x86 family lagged far behind other
computers. As a result, Intel created a more traditional floating-point architecture
as part of SSE2.
The Intel Streaming SIMD Extension 2 (SSE2)
Floating-Point Architecture
Chapter 2 notes that in 2001 Intel added 144 instructions to its architecture, including double precision floating-point registers and operations. It includes eight
64-bit registers that can be used for floating-point operands, giving the compiler
a different target for floating-point operations than the unique stack architecture.
Compilers can choose to use the eight SSE2 registers as floating-point registers like
those found in other computers. AMD expanded the number to 16 registers as part
of AMD64, which Intel relabeled EM64T for its use. Figure 3.20 summarizes the
SSE and SSE2 instructions.
In addition to holding a single precision or double precision number in a register, Intel allows multiple floating-point operands to be packed into a single 128-bit
SSE2 register: four single precision or two double precision. Thus, the 16 floatingpoint registers for SSE2 are actually 128 bits wide. If the operands can be arranged
in memory as 128-bit aligned data, then 128-bit data transfers can load and store
multiple operands per instruction. This packed floating-point format is supported
by arithmetic operations that can operate simultaneously on four singles (PS) or
two doubles (PD). This architecture more than doubles performance over the stack
architecture.
04-Ch03-P374750.indd 261
7/3/09 9:00:49 AM
262
Chapter 3
Arithmetic for Computers
Data transfer
MOV{A/U}{SS/PS/SD/
PD} xmm, mem/xmm
MOV {H/L} {PS/PD}
xmm, mem/xmm
Arithmetic
ADD{SS/PS/SD/PD} xmm,
mem/xmm
SUB{SS/PS/SD/PD} xmm,
mem/xmm
MUL{SS/PS/SD/PD} xmm,
mem/xmm
DIV{SS/PS/SD/PD} xmm,
mem/xmm
SQRT{SS/PS/SD/PD} mem/xmm
Compare
CMP{SS/PS/SD/
PD}
MAX {SS/PS/SD/PD} mem/xmm
MIN{SS/PS/SD/PD} mem/xmm
FIGURE 3.20 The SSE/SSE2 floating-point instructions of the x86. xmm means one operand is
a 128-bit SSE2 register, and mem/xmm means the other operand is either in memory or it is an SSE2 register.
We use the curly brackets {} to show optional variations of the basic operations: {SS} stands for Scalar Single
precision floating point, or one 32-bit operand in a 128-bit register; {PS} stands for Packed Single precision
floating point, or four 32-bit operands in a 128-bit register; {SD} stands for Scalar Double precision floating
point, or one 64-bit operand in a 128-bit register; {PD} stands for Packed Double precision floating point, or
two 64-bit operands in a 128-bit register; {A} means the 128-bit operand is aligned in memory; {U} means
the 128-bit operand is unaligned in memory; {H} means move the high half of the 128-bit operand; and {L}
means move the low half of the 128-bit operand.
Thus mathematics
may be defined as the
subject in which we
never know what
we are talking about,
nor whether what we
are saying is true.
Bertrand Russell, Recent
Words on the Principles
of Mathematics, 1901
3.8
Fallacies and Pitfalls
Arithmetic fallacies and pitfalls generally stem from the difference between the
limited precision of computer arithmetic and the unlimited precision of natural
arithmetic.
Fallacy: Just as a left shift instruction can replace an integer multiply by a power of
2, a right shift is the same as an integer division by a power of 2.
Recall that a binary number x, where xi means the ith bit, represents the number
. . . + (x3 × 23) + (x2 × 22) + (x1 × 21) + (x0 × 20)
Shifting the bits of x right by n bits would seem to be the same as dividing by 2n.
And this is true for unsigned integers. The problem is with signed integers. For
example, suppose we want to divide −5ten by 4ten; the quotient should be −1ten. The
two’s complement representation of −5ten is
1111 1111 1111 1111 1111 1111 1111 1011two
According to this fallacy, shifting right by two should divide by 4ten (22):
0011 1111 1111 1111 1111 1111 1111 1110two
04-Ch03-P374750.indd 262
7/3/09 9:00:49 AM
3.8
Fallacies and Pitfalls
263
With a 0 in the sign bit, this result is clearly wrong. The value created by the shift
right is actually 1,073,741,822ten instead of −1ten.
A solution would be to have an arithmetic right shift that extends the sign bit
instead of shifting in 0s. Indeed, ARM has another optional shift for the second
operand of data processing instructions that performs an arithmetic right shift:
ASR. A 2-bit arithmetic shift right of −5ten produces
1111 1111 1111 1111 1111 1111 1111 1110two
The result is −2ten instead of −1ten; close, but no cigar.
Fallacy: Only theoretical mathematicians care about floating-point accuracy.
Newspaper headlines of November 1994 prove this statement is a fallacy (see
Figure 3.21). The following is the inside story behind the headlines.
The Pentium uses a standard floating-point divide algorithm that generates multiple quotient bits per step, using the most significant bits of divisor and
dividend to guess the next 2 bits of the quotient. The guess is taken from a lookup
table containing −2, −1, 0, +1, or +2. The guess is multiplied by the divisor and
subtracted from the remainder to generate a new remainder. Like nonrestoring
FIGURE 3.21 A sampling of newspaper and magazine articles from November 1994,
including the New York Times, San Jose Mercury News, San Francisco Chronicle, and Infoworld. The
Pentium floating-point divide bug even made the “Top 10 List” of the David Letterman Late Show on television. Intel eventually took a $300 million write-off to replace the buggy chips.
04-Ch03-P374750.indd 263
7/3/09 9:00:49 AM
264
Chapter 3
Arithmetic for Computers
division, if a previous guess gets too large a remainder, the partial remainder is
adjusted in a subsequent pass.
Evidently, there were five elements of the table from the 80486 that Intel thought
could never be accessed, and they optimized the PLA to return 0 instead of 2 in these
situations on the Pentium. Intel was wrong: while the first 11 bits were always correct,
errors would show up occasionally in bits 12 to 52, or the 4th to 15th decimal digits.
The following is a timeline of the Pentium bug morality play.
04-Ch03-P374750.indd 264
■
July 1994: Intel discovers the bug in the Pentium. The actual cost to fix the bug
was several hundred thousand dollars. Following normal bug fix procedures, it
will take months to make the change, reverify, and put the corrected chip into
production. Intel planned to put good chips into production in January 1995,
estimating that 3 to 5 million Pentiums would be produced with the bug.
■
September 1994: A math professor at Lynchburg College in Virginia, Thomas
Nicely, discovers the bug. After calling Intel technical support and getting
no official reaction, he posts his discovery on the Internet. It quickly gained
a following, and some pointed out that even small errors become big when
multiplied by big numbers: the fraction of people with a rare disease times
the population of Europe, for example, might lead to the wrong estimate of
the number of sick people.
■
November 7, 1994: Electronic Engineering Times puts the story on its front
page, which is soon picked up by other newspapers.
■
November 22, 1994: Intel issues a press release, calling it a “glitch.” The Pentium
“can make errors in the ninth digit. . . . Even most engineers and financial analysts require accuracy only to the fourth or fifth decimal point. Spreadsheet
and word processor users need not worry. . . . There are maybe several dozen
people that this would affect. So far, we’ve only heard from one. . . . [Only]
theoretical mathematicians (with Pentium computers purchased before the
summer) should be concerned.” What irked many was that customers were
told to describe their application to Intel, and then Intel would decide whether
or not their application merited a new Pentium without the divide bug.
■
December 5, 1994: Intel claims the flaw happens once in 27,000 years for the
typical spreadsheet user. Intel assumes a user does 1000 divides per day and
multiplies the error rate assuming floating-point numbers are random, which
is one in 9 billion, and then gets 9 million days, or 27,000 years. Things begin
to calm down, despite Intel neglecting to explain why a typical customer
would access floating-point numbers randomly.
■
December 12, 1994: IBM Research Division disputes Intel’s calculation
of the rate of errors (you can access this article by visiting www.mkp.com/
books_catalog/cod/links.htm). IBM claims that common spreadsheet programs, recalculating for 15 minutes a day, could produce Pentium-related
errors as often as once every 24 days. IBM assumes 5000 divides per second,
for 15 minutes, yielding 4.2 million divides per day, and does not assume
7/3/09 9:00:50 AM
3.9 Concluding Remarks
265
random distribution of numbers, instead calculating the chances as one in
100 million. As a result, IBM immediately stops shipment of all IBM personal
computers based on the Pentium. Things heat up again for Intel.
■
December 21, 1994: Intel releases the following, signed by Intel’s president,
chief executive officer, chief operating officer, and chairman of the board:
“We at Intel wish to sincerely apologize for our handling of the recently
publicized Pentium processor flaw. The Intel Inside symbol means that
your computer has a microprocessor second to none in quality and performance. Thousands of Intel employees work very hard to ensure that
this is true. But no microprocessor is ever perfect. What Intel continues
to believe is technically an extremely minor problem has taken on a life
of its own. Although Intel firmly stands behind the quality of the current
version of the Pentium processor, we recognize that many users have concerns. We want to resolve these concerns. Intel will exchange the current
version of the Pentium processor for an updated version, in which this
floating-point divide flaw is corrected, for any owner who requests it, free
of charge anytime during the life of their computer.”
Analysts estimate that this recall cost Intel $500 million, and Intel engineers did not
get a Christmas bonus that year.
This story brings up a few points for everyone to ponder. How much cheaper
would it have been to fix the bug in July 1994? What was the cost to repair the
damage to Intel’s reputation? And what is the corporate responsibility in disclosing
bugs in a product so widely used and relied upon as a microprocessor?
In April 1997, another floating-point bug was revealed in the Pentium Pro and
Pentium II microprocessors. When the floating-point-to-integer store instructions
(fist, fistp) encounter a negative floating-point number that is too large to fit in a
16- or 32-bit word after being converted to integer, they set the wrong bit in the FPO
status word (precision exception instead of invalid operation exception). To Intel’s
credit, this time they publicly acknowledged the bug and offered a software patch
to get around it—quite a different reaction from what they did in 1994.
3.9
Concluding Remarks
A side effect of the stored-program computer is that bit patterns have no inherent
meaning. The same bit pattern may represent a signed integer, unsigned integer,
floating-point number, instruction, and so on. It is the instruction that operates on
the word that determines its meaning.
Computer arithmetic is distinguished from paper-and-pencil arithmetic by the
constraints of limited precision. This limit may result in invalid operations through
04-Ch03-P374750.indd 265
7/3/09 9:00:50 AM
266
Chapter 3
Arithmetic for Computers
calculating numbers larger or smaller than the predefined limits. Such anomalies,
called “overflow” or “underflow,” may result in exceptions or interrupts, emergency
events similar to unplanned subroutine calls. Chapter 4 discusses exceptions in
more detail.
Floating-point arithmetic has the added challenge of being an approximation
of real numbers, and care needs to be taken to ensure that the computer number
selected is the representation closest to the actual number. The challenges of
imprecision and limited representation are part of the inspiration for the field of
numerical analysis. The recent switch to parallelism will shine the searchlight on
numerical analysis again, as solutions that were long considered safe on sequential
computers must be reconsidered when trying to find the fastest algorithm for parallel computers that still achieves a correct result.
Over the years, computer arithmetic has become largely standardized, greatly
enhancing the portability of programs. Two’s complement binary integer arithmetic and IEEE 754 binary floating-point arithmetic are found in the vast majority of
computers sold today. For example, every desktop computer sold since this book
was first printed follows these conventions.
With the explanation of computer arithmetic in this chapter comes a description
of much more of the ARM instruction set. Figure 3.22 lists the ARM instructions
ARM core instructions
Name
Format
add
subtract
move
AND
OR
NOT
logical shift left
NOT
logical shift right
load register
store register
load register halfword
store register halfword
load register byte
store register byte
swap (atomic update)
ADD
SUB
MOV
AND
ORR
MVN
LSL
MVN
LSR
LDR
STR
LDRH
STRH
LDRB
STRB
SWP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DT
DT
DT
DT
DT
DT
DT
branch on x
(x = eq, ne, lt, le, gt, ge)
compare
branch (always)
BEQ
BR
CMP
B
DP
BR
branch and link
BL
BR
FIGURE 3.22
04-Ch03-P374750.indd 266
ARM arithmetic core
multiply
floating-point add single
floating-point add double
floating-point subtract single
floating-point subtract double
floating-point multiply single
floating-point multiply double
floating-point divide single
floating-point divide double
load word to floating-point single
store word to floating-point single
load word to floating-point double
store word to floating-point double
floating-point compare single
floating-point compare double
FP Move Status (for conditional
branch)
Name
Format
MUL
FADDS
FADDD
FSUBS
FSUBD
FMULS
FMULD
FDIVS
FDIVD
FLDS
FSTS
FLDD
FSTD
FCMPS
FCMPD
FMSTAT
DP
R
R
R
R
R
R
R
R
I
I
I
I
R
R
The ARM instruction set. This book concentrates on the instructions like those in the left column.
7/3/09 9:00:50 AM
267
3.9 Concluding Remarks
covered in this chapter and Chapter 2. We call the set of instructions on the lefthand side of the figure the ARM core. The instructions on the right we call the
ARM arithmetic core. For the application version of the ARM Instruction set,
floating-point instructions are becoming standard. On the left of Figure 3.23 are
the instructions the ARM processor executes that are not found in Figure 3.22.
Figure 3.24 gives the popularity of the ARM instructions for SPEC2006 integer and
floating-point benchmarks. All instructions are listed that were responsible for at
least 0.3% of the instructions executed.
Remaining ARMv3-v6
Name
exclusive or (Rn ⊕ Rm)
bit clear (Rn & ∼ Rm)
arithmetic shift right (operation)
rotate right (operation)
count leading zeros
reverse subtract
add with carry
sutract with carry
reverse sutract with carry
load register signed byte
load register signed halfword
swap byte (atomic update)
load multiple
store multiple
compare nagative (Rn + Rm)
test equal (Rn ⊕ Rm)
test equal (Rn & Rm)
multiply and add
EOR
BIC
ASR
ROR
CLZ
RSB
ADC
SBC
RSC
LDRSB
LDRSH
SWPB
LDM
STM
CMN
TEQ
TST
MLA
FP move(S or D)
FP convert integer to FP (S or D)
FP convert FP (S or D) to integer
FP square root (S or D)
FP absolute value (S or D)
FP negate (S or D)
FP convert (S or D)
FP compare w. exceptions (S or D)
FP compare to zero w. exceptions (S or D)
move from SP FP to integer
move to SP FP from integer
move from High half DP FP to integer
move to High half DP FP from integer
move from Low half DP FP to integer
move to Low half DP FP from integer
load multiple FP
store multiple FP
multiply and add
long multiply - 64 bit (S or Uns.)
long multiply and add (S or Uns.)
load byte with user priviledge mode
load word with user priviledge mode
store byte with user priviledge mode
store word with user priviledge mode
coprocessor data operation
load coprocessor register
store coprocessor register
move coprocessor register to regiister
SMULL
SMLAL
LDRBT
LDRT
STRBT
STRT
CDP
LDC
STC
MRC
multiply and subtract
negated multiply and add
negated multiply and subtract
move register to coprocessor regiister
MCR
move status register to regiister
MRS
move register to status regiister
MSR
breakpoint (cause exception)
BKPT
software inter. (cause exception)
SWI
Remaining ARM VFP
FCPYF
FSITOF
FTOSIF
FSQRTF
FABSF
FNEGF
FCVTFF
FCMPEF
FCMPEZF
FMRS
FMSR
FMRDH
FMDHR
FMRDL
FMDLR
LDMF
STMF
FMACF
FMSCF
FNMACF
FNMSCF
FIGURE 3.23 Remaining ARM instructions. F means single (S) or double (D) precision floating-point instructions, and S means signed (S) and unsigned (U) versions. The underscore represents the letter to include to represent that
datatype.
04-Ch03-P374750.indd 267
7/3/09 9:00:50 AM
268
Chapter 3
Core ARM
add
subtract
and
or
not
move
load register
store register
load register byte
store register byte
conditional branch
branch and link
compare
Name
ADD
SUB
AND
ORR
MVN
MOV
LDR
STR
LDRB
STRB
Bcc
BL
CMP
Arithmetic for Computers
Integer
14.2%
2.2%
0.9%
5.0%
0.4%
10.4%
18.6%
7.6%
3.7%
0.6%
17.0%
0.7%
17.5%
Fl. pt.
10.7%
0.6%
0.3%
1.4%
0.2%
3.4%
5.8%
2.0%
0.1%
0.0%
4.0%
0.2%
3.5%
Arithmetic core + ARMv4
FP add double
FP subtract double
FP multiply double
FP divide double
FP add single
FP subtract single
FP multiply single
FP divide single
load word to FP double
store word to FP double
load word to FP single
store word to FP single
floating-point compare double
multiply
load half
store half
Name
add.d
sub.d
mul.d
div.d
add.s
sub.s
mul.s
div.s
l.d
s.d
l.s
s.s
c.x.d
mul
lhu
sh
Integer
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
1.3%
0.1%
Fl. pt.
10.6%
4.9%
15.0%
0.2%
1.5%
1.8%
2.4%
0.2%
17.5%
4.9%
4.2%
1.1%
0.6%
0.2%
0.0%
0.0%
FIGURE 3.24 The frequency of the ARM instructions for SPEC2006 integer and floating point. All instructions that
accounted for at least 1% of the instructions are included in the table. Extrapolated from measuremements of MIPS programs.
Note that although programmers and compiler writers have a rich menu of options,
ARM core instructions dominate integer SPEC2006 execution, and the integer core
plus arithmetic core dominate SPEC2006 floating point, as the table below shows.
Instruction subset
ARM core
ARM arithmetic core
Remaining ARM
Integer
Fl. pt.
98%
2%
0%
31%
66%
3%
For the rest of the book, we concentrate on the core instructions—the integer
instruction set excluding multiply and divide—to make the explanation of computer design easier. We use the MIPS instruction set core, although there are a few
differences between it and ARM, as it is a little simpler. However, the same techniques
apply to ARM. As you can see, the core includes the most popular instructions; be
assured that understanding a computer that runs the core will give you sufficient
background to understand even more ambitious computers.
Gresham’s Law (“Bad
money drives out
Good”) for computers
would say, “The Fast
drives out the Slow
even if the Fast is
wrong.”
W. Kahan, 1992
04-Ch03-P374750.indd 268
3.10
Historical Perspective and Further
Reading
This section surveys the history of the floating point going back to von Neumann,
including the surprisingly controversial IEEE standards effort, plus the rationale
for the 80-bit stack architecture for floating point in the x86. See Section 3.10.
7/3/09 9:00:50 AM
3.11
3.11
269
Exercises
Never give in, never
give in, never, never,
never—in nothing,
great or small, large or
petty—never give in.
Exercises
Contributed by Matthew Farrens, UC Davis
Exercise 3.1
The book shows how to add and subtract binary and decimal numbers. However,
other numbering systems were also very popular when dealing with computers.
Octal (base 8) numbering system was one of these. The following table shows pairs
of octal numbers.
A
B
a.
5323
2275
b.
0147
3457
Winston Churchill,
address at Harrow
School, 1941
3.1.1 [5] <3.2> What is the sum of A and B if they represent unsigned 12-bit
octal numbers? The result should be written in octal. Show your work.
3.1.2 [5] <3.2> What is the sum of A and B if they represent signed 12-bit octal
numbers stored in sign-magnitude format? The result should be written in octal.
Show your work.
3.1.3 [10] <3.2> Convert A into a decimal number, assuming it is unsigned.
Repeat assuming it stored in sign-magnitude format. Show your work.
The following table also shows pairs of octal numbers.
A
B
a.
2762
2032
b.
2646
1066
3.1.4 [5] <3.2> What is A − B if they represent unsigned 12-bit octal numbers?
The result should be written in octal. Show your work.
3.1.5 [5] <3.2> What is A − B if they represent signed 12-bit octal numbers stored
in sign-magnitude format? The result should be written in octal. Show your work.
3.1.6 [10] <3.2> Convert A into a binary number. What makes base 8 (octal) an
attractive numbering system for representing values in computers?
04-Ch03-P374750.indd 269
7/3/09 9:00:50 AM
270
Chapter 3
Arithmetic for Computers
Exercise 3.2
Hexadecimal (base 16) is also a commonly used numbering system for representing values in computers. In fact, it has become much more popular than octal. The
following table shows pairs of hexadecimal numbers.
A
B
a.
0D34
DD17
b.
BA1D
3617
3.2.1 [5] <3.2> What is the sum of A and B if they represent unsigned 16-bit
hexadecimal numbers? The result should be written in hexadecimal. Show your
work.
3.2.2 [5] <3.2> What is the sum of A and B if they represent signed 16-bit hexadecimal numbers stored in sign-magnitude format? The result should be written
in hexadecimal. Show your work.
3.2.3 [10] <3.2> Convert A into a decimal number, assuming it is unsigned.
Repeat assuming it stored in sign-magnitude format. Show your work.
The following table also shows pairs of hexadecimal numbers.
A
B
a.
BA7C
241A
b.
AADF
47BE
3.2.4 [5] <3.2> What is A − B if they represent unsigned 16-bit hexadecimal
numbers? The result should be written in hexadecimal. Show your work.
3.2.5 [5] <3.2> What is A − B if they represent signed 16-bit hexadecimal numbers
stored in sign-magnitude format? The result should be written in hexadecimal.
Show your work.
3.2.6 [10] <3.2> Convert A into a binary number. What makes base 16 (hexadecimal) an attractive numbering system for representing values in computers?
Exercise 3.3
Overflow occurs when a result is too large to be represented accurately given a
finite word size. Underflow occurs when a number is too small to be represented
correctly—a negative result when doing unsigned arithmetic, for example. (The
case when a positive result is generated by the addition of two negative integers is
04-Ch03-P374750.indd 270
7/3/09 9:00:50 AM
3.11
Exercises
271
also referred to as underflow by many, but in this textbook, that is considered an
overflow). The following table shows pairs of decimal numbers.
A
B
a.
69
90
b.
102
44
3.3.1 [5] <3.2> Assume A and B are unsigned 8-bit decimal integers. Calculate
A − B. Is there overflow, underflow, or neither?
3.3.2 [5] <3.2> Assume A and B are signed 8-bit decimal integers stored in signmagnitude format. Calculate A + B. Is there overflow, underflow, or neither?
3.3.3 [5] <3.2> Assume A and B are signed 8-bit decimal integers stored in signmagnitude format. Calculate A − B. Is there overflow, underflow, or neither?
The following table also shows pairs of decimal numbers.
A
B
a.
200
103
b.
247
237
3.3.4 [10] <3.2> Assume A and B are signed 8-bit decimal integers stored in
two’s-complement format. Calculate A + B using saturating arithmetic. The result
should be written in decimal. Show your work.
3.3.5 [10] <3.2> Assume A and B are signed 8-bit decimal integers stored in
two’s-complement format. Calculate A − B using saturating arithmetic. The result
should be written in decimal. Show your work.
3.3.6 [10] <3.2> Assume A and B are unsigned 8-bit integers. Calculate A + B
using saturating arithmetic. The result should be written in decimal. Show your
work.
Exercise 3.4
Let’s look in more detail at multiplication. We will use the numbers in the following
table.
A
B
a.
50
23
b.
66
04
04-Ch03-P374750.indd 271
7/3/09 9:00:50 AM
272
Chapter 3
Arithmetic for Computers
3.4.1 [20] <3.3> Using a table similar to that shown in Figure 3.7, calculate the
product of the octal unsigned 6-bit integers A and B using the hardware described
in Figure 3.4. You should show the contents of each register on each step.
3.4.2 [20] <3.3> Using a table similar to that shown in Figure 3.7, calculate the
product of the hexadecimal unsigned 8-bit integers A and B using the hardware
described in Figure 3.6. You should show the contents of each register on each step.
3.4.3 [60] <3.3> Write an ARM assembly language program to calculate the
product of unsigned integers A and B, using the approach described in Figure 3.4.
The following table shows pairs of octal numbers.
A
B
a.
54
67
b.
30
07
3.4.4 [30] <3.3> When multiplying signed numbers, one way to get the correct
answer is to convert the multiplier and multiplicand to positive numbers, save the
original signs, and then adjust the final value accordingly. Using a table similar
to that shown in Figure 3.7, calculate the product of A and B using the hardware
described in Figure 3.4. You should show the contents of each register on each step,
and include the step necessary to produce the correctly signed result. Assume A and
B are stored in 6-bit sign-magnitude format.
3.4.5 [30] <3.3> When shifting a register one bit to the right, there are several
ways to decide what the new entering bit should be. It can always be a 0, or always
a 1, or the incoming bit could be the one that is being pushed out of the right side
(turning a shift into a rotate), or the value that is already in the leftmost bit can
simply be retained (called an arithmetic shift right, because it preserves the sign of
the number that is being shifted.) Using a table similar to that shown in Figure 3.7,
calculate the product of the 6-bit two’s-complement numbers A and B using the
hardware described in Figure 3.6. The right shifts should be done using an arithmetic shift right. Note that the algorithm described in the text will need to be modified
slightly to make this work—in particular, things must be done differently if the
multiplier is negative. You can find details by searching the Web. Show the contents
of each register on each step.
3.4.6 [60] <3.3> Write an ARM assembly language program to calculate the
product of the signed integers A and B. State if you are using the approach given
in 3.4.4 or 3.4.5.
04-Ch03-P374750.indd 272
7/3/09 9:00:50 AM
3.11
Exercises
273
Exercise 3.5
For many reasons, we would like to design multipliers that require less time. Many
different approaches have been taken to accomplish this goal. In the following
table, A represents the bit width of an integer, and B represents the number of time
units (tu) taken to perform a step of an operation.
A
B
a.
4
3 tu
b.
32
7 tu
3.5.1 [10] <3.3> Calculate the time necessary to perform a multiply using the
approach given in Figures 3.4 and 3.5 if an integer is A bits wide and each step
of the operation takes B time units. Assume that in step 1a an addition is always
performed—either the multiplicand will be added, or a zero will be. Also assume
that the registers have already been initialized (you are just counting how long it
takes to do the multiplication loop itself). If this is being done in hardware, the
shifts of the multiplicand and multiplier can be done simultaneously. If this is being
done in software, they will have to be done one after the other. Solve for each case.
3.5.2 [10] <3.3> Calculate the time necessary to perform a multiply using the
approach described in the text (31 adders stacked vertically) if an integer is A bits
wide and an adder takes B time units.
3.5.3 [20] <3.3> Calculate the time necessary to perform a multiply using the
approach given in Figure 3.8, if an integer is A bits wide and an adder takes B time
units.
Exercise 3.6
In this exercise we will look at a couple of other ways to improve the performance
of multiplication, based primarily on doing more shifts and fewer arithmetic operations. The following table shows pairs of hexadecimal numbers.
A
B
a.
24
c9
b.
41
18
04-Ch03-P374750.indd 273
7/3/09 9:00:50 AM
274
Chapter 3
Arithmetic for Computers
3.6.1 [20] <3.3> As discussed in the text, one possible performance enhancement
is to do a shift and add instead of an actual multiplication. Since 9 ⋅ 6, for example,
can be written (2 ⋅ 2 ⋅ 2 + 1) ⋅ 6, we can calculate 9 ⋅ 6 by shifting 6 to the left three
times and then adding 6 to that result. Show the best way to calculate A ⋅ B using
shifts and adds/subtracts. Assume that A and B are 8-bit unsigned integers.
3.6.2 [20] <3.3> Show the best way to calculate A ⋅ B using shift and adds, if
A and B are 8-bit signed integers stored in sign-magnitude format.
3.6.3 [60] <3.3> Write an ARM assembly language program that performs a
multiplication on signed integers using shift and adds, as described in 3.6.1.
The following table shows further pairs of hexadecimal numbers.
A
B
a.
42
36
b.
9F
8E
3.6.4 [30] <3.3> Booth’s algorithm is another approach to reducing the number
of arithmetic operations necessary to perform a multiplication. This algorithm has
been around for years, and details about how it works are available on the Web.
Basically, it assumes that a shift takes less time than an add or subtract, and uses
this fact to reduce the number of arithmetic operations necessary to perform a
multiply. It works by identifying runs of 1s and 0s, and performing shifts during
the runs. Find a description of the algorithm and explain in detail how it works.
3.6.5 [30] <3.3> Show the step-by-step result of multiplying A and B, using
Booth’s algorithm. Assume A and B are 8-bit two’s-complement integers, stored in
hexadecimal format.
3.6.6 [60] <3.3> Write an ARM assembly language program to perform the multiplication of A and B using Booth’s algorithm.
Exercise 3.7
Let’s look in more detail at division. We will use the octal numbers in the following
table.
04-Ch03-P374750.indd 274
A
B
a.
50
23
b.
25
44
7/3/09 9:00:50 AM
3.11
Exercises
275
3.7.1 [20] <3.4> Using a table similar to that shown in Figure 3.11, calculate A
divided by B using the hardware described in Figure 3.9. You should show the contents of each register on each step. Assume A and B are unsigned 6-bit integers.
3.7.2 [30] <3.4> Using a table similar to that shown in Figure 3.11, calculate A
divided by B using the hardware described in Figure 3.12. You should show the
contents of each register on each step. Assume A and B are unsigned 6-bit integers.
This algorithm requires a slightly different approach than that shown in Figure 3.10.
You will want to think hard about this, do an experiment or two, or else go to the
Web to figure out how to make this work correctly. (Hint: one possible solution
involves using the fact that Figure 3.12 implies the remainder register can be shifted
either direction).
3.7.3 [60] <3.4> Write an ARM assembly language program to calculate A
divided by B, using the approach described in Figure 3.9. Assume A and B are
unsigned 6-bit integers.
The following table shows further pairs of octal numbers.
A
B
a.
55
24
b.
36
51
3.7.4 [30] <3.4> Using a table similar to that shown in Figure 3.11, calculate A
divided by B using the hardware described in Figure 3.9. You should show the contents of each register on each step. Assume A and B are 6-bit signed integers in
sign-magnitude format. Be sure to include how you are calculating the signs of the
quotient and remainder.
3.7.5 [30] <3.4> Using a table similar to that shown in Figure 3.11, calculate
A divided by B using the hardware described in Figure 3.12. You should show the
contents of each register on each step. Assume A and B are 6-bit signed integers in
sign-magnitude format. Be sure to include how you are calculating the signs of the
quotient and remainder.
3.7.6 [60] <3.4> Write a ARM assembly language program to calculate A divided
by B, using the approach described in Figure 3.12. Assume A and B are signed
integers.
Exercise 3.8
Figure 3.10 describes a restoring division algorithm, because when subtracting the
divisor from the remainder produces a negative result, the divisor is added back to
the remainder (thus restoring the value). However, there are other algorithms that
04-Ch03-P374750.indd 275
7/3/09 9:00:50 AM
276
Chapter 3
Arithmetic for Computers
have been developed that eliminate the extra addition. Many references to these
algorithms are easily found on the Web. We will explore these algorithms using the
pairs of octal numbers in the following table.
A
B
a.
75
12
b.
52
37
3.8.1 [30] <3.4> Using a table similar to that shown in Figure 3.11, calculate A
divided by B using nonrestoring division. You should show the contents of each
register on each step. Assume A and B are 6-bit unsigned integers.
3.8.2 [60] <3.4> Write an ARM assembly language program to calculate A
divided by B using nonrestoring division. Assume A and B are 6-bit signed
(two’s-complement) integers.
3.8.3 [60] <3.4> How does the performance of restoring and non-restoring
division compare? Demonstrate by showing the number of steps necessary to
calculate A divided by B using each method. Assume A and B are 6-bit signed
(sign-magnitude) integers. Writing a program to perform the restoring and nonrestoring divisions is acceptable.
The following table shows further pairs of octal numbers.
A
B
a.
17
14
b.
70
23
3.8.4 [30] <3.4> Using a table similar to that shown in Figure 3.11, calculate A
divided by B using nonperforming division. You should show the contents of each
register on each step. Assume A and B are 6-bit unsigned integers.
3.8.5 [60] <3.4> Write a ARM assembly language program to calculate A divided
by B using nonperforming division. Assume A and B are 6-bit two’s complement
signed integers.
3.8.6 [60] <3.4> How does the performance of non-restoring and nonperforming
division compare? Demonstrate by showing the number of steps necessary to calculate A divided by B using each method. Assume A and B are signed 6-bit integers,
stored in sign-magnitude format. Writing a program to perform the nonperforming and non-restoring divisions is acceptable.
04-Ch03-P374750.indd 276
7/3/09 9:00:51 AM
3.11
Exercises
277
Exercise 3.9
Division is so time-consuming and difficult that the CRAY T3E Fortran Optimization guide states, “The best strategy for division is to avoid it whenever possible.”
This exercise looks at the following different strategies for performing divisions.
a.
restoration division
b.
SRT division
3.9.1 [30] <3.4> Describe the algorithm in detail.
3.9.2 [60] <3.4> Use a flow chart (or a high-level code snippet) to describe how
the algorithm works.
3.9.3 [60] <3.4> Write a ARM assembly language program to perform a division
using the algorithm.
Exercise 3.10
In a Von Neumann architecture, groups of bits have no intrinsic meanings by
themselves. What a bit pattern represents depends entirely on how it is used. The
following table shows bit patterns expressed in hexademical notation.
a.
0x24A60004
b.
0xAFBF0000
3.10.1 [5] <3.5> What decimal number does the bit pattern represent if it is a
two’s-complement integer? An unsigned integer?
3.10.2 [10] <3.5> If this bit pattern is placed into the Instruction Register, what
ARM instruction will be executed?
3.10.3 [10] <3.5> What decimal number does the bit pattern represent if it is a
floating point number? Use the IEEE 754 standard.
The following table shows decimal numbers.
a.
–1609.5
b.
–938.8125
3.10.4 [10] <3.5> Write down the binary representation of the decimal number,
assuming the IEEE 754 single precision format.
3.10.5 [10] <3.5> Write down the binary representation of the decimal number,
assuming the IEEE 754 double precision format.
04-Ch03-P374750.indd 277
7/3/09 9:00:51 AM
278
Chapter 3
Arithmetic for Computers
3.10.6 [10] <3.5> Write down the binary representation of the decimal number
assuming it was stored using the single precision IBM format (base 16, instead of
base 2, with 7 bits of exponent).
Exercise 3.11
In the IEEE 754 floating point standard the exponent is stored in “bias” (also
known as “Excess-N”) format. This approach was selected because we want an all
zero pattern to be as close to zero as possible. Because of the use of a hidden 1, if
we were to represent the exponent in two’s-complement format an all-zero pattern
would actually be the number 1! (Remember, anything raised to the zeroth power
is 1, so 1.00 = 1.) There are many other aspects of the IEEE 754 standard that exist
in order to help hardware floating point units work more quickly. However, in
many older machines floating point calculations were handled in software, and
therefore other formats were used. The following table shows decimal numbers.
a.
5.00736125 x 105
b.
–2.691650390625 x 10–2
3.11.1 [20] <3.5> Write down the binary bit pattern assuming a format similar
to that employed by the DEC PDP-8 (left most 12 bits are the exponent stored as a
two’s-complement number, and the rightmost 24 bits are the mantissa stored as a
two’s-complement number.) No hidden 1 is used. Comment on how the range and
accuracy of this 36-bit pattern compares to the single and double precision IEEE
754 standards.
3.11.2 [20] <3.5> NVIDIA has a “half ” format, which is similar to IEEE 754
except that it is only 16 bits wide. The left-most bit is still the sign bit, the exponent
is 5 bits wide and stored in excess-16 format, and the mantissa is 10 bits long. A
hidden 1 is assumed. Write down the bit pattern assuming this format. Comment
on how the range and accuracy of this 16-bit pattern compares to the single precision IEEE 754 standard.
3.11.3 [20] <3.5> The Hewlett-Packard 2114, 2115, and 2116 used a format with
the left-most 16 bits being the mantissa stored in two’s-complement format, followed by another 16-bit field which had the leftmost 8 bits an extension of the
mantissa (making the mantissa 24 bits long), and the right-most 8 bits representing
the exponent. However, in an interesting twist, the exponent was stored in signmagnitude format with the sign bit on the far right! Write down the bit pattern
assuming this format. No hidden 1 is used. Comment on how the range and accuracy of this 32-bit pattern compares to the single precision IEEE 754 standard.
04-Ch03-P374750.indd 278
7/3/09 9:00:51 AM
3.11
Exercises
279
The following table shows pairs of decimal numbers.
A
B
3
a.
–1278 x 10
–3.90625 x 10–1
b.
2.3109375 x 101
6.391601562 x 10–1
3.11.4 [20] <3.5> Calculate the sum of A and B by hand, assuming A and B are
stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the
text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even.
Show all the steps.
3.11.5 [60] <3.5> Write an ARM assembly language program to calculate the
sum of A and B, assuming they are stored in the 16-bit NVIDIA format described
in 3.11.2 (and also described in the text). Assume 1 guard, 1 round and 1 sticky bit,
and round to the nearest even.
3.11.6 [60] <3.5> Write an ARM assembly language program to calculate the
sum of A and B, assuming they are stored using the format described in 3.11.1.
Now modify the program to calculate the sum assuming the format described
in 3.11.3. Which format is easier for a programmer to deal with? How do they
each compare to the IEEE 754 format? (Do not worry about sticky bits for this
question.)
Exercise 3.12
Floating-point multiplication is even more complicated and challenging than
floating-point addition, and both pale in comparison to floating-point division.
A
a.
b.
B
5.66015625 x 100
2
6.18 x 10
8.59375 x 100
5.796875 x 101
3.12.1 [30] <3.5> Calculate the product of A and B by hand, assuming A and B
are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in
the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest
even. Show all the steps; however, as is done in the example in the text, you can
do the multiplication in human-readable format instead of using the techniques
described in 3.4 through 3.6. Indicate if there is overflow or underflow. Write your
answer as a 16-bit pattern, and also as a decimal number. How accurate is your
result? How does it compare to the number you get if you do the multiplication on
a calculator?
04-Ch03-P374750.indd 279
7/3/09 9:00:51 AM
280
Chapter 3
Arithmetic for Computers
3.12.2 [60] <3.5> Write an ARM assembly language program to calculate the
product of A and B, assuming they are stored in IEEE 754 format. Indicate if there
is overflow or underflow. (Remember, IEEE 754 assumes 1 guard, 1 round and 1
sticky bit, and rounds to the nearest even.)
3.12.3 [60] <3.5> Write an ARM assembly language program to calculate
the product of A and B, assuming they are stored using the format described
in 3.11.1. Now modify the program to calculate the sum assuming the format described in 3.11.3. Which format is easier for a programmer to deal with?
How do they each compare to the IEEE 754 format? (Do not worry about sticky
bits for this question.)
The following table shows further pairs of decimal numbers.
A
B
3
a.
3.264 x 10
b.
–2.27734375 x 100
6.52 x 102
1.154375 x 102
3.12.4 [30] <3.5> Calculate by hand A divided by B. Show all the steps necessary to achieve your answer. Assume there is a guard, round, and sticky bit, and
use them if necessary. Write the final answer in both 16-bit floating-point format
and in decimal and compare the decimal result to that which you get if you use a
calculator.
The Livermore Loops are a set of floating-point intensive kernels taken from
scientific programs run at Lawrence Livermore Laboratory. The following table
identifies individual kernels from the set.
a.
Livermore Loop 1
b.
Livermore Loop 7
3.12.5 [60] <3.5> Write the loop in ARM assembly language.
3.12.6 [60] <3.5> Describe in detail one technique for performing floating-point
division in a digital computer. Be sure to include references to the sources you
used.
Exercise 3.13
Operations performed on fixed-point integers behave the way one expects—
the commutative, associative, and distributive laws all hold. This is not always the
case when working with floating-point numbers, however. Let’s first look at the
associative law. The following table shows sets of decimal numbers.
04-Ch03-P374750.indd 280
7/3/09 9:00:51 AM
3.11
Exercises
A
B
C
a.
–1.6360 x 104
1.6360 x 104
1.0 x 100
b.
2.865625 x 101
4.140625 x 10–1
1.2140625 x 101
281
3.13.1 [20] <3.2, 3.5, 3.6> Calculate (A + B) + C by hand, assuming A, B, and C
are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in
the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest
even. Show all the steps, and write your answer both in 16-bit floating point format
and in decimal.
3.13.2 [20] <3.2, 3.5, 3.6> Calculate A + (B + C) by hand, assuming A, B, and C
are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in
the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest
even. Show all the steps, and write your answer both in 16-bit floating point format
and in decimal.
3.13.3 [10] <3.2, 3.5, 3.6> Based on your answers to 3.13.1 and 3.13.2, does
(A + B) + C = A + (B + C)?
The following table shows further sets of decimal numbers.
A
B
C
a.
4.8828125 x 10–4
1.768 x 103
2.50125 x 102
b.
4.721875 x 101
2.809375 x 101
3.575 x 101
3.13.4 [30] <3.3, 3.5, 3.6> Calculate (A ⋅ B) ⋅ C by hand, assuming A, B, and C are
stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the
text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even.
Show all the steps, and write your answer both in 16-bit floating point format and
in decimal.
3.13.5 [30] <3.3, 3.5, 3.6> Calculate A ⋅ (B ⋅ C) by hand, assuming A, B, and C are
stored in the 16-bit NVIDIA format described in 3.11.2 (and also described in the
text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even.
Show all the steps, and write your answer both in 16-bit floating point format and
in decimal.
3.13.6 [10] <3.3, 3.5, 3.6> Based on your answers to 3.13.4 and 3.13.5, does
(A ⋅ B) ⋅ C = A ⋅ (B ⋅ C)?
04-Ch03-P374750.indd 281
7/3/09 9:00:51 AM
282
Chapter 3
Arithmetic for Computers
Exercise 3.14
The associative law is not the only one that does not always hold in dealing with
floating point numbers. There are other oddities that occur as well. The following
table shows sets of decimal numbers.
A
B
–1
a.
1.5234375 x 10
b.
–2.7890625 x 101
C
–1
9.96875 x 101
2.0703125 x 10
–8.088 x 103
1.0216 x 104
3.14.1 [30] <3.2, 3.3, 3.5, 3.6> Calculate A ⋅ (B + C) by hand, assuming A, B, and
C are stored in the 16-bit NVIDIA format described in 3.11.2 (and also described
in the text). Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest
even. Show all the steps, and write your answer both in 16-bit floating point format
and in decimal.
3.14.2 [30] <3.2, 3.3, 3.5, 3.6> Calculate (A × B) + (A × C) by hand, assuming
A, B, and C are stored in the 16-bit NVIDIA format described in 3.11.2 (and also
described in the text). Assume one guard, one round bit and one sticky bit, and
round to the nearest even. Show all the steps, and write your answer both in 16-bit
floating-point format and in decimal.
3.14.3 [10] <3.2, 3.3, 3.5, 3.6> Based on your answers to 3.14.1 and 3.14.2, does
(A ⋅ B) + (A ⋅ C) = A ⋅ (B + C)?
The following table shows pairs, each consisting of a fraction and an integer.
A
B
a.
1/3
3
b.
–1/7
7
3.14.4 [10] <3.5> Using the IEEE 754 floating point format, write down the bit
pattern that would represent A. Can you represent A exactly?
3.14.5 [10] <3.2, 3.3, 3.5, 3.6> What do you get if you add A to itself B times?
What is A × B? Are they the same? What should they be?
3.14.6 [60] <3.2, 3.3, 3.4, 3.5, 3.6> What do you get if you take the square root
of B and then multiply that value by itself? What should you get? Do for both
04-Ch03-P374750.indd 282
7/3/09 9:00:51 AM
3.11
Exercises
283
single and double precision floating point numbers. (Write a program to do these
calculations).
Exercise 3.15
Binary numbers are used in the mantissa field, but they do not have to be. IBM used
base 16 numbers, for example, in some of their floating point formats. There are
other approaches that are possible as well, each with their own particular advantages and disadvantages. The following table shows fractions to be represented in
various floating point formats.
a.
1/2
b.
1/9
3.15.1 [10] <3.5, 3.6> Write down the bit pattern in the mantissa assuming a
floating point format that uses binary numbers in the mantissa (essentially what
you have been doing in this chapter). Assume there are 24 bits, and you do not need
to normalize. Is this representation exact?
3.15.2 [10] <3.5, 3.6> Write down the bit pattern in the mantissa assuming a
floating-point format that uses Binary Coded Decimal (base 10) numbers in the
mantissa instead of base 2. Assume there are 24 bits, and you do not need to normalize. Is this representation exact?
3.15.3 [10] <3.5, 3.6> Write down the bit pattern assuming that we are using
base 15 numbers in the mantissa instead of base 2. (Base 16 numbers use the symbols 0–9 and A–F. Base 15 numbers would use 0–9 and A–E.) Assume there are 24
bits, and you do not need to normalize. Is this representation exact?
3.15.4 [20] <3.5, 3.6> Write down the bit pattern assuming that we are using
base 30 numbers in the mantissa instead of base 2. (Base 16 numbers use the symbols 0–9 and A–F. Base 30 numbers would use 0–9 and A–T.) Assume there are 20
bits, and you do not need to normalize. Is this representation exact? Do you see any
advantage to using this approach?
§3.2, page 219: 3.
§3.5, page 257: 3.
04-Ch03-P374750.indd 283
Answers to
Check Yourself
7/3/09 9:00:51 AM
B1
A
P
P
E
N
D
I
X
ARM and Thumb
Assembler
Instructions
Andrew Sloss,
ARM; Dominic Symes, ARM;
Chris Wright,
Ultimodule Inc.
B1.1
Using This Appendix B1-3
B1.2
Syntax B1-4
B1.3
Alphabetical List of ARM and Thumb Instructions B1-8
B1.4
ARM Assembler Quick Reference B1-49
B1.5
GNU Assembler Quick Reference B1-60
This appendix lists the ARM and Thumb instructions available up to, and
including, ARM architecture ARMv6, which was just released at the time of
writing. We list the operations in alphabetical order for easy reference. Sections
B1.5 and B1.4 give quick reference guides to the ARM and GNU assemblers
armasm and gas.
We have designed this appendix for practical programming use, both for writing
assembly code and for interpreting disassembly output. It is not intended as a
definitive architectural ARM reference. In particular, we do not list the exhaustive
details of each instruction bitmap encoding and behavior. For this level of detail, see
the ARM Architecture Reference Manual, edited by David Seal, published by Addison
Wesley. We do give a summary of ARM and Thumb instruction set encodings in
Appendix B2.
B1.1
Using This Appendix
Each appendix entry begins by enumerating the available instructions formats for the
given instruction class. For example, the first entry for the instruction class ADD reads
1. ADD<cond>{S} Rd, Rn, #<rotated_immed>
ARMv1
B1-4
Appendix B1 ARM and Thumb Assembler Instructions
The fields <cond> and <rotated_immed> are two of a number of standard fields
described in Section B1.2. Rd and Rn denote ARM registers. The instruction is only
executed if the condition <cond> is passed. Each entry also describes the action of
the instruction if it is executed.
The {S} denotes that you may apply an optional S suffix to the instruction.
Finally, the right-hand column specifies that the instruction is available from the
listed ARM architecture version onwards. Table B1.1 shows the entries possible
for this column.
TABLE B1.1
Instruction types.
Type
Meaning
ARMvX
32-bit ARM instruction first appearing in ARM architecture version X
THUMBvX
16-bit Thumb instruction first appearing in Thumb architecture version X
MACRO
Assembler pseudoinstruction
Note that there is no direct correlation between the Thumb architecture
number and the ARM architecture number. The THUMBv1 architecture is used in
ARMv4T processors; the THUMBv2 architecture, in ARMv5T processors; and the
THUMBv3 architecture, in ARMv6 processors.
Each instruction definition is followed by a notes section describing restrictions
on the use of the instruction. When we make a statement such as “ Rd must not be
pc,’’ we mean that the description of the function only applies when this condition
holds. If you break the condition, then the instruction may be unpredictable or
have predictable effects that we haven’t had space to describe here. Well-written
programs should not need to break these conditions.
B1.2
Syntax
We use the following syntax and abbreviations throughout this appendix.
Optional Expressions
■
{<expr>} is an optional expression. For example, LDR{B} is shorthand for
LDR or LDRB.
■
{<exp1>|<exp2>|...|<expN>}, including at least one “|’’ divider, is a list
of expressions. One of the listed expressions must appear. For example
LDR{B|H} is shorthand for LDRB or LDRH. It does not include LDR. We would
represent these three possibilities by LDR{|B|H}.
B1.2
Syntax
Register Names
■
Rd, Rn, Rm, Rs, RdHi, RdLo represent ARM registers in the range r0 to r15.
■
Ld, Ln, Lm, Ls represent low-numbered ARM registers in the range r0 to r7.
■
Hd, Hn, Hm, Hs represent high-numbered ARM registers in the range r8 to
r15.
■
Cd, Cn, Cm represent coprocessor registers in the range c0 to c15.
■
sp, lr, pc are names for r13, r14, r15, respectively.
■
Rn[a] denotes bit a of register Rn. Therefore Rn[a] (Rn » a) & 1.
■
Rn[a:b] denotes the a 1 b bit value stored in bits a to b of Rn inclusive.
■
RdHi:RdLo represents the 64-bit value with high 32 RDHi bits and low 32
bits RdLo.
Values Stored as Immediates
■ <immedN>
is any unsigned N-bit immediate. For example, <immed8>
represents any integer in the range 0 to 255. <immed5>*4 represents any
integer in the list 0, 4, 8, ..., 124.
■ <addressN>
is an address or label stored as a relative offset. The address
must be in the range pc 2N address pc 2N. Here, pc is the address of
the instruction plus eight for ARM state, or the address of the instruction
plus four for Thumb state. The address must be four-byte aligned if the
destination is an ARM instruction or two-byte aligned if the destination is a
Thumb instruction.
■ <A-B>
represents any integer in the range A to B inclusive.
■ <rotated_immed> is any 32-bit immediate that can be represented as an eight-
bit unsigned value rotated right (or left) by an even number of bit positions.
In other words, <rotated_immed> = <immed8> ROR (2*<immed4>). For
example 0xff, 0x104, 0xe0000005, and 0x0bc00000 are possible values
for <rotated_immed>. However, 0x101 and 0x102 are not. When you use a
rotated immediate, <shifter_C> is set according to Table B1.3 (discussed in
Section Shift Operations). A nonzero rotate may cause a change in the carry
flag. For this reason, you can also specify the rotation explicitly, using the
assembly syntax <immed8>, 2*<immed4>.
Condition Codes and Flags
■ <cond>
represents any of the standard ARM condition codes. Table B1.2
shows the possible values for <cond>.
B1-5
B1-6
Appendix B1 ARM and Thumb Assembler Instructions
TABLE B1.2
ARM condition mnemonics.
Instruction is executed when
<cond>
cpsr condition
ALways
TRUE
EQ
EQual (last result zero)
Z==1
NE
Not Equal (last result nonzero)
Z==0
{CS|HS}
Carry Set, unsigned Higher or Same (following a
compare)
C==1
{CC|LO}
Carry Clear, unsigned LOwer (following a comparison)
C==0
MI
MInus (last result negative)
N==1
PL
PLus (last result greater than or equal to zero)
N==0
VS
V flag Set (signed overflow on last result)
V==1
VC
V flag Clear (no signed overflow on last result)
V==0
HI
unsigned HIgher (following a comparison)
LS
unsigned Lower or Same (following a comparison)
GE
signed Greater than or Equal
N==V
LT
signed Less Than
N!=V
GT
signed Greater Than
LE
signed Less than or Equal
NV
NeVer—ARMv1 and ARMv2 only—DO NOT USE
{|AL}
C==1 && Z==0
C==0 || Z==1
N==V && Z==0
N!=V || Z==1
FALSE
■ <SignedOverflow>
is a flag indicating that the result of an arithmetic
operation suffered from a signed overflow. For example, 0x7fffffff + 1 =
0x80000000 produces a signed overflow because the sum of two positive
32-bit signed integers is a negative 32-bit signed integer. The V flag in the cpsr
typically records signed overflows.
■ <UnsignedOverflow>
is a flag indicating that the result of an arithmetic
operation suffered from an unsigned overflow. For example, 0xffffffff + 1
= 0 produces an overflow in unsigned 32-bit arithmetic. The C flag in the cpsr
typically records unsigned overflows.
■ <NoUnsignedOverflow>
is the same as 1 – <UnsignedOverflow>.
B1.2
Syntax
■ <Zero> is a flag indicating that the result of an arithmetic or logical operation
is zero. The Z flag in the cpsr typically records the zero condition.
■ <Negative>
is a flag indicating that the result of an arithmetic or logical
operation is negative. In other words, <Negative> is bit 31 of the result. The
N flag in the cpsr typically records this condition.
Shift Operations
■ <imm_shift>
represents a shift by an immediate specified amount. The
possible shifts are LSL #<0-31>, LSR #<1-32>, ASR #<1-32>, ROR #<131>, and RRX. See Table B1.3 for the actions of each shift.
■ <reg_shift>
represents a shift by a register-specified amount. The
possible shifts are LSL Rs, LSR Rs, ASR Rs, and ROR Rs. Rs must not be pc .
The bottom eight bits of Rs are used as the shift value k in Table B1.3. Bits
Rs[31:8] are ignored.
■ <shift>
is shorthand for <imm_shift> or <reg_shift>.
■ <shifted_Rm>
is shorthand for the value of Rm after the specified shift has
been applied. See Table B1.3.
■ <shifter_C>
is shorthand for the carry value output by the shifting circuit.
See Table B1.3.
TABLE B1.3
Barrel shifter circuit outputs for different shift types.
Shift
k range
<shifted_Rm>
<shifter_C>
LSL k
k0
Rm
C (from cpsr)
LSL k
1 k 31
Rm « k
Rm[32-k]
LSL k
k 32
0
Rm[0]
LSL k
k 33
0
0
LSR k
k0
Rm
C
LSR k
1 k 31
(unsigned)Rm » k
Rm[k-1]
LSR k
k 32
0
Rm[31]
LSR k
k 33
0
0
ASR k
k0
Rm
C
B1-7
B1-8
Appendix B1 ARM and Thumb Assembler Instructions
Shift
k range
<shifted_Rm>
<shifter_C>
ASR k
1 k 31
(signed)Rm»k
Rm[k-1]
ASR k
k 32
Rm[31]
Rm[31]
ROR k
k0
Rm
C
ROR k
1 k 31
((unsigned)Rm » k)|
(Rm » (32-k))
Rm[k-1]
ROR k
k 32
Rm ROR (k & 31)
Rm[(k-1) & 31]
(C « 31) |
((unsigned)Rm » 1)
Rm[0]
RRX
Alphabetical List of ARM
and Thumb Instructions
B1.3
Instructions are listed in alphabetical order. However, where signed and unsigned
variants of the same operation exist, the main entry is under the signed variant.
ADC Add two 32-bit values and carry
1.
2.
3.
ADC<cond>{S}
ADC<cond>{S}
ADC
Rd, Rn, #<rotated_immed>
Rd, Rn, Rm {, <shift>}
Ld, Lm
ARMv1
ARMv1
THUMBv1
Action
Effect on the cpsr
1. Rd = Rn + <rotated_immed> + C
2. Rd = Rn + <shifted_Rm> + C
3. Ld = Ld + Lm + C
Updated if S suffix specified
Updated if S suffix specified
Updated (see Notes below)
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>,
Z = <Zero>, C = <UnsignedOverflow>, V = <SignedOverflow>.
■
If Rd is pc, then the instruction effects a jump to the calculated address. If the
operation updates the cpsr, then the processor mode must have an spsr; in this
case, the cpsr is set to the value of the spsr.
■
If Rn or Rm is pc, then the value used is the address of the instruction plus
eight bytes.
B1.3
Alphabetical List of ARM and Thumb Instructions
Examples
ADDS
ADC
ADCS
r0, r0, r2
r1, r1, r3
r0, r0, r0
; first half of a 64-bit add
; second half of a 64-bit add
; shift r0 left, inserting carry (RLX)
ADD Add two 32-bit values
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
ADD<cond>S
ADD<cond>S
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
Rd,
Rd,
Ld,
Ld,
Ld,
Hd,
Ld,
Hd,
Ld,
Ld,
sp,
Rn, #<rotated_immed>
Rn, Rm {, <shift>}
Ln, #<immed3>
#<immed8>
Ln, Lm
Lm
Hm
Hm
pc, #<immed8>*4
sp, #<immed8>*4
#<immed7>*4
Action
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
ARMv1
ARMv1
THUMBv1
THUMBv1
THUMBv1
THUMBv1
THUMBv1
THUMBv1
THUMBv1
THUMBv1
THUMBv1
Effect on the cpsr
Rd
Rd
Ld
Ld
Ld
Hd
Ld
Hd
Ld
Ld
sp
=
=
=
=
=
=
=
=
=
=
=
Rn
Rn
Ln
Ld
Ln
Hd
Ld
Hd
pc
sp
sp
+
+
+
+
+
+
+
+
+
+
+
<rotated_immed>
<shifted_Rm>
<immed3>
<immed8>
Lm
Lm
Hm
Hm
4*<immed8>
4*<immed8>
4*<immed7>
Updated if S
Updated if S
Updated (see
Updated (see
Updated (see
Preserved
Preserved
Preserved
Preserved
Preserved
Preserved
suffix
suffix
Notes
Notes
Notes
specified
specified
below)
below)
below)
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>,
Z = <Zero>, C = <UnsignedOverflow>, V = <SignedOverflow>.
■
If Rd or Hd is pc, then the instruction effects a jump to the calculated address.
If the operation updates the cpsr, then the processor mode must have an spsr;
in this case, the cpsr is set to the value of the spsr.
■
If Rn or Rm is pc, then the value used is the address of the instruction plus
eight bytes.
■
If Hd or Hm is pc, then the value used is the address of the instruction plus
four bytes.
B1-9
B1-10
Appendix B1 ARM and Thumb Assembler Instructions
Examples
ADD
ADDS
ADD
ADD
ADD
ADDS
ADR
r0,
r0,
r0,
pc,
r0,
pc,
r1,
r2,
r0,
pc,
r1,
lr,
#4
r2
r0, LSL #1
r0, LSL #2
r2, ROR r3
#4
;
;
;
;
;
;
r0 =
r0 =
r0 =
skip
r0 =
jump
r1 + 4
r2 + r2 and flags updated
3*r0
r0+1 instructions
r1 + ((r2r»3)|(r2«(32-r3))
to lr+4, restoring the cpsr
Address relative
1. ADR{L}<cond> Rd, <address>
MACRO
This is not an ARM instruction, but an assembler macro that attempts to set Rd
to the value <address> using a pc-relative calculation. The ADR instruction macro
always uses a single ARM (or Thumb) instruction. The long-version ADRL always
uses two ARM instructions and so can access a wider range of addresses. If the
assembler cannot generate an instruction sequence reaching the address, then it
will generate an error.
The following example shows how to call the function pointed to by r9. We use
ADR to set lr to the return address; in this case, it will assemble to ADD lr, pc, #4.
Recall that pc reads as the address of the current instruction plus eight in this case.
ADR
lr, return_address
MOV
r0, #0
BX
r9
return_address
;
;
;
;
set return address
set a function argument
call the function
resume
AND Logical bitwise AND of two 32-bit values
1. AND<cond>{S}
2. AND<cond>{S}
3. AND
Rd, Rn, #<rotated_immed>
Rd, Rn, Rm {, <shift>}
Ld, Lm
Action
1. Rd = Rn & <rotated_immed>
2. Rd = Rn & <shifted_Rm>
3. Ld = Ld & Lm
ARMv1
ARMv1
THUMBv1
Effect on the cpsr
Updated if S suffix specified
Updated if S suffix specified
Updated (see Notes below)
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>,
Z = <Zero>, C = <shifter_C> (see Table B1.3), V is preserved.
■
If Rd is pc, then the instruction effects a jump to the calculated address. If the
operation updates the cpsr, then the processor mode must have an spsr; in this
case, the cpsr is set to the value of the spsr.
B1.3
■
Alphabetical List of ARM and Thumb Instructions
If Rn or Rm is pc, then the value used is the address of the instruction plus
eight bytes.
Example
AND
ANDS
r0, r0, #0xFF
r0, r0, #1«31
; extract the lower 8 bits of a byte
; extract sign bit
ASR Arithmetic shift right for Thumb (see MOV for the ARM equivalent)
1. ASR Ld, Lm, #<immed5>
2. ASR Ld, Ls
Action
THUMBv1
THUMBv1
Effect on the cpsr
1. Ld = Lm ASR #<immed5>
2. Ld = Ld ASR Ls[7:0]
Updated (see Notes below)
Updated
Note
■
B
The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table
B1.3).
Branch relative
1. B<cond>
2. B<cond>
3. B
<address25>
<address8>
<address11>
ARMv1
THUMBv1
THUMBv1
Branches to the given address or label. The address is stored as a relative
offset.
Examples
B
label
BGT loop
BIC
; branch unconditionally to a label
; conditionally continue a loop
Logical bit clear (AND NOT) of two 32-bit values
1. BIC<cond>{S} Rd, Rn, #<rotated_immed>
2. BIC<cond>{S} Rd, Rn, Rm {, <shift>}
3. BIC
Ld, Lm
Action
1. Rd = Rn & ~<rotated_immed>
2. Rd = Rn & ~<shifted_Rm>
3. Ld = Ld & ~Lm
ARMv1
ARMv1
THUMBv1
Effect on the cpsr
Updated if S suffix specified
Updated if S suffix specified
Updated (see Notes below)
B1-11
B1-12
Appendix B1 ARM and Thumb Assembler Instructions
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z =
<Zero>, C = <shifter_C> (see Table B1.3), V is preserved.
■
If Rd is pc, then the instruction effects a jump to the calculated address. If the
operation updates the cpsr, then the processor mode must have an spsr; in this
case, the cpsr is set to the value of the spsr.
■
If Rn or Rm is pc, then the value used is the address of the instruction plus
eight bytes.
Examples
BIC
BKPT
r0, r0, #1 « 22
; clear bit 22 of r0
Breakpoint instruction
1. BKPT <immed16>
ARMv5
2. BKPT <immed8>
THUMBv2
The breakpoint instruction causes a prefetch data abort, unless overridden by
debug hardware. The ARM ignores the immediate value. This immediate can be
used to hold debug information such as the breakpoint number.
Relative branch with link (subroutine call)
BL
1. BL<cond>
<address25>
ARMv1
2. BL
<address22>
THUMBv1
Action
Effect on the cpsr
1. lr = ret+0; pc = <address25>
None
2. lr = ret+1; pc = <address22>
None
Note
■
These instructions set lr to the address of the following instruction ret plus
the current cpsr T-bit setting. Therefore you can return from the subroutine
using BX lr to resume execution address and ARM or Thumb state.
Examples
BL
subroutine ; call subroutine (return with MOV pc,lr)
BLVS overflow
BLX
; call subroutine on an overflow
Branch with link and exchange (subroutine call with possible state switch)
1. BLX
<address25>
ARMv5
B1.3
Alphabetical List of ARM and Thumb Instructions
2. BLX<cond>
Rm
ARMv5
3. BLX
<address22>
THUMBv2
4. BLX
Rm
THUMBv2
Action
Effect on the cpsr
1. lr = ret+0; pc = <address25>
T=1 (switch to Thumb state)
2. lr = ret+0; pc = Rm & 0xfffffffe
T=Rm & 1
3. lr = ret+1; pc = <address22>
T=0 (switch to ARM state)
4. lr = ret+1; pc = Rm & 0xfffffffe
T=Rm & 1
Notes
■
These instructions set lr to the address of the following instruction ret plus
the current cpsr T-bit setting. Therefore you can return from the subroutine
using BX lr to resume execution address and ARM or Thumb state.
■
Rm must not be pc.
■
Rm & 3 must not be 2. This would cause a branch to an unaligned ARM
instruction.
Example
BLX
BLX
BX
BXJ
thumb_code ; call a Thumb subroutine from ARM state
r0
; call the subroutine pointed to by r0
; ARM code if r0 even, Thumb if r0 odd
Branch with exchange (branch with possible state switch)
1. BX<cond>
2. BX
3. BXJ<cond>
Rm
Rm
Rm
ARMv4T
THUMBv1
ARMv5J
Action
Effect on the cpsr
1. pc = Rm & 0xfffffffe
2. pc = Rm & 0xfffffffe
3. Depends on JE configuration bit
T=Rm & 1
T=Rm & 1
J,T affected
Notes
■
If Rm is pc and the instruction is word aligned, then Rm takes the value of the
current instruction plus eight in ARM state or plus four in Thumb state.
■
Rm & 3 must not be 2. This would cause a branch to an unaligned ARM
instruction.
B1-13
B1-14
Appendix B1 ARM and Thumb Assembler Instructions
■
If the JE (Java Enable) configuration bit is clear, then BXJ behaves as a BX.
Otherwise, the behavior is defined by the architecture of the Java Extension
hardware. Typically it sets J = 1 in the cpsr and starts executing Java instructions
from a general purpose register designated as the Java program counter jpc.
Examples
BX
BX
CDP
lr
r0
; return from ARM or Thumb subroutine
; branch to ARM or Thumb function pointer r0
Coprocessor data processing operation
1. CDP<cond> <copro>, <op1>, Cd, Cn, Cm, <op2>
2. CDP2
<copro>, <op1>, Cd, Cn, Cm, <op2>
ARMv2
ARMv5
These instructions initiate a coprocessor-dependent operation. <copro> is the
number of the coprocessor in the range p0 to p15. The core takes an undefined
instruction trap if the coprocessor is not present. The coprocessor operation
specifiers <op1> and <op2>, and the coprocessor register numbers Cd, Cn, Cm,
are interpreted by the coprocessor and ignored by the ARM. CDP2 provides an
additional set of coprocessor instructions.
CLZ
Count leading zeros
1. CLZ<cond> Rd, Rm
ARMv5
Rn is set to the maximum left shift that can be applied to Rm without unsigned
overflow. Equivalently, this is the number of zeros above the highest one in the
binary representation of Rm. If Rm = 0, then Rn is set to 32. The following example
normalizes the value in r0 so that bit 31 is set
CLZ r1, r0
MOV r0, r0, LSL r1
; find normalization shift
; normalize so bit 31 is set (if r0!=0)
CMN Compare negative
1. CMN<cond>
2. CMN<cond>
3. CMN
Rn, #<rotated_immed>
Rn, Rm {, <shift>}
Ln, Lm
ARMv1
ARMv1
THUMBv1
Action
1. cpsr flags set on the result of (Rn + <rotated_immed>)
2. cpsr flags set on the result of (Rn + <shifted_Rm>)
3. cpsr flags set on the result of (Ln + Lm)
Notes
■
In the cpsr: N = <Negative>, Z = <Zero>, C = <Unsigned-Overflow>, V =
<SignedOverflow>. These are the same flags as generated by CMP with the
second operand negated.
B1.3
■
Alphabetical List of ARM and Thumb Instructions
If Rn or Rm is pc, then the value used is the address of the instruction plus
eight bytes.
Example
CMN
r0, #3
; compare r0 with -3
BLT
label
; if (r0 ‹ 3) goto label
CMP Compare two 32-bit integers
1.
2.
3.
4.
CMP<cond>
CMP<cond>
CMP
CMP
Rn,
Rn,
Ln,
Rn,
#<rotated_immed>
Rm {, <shift>}
#<immed8>
Rm
ARMv1
ARMv1
THUMBv1
THUMBv1
Action
1.
2.
3.
4.
cpsr
cpsr
cpsr
cpsr
flags
flags
flags
flags
set
set
set
set
on
on
on
on
the
the
the
the
result
result
result
result
of
of
of
of
(Rn
(Rn
(Ln
(Rn
-
<rotated_immed>)
<shifted_Rm>)
<immed8>)
Rm)
Notes
■
In the cpsr: N = <Negative>, Z = <Zero>, C = <NoUnsigned-Overflow>,
V = <SignedOverflow>. The carry flag is set this way because the subtract x – y is
implemented as the add x + ~ y + 1. The carry flag is one if x + ~ y + 1 overflows.
This happens when x y (equivalently when x – Ây doesn’t overflow).
■
If Rn or Rm is pc, then the value used is the address of the instruction plus eight
bytes for ARM instructions, or plus four bytes for Thumb instructions.
Example
CMP
BHS
r0, r1, LSR#2
label
; compare r0 with (r1/4)
; if (r0 >= (r1/4)) goto label;
CPS Change processor state; modifies selected bits in the cpsr
1.
2.
3.
4.
5.
CPS
CPSID
CPSIE
CPSID
CPSIE
#<mode>
<flags> {, #<mode>}
<flags> {, #<mode>}
<flags>
<flags>
ARMv6
ARMv6
ARMv6
THUMBv3
THUMBv3
Action
1. cpsr[4:0] = <mode>
2. cpsr = cpsr | mask; { cpsr[4:0]=<mode> }
B1-15
B1-16
Appendix B1 ARM and Thumb Assembler Instructions
3. cpsr = cpsr & ~mask; { cpsr[4:0]=<mode> }
4. cpsr = cpsr | mask
5. cpsr = cpsr & ~mask
Bits are set in mask according to letters in the <flags> value as in Table B1.4. The ID
(interrupt disable) variants mask interrupts by setting cpsr bits. The IE (interrupt
enable) variants unmask interrupts by clearing cpsr bits.
TABLE B1.4
CPS flags characters.
Character
CPY
cpsr bit affected
Bit set in mask
a
imprecise data Abort mask bit
0 100 = 1 << 8
i
IRQ mask bit
0 080 = 1 << 7
f
FIQ mask bit
0 040 = 1 << 6
Copy one ARM register to another without affecting the cpsr.
1. CPY<cond>
2. CPY
Rd, Rm
Rd, Rm
ARMv6
THUMBv3
This assembles to MOV <cond> Rd, Rm except in the case of Thumb where Rd and
Rm are low registers in the range r0 to r7. Then it is a new operation that sets Rd=Rm
without affecting the cpsr.
EOR Logical exclusive OR of two 32-bit values
1. EOR<cond>{S} Rd, Rn, #<rotated_immed>
2. EOR<cond>{S} Rd, Rn, Rm {, <shift>}
3. EOR
Ld, Lm
ARMv1
ARMv1
THUMBv1
Action
Effect on the cpsr
1. Rd = Rn ^<rotated_immed>
2. Rd = Rn ^ <shifted_Rm>
3. Ld = Ld ^ Lm
Updated if S suffix specified
Updated if S suffix specified
Updated (see Notes below)
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>,
Z = <Zero>, C = <shifter_C> (see Table B1.3), V is preserved.
■
If Rd is pc, then the instruction effects a jump to the calculated address. If the
operation updates the cpsr, then the processor mode must have an spsr; in this
case, the cpsr is set to the value of the spsr.
B1.3
■
Alphabetical List of ARM and Thumb Instructions
If Rn or Rm is pc, then the value used is the address of the instruction plus eight
bytes.
Example
EOR
r0, r0, #1 16
; toggle bit 16
LDC Load to coprocessor single or multiple 32-bit values
1.
2.
3.
4.
5.
6.
LDC<cond>{L}
LDC<cond>{L}
LDC<cond>{L}
LDC2{L}
LDC2{L}
LDC2{L}
<copro>,
<copro>,
<copro>,
<copro>,
<copro>,
<copro>,
Cd,
Cd,
Cd,
Cd,
Cd,
Cd,
[Rn {,
[Rn],
[Rn],
[Rn {,
[Rn],
[Rn],
#{-}<immed8>*4}]{!}
#{-}<immed8>*4
<option>
#{-}<immed8>*4}]{!}
#{-}<immed8>*4
<option>
ARMv2
ARMv2
ARMv2
ARMv5
ARMv5
ARMv5
These instructions initiate a memory read, transferring data to the given coprocessor.
<copro> is the number of the coprocessor in the range p0 to p15. The core takes an
undefined instruction trap if the coprocessor is not present. The memory read consists
of a sequence of words from sequentially increasing addresses. The initial address is
specified by the addressing mode in Table B1.5. The coprocessor controls the number
of words transferred, up to a maximum limit of 16 words. The fields {L} and Cd are
interpreted by the coprocessor and ignored by the ARM. Typically Cd specifies the
destination coprocessor register for the transfer. The <option> field is an eight-bit
integer enclosed in { }. Its interpretation is coprocessor dependent.
TABLE B1.5
LDC addressing modes.
Addressing format
Address accessed
Value written back to Rn
[Rn {,# { - } <immed>}]
Rn + {{ - } <immed>}
Rn preserved
[Rn {,# { - } <immed>}]!
Rn + {{ - } <immed>}
Rn + {{ - }<immed>}
[Rn], # { - } <immed>
Rn
Rn + { - }<immed>
[Rn], < option>
Rn
Rn preserved
If the address is not a multiple of four, then the access is unaligned. The restrictions
on unaligned accesses are the same as for LDM.
LDM
Load multiple 32-bit words from memory to ARM registers
1. LDM<cond><amode>
2. LDMIA
Rn{!}, <register_list>{^}
Rn!,
<register_list>
ARMv1
THUMBv1
These instructions load multiple words from sequential memory addresses. The
<register_list> specifies a list of registers to load, enclosed in curly brackets
B1-17
B1-18
Appendix B1 ARM and Thumb Assembler Instructions
{ }. Although the assembler allows you to specify the registers in the list in any order,
the order is not stored in the instruction, so it is good practice to write the list in
increasing order of register number because this is the usual order of the memory
transfer.
The following pseudocode shows the normal action of LDM. We use <register_
list>[i] to denote the register appearing at position i in the list, starting at 0 for the
first register. This assumes that the list is in order of increasing register number.
N = the number of registers in <register_list>
start = the lowest address accessed given in Table B1.6
for (i=0; i<N; i++)
<register_list>[i] = memory(start+i*4, 4);
if (! specified) then update Rn according to Table B1.6
Note that memory(a, 4) returns the four bytes at address a packed according
to the current processor data endianness. If a is not a multiple of four, then
the load is unaligned. Because the behavior of an unaligned load depends on
the architecture revision, memory system, and system coprocessor (CP15)
configuration, it’s best to avoid unaligned loads if possible. Assuming that the
external memory system does not abort unaligned loads, then the following rules
usually apply:
■
If the core has a system coprocessor and bit 1 (A-bit) or bit 22 (U-bit) of
CP15:c1:c0:0 is set, then unaligned load multiples cause an alignment fault
data abort exception.
■
Otherwise the access ignores the bottom two address bits.
Table B1.6 lists the possible addressing modes specified by <amode>. If you
specify the !, then the base address register is updated according to Table B1.6;
otherwise it is preserved. Note that the lowest register number is always read from
the lowest address.
The first half of the addressing mode mnemonics stands for Increment
After, Increment Before, Decrement After, and Decrement Before, respectively.
Increment modes load the registers sequentially forward, starting from address
Rn (increment after) or Rn + 4 (increment before). Decrement modes have
the same effect as if you loaded the register list backwards from sequentially
TABLE B1.6
LDM addressing modes.
Addressing
mode
Lowest address
accessed
Highest address
accessed
Value written back
to Rn if ! specified
{IA|FD}
Rn
Rn + N*4 - 4
Rn + N*4
{IB|ED}
Rn + 4
Rn + N*4
Rn + N*4
{DA|FA}
Rn N*4 + 4
Rn
Rn N*4
{DB|EA}
Rn N*4
Rn - 4
Rn N*4
B1.3
Alphabetical List of ARM and Thumb Instructions
descending memory addresses, starting from address Rn (decrement after) or
Rn – 4 (decrement before).
The second half of the addressing mode mnemonics stands for the stack type you
can implement with that address mode: Full Descending, Empty Descending, Full
Ascending, and Empty Ascending, With a full stack, Rn points to the last stacked value;
with an empty stack, Rn points to the first unused stack location. ARM stacks are
usually full descending. You should use full descending or empty ascending stacks by
preference, since LDC also supports these addressing modes.
Notes
■
For Thumb (format 2), Rn and the register list registers must be in the range r0
to r7.
■
The number of registers N in the list must be nonzero.
■
Rn must not be pc.
■
Rn must not appear in the register list if ! (writeback) is specified.
■
If pc appears in the register list, then on ARMv5 and above the processor performs
a BX to the loaded address. For ARMv4 and below, the processor branches to the
loaded address.
■
If ^ is specified, then the operation is modified. The processor must not be in user
or system mode. If pc is not in the register list, then the registers appearing in the
register list refer to the user mode versions of the registers and writeback must
not be specified. If pc is in the register list, then the spsr is copied to the cpsr in
addition to the standard operation.
■
The time order of the memory accesses may depend on the implementation. Be
careful when using a load multiple to access I/O locations where the access order
matters. If the order matters, then check that the memory locations are marked
as I/O in the page tables, do not cross page boundaries, and do not use pc in the
register list.
Examples
LDMIA
LDMDB
LDMEQFD
LDMFD
LDMFD
r4!, {r0, r1}
r4!, {r0, r1}
sp!, {r0, pc}
sp, {sp}^
sp!, {r0-pc}^
;
;
;
;
;
r0=*r4, r1=*(r4+4), r4+=8
r1=*(r4-4), r0=*(r4-8), r4-=8
if (result zero) then unstack r0, pc
load sp_usr from sp_current
return from exception, restore cpsr
LDR Load a single value from a virtual address in memory
1. LDR<cond>{|B}
2. LDR<cond>{|B}
Rd, [Rn {, #{-}<immed12>}]{!}
ARMv1
Rd, [Rn, {-}Rm {,<imm_shift>}]{!} ARMv1
B1-19
B1-20
Appendix B1 ARM and Thumb Assembler Instructions
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
LDR<cond>{|B}{T}
LDR<cond>{|B}{T}
LDR<cond>{H|SB|SH}
LDR<cond>{H|SB|SH}
LDR<cond>{H|SB|SH}
LDR<cond>{H|SB|SH}
LDR<cond>D
LDR<cond>D
LDR<cond>D
LDR<cond>D
LDREX<cond>
LDR{|B|H}
LDR{|B|H|SB|SH}
LDR
LDR
LDR<cond><type>
LDR<cond>
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Ld,
Ld,
Ld,
Ld,
Rd,
Rd,
[Rn], #{-}<immed12>
[Rn], {-}Rm {,<imm_shift>}
[Rn, {, #{-}<immed8>}]{!}
[Rn, {-}Rm]{!}
[Rn], #{-}<immed8>
[Rn], {-}Rm
[Rn, {, #{-}<immed8>}]{!}
[Rn, {-}Rm]{!}
[Rn], #{-}<immed8>
[Rn], {-}Rm
[Rn]
[Ln, #<immed5>*<size>]
[Ln, Lm]
[pc, #<immed8>*4]
[sp, #<immed8>*4]
<label>
=<32-bit-value>
ARMv1
ARMv1
ARMv4
ARMv4
ARMv4
ARMv4
ARMv5E
ARMv5E
ARMv5E
ARMv5E
ARMv6
THUMBv1
THUMBv1
THUMBv1
THUMBv1
MACRO
MACRO
Formats 1 to 17 load a single data item of the type specified by the opcode suffix,
using a preindexed or postindexed addressing mode. Tables B1.7 and B1.8 show the
different addressing modes and data types.
TABLES B1.7 LDR Addressing Modes.
Addressing format
Address a accessed
Value written back to Rn
[Rn {,#{-}<immed>}]
Rn + {{-}<immed>}
Rn preserved
[Rn {,#{-}<immed>}]!
Rn + {{-}<immed>}
Rn + {{-}<immed>}
[Rn, {-}Rm {,<shift>}]
Rn + {-}<shifted_Rm>
Rn preserved
[Rn, {-}Rm {,<shift>}]!
Rn + {-}<shifted_Rm>
Rn + {-}<shifted_Rm>
[Rn], #{-}<immed>
Rn
Rn + {-}<immed>
[Rn], {-}Rm {,<shift>}
Rn
Rn + {-}<shifted_Rm>
In Table B1.8 memory(a, n) reads n sequential bytes from address a. The bytes
are packed according to the configured processor data endianness. The function
memoryT(a, n) performs the same access but with user mode privileges, regardless of
the current processor mode. The function memoryEx(a, n) used by LDREX performs
the access and marks the access as exclusive. If address a has the shared TLB attribute,
then this marks address a as exclusive to the current processor and clears any other
exclusive addresses for this processor. Otherwise the processor remembers that there
is an outstanding exclusive access. Exclusivity only affects the action of the STREX
instruction.
B1.3
Alphabetical List of ARM and Thumb Instructions
TABLES B1.8 LDR datatypes.
Load
Datatype
<size> (bytes)
Action
LDR
word
4
Rd = memory(a, 4)
LDRB
unsigned Byte
1
Rd = (zero-extend)memory(a, 1)
LDRBT
Byte Translated
1
Rd = (zero-extend)memoryT(a, 1)
LDRD
Double word
8
Rd = memory(a, 4)
R(d+1) = memory(a+4, 4)
LDREX
word EXclusive
4
Rd = memoryEx(a, 4)
LDRH
unsigned Halfword
2
Rd = (zero-extend)memory(a, 2)
LDRSB
Signed Byte
1
Rd = (sign-extend)memory(a, 1)
LDRSH
Signed Halfword
2
Rd = (sign-extend)memory(a, 2)
LDRT
word Translated
4
Rd = memoryT(a, 4)
If address a is not a multiple of <size>, then the load is unaligned. Because the
behavior of an unaligned load depends on the architecture revision, memory system,
and system coprocessor (CP15) configuration, it’s best to avoid unaligned loads if
possible. Assuming that the external memory system does not abort unaligned loads,
then the following rules usually apply. In the rules, A is bit 1 of system coprocessor
register CP15:c1:c0:0, and U is bit 22 of CP15:c1:c0:0, introduced in ARMv6. If there is
no system coprocessor, then A = U = 0.
■
If A = 1, then unaligned loads cause an alignment fault data abort exception
except that word-aligned double-word loads are supported if U = 1.
■
If A = 0 and U = 1, then unaligned loads are supported for LDR{|T|H|SH}.
Word-aligned loads are supported for LDRD. A non-word-aligned LDRD generates
an alignment fault data abort.
■
If A = 0 and U = 0, then LDR and LDRT return the value memory(a & ~ 3, 4) ROR
((a&3)*8). All other unaligned operations are unpredictable but do not
generate an alignment fault.
Format 18 generates a pc-relative load accessing the address specified by <label>.
In other words, it assembles to LDR<cond><type> Rd, [pc, #<offset>]
whenever this instruction is supported and <offset>=<label>-pc is in range.
Format 19 generates an instruction to move the given 32-bit value to the register
Rd. Usually the instruction is LDR<cond> Rd, [pc, #<offset>], where the 32-bit
value is stored in a literal pool at address pc+<offset>.
Notes
■
For double-word loads (formats 9 to 12), Rd must be even and in the range r0
to r12.
■
If the addressing mode updates Rn, then Rd and Rn must be distinct.
B1-21
B1-22
Appendix B1 ARM and Thumb Assembler Instructions
■
If Rd is pc, then <size> must be 4. Up to ARMv4, the core branches to the loaded
address. For ARMv5 and above, the core performs a BX to the loaded address.
■
If Rn is pc, then the addressing mode must not update Rn . The value used for
Rn is the address of the instruction plus eight bytes for ARM or four bytes for
Thumb.
■
Rm must not be pc.
■
For ARMv6 use LDREX and STREX to implement semaphores rather than SWP.
Examples
LSL
LDR
LDRSH
LDRB
LDRD
r0,
r0,
r0,
r2,
[r0]
[r1], #4
[r1, #-8]!
[r1]
;
;
;
;
LDRSB
r0, [r2, #55]
LDRCC
LDRB
pc, [pc, r0, LSL #2]
;
r0, [r1], -r2, LSL #8 ;
LDR
r0, =0x12345678
;
;
r0 = *(int*)r0;
r0 = *(short*)r1; r1 += 4;
r1 -= 8; r0 = *(char*)r1;
r2 =* (int*)r1;
r3 =* (int*)(r1+4);
r0 = *(signed char*)
(r2+55);
if (C==0) goto *(pc+4*r0);
r0 = *(char*)r1;
r1 -= 256*r2;
r0 = 0x12345678;
Logical shift left for Thumb (see MOV for the ARM equivalent)
1. LSL Ld, Lm, #<immed5>
2. LSL Ld, Ls
THUMBv1
THUMBv1
Action
Effect on the cpsr
1. Ld = Lm LSL #<immed5>
2. Ld = Ld LSL Ls[7:0]
Updated (see Note below)
Updated
Note
■
The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C>
(see Table B1.3).
LSR Logical shift right for Thumb (see MOV for the ARM equivalent)
1. LSR Ld, Lm, #<immed5>
2. LSR Ld, Ls
THUMBv1
THUMBv1
Action
Effect on the cpsr
1. Ld = Lm LSR #<immed5>
2. Ld = Ld LSR Ls[7:0]
Updated (see Note below)
Updated
B1.3
Alphabetical List of ARM and Thumb Instructions
Note
■
MCR
MCRR
The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3).
Move to coprocessor from an ARM register
1.
2.
3.
4.
MCR<cond>
MCR2
MCRR<cond>
MCRR2
<copro>,
<copro>,
<copro>,
<copro>,
<op1>,
<op1>,
<op1>,
<op1>,
Rd,
Rd,
Rd,
Rd,
Cn,
Cn,
Rn,
Rn,
Cm {, <op2>} ARMv2
Cm {, <op2>} ARMv5
Cm
ARMv5E
Cm
ARMv6
These instructions transfer the value of ARM register Rd to the indicated coprocessor.
Formats 3 and 4 also transfer a second register Rn. <copro> is the number of the
coprocessor in the range p0 to p15. The core takes an undefined instruction trap if the
coprocessor is not present. The coprocessor operation specifiers <op1> and <op2>,
and the coprocessor register numbers Cn, Cm, are interpreted by the coprocessor, and
ignored by the ARM. Rd and Rn must not be pc. Coprocessor p15 controls memory
management options. For example, the following code sequence enables alignment
fault checking:
MRC
ORR
MCR
MLA
p15, 0, r0, c1, c0, 0
r0, r0, #2
p15, 0, r0, c1, c0, 0
; read the MMU register, c1
; set the A bit
; write the MMU register, c1
Multiply with accumulate
1. MLA<cond>{S} Rd, Rm, Rs, Rn
ARMv2
Action
Effect on the cpsr
1. Rd = Rn + Rm*Rs
Updated if S suffix supplied
Notes
■
Rd is set to the lower 32 bits of the result.
■
Rd, Rm, Rs, Rn must not be pc.
■
Rd and Rm must be different registers.
■
Implementations may terminate early on the value of the Rs operand. For this
reason use small or constant values for Rs where possible. See Appendix B3.
■
If the cpsr is updated, then N = <Negative>, Z = <Zero>, C is unpredictable, and
V is preserved. Avoid using the instruction MLAS because implementations often
B1-23
B1-24
Appendix B1 ARM and Thumb Assembler Instructions
impose penalty cycles for this operation. Instead use MLA followed by a compare,
and schedule the compare to avoid multiply result use interlocks.
MOV
Move a 32-bit value into a register
1.
2.
3.
4.
5.
6.
7.
MOV<cond>{S}
MOV<cond>{S}
MOV
MOV
MOV
MOV
MOV
Rd,
Rd,
Ld,
Ld,
Hd,
Ld,
Hd,
#<rotated_immed>
Rm {, <shift>}
#<immed8>
Ln
Lm
Hm
Hm
Action
1.
2.
3.
4.
5.
6.
7.
Rd
Rd
Ld
Ld
Hd
Ld
Hd
ARMv1
ARMv1
THUMBv1
THUMBv1
THUMBv1
THUMBv1
THUMBv1
Effect on the cpsr
=
=
=
=
=
=
=
<rotated_immed>
<shifted_Rm>
<immed8>
Ln
Lm
Hm
Hm
Updated if S
Updated if S
Updated (see
Updated (see
Preserved
Preserved
Preserved
suffix
suffix
Notes
Notes
specified
specified
below)
below)
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>,
Z = <Zero>, C = <shifter_C > (see Table B1.3), and V is preserved.
■
If Rd or Hd is pc, then the instruction effects a jump to the calculated address.
If the operation updates the cpsr, then the processor mode must have an spsr; in
this case, the cpsr is set to the value of the spsr.
■
If Rm is pc, then the value used is the address of the instruction plus eight bytes.
■
If Hm is pc, then the value used is the address of the instruction plus four bytes.
Examples
MOV
MOV
MOV
MOVS
MRC
r0,
r0,
pc,
pc,
#0x00ff0000 ; r0 = 0x00ff0000
r1, LSL#2
; r0 = 4*r1
lr
; return from subroutine (pc=lr)
lr
; return from exception (pc=lr, cpsr=spsr)
Move to ARM register from a coprocessor
MRRC
1. MRC<cond>
2. MRC2
3. MRRC<cond>
4. MRRC2
<copro>,
<copro>,
<copro>,
<copro>,
<op1>,
<op1>,
<op1>,
<op1>,
Rd,
Rd,
Rd,
Rd,
Cn,
Cn,
Rn,
Rn,
Cm , <op2>
Cm , <op2>
Cm
Cm
ARMv2
ARMv5
ARMv5E
ARMv6
B1.3
Alphabetical List of ARM and Thumb Instructions
These instructions transfer a 32-bit value from the indicated coprocessor to the ARM
register Rd. Formats 3 and 4 also transfer a second 32-bit value to Rn. <copro> is
the number of the coprocessor in the range p0 to p15. The core takes an undefined
instruction trap if the coprocessor is not present. The coprocessor operation specifiers
<op1> and <op2>, and the coprocessor register numbers Cn, Cm, are interpreted by
the coprocessor and ignored by the ARM. For formats 1 and 2, if Rd is pc, then the top
four bits of the cpsr (the NZCV condition code flags) are set from the top four bits of
the 32-bit value transferred; pc is not affected. For other formats, Rd and Rn must be
distinct and not pc.
Coprocessor p15 controls memory management options. For example, the
following instruction reads the main ID register from p15 :
MRC
MRS
p15, 0, r0, c0, c0
; read the MMU ID register, c0
Move to ARM register from status register ( cpsr or spsr )
1. MRS<cond> Rd, cpsr
2. MRS<cond> Rd, spsr
ARMv3
ARMv3
These instructions set Rd = cpsr and Rd = spsr, respectively. Rd must not be pc.
MSR
Move to status register ( cpsr or spsr ) from an ARM register
1.
2.
3.
4.
MSR<cond>
MSR<cond>
MSR<cond>
MSR<cond>
cpsr_<fields>,
cpsr_<fields>,
spsr_<fields>,
spsr_<fields>,
#<rotated_immed>
Rm
#<rotated_immed>
Rm
ARMv3
ARMv3
ARMv3
ARMv3
Action
1.
2.
3.
4.
cpsr
cpsr
spsr
spsr
=
=
=
=
(cpsr
(cpsr
(spsr
(spsr
&
&
&
&
~<mask>)
~<mask>)
~<mask>)
~<mask>)
|
|
|
|
(<rotated_immed> & <mask>)
(Rm & <mask>)
(<rotated_immed> & <mask>)
(Rm & <mask>)
These instructions alter selected bytes of the cpsr or spsr according to the value of
<mask>. The <fields> specifier is a sequence of one or more letters, determining which
bytes of <mask> are set. See Table B1.9.
TABLE B1.9
Format of the <fields> specifier.
<fields> letter
Meaning
c
Control byte
Bits set in <mask>
0x000000ff
x
eXtension byte
0x0000ff00
s
Status byte
0x00ff0000
f
Flags byte
0xff000000
B1-25
B1-26
Appendix B1 ARM and Thumb Assembler Instructions
Some old ARM toolkits allowed cpsr or cpsr_all in place of cpsr_fsxc. They also used
cpsr_flg and cpsr_ctl in place of cpsr_f and cpsr_c, respectively. These formats, and the spsr
equivalents, are obsolete, so you should not use them. The following example changes to
system mode and enables IRQ, which is useful in a reentrant interrupt handler:
MRS
BIC
ORR
MSR
r0, cpsr
r0, r0, #0x9f
r0, r0, #0x1f
cpsr_c, r0
;
;
;
;
read cpsr state
clear IRQ disable and mode bits
set system mode
update control byte of the cpsr
MUL Multiply
1. MUL<cond>{S}
2. MUL
Rd, Rm, Rs
Ld, Lm
ARMv2
THUMBv1
Action
Effect on the cpsr
1. Rd = Rm*Rs
2. Ld = Lm*Ld
Updated if S suffix supplied
Updated
Notes
■
Rd or Ld is set to the lower 32 bits of the result.
■
Rd, Rm, Rs must not be pc.
■
Rd and Rm must be different registers. Similarly Ld and Lm must be different.
■
Implementations may terminate early on the value of the Rs or Ld operand. For
this reason use small or constant values for Rs or Ld where possible.
■
If the cpsr is updated, then N = <Negative>, Z = <Zero>, C is
unpredictable, and V is preserved. Avoid using the instruction MULS because
implementations often impose penalty cycles for this operation. Instead use
MUL followed by a compare, and schedule the compare, to avoid multiply
result use interlocks.
MVN
Move the logical not of a 32-bit value into a register
1. MVN<cond>{S}
2. MVN<cond>{S}
3. MVN
Rd, #<rotated_immed>
Rd, Rm {, <shift>}
Ld, Lm
Action
1. Rd = ~<rotated_immed>
2. Rd = ~<shifted_Rm>
3. Ld = ~ Lm
ARMv1
ARMv1
THUMBv1
Effect on the cpsr
Updated if S suffix specified
Updated if S suffix specified
Updated (see Notes below)
B1.3
Alphabetical List of ARM and Thumb Instructions
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>,
Z = <Zero>, C = <shifter_C> (see Table B1.3), and V is preserved.
■
If Rd is pc, then the instruction effects a jump to the calculated address. If the
operation updates the cpsr, then the processor mode must have an spsr; in this
case, the cpsr is set to the value of the spsr.
■
If Rm is pc, then the value used is the address of the instruction plus eight bytes.
Examples
MVN
MVN
r0, #0xff
r0, #0
; r0 = 0xffffff00
; r0 = -1
NEG Negate value in Thumb (use RSB to negate in ARM state)
1. NEG Ld, Lm
THUMBv1
Action
Effect on the cpsr
1. Ld = -Lm
Updated (see Notes below)
Notes
■
The cpsr is updated: N = <Negative>, Z = <Zero>, C = <NoUnsignedOverflow>,
V = <SignedOverflow>. Note that Z = C and V = (Ld== 0x80000000).
■
This is the same as the operation RSBS Ld, Lm, #0 in ARM state.
NOP
No operation
1. NOP
MACRO
This is not an ARM instruction. It is an assembly macro that produces an instruction
having no effect other than advancing the pc as normal. In ARM state it assembles
to MOV r0, r0. In Thumb state it assembles to MOV r8, r8. The operation is not
guaranteed to take one processor cycle. In particular, if you use NOP after a load of r0,
then the operation may cause pipeline interlocks.
ORR
Logical bitwise OR of two 32-bit values
1. ORR<cond>{S}
2. ORR<cond>{S}
3. ORR
Rd, Rn, #<rotated_immed>
Rd, Rn, Rm {, <shift>}
Ld, Lm
Action
1. Rd = Rn | <rotated_immed>
2. Rd = Rn | <shifted_Rm>
3. Ld = Ld | Lm
ARMv1
ARMv1
THUMBv1
Effect on the cpsr
Updated if S suffix specified
Updated if S suffix specified
Updated (see Notes below)
B1-27
B1-28
Appendix B1 ARM and Thumb Assembler Instructions
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>,
Z = <Zero>, C = <shifter_C> (see Table B1.3), and V is preserved.
■
If Rd is pc, then the instruction effects a jump to the calculated address. If the
operation updates the cpsr, then the processor mode must have an spsr, in this
case, the cpsr is set to the value of the spsr.
■
If Rn or Rm is pc, then the value used is the address of the instruction plus eight
bytes.
Example
ORR
r0, r0,#1 ‹‹1 ; set bit 13 of r0
PKH Pack 16-bit halfwords into a 32-bit word
1. PKHBT<cond> Rd, Rn, Rm {, LSL #<0-31>} ARMv6
2. PKHTB<cond> Rd, Rn, Rm {, ASR #<1-32>} ARMv6
Action
1. Rd[15:00] = Rn[15:00]; Rd[31:16]=<shifted_Rm>[31:16]
2. Rd[31:16] = Rn[31:16]; Rd[15:00]=<shifted_Rm>[15:00]
Note
■
Rd, Rn, Rm must not be pc. cpsr is not affected.
Examples
PKHBT
PKHTB
r0, r1, r2, LSL#16 ; r0 = (r2[15:00]‹‹16)|r1[15:00]
r0, r2, r1, ASR#16 ; r0 = (r2[31:15]‹‹16)|r1[31:15]
PLD Preload hint instruction
1. PLD [Rn {, #{-}<immed12>}]
2. PLD [Rn, {-}Rm {,<imm_shift>}]
ARMv5E
ARMv5E
Action
1. Preloads from address (Rn + {{-}<immed12>})
2. Preloads from address (Rn + {-}<shifted_Rm>)
This instruction does not affect the processor registers (other than advancing pc). It
merely hints that the programmer is likely to read from the given address in future. A
cached processor may take this as a hint to load the cache line containing the address
into the cache. The instruction should not generate a data abort or any other memory
B1.3
Alphabetical List of ARM and Thumb Instructions
system error. If Rn is pc, then the value used for Rn is the address of the instruction plus
eight. Rm must not be pc.
Examples
PLD
PLD
[r0, #7]
[r0, r1, LSL#2]
; Preload from r0+7
; Preload from r0+4*r1
POP Pops multiple registers from the stack in Thumb state (for ARM state use LDM)
1. POP <regster_list>
THUMBv1
Action
1. equivalent to the ARM instruction LDMFD sp!, <register_list>
The <register_list> can contain registers in the range r0 to r7 and pc. The following
example restores the low-numbered ARM registers and returns from a subroutine:
POP {r0-r7,pc}
PUSH Pushes multiple registers to the stack in Thumb state (for ARM state use STM)
1. PUSH <regster_list>
THUMBv1
Action
1. equivalent to the ARM instruction STMFD sp!, <register_list>
The <register_list> can contain registers in the range r0 to r7 and lr. The
following example saves the low-numbered ARM registers and link register.
PUSH {r0-r7,lr}
QADD
QDADD
QDSUB
QSUB
Saturated signed and unsigned arithmetic
1. QADD<cond>
2.
3.
4.
5.
6.
7.
8.
9.
10.
QDADD<cond>
QSUB<cond>
QDSUB<cond>
{U}QADD16<cond>
{U}QADDSUBX<cond>
{U}QSUBADDX<cond>
{U}QSUB16<cond>
{U}QADD8<cond>
{U}QSUB8<cond>
Rd, Rm, Rn
ARMv5E
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
ARMv5E
ARMv5E
ARMv5E
ARMv6
ARMv6
ARMv6
ARMv6
ARMv6
ARMv6
Rm,
Rm,
Rm,
Rn,
Rn,
Rn,
Rn,
Rn,
Rn,
Rn
Rn
Rn
Rm
Rm
Rm
Rm
Rm
Rm
B1-29
B1-30
Appendix B1 ARM and Thumb Assembler Instructions
Action
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Rd = sat32(Rm+Rn)
Rd = sat32(Rm+sat32(2*Rn))
Rd = sat32(Rm-Rn)
Rd = sat32(Rm-sat32(2*Rn))
Rd[31:16] = sat16(Rn[31:16]
Rd[15:00] = sat16(Rn[15:00]
Rd[31:16] = sat16(Rn[31:16]
Rd[15:00] = sat16(Rn[15:00]
Rd[31:16] = sat16(Rn[31:16]
Rd[15:00] = sat16(Rn[15:00]
Rd[31:16] = sat16(Rn[31:16]
Rd[15:00] = sat16(Rn[15:00]
Rd[31:24] = sat8(Rn[31:24]
Rd[23:16] = sat8(Rn[23:16]
Rd[15:08] = sat8(Rn[15:08]
Rd[07:00] = sat8(Rn[07:00]
Rd[31:24] = sat8(Rn[31:24]
Rd[23:16] = sat8(Rn[23:16]
Rd[15:08] = sat8(Rn[15:08]
Rd[07:00] = sat8(Rn[07:00]
+
+
+
+
+
+
+
+
-
Rm[31:16]);
Rm[15:00])
Rm[15:00]);
Rm[31:16])
Rm[15:00]);
Rm[31:16])
Rm[31:16]);
Rm[15:00])
Rm[31:24]);
Rm[23:16]);
Rm[15:08]);
Rm[07:00])
Rm[31:24]);
Rm[23:16]);
Rm[15:08]);
Rm[07:00])
Notes
■
The operations are signed unless the U prefix is present. For signed operations,
satN(x) saturates x to the range –2N–1 x < 2 N–1. For unsigned operations,
satN(x) saturates x to the range 0 x < 2 N.
■
The cpsr Q-flag is set if saturation occurred; otherwise it is preserved.
■
Rd, Rn, Rm must not be pc.
■
The X operations are useful for packed complex numbers. The following examples
assume bits [15:00] hold the real part and [31:16] the imaginary part.
Examples
QDADD
QADD16
QADDSUBX
QSUBADDX
REV
r0,
r0,
r0,
r0,
r0,
r1,
r1,
r1,
r2
r2
r2
r2
;
;
;
;
add Q30 value r2 to Q31 accumulator r0
SIMD saturating add
r0=r1+i*r2 in packed complex arithmetic
r0=r1-i*r2 in packed complex arithmetic
Reverse bytes within a word or halfword.
1. REV<cond>
2. REV16<cond>
3. REVSH<cond>
Rd, Rm
Rd, Rm
Rd, Rm
ARMv6/THUMBv3
ARMv6/THUMBv3
ARMv6/THUMBv3
B1.3
Alphabetical List of ARM and Thumb Instructions
Action
1. Rd[31:24]
Rd[15:08]
2. Rd[31:24]
Rd[15:08]
3. Rd[31:08]
=
=
=
=
=
Rm[07:00]; Rd[23:16] = Rm[15:08];
Rm[23:16]; Rd[07:00] = Rm[31:24]
Rm[23:16]; Rd[23:16] = Rm[31:24];
Rm[07:00]; Rd[07:00] = Rm[15:08]
sign-extend(Rm[07:00]); Rd[07:00] = Rm[15:08]
Notes
■
Rd and Rm must not be pc.
■
For Thumb, Rd, Rm must be in the range r0 to r7 and <cond> cannot be
specified.
■
These instructions are useful to convert big-endian data to little-endian and vice
versa.
Examples
REV
REV16
REVSH
r0, r0 ; switch endianness of a word
r0, r0 ; switch endianness of two packed halfwords
r0, r0 ; switch endianness of a signed halfword
RFE Return from exception
1. RFE<amode> Rn!
ARMv6
This performs the operation that LDM<amode> Rn{!}, {pc, cpsr} would perform
if LDM allowed a register list of {pc, cpsr}. See the entry for LDM.
ROR Rotate right for Thumb (see MOV for the ARM equivalent)
1. ROR Ld, Ls
THUMBv1
Action
Effect on the cpsr
1. Ld = Ld ROR Ls[7:0]
Updated
Notes
■
The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3).
RSB Reverse subtract of two 32-bit integers
1. RSB<cond>{S} Rd, Rn, #<rotated_immed>
2. RSB<cond>{S} Rd, Rn, Rm {, <shift>}
ARMv1
ARMv1
Action
Effect on the cpsr
1. Rd = <rotwated_immed> - Rn
2. Rd = <shifted_Rm> - Rn
Updated if S suffix present
Updated if S suffix present
B1-31
B1-32
Appendix B1 ARM and Thumb Assembler Instructions
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z = <Zero>,
C = <NoUnsignedOverflow>, and V = <SignedOverflow>. The carry flag is set this
way because the subtract x – y is implemented as the add x + ~ y + 1. The carry flag is
one if x + ~ y + 1 overflows. This happens when x y, when x – y doesn’t overflow.
■
If Rd is pc, then the instruction effects a jump to the calculated address. If the
operation updates the cpsr, then the processor mode must have an spsr in this
case, the cpsr is set to the value of the spsr.
■
If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes.
Examples
RSB
RSB
r0, r0, #0
r0, r1, r1, LSL#3
; r0 = -r0
; r0 = 7*r1
RSC Reverse subtract with carry of two 32-bit integers
1. RSC<cond>{S} Rd, Rn, #<rotated_immed>
2. RSC<cond>{S} Rd, Rn, Rm {, <shift>}
ARMv1
ARMv1
Action
Effect on the cpsr
1. Rd = <rotated_immed> - Rn - (~C)
2. Rd = <shifted_Rm> - Rn - (~C)
Updated if S suffix present
Updated if S suffix present
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z =
<Zero>, C = <NoUnsignedOverflow>, V = <SignedOverflow>. The carry flag
is set this way because the subtract x – y – ~C is implemented as the add x +
~y + C. The carry flag is one if x + ~y + C overflows. This happens when
x –y – ~ C doesn’t overflow.
■
If Rd is pc, then the instruction effects a jump to the calculated address. If the
operation updates the cpsr, then the processor mode must have an spsr; in this
case the cpsr is set to the value of the spsr.
■
If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes.
The following example negates a 64-bit integer where r0 is the low 32 bits and r1
the high 32 bits.
RSBS
RSC
r0, r0, #0
r1, r1, #0
; r0 = -r0
C=NOT(borrow)
; r1 = -r1-borrow
SADD Parallel modulo add and subtract operations
1. {S|U}ADD16<cond>
Rd, Rn, Rm
2. {S|U}ADDSUBX<cond>
Rd, Rn, Rm
ARMv6
ARMv6
B1.3
3.
4.
5.
6.
{S|U}SUBADDX<cond>
{S|U}SUB16<cond>
{S|U}ADD8<cond>
{S|U}SUB8<cond>
Alphabetical List of ARM and Thumb Instructions
Rd,
Rd,
Rd,
Rd,
Rn,
Rn,
Rn,
Rn,
Rm
Rm
Rm
Rm
ARMv6
ARMv6
ARMv6
ARMv6
Action
Effect on the cpsr
1. Rd[31:16]=Rn[31:16]+Rm[31:16];
Rd[15:00]=Rn[15:00]+Rm[15:00]
2. Rd[31:16]=Rn[31:16]+Rm[15:00];
Rd[15:00]=Rn[15:00]-Rm[31:16]
3. Rd[31:16]=Rn[31:16]-Rm[15:00];
Rd[15:00]=Rn[15:00]+Rm[31:16]
4. Rd[31:16]=Rn[31:16]-Rm[31:16];
Rd[15:00]=Rn[15:00]-Rm[15:00]
5. Rd[31:24]=Rn[31:24]+Rm[31:24];
Rd[23:16]=Rn[23:16]+Rm[23:16];
Rd[15:08]=Rn[15:08]+Rm[15:08];
Rd[07:00]=Rn[07:00]+Rm[07:00]
6. Rd[31:24]=Rn[31:24]-Rm[31:24];
Rd[23:16]=Rn[23:16]-Rm[23:16];
Rd[15:08]=Rn[15:08]-Rm[15:08];
Rd[07:00]=Rn[07:00]-Rm[07:00]
GE3=GE2=cmn(Rn[31:16],Rm[31:16])
GE1=GE0=cmn(Rn[15:00],Rm[15:00])
GE3=GE2=cmn(Rn[31:16],Rm[15:00])
GE1=GE0=(Rn[15:00] >= Rm[31:16])
GE3=GE2=(Rn[31:16] >= Rm[15:00])
GE1=GE0=cmn(Rn[15:00],Rm[31:16])
GE3=GE2=(Rn[31:16] >= Rm[31:16])
GE1=GE0=(Rn[15:00] >= Rm[15:00])
GE3 = cmn(Rn[31:24],Rm[31:24])
GE2 = cmn(Rn[23:16],Rm[23:16])
GE1 = cmn(Rn[15:08],Rm[15:08])
GE0 = cmn(Rn[07:00],Rm[07:00])
GE3 = (Rn[31:24] >= Rm[31:24])
GE2 = (Rn[23:16] >= Rm[23:16])
GE1 = (Rn[15:08] >= Rm[15:08])
GE0 = (Rn[07:00] >= Rm[07:00])
Notes
■
If you specify the S prefix, then all comparisons are signed. The cmn(x, y)
function returns x – y or equivalently x + y 0.
■
If you specify the U prefix, then all comparisons are unsigned. The cmn(x, y)
function returns x (unasigned) (–y) or equivalently if the x + y operation
produces a carry.
■
Rd, Rn, and Rm must not be pc.
■
The X operations are useful for packed complex numbers. The following
examples assume bits [15:00] hold the real part and [31:16] the imaginary part.
Examples
SADD16
SADDSUBX
SSUBADDX
SBC
r0, r1, r2
r0, r1, r2
; Signed 16-bit SIMD add
; r0=r1+i*r2 in packed complex arithmetic
r0, r1, r2
; r0=r1-i*r2 in packed complex arithmetic
Subtract with carry
1. SBC<cond>{S} Rd, Rn, #<rotated_immed>
2. SBC<cond>{S} Rd, Rn, Rm {, <shift>}
3. SBC
Ld, Lm
ARMv1
ARMv1
THUMBv1
B1-33
B1-34
Appendix B1 ARM and Thumb Assembler Instructions
Effect on the cpsr
Action
1. Rd = Rn - <rotated_immed> - (~C) Updated if S suffix specified
2. Rd = Rn - <shifted_Rm> - (~C)
Updated if S suffix specified
3. Ld = Ld - Lm - (~C)
Updated (see Notes below)
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z =
<Zero>, C = <NoUnsignedOverflow>, V = <SignedOverflow>. The carry flag is
set this way because the subtract x – y – ~C is implemented as the add x + ~y +
C. The carry flag is one if x + ~y + C overflows. This happens when x – y – ~C
doesn’t overflow.
■
If Rd is pc, then the instruction effects a jump to the calculated address. If the
operation updates the cpsr, then the processor mode must have an spsr. In this
case the cpsr is set to the value of the spsr.
■
If Rn or Rm is pc, then the value used is the address of the instruction plus eight
bytes.
The following example implements a 64-bit subtract:
SUBS
SBC
r0, r0, r2
r1, r1, r3
; subtract low words, C=NOT(borrow)
; subtract high words and borrow
SEL Select between two source operands based on the GE flags
1. SEL<cond> Rd, Rn, Rm
ARMv6
Action
1. Rd[31:24]
Rd[23:16]
Rd[15:08]
Rd[07:00]
=
=
=
=
GE3
GE2
GE1
GE0
?
?
?
?
Rn[31:24]
Rn[23:16]
Rn[15:08]
Rn[07:00]
:
:
:
:
Rm[31:24];
Rm[23:16];
Rm[15:08];
Rm[07:00]
Notes
■
Rd, Rn, Rm must not be pc.
■
See SADD for instructions that set the GE flags in the cpsr.
SETEND Set the endianness for data accesses
1. SETEND BE
2. SETEND LE
ARMv6/THUMBv3
ARMv6/THUMBv3
Action
1. In the cpsr E=1 so data accesses will be big-endian
2. In the cpsr E=0 so data accesses will be little-endian
B1.3
Alphabetical List of ARM and Thumb Instructions
Note
■
ARMv6 uses a byte-invariant endianness model. This means that byte loads
and stores are not affected by the configured endianess. For little-endian data
access the byte at the lowest address appears in the least significant byte of
the loaded word. For big-endian data accesses the byte at the lowest address
appears in the most significant byte of the loaded word.
SHADD Parallel halving add and subtract operations
1.
2.
3.
4.
5.
6.
{S|U}HADD16<cond>
{S|U}HADDSUBX<cond>
{S|U}HSUBADDX<cond>
{S|U}HSUB16<cond>
{S|U}HADD8<cond>
{S|U}HSUB8<cond>
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Rn,
Rn,
Rn,
Rn,
Rn,
Rn,
Rm
Rm
Rm
Rm
Rm
Rm
ARMv6
ARMv6
ARMv6
ARMv6
ARMv6
ARMv6
Action
1. Rd[31:16]
Rd[15:00]
2. Rd[31:16]
Rd[15:00]
3. Rd[31:16]
Rd[15:00]
4. Rd[31:16]
Rd[15:00]
5. Rd[31:24]
Rd[23:16]
Rd[15:08]
Rd[07:00]
6. Rd[31:24]
Rd[23:16]
Rd[15:08]
Rd[07:00]
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
(Rn[31:16]
(Rn[15:00]
(Rn[31:16]
(Rn[15:00]
(Rn[31:16]
(Rn[15:00]
(Rn[31:16]
(Rn[15:00]
(Rn[31:24]
(Rn[23:16]
(Rn[15:08]
(Rn[07:00]
(Rn[31:24]
(Rn[23:16]
(Rn[15:08]
(Rn[07:00]
+
+
+
+
+
+
+
+
-
Rm[31:16])››1;
Rm[15:00])››1
Rm[15:00])››1;
Rm[31:16])››1
Rm[15:00])››1;
Rm[31:16])››1
Rm[31:16])››1;
Rm[15:00])››1
Rm[31:24])››1;
Rm[23:16])››1;
Rm[15:08])››1;
Rm[07:00])››1
Rm[31:24])››1;
Rm[23:16])››1;
Rm[15:08])››1;
Rm[07:00])››1
Notes
■
If you use the S prefix, then all operations are signed and values are
sign-extended before the addition.
■
If you use the U prefix, then all operations are unsigned and values are
zero-extended before the addition.
■
Rd, Rn, and Rm must not be pc.
B1-35
B1-36
Appendix B1 ARM and Thumb Assembler Instructions
■
These operations provide parallel arithmetic that cannot overflow, which is
useful for DSP processing of normalized signals.
SMLS Signed multiply accumulate instructions
SMLA
1.
2.
3.
4.
5.
6.
7.
8.
SMLA<x><y><cond>
SMLAW<y><cond>
SMLAD{X}<cond>
SMLSD{X}<cond>
{U|S}MLAL<cond>{S}
SMLAL<x><y><cond>
SMLALD{X}<cond>
SMLSLD{X}<cond>
Rd, Rm,
Rd, Rm,
Rd, Rm,
Rd, Rm,
RdLo, RdHi,
RdLo, RdHi,
RdLo, RdHi,
RdLo, RdHi,
Rs,
Rs,
Rs,
Rs,
Rm,
Rm,
Rm,
Rm,
Rn
Rn
Rn
Rn
Rs
Rs
Rs
Rs
ARMv5E
ARMv5E
ARMv6
ARMv6
ARMv3M
ARMv5E
ARMv6
ARMv6
Action
1.
2.
3.
4.
5.
6.
7.
8.
Rd = Rn + (Rm.<x> * Rs.<y>)
Rd = Rn + (((signed)Rm * Rs.<y>)››16)
Rd = Rn + Rm.B*<rotated_Rs>.B + Rm.T*<rotated_Rs>.T
Rd = Rn + Rm.B*<rotated_Rs>.B - Rm.T*<rotated_Rs>.T
RdHi:RdLo = RdHi:RdLo + (Rm * Rs)
RdHi:RdLo = RdHi:RdLo + (Rm.<x> * Rm.<y>)
RdHi:RdLo = RdHi:RdLo + Rm.B*<rotated_Rs>.B + Rm.T*<rotated_Rs>.T
RdHi:RdLo = RdHi:RdLo + Rm.B*<rotated_Rs>.B - Rm.T*<rotated_Rs>.T
Notes
■
<x> and <y> can be B or T.
■
Rm.B is shorthand for (sign-extend)Rm[15:00], the bottom 16 bits of Rm.
■
Rm.T is shorthand for (sign-extend)Rm[31:16], the top 16 bits of Rm.
■
<rotated_Rs> is Rs if you do not specify the X suffix or Rs ROR 16 if you do
specify the X suffix.
■
RdHi and RdLo must be different registers. For format 5, Rm must be a
different register from RdHi and RdLo.
■
Formats 1 to 4 update the cpsr Q-flag: Q = Q < SignedOverflow>.
■
Format 5 implements an unsigned multiply with the U prefix or a signed
multiply with the S prefix.
■
Format 5 updates the cpsr if the S suffix is present: N = RdHi[31], Z = ( RdHi==0
&& RdLo==0); the C and V flags are unpredictable. Avoid using {U|S}MLALS
because implementations often impose penalty cycles for this operation.
■
Implementations may terminate early on the value of Rs. For this reason use
small or constant values for Rs where possible.
B1.3
■
Alphabetical List of ARM and Thumb Instructions
The X suffix and multiply subtract versions are useful for packed complex
numbers. The following examples assume bits [15:00] hold the real part and
[31:16] the imaginary part.
Examples
SMLABB
SMLABT
SMLAWB
SMLAL
SMLALTB
SMLSD
SMLADX
r0,
r0,
r0,
r0,
r0,
r0,
r0,
r1,
r1,
r1,
r1,
r1,
r1,
r1,
r2,
r2,
r2,
r2,
r2,
r2,
r2,
r0 ; r0
r0 ; r0
r0 ; r0
r3 ; acc
r3 ; acc
r0 ; r0
r0 ; r0
+=
+=
+=
+=
+=
+=
+=
(short)r1 * (short)r2
(short)r1 * ((signed)r››216)
(r1*(short)r2)››16
r2*r3, acc is 64 bits [r1:r0]
((signed)r2››16)*((short)r3)
real(r1*r2) in complex maths
imag(r1*r2) in complex maths
SMMUL Signed most significant word multiply instructions
SMMLA
SMMLS 1. SMMUL{R}<cond> Rd, Rm, Rs
2. SMMLA{R}<cond> Rd, Rm, Rs, Rn
3. SMMLS{R}<cond> Rd, Rm, Rs, Rn
ARMv6
ARMv6
ARMv6
Action
1. Rd = ((signed)Rm*(signed)Rs + round)››32
2. Rd = ((Rn ‹‹ 32) + (signed)Rm*(signed)Rs + round)››32
3. Rd = ((Rn ‹‹ 32) - (signed)Rm*(signed)Rs + round)››32
Notes
■
If you specify the R suffix then round = 231; otherwise, round = 0.
■
Rd, Rm, Rs, and Rn must not be pc.
■
Implementations may terminate early on the value of Rs.
■
For 32-bit DSP algorithms these operations have several advantages over
using the high result register from SMLAL: They often take fewer cycles than
SMLAL. They also implement rounding, multiply subtract, and don’t require
a temporary scratch register for the low 32 bits of result.
Example
SMMULR
SMUL
SMUA
SMUS
r0, r1, r2
; r0=r1*r2/2 using Q31 arithmetic
Signed multiply instructions
1. SMUL<x><y><cond>
2. SMULW<y><cond>
3. SMUAD{X}<cond>
Rd,
Rd,
Rd,
Rm,
Rm,
Rm,
Rs
Rs
Rs
ARMv5E
ARMv5E
ARMv6
B1-37
B1-38
Appendix B1 ARM and Thumb Assembler Instructions
4. SMUSD{X}<cond>
Rd,
5. {U|S}MULL<cond>{S} RdLo,
Rm,
RdHi,
Rs
Rm,
Rs
ARMv6
ARMv3M
Action
1.
2.
3.
4.
5.
Rd
= Rm.<x> * Rs.<y>
Rd
= (Rm * Rs.<y>)››16
Rd
= Rm.B*<rotated_Rs>.B + Rm.T*<rotated_Rs>.T
Rd
= Rm.B*<rotated_Rs>.B - Rm.T*<rotated_Rs>.T
RdHi:RdLo = Rm*Rs
Notes
■
<x> and <y> can be B or T.
■
Rm.B is shorthand for (sign-extend)Rm[15:00], the bottom 16 bits of Rm.
■
Rm.T is shorthand for (sign-extend)Rm[31:16], the top 16 bits of Rm.
■
<rotated_Rs> is Rs if you do not specify the X suffix or Rs ROR 16 if you do
specify the X suffix.
■
RdHi and RdLo must be different registers. For format 5, Rm must be a
different register from RdHi and RdLo.
■
Format 4 updates the cpsr Q-flag: Q = Q | <SignedOverflow>.
■
Format 5 implements an unsigned multiply with the U prefix or a signed
multiply with the S prefix.
■
Format 5 updates the cpsr if the S suffix is present: N = RdHi[31], Z = ( RdHi==0
&& RdLo==0); the C and V flags are unpredictable. Avoid using {S|U}MULLS
because implementations often impose penalty cycles for this operation.
■
Implementations may terminate early on the value of Rs. For this reason use
small or constant values for Rs where possible.
■
The X suffix and multiply subtract versions are useful for packed complex
numbers. The following examples assume bits [15:00] hold the real part and
[31:16] the imaginary part.
Examples
SMULBB
SMULBT
SMULWB
SMULL
SMUADX
SRS
r0,
r0,
r0,
r0,
r0,
r1,
r1,
r1,
r1,
r1,
r2
;
r2
;
r2
;
r2, r3 ;
r2
;
r0
r0
r0
acc
r0
=
=
=
=
=
(short)r1 * (short)r2
(short)r1 * ((signed)r2››16)
(r1*(short)r2)››16
r2*r3, acc is 64 bits [r1:r0]
imag(r1*r2) in complex maths
Save return state
1. SRS<amode> #<mode>{!}
ARMv6
B1.3
Alphabetical List of ARM and Thumb Instructions
This performs the operation that STM<amode> sp_<mode>{!} , {lr, spsr} would
perform if STM allowed a register list of {lr, spsr} and allowed you to reference the
stack pointer of a different mode. See the entry for STM.
SSAT Saturate to n bits
1. {S|U}SAT<cond>
Rd, #<n>, Rm {, LSL#<0-31>}
2. {S|U}SAT<cond>
Rd, #<n>, Rm {, ASR#<1-32>}
3. {S|U}SAT16<cond> Rd, #<n>, Rm
Action
Effect on the cpsr
1. Rd = sat(<shifted_Rm>, n);
2. Rd = sat(<shifted_Rm>, n);
3. Rd[31:16] = sat(Rm[31:16], n);
Rd[15:00] = sat(Rm[15:00], n)
Q=Q | 1 if saturation occurred
Q=Q | 1 if saturation occurred
Q=Q | 1 if saturation occurred
Notes
■
If you specify the S prefix, then sat (x, n) saturates the signed value x to a signed
n-bit value in the range 2n1 x 2n1. n is encoded as 1 + <immed5> for
SAT and 1 + <immed4> for SAT16.
■
If you specify the U prefix, then sat (x, n) saturates the signed value x to an
unsigned n-bit value in the range 0 x 2n. n is encoded as <immed5> for
SAT and <immed4> for SAT16.
■
Rd and Rm must not be pc.
SSUB Signed parallel subtract (see SADD)
STC Store to coprocessor single or multiple 32-bit values
1.
2.
3.
4.
5.
6.
STC<cond>{L}
STC<cond>{L}
STC<cond>{L}
STC2{L}
STC2{L}
STC2{L}
<copro>,
<copro>,
<copro>,
<copro>,
<copro>,
<copro>,
Cd,
Cd,
Cd,
Cd,
Cd,
Cd,
[Rn {, #{-}<immed8>*4}]{!}
[Rn], #{-}<immed8>*4
[Rn], <option>
[Rn {, #{-}<immed8>*4}]{!}
[Rn], #{-}<immed8>*4
[Rn], <option>
ARMv2
ARMv2
ARMv2
ARMv5
ARMv5
ARMv5
These instructions initiate a memory write, transferring data to memory from the
given coprocessor. <copro> is the number of the coprocessor in the range p0 to
p15. The core takes an undefined instruction trap if the coprocessor is not present.
The memory write consists of a sequence of words to sequentially increasing
addresses. The initial address is specified by the addressing mode in Table B1.10.
The coprocessor controls the number of words transferred, up to a maximum
B1-39
B1-40
Appendix B1 ARM and Thumb Assembler Instructions
limit of 16 words. The fields {L} and Cd are interpreted by the coprocessor and
ignored by the ARM. Typically Cd specifies the source coprocessor register for the
transfer. The <option> field is an eight-bit integer enclosed in {}. Its interpretation
is coprocessor dependent.
If the address is not a multiple of four, then the access is unaligned. The
restrictions on an unaligned access are the same as for STM.
TABLE B1.10
STC addressing modes.
Addressing format
Address accessed
Value written back to Rn
[Rn {, #{-}<immed>}]
Rn + {{-}<immed>}
Rn preserved
[Rn {, #{-}<immed>}]!
Rn + {{-}<immed>}
Rn + {{-}<immed>}
[Rn], #{-}<immed>
Rn
Rn + {-}<immed>
[Rn], <option>
Rn
Rn preserved
STM
Store multiple 32-bit registers to memory
∧
1. STM<cond><a mode> Rn{!}, <register_list>{ }
2. STMIA
Rn!,
<register_list>
ARMv1
THUMBv1
These instructions store multiple words to sequential memory addresses. The <register_
list> specifies a list of registers to store, enclosed in curly brackets {}. Although the
assembler allows you to specify the registers in the list in any order, the order is not
stored in the instruction, so it is good practice to write the list in increasing order of
register number since this is the usual order of the memory transfer.
The following pseudocode shows the normal action of STM. We use <register_
list>[i] to denote the register appearing at position i in the list starting at
0 for the first register. This assumes that the list is in order of increasing register
number.
N = the number of registers in <register_list>
start = the lowest address accessed given in Table B1.11
for (i=0; i<N; i++)
memory(start+i*4, 4) = <register_list>[i];
if (! specified) then update Rn according to Table B1.11
Note that memory(a, 4) refers to the four bytes at address a packed according
to the current processor data endianness. If a is not a multiple of four, then the store
is unaligned. Because the behavior of an unaligned store depends on the architecture
revision, memory system, and system coprocessor (CP15) configuration, it is best to
avoid unaligned stores if possible. Assuming that the external memory system does not
abort unaligned stores, then the following rules usually apply:
B1.3
Alphabetical List of ARM and Thumb Instructions
■
If the core has a system coprocessor and bit 1 ( A-bit) or bit 22 ( U-bit) of CP15:
c1:c0:0 is set, then unaligned store-multiples cause an alignment fault data abort
exception.
■
Otherwise, the access ignores the bottom two address bits.
Table B1.11 lists the possible addressing modes specified by <amode>. If you specify
the !, then the base address register is updated according to Table B1.11; otherwise,
it is preserved. Note that the lowest register number is always written to the lowest
address.
TABLE B1.11
STM addressing modes.
Addressing
mode
Lowest address
accessed
Highest address
accessed
Value written back
to Rn if ! specified
{IA|EA}
Rn
Rn + N*4 - 4
Rn + N*4
{IB|FA}
Rn + 4
Rn + N*4
Rn + N*4
{DA|ED}
Rn - N*4 + 4
Rn
Rn - N*4
{DB|FD}
Rn - N*4
Rn - 4
Rn - N*4
The first half of the addressing mode mnemonics stands for Increment After,
Increment Before, Decrement After, and Decrement Before, respectively. Increment
modes store the registers sequentially forward starting from address Rn (increment
after) or Rn + 4 (increment before). Decrement modes have the same effect as if you
stored the register list backwards to sequentially descending memory addresses starting
from address Rn (decrement after) or Rn 4 (decrement before).
The second half of the addressing mode mnemonics stands for the stack type you
can implement with that address mode: Full Descending, Empty Descending, Full
Ascending, and Empty Ascending. With a full stack, Rn points to the last stacked value.
With an empty stack, Rn points to the first unused stack location. ARM stacks are
usually full descending. You should use full descending or empty ascending stacks by
preference, since STC also supports these addressing modes.
Notes
■
For Thumb (format 2), Rn and the register list registers must be in the range r0 to r7.
■
The number of registers N in the list must be nonzero.
■
Rn must not be pc.
■
If Rn appears in the register list and ! (writeback) is specified, the behavior is as
follows: If Rn is the lowest register number in the list, then the original value is
stored; otherwise, the stored value is unpredictable.
B1-41
B1-42
Appendix B1 ARM and Thumb Assembler Instructions
■
If pc appears in the register list, then the value stored is implementation defined.
■
If
is specified, then the operation is modified. The processor must not be in
user or system mode. The registers appearing in the register list refer to the user
mode versions of the registers and writeback must not be specified.
■
The time order of the memory accesses may depend on the implementation. Be
careful when using a store multiple to access I/O locations where the access order
matters. If the order matters, then check that the memory locations are marked
as I/O in the page tables. Do not cross page boundaries, and do not use pc in the
register list.
∧
Examples
STMIA
STMDB
STMEQFD
STMFD
STR
r4!,
r4!,
sp!,
sp,
{r0, r1}
{r0, r1}
{r0, lr}
∧
{sp}
; *r4=r0, *(r4+4)=r1, r4+=8
; *(r4-4)=r1, *(r4-8)=r0, r4-=8
; if (result zero) then stack r0, lr
; store sp_usr on stack sp_current
Store a single value to a virtual address in memory
1. STR<cond>{|B}
Rd,
2. STR<cond>{|B}
Rd,
3. STR<cond>{|B}{T} Rd,
4. STR<cond>{|B}{T} Rd,
5. STR<cond>{H}
Rd,
6. STR<cond>{H}
Rd,
7. STR<cond>{H}
Rd,
8. STR<cond>{H}
Rd,
9. STR<cond>D
Rd,
10. STR<cond>D
Rd,
11. STR<cond>D
Rd,
12. STR<cond>D
Rd,
13. STREX<cond>
Rd,
14. STR{|B|H}
Ld,
15. STR{|B|H}
Ld,
16. STR
Ld,
17. STR<cond><type> Rd,
[Rn {, #{-}<immed12>}]{!}
[Rn, {-}Rm {,<imm_shift>}]{!}
[Rn], #{-}<immed12>
[Rn], {-}Rm {,<imm_shift>}
[Rn, {, #{-}<immed8>}]{!}
[Rn, {-}Rm]{!}
[Rn], #{-}<immed8>
[Rn], {-}Rm
[Rn, {, #{-}<immed8>}]{!}
[Rn, {-}Rm]{!}
[Rn], #{-}<immed8>
[Rn], {-}Rm
Rm, [Rn]
[Ln, #<immed5>*<size>]
[Ln, Lm]
[sp, #<immed8>*4]
<label>
ARMv1
ARMv1
ARMv1
ARMv1
ARMv4
ARMv4
ARMv4
ARMv4
ARMv5E
ARMv5E
ARMv5E
ARMv5E
ARMv6
THUMBv1
THUMBv1
THUMBv1
MACRO
Formats 1 to 16 store a single data item of the type specified by the opcode suffix,
using a preindexed or postindexed addressing mode. Tables B1.12 and B1.13 show the
different addressing modes and data types.
In Table B1.13, memory (a, n) refers to n sequential bytes at address a. The bytes are
packed according to the configured processor data endianness. memoryT(a, n) performs
the access with user mode privileges, regardless of the current processor mode. The act
of function IsExclusive(a) used by STREX depends on address a. If a has the shared
TLB attribute, then IsExclusive(a) is true if address a is marked as exclusive for
this processor. It then clears any exclusive accesses on this processor and any exclusive
B1.3
Alphabetical List of ARM and Thumb Instructions
accesses to address a on other processors in the system. If a does not have the shared
TLB attribute, then IsExclusive(a) is true if there is an outstanding exclusive access
on this processor. It then clears any such outstanding access.
TABLE B1.12
STR addressing modes.
Addressing format
Address a accessed
Value written back to Rn
[Rn {,#{-}<immed>}]
Rn + {{-}<immed>}
Rn preserved
[Rn {,#{-}<immed>}]!
Rn + {{-}<immed>}
Rn + {{-}<immed>}
[Rn, {-}Rm {,<shift>}]
Rn + {-}<shifted_Rm>
Rn preserved
[Rn, {-}Rm {,<shift>}]!
Rn + {-}<shifted_Rm>
Rn + {-}<shifted_Rm>
[Rn], #{-}<immed>
Rn
Rn + {-}<immed>
[Rn], {-}Rm {,<shift>}
Rn
Rn + {-}<shifted_Rm>
TABLE B1.13
STR data types.
Store
Datatype
<size> (bytes)
Action
STR
word
4
memory(a, 4) = Rd
STRB
unsigned Byte
1
memory(a, 1) = (char)Rd
STRBT
Byte Translated
1
memoryT(a, 1) = (char)Rd
STRD
Double word
8
memory(a, 4) = Rd
STREX
word EXclusive
4
memory(a+4, 4) = R(d+1)
if (IsExclsuive(a)) {
memory (a, 4) = Rm;
Rd = 0;
} else {
Rd = 1;
}
STRH
unsigned Halfword
2
memory(a, 2) = (short) Rd
STRT
word Translated
4
memoryT(a, 4) = Rd
If the address a is not a multiple of <size>, then the store is unaligned. Because the
behavior of an unaligned store depends on the architecture revision, memory system,
and system coprocessor (CP15) configuration, it is best to avoid unaligned stores if
possible. Assuming that the external memory system does not abort unaligned stores,
then the following rules usually apply. In the rules, A is bit 1 of system coprocessor
register CP15:c1:c0:0, and U is bit 22 of CP15:c1:c0:0, introduced in ARMv6. If there is
no system coprocessor, then A U 0.
B1-43
B1-44
Appendix B1 ARM and Thumb Assembler Instructions
■
If A = 1, then unaligned stores cause an alignment fault data abort exception
except that word-aligned double-word stores are supported if U = 1.
■
If A = 0 and U = 1, then unaligned stores are supported for STR{|T|H|SH}. Wordaligned stores are supported for STRD. A non-word-aligned STRD generates an
alignment fault data abort.
■
If A = 0 and U = 0, then STR and STRT write to memory(a&~ 3, 4). All other
unaligned operations are unpredictable but do not cause an alignment fault
Format 17 generates a pc -relative store accessing the address specified by <label> . In
other words it assembles to STR<cond><type> Rd, [pc, #<offset>] whenever this
instruction is supported and <offset>=<label>-pc is in range.
Notes
■
For double-word stores (formats 9 to 12), Rd must be even and in the range r0
to r12.
■
If the addressing mode updates Rn, then Rd and Rn must be distinct.
■
If Rd is pc, then <size> must be 4. The value stored is implementation defined.
■
If Rn is pc, then the addressing mode must not update Rn . The value used for Rn
is the address of the instruction plus eight bytes.
■
Rm must not be pc.
Examples
STR
STRH
STRD
STRB
STRB
SUB
r0,
r0,
r2,
r0,
r0,
[r0]
[r1], #4
[r1, #-8]!
[r2, #55]
[r1], -r2,
; *(int*)r0 = r0;
; *(short*)r1 = r0; r1+=4;
; r1-=8; *(int*)r1=r2; *(int*)(r1+4)=r3
; *(char*)(r2+55) = r0;
LSL #8 ; *(char*)r1 = r0; r1-=256*r2;
Subtract two 32-bit values
1.
2.
3.
4.
5.
6.
SUB<cond>{S}
SUB<cond>{S}
SUB
SUB
SUB
SUB
Rd,
Rd,
Ld,
Ld,
Ld,
sp,
Rn, #<rotated_immed>
Rn, Rm {, <shift>}
Ln, #<immed3>
#<immed8>
Ln, Lm
#<immed7>*4
Effect on the cpsr
Action
1. Rd
2. Rd
3. Ld
4. Ld
ARMv1
ARMv1
THUMBv1
THUMBv1
THUMBv1
THUMBv1
=
=
=
=
Rn Rn Ln Ld -
<rotated_immed>
<shifted_Rm>
<immed3>
<immed8>
Updated
Updated
Updated
Updated
if S
if S
(see
(see
suffix
suffix
Notes
Notes
specified
specified
below)
below)
B1.3
Alphabetical List of ARM and Thumb Instructions
5. Ld = Ln - Lm
6. sp = sp - <immed7>*4
Updated (see Notes below)
Preserved
Notes
■
If the operation updates the cpsr and Rd is not pc, then N = <Negative>, Z
= <Zero>, C = <NoUnsignedOverflow>, and V = <SignedOverflow>. The
carry flag is set this way because the subtract x y is implemented as the add
x ~ y 1. The carry flag is one if x ~ y 1 overflows. This happens when
x y, when x y doesn’t overflow.
■
If Rd is pc, then the instruction effects a jump to the calculated address. If the
operation updates the cpsr, then the processor mode must have an spsr; in this
case, the cpsr is set to the value of the spsr.
■
If Rn or Rm is pc, then the value used is the address of the instruction plus eight
bytes.
Examples
SUBS
SUB
SUBS
SWI
r0, r0, #1
r0, r1, r1, LSL #2
pc, lr, #4
; r0-=1, setting flags
; r0 = -3*r1
; jump to lr-4, set cpsr=spsr
Software interrupt
1. SWI<cond> <immed24>
2. SWI <immed8>
ARMv1
THUMBv1
The SWI instruction causes the ARM to enter supervisor mode and start executing
from the SWI vector. The return address and cpsr are saved in lr_svc and spsr_svc,
respectively. The processor switches to ARM state and IRQ interrupts are disabled. The
SWI vector is at address 0x00000008, unless high vectors are configured; then it is at
address 0xFFFF0008.
The immediate operand is ignored by the ARM. It is normally used by the SWI
exception handler as an argument determining which function to perform.
Example
SWI 0x123456 ; Used by the ARM tools to implement Semi-Hosting
SWP
Swap a word in memory with a register, without interruption
1. SWP<cond> Rd, Rm, [Rn]
2. SWP<cond>B Rd, Rm, [Rn]
ARMv2a
ARMv2a
Action
1. temp=memory(Rn,4); memory(Rn,4)=Rm; Rd=temp;
2. temp=(zero extend)memory(Rn,1); memory(Rn,1)=(char)Rm; Rd=temp;
B1-45
B1-46
Appendix B1 ARM and Thumb Assembler Instructions
Notes
■
The operations are atomic. They cannot be interrupted partway through.
■
Rd, Rm, Rn must not be pc.
■
Rn and Rm must be different registers. Rn and Rd must be different registers.
■
Rn should be aligned to the size of the memory transfer.
■
If a data abort occurs on the load, then the store does not occur. If a data abort
occurs on the store, then Rd is not written.
You can use the SWP instruction to implement 8-bit or 32-bit semaphores on ARMv5
and below. For ARMv6 use LDREX and STREX in preference. As an example, suppose
a byte semaphore register pointed to by r1 can have the value 0xFF (claimed) or 0x00
(free). The following example claims the lock. If the lock is already claimed, then the
code loops, waiting for an interrupt or task switch that will free the lock.
MOV
loops WPB
CMP
BEQ
r0, #0xFF
r0, r0, [r1]
r0, #0xFF
loop
;
;
;
;
value to claim the lock
try and claim the lock
check to see if it was already claimed
if so wait for it to become free
SXT Byte or halfword extract or extract with accumulate
SXTA
1.
2.
3.
4.
5.
6.
7.
8.
{S|U}XTB16<cond>
{S|U}XTB<cond>
{S|U}XTH<cond>
{S|U}XTAB16<cond>
{S|U}XTAB<cond>
{S|U}XTAH<cond>
{S|U}XTB
{S|U}XTH
Rd,
Rd,
Rd,
Rd,
Rd,
Rd,
Ld,
Ld,
Rm
Rm
Rm
Rn,
Rn,
Rn,
Lm
Lm
{, ROR#8*<rot> }
{, ROR#8*<rot> }
{, ROR#8*<rot> }
Rm {, ROR#8*<rot> }
Rm {, ROR#8*<rot> }
Rm {, ROR#8*<rot> }
THUMBv3
THUMBv3
ARMv6
ARMv6
ARMv6
ARMv6
ARMv6
ARMv6
Action
1. Rd[31:16] = extend(<shifted_Rm>[23:16]);
Rd[15:00] = extend(<shifted_Rm>[07:00])
2. Rd = extend(<shifted_Rm>[07:00])
3. Rd = extend(<shifted_Rm>[15:00])
4. Rd[31:16] = Rn[31:16] + extend(<shifted_Rm>[23:16]);
5.
6.
7.
8.
Rd[15:00] = Rn[15:00] + extend(<shifted_Rm>[07:00])
Rd = Rn + extend(<shifted_Rm>[07:00])
Rd = Rn + extend(<shifted_Rm>[15:00])
Ld = extend(Lm[07:00])
Ld = extend(Lm[15:00])
B1.3
Alphabetical List of ARM and Thumb Instructions
Notes
■
If you specify the S prefix, then extend( x ) sign extends x.
■
If you specify the U prefix, then extend( x ) zero extends x.
■
Rd and Rm must not be pc.
■
<rot> is an immediate in the range 0 to 3.
TEQ
Test for equality of two 32-bit values
1. TEQ<cond> Rn, #<rotated_immed>
2. TEQ<cond> Rn, Rm {, <shift>}
ARMv1
ARMv1
Action
∧
1. Set the cpsr on the result of (Rn
<rotated_immed>)
∧
<shifted_Rm>)
2. Set the cpsr on the result of (Rn
Notes
■
The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table
B1.3).
■
If Rn or Rm is pc, then the value used is the address of the instruction plus eight
bytes.
■
Use this instruction instead of CMP when you want to check for equality and
preserve the carry flag.
Example
TEQ
r0, #1
; test to see if r0==1
TST Test bits of a 32-bit value
1. TST<cond> Rn, #<rotated_immed>
2. TST<cond> Rn, Rm {, <shift>}
3. TST
Ln, Lm
ARMv1
ARMv1
THUMBv1
Action
1. Set the cpsr on the result of (Rn & <rotated_immed>)
2. Set the cpsr on the result of (Rn & <shifted_Rm>)
3. Set the cpsr on the result of (Ln & Lm)
Notes
■
The cpsr is updated: N = <Negative>, Z = <Zero>, C = <shifter_C> (see Table B1.3).
■
If Rn or Rm is pc, then the value used is the address of the instruction plus eight bytes.
B1-47
B1-48
Appendix B1 ARM and Thumb Assembler Instructions
■
Use this instruction to test whether a selected set of bits are all zero.
Example
TST r0, #0xFF
; test if the bottom 8 bits of r0 are 0
UADD
Unsigned parallel modulo add (see the entry for SADD)
UHADD
UHSUB
Unsigned halving add and subtract (see the entry for SHADD)
UMAAL
Unsigned multiply accumulate accumulate long
1. UMAAL<cond> RdLo, RdHi, Rm, Rs
ARMv6
Action
1. RdHi:RdLo
(unsigned)Rm*Rs (unsigned)RdLo (unsigned)RdHi
Notes
■
RdHi and RdLo must be different registers.
■
RdHi, RdLo, Rm, Rs must not be pc.
■
This operation cannot overflow because (232 1) (232 1)(232 1) (232 1) (2641). You can use it to synthesize the multiword multiplications
used by public key cryptosystems.
UMLAL
UMULL
Unsigned long multiply and multiply accumulate (see the SMLAL and
SMULL entries)
UQADD
UQSUB
Unsigned saturated add and subtract (see the QADD entry)
USAD
Unsigned sum of absolute differences
1. USAD8<cond>
2. USADA8<cond>
Rd, Rm, Rs
Rd, Rm, Rs, Rn
ARMv6
ARMv6
Action
1. Rd =
abs(Rm[31:24]-Rs[31:24]) +
+ abs(Rm[15:08]-Rs[15:08]) +
2. Rd = Rn + abs(Rm[31:24]-Rs[31:24])
+ abs(Rm[15:08]-Rs[15:08])
abs(Rm[23:16]-Rs[23:16])
abs(Rm[07:00]-Rs[07:00])
+ abs(Rm[23:16]-Rs[23:16])
+ abs(Rm[07:00]-Rs[07:00])
B1.4
ARM Assembler Quick Reference
Note
■
abs( x ) returns the absolute value of x. Rm and Rs are treated as unsigned.
■
Rd, Rm, and Rs must not be pc.
■
The sum of absolute differences operation is common in video codecs where it
provides a metric to measure how similar two images are.
USAT
Unsigned saturation instruction (see the SSAT entry)
USUB
Unsigned parallel modulo subtracts (see the SADD entry)
UXT
UXTA
Unsigned extract, extract with accumulate (see the entry for SXT)
B1.4
ARM Assembler Quick Reference
This section summarizes the more useful commands and expressions available with
the ARM assembler, armasm. Each assembly line has one of the following formats:
{<label>} {<instruction>}
; comment
{<symbol>} <directive>
; comment
{<arg_0>} <macro> {<arg_1>} {,<arg_2>} .. {,<arg_n>} ; comment
where
■
<instruction> is any ARM or Thumb instruction supported by the processor
you are assembling for. See Section B1.3.
■
<label> is the name of a symbol to store the address of the instruction.
■
<directive> is an ARM assembler directive. See Section ARM Assembler Directives.
■
<symbol> is the name of a symbol used by the <directive>.
■
<macro> is the name of a new directive defined using the MACRO
directive.
■
<arg_k> is the kth macro argument.
You must use an AREA directive to define an area before any ARM or Thumb
instructions appear. All assembly files must finish with the END directive. The
following example shows a simple assembly file defining a function add that returns
the sum of the two input arguments:
B1-49
B1-50
Appendix B1 ARM and Thumb Assembler Instructions
AREA
EXPORT
add
ADD
MOV
maths_routines, CODE, READONLY
add
; give the symbol add external linkage
r0, r0, r1
pc, lr
; add input arguments
; return from sub-routine
END
ARM Assembler Variables
The ARM assembler supports three types of assemble time variables (see Table
B1.14). Variable names are case sensitive and must be declared before use with the
directives GBLx or LCLx.
TABLE B1.14
ARM assembler variable types.
Declare
globally
Declare locally
to a macro
Set value
Unsigned 32-bit
integer
GBLA
LCLA
SETA
15, 0xab
ASCII string
GBLS
LCLS
SETS
“”, “ADD”
Logical
GBLL
LCLL
SETL
{TRUE}, {FALSE}
Variable type
Example
values
You can use variables in expressions (see Section ARM Assembler Labels), or substitute
their value at assembly time using the $ operator. Specifically, $name. expands to the
value of the variable name before the line is assembled. You can omit the final period
if name is not followed by an alphanumeric or underscore. Use $$ to produce a single
$. Arithmetic variables expand to an eight-digit hexadecimal string on substitution.
Logical variables expand to T or F.
The following example code shows how to declare and substitute variables of
each type:
; arithmetic variables
GBLA count ; declare an integer variable count
count SETA 1
; set count = 1
WHILE
count<15
BL
test$count
; call test00000001, test00000002 ...
count SETA count+1
; .... test00000000E
WEND
cc
; string variables
GBLS
cc
SETS
“NE”
ADD$cc r0, r0, r0
STR$cc.B r0, [r1]
;
;
;
;
declare a string variable called cc
set cc=”NE”
assembles as ADDNE r0,r0,r0
assembles as STRNEB r0,[r1]
B1.4
; logical variable
GBLL
debug
debug SETL
{TRUE}
IF debug
BL
print_debug
ENDIF
;
;
;
;
ARM Assembler Quick Reference
declare a logical variable called debug
set debug={TRUE}
if debug is TRUE then
print out some debug information
ARM Assembler Labels
A label definition must begin on the first character of a line. The assembler treats
indented text as an instruction, directive, or macro. It treats labels of the form
<N><name> as a local label, where <N> is an integer in the range 0 to 99 and <name>
is an optional textual name. Local labels are limited in scope by the ROUT directive.
To reference a local label, you refer to it as %{|F|B}{|A|T}<N>{<name>}. The
extra prefix letters tell the assembler how to search for the label:
■
If you specify F, the assembler searches forward; if B, then the assembler
searches backwards. Otherwise the assembler searches backwards and then
forwards.
■
If you specify T, the assembler searches the current macro only; if A, then
the assembler searches all macro levels. Otherwise the assembler searches the
current and higher macro nesting levels.
ARM Assembler Expressions
The ARM assembler can evaluate a number of numeric, string, and logical
expressions at assembly time. Table B1.15 shows some of the unary and binary
operators you can use within expressions. Brackets can be used to change the order
of evaluation in the usual way.
TABLE B1.15: ARM assembler unary and binary operators.
Expression
Result
Example
A+B, A-B
A plus or minus B
A*B, A/B
A multiplied by or divided by B
2*3 = 6, 7/3 = 2
A:MOD:B
A modulo B
7:MOD:3 = 1
:CHR:A
string with ASCII code A
:CHR:32 = “ ”
‘X’
the ASCII value of X
‘a’ = 0x61
:STR:A, :STR:L
A or L converted to a string
:STR:32 = “00000020” :
STR:{TRUE} = “T”
1-2 = 0xffffffff
A‹‹B, A:SHL:B
A shifted left by B bits
1 ‹‹ 3 = 8
A››B, A:SHR:B
A shifted right by B bits
(logical shift)
0x80000000 ›› 4 = 0x08000000
A:ROR:B, A:ROL:B
A rotated right/left by B bits
1:ROR:1 = 0x80000000
0x80000000:ROL:1 = 1
B1-51
B1-52
Appendix B1 ARM and Thumb Assembler Instructions
A=B, A>B, A>=B, A<B,
A<=B, A/=B, A<>B
comparison of arithmetic or
string variables ( /= and <> both
mean not equal)
(1=2) = {FALSE}, (1<2) =
{TRUE}, (“a”=“c”) = {FALSE},
(“a”<“c”) = {TRUE}
A: AND: B, A: OR: B, A: EOR:
B, :NOT:A
Bitwise AND, OR, exclusive OR
of A and B; bitwise NOT of A.
1:AND:3 = 1 1:OR:3 = 3:NOT:0
= 0xFFFFFFFF
:LEN:S
length of the string S
:LEN:“ABC” = 3
S:LEFT:B, S:RIGHT:B
leftmost or rightmost B
characters of S
“ABC”:LEFT:2 = “AB”, “ABC”:
RIGHT:2 = “BC”
S:CC:T
the concatenation of S, T
“AB”:CC:“C” = “ABC”
L:LAND:M, L:LOR:M, L:LEOR:M
logical AND, OR, exclusive OR
of L and M
{TRUE}:LAND:{FALSE} =
{FALSE}
:DEF:X
returns TRUE if a variable
called X is defined
:BASE:A :INDEX:A
see the MAP directive
TABLE B1.16 Predefined expressions.
Variable
Value
{ARCHITECURE}
The ARM architecture of the CPU (“4T” for ARMv4T)
{ARMASM_VERSION}
The assembler version number
{CONFIG} or
{CODESIZE}
The bit width of the instructions being assembled (32 for ARM state, 16 for
Thumb state)
{CPU}
The name of the CPU being assembled for
{ENDIAN}
The configured endianness, “big’’ or “little”
{INTER}
{TRUE} if ARM/Thumb interworking is on
{PC}
The address of the current instruction being assembled
(alias .)
{ROPI}, {RWPI}
{TRUE} if read-only/read-write position independent
{VAR}
The MAP counter (see the MAP directive)
(alias @)
In Table B1.15, A and B represent arbitrary integers; S and T, strings; and L and M,
logical values. You can use labels and other symbols in place of integers in many
expressions.
Predefined Variables
Table B1.16 shows a number of special variables that can appear in expressions.
These are predefined by the assembler, and you cannot override them.
ARM Assembler Directives
Here is an alphabetical list of the more common armasm directives.
B1.4
ARM Assembler Quick Reference
ALIGN
ALIGN
{<expression>, {<offset>}}
Aligns the address of the next instruction to the form q*<expression>+<offset>.
The alignment is relative to the start of the ELF section so this must be aligned
appropriately (see the AREA directive). <expression> must be a power of two; the
default is 4. <offset> is zero if not specified.
AREA
AREA
<section> {,<attr_1>} {,<attr_2>} ... {,<attr_k>}
Starts a new code or data section of name <section>. Table B1.17 lists the possible
attributes.
TABLE B1.17
AREA attributes.
Attribute
Meaning
expression
ALIGN=<expression>
Align the ELF section to a 2
ASSOC=<sectionname>
If this section is linked, also link <sectionname>.
CODE
The section contains instructions and is read only.
DATA
The section contains data and is read write.
NOINIT
The data section does not require initialization.
READONLY
The section is read only.
READWRITE
The section is read write.
byte boundary.
ASSERT
ASSERT
<logical-expression>
Assemble time assert. If the logical expression is false, then assembly terminates
with an error.
CN
<name>
CN
<numeric-expression>
Set <name> to be an alias for coprocessor register <numeric-expression>.
CODE16, CODE32
CODE16 tells the assembler to assemble the following instructions as 16-bit Thumb
instructions. CODE32 indicates 32-bit ARM instructions (the default for armasm).
B1-53
B1-54
Appendix B1 ARM and Thumb Assembler Instructions
CP
<name>
CP
<numeric-expression>
Set <name> to be an alias for coprocessor number <numeric-expression>.
DATA
<label> DATA
The DATA directive indicates that the label points to data rather than code. In
Thumb mode this prevents the linker from setting the bottom bit of the label. Bit
0 of a function pointer or code label is 0 for ARM code and 1 for Thumb code (see
the BX instruction).
DCB, DCD{U}, DCI, DCQ{U}, DCW{U}
These directives allocate one or more bytes of initialized memory according to
Table B1.18. Follow each directive with a comma-separated list of initialization
values. If you specify the optional U suffix, then the assembler does not insert any
alignment padding.
Examples
hello
powers
TABLE B1.18
DCB “hello”, 0
DCD 1, 2, 4, 8, 10, 0x20, 0x40, 0x80
DCI 0xEA000000
Memory initialization directives.
Directive
Alias
Data size (bytes)
DCB
=
1
byte or string
2
16-bit integer (aligned to 2 bytes)
DCW
DCD
&
Initialization value
4
32-bit integer (aligned to 4 bytes)
DCQ
8
64-bit integer (aligned to 4 bytes)
DCI
2 or 4
integer defining an ARM or Thumb
instruction
ELSE (alias |)
See IF.
END
This directive must appear at the end of a source file. Assembler source after an
END directive is ignored.
B1.4
ARM Assembler Quick Reference
ENDFUNC (alias ENDP), ENDIF (alias ])
See FUNCTION and IF, respectively.
ENTRY
This directive specifies the program entry point for the linker. The entry point is
usually contained in the ARM C library.
EQU (alias *)
<name>
EQU <numeric-expression>
This directive is similar to #define in C. It defines a symbol <name> with value
defined by the expression. This value cannot be redefined. See Section ARM
Assembler Variables for the use of redefinable variables.
EXPORT (alias GLOBAL)
EXPORT
<symbol>{[WEAK]}
Assembler symbols are local to the object file unless exported using this command.
You can link exported symbols with other object and library files. The optional
[WEAK] suffix indicates that the linker should try and resolve references with other
instances of this symbol before using this instance.
EXTERN, IMPORT
EXTERN
IMPORT
<symbol>{[WEAK]}
<symbol>{[WEAK]}
Both of these directives declare the name of an external symbol, defined in another
object file or library. If you use this symbol, then the linker will resolve it at link
time. For IMPORT, the symbol will be resolved even if you don’t use it. For EXTERN,
only used symbols are resolved. If you declare the symbol as [WEAK], then no error
is generated if the linker cannot resolve the symbol; instead the symbol takes the
value 0.
FIELD (alias #)
See MAP.
FUNCTION (alias PROC) and ENDFUNC (alias ENDP)
The FUNCTION and ENDFUNC directives mark the start and end of an ATPCScompliant function. Their main use is to improve the debug view and allow
backtracking of function calls during debugging. They also allow the profiler to
B1-55
B1-56
Appendix B1 ARM and Thumb Assembler Instructions
more accurately profile assembly functions. You must precede the function directive
with the ATPCS function name. For example:
sub
FUNCTION
SUB
r0, r0, r1
MOV
pc, lr
ENDFUNC
GBLA, GBLL, GBLS
Directives defining global arithmetic, logic, and string variables, respectively. See
Section ARM Assembler Variables.
GET
See INCLUDE.
GLOBAL
See EXPORT.
IF (alias [), ELSE (alias |), ENDIF (alias ])
These directives provide for conditional assembly. They are similar to #if, #else,
#endif, available in C. The IF directive is followed by a logical expression. The
ELSE directive may be omitted. For example:
IF ARCHITECTURE=“5TE”
SMULBB r0, r1, r1
ELSE
MUL
r0, r1, r1
ENDIF
IMPORT
See EXTERN.
INCBIN
INCBIN <filename>
This directive includes the raw data contained in the binary file <filename> at the
current point in the assembly. For example, INCBIN table.dat.
INCLUDE (alias GET)
INCLUDE <filename>
Use this directive to include another assembly file. It is similar to the #include
command in C. For example, INCLUDE header.h.
B1.4
ARM Assembler Quick Reference
INFO (alias !)
INFO
<numeric_expression>, <string_expression>
If <numeric_expresssion> is nonzero, then assembly terminates with error <string_
expresssion>. Otherwise the assembler prints <string_expression> as an information
message.
KEEP
KEEP
{<symbol>}
By default the assembler does not include local symbols in the object file, only
exported symbols (see EXPORT). Use KEEP to include all local symbols or a specified
local symbol. This aids the debug view.
LCLA, LCLL, LCLS
These directives declare macro-local arithmetic, logical, and string variables,
respectively. See Section ARM Assembler Variables.
LTORG
Use LTORG to insert a literal pool. The assembler uses literal pools to store the
constants appearing in the LDR Rd,=<value> instruction. See LDR format 19.
Usually the assembler inserts literal pools automatically, at the end of each area.
However, if an area is too large, then the LDR instruction cannot reach this literal
pool using pc-relative addressing. Then you need to insert a literal pool manually,
near the LDR instruction.
MACRO, MEXIT, MEND
Use these directives to declare a new assembler macro or pseudoinstruction. The
syntax is
{$<arg_0>}
MACRO
<macro_name> {$<arg_1>} {,$<arg_2>} ... {,$<arg_k>}
<macro_code>
MEND
The macro parameters are stored in the dummy variables $<arg_i>. This
argument is set to the empty string if you don’t supply a parameter when
calling the macro. The MEXIT directive terminates the macro early and is usually
used inside IF statements. For example, the following macro defines a new
pseudoinstruction SMUL, which evaluates to a SMULBB on an ARMv5TE processor,
and an MUL otherwise.
B1-57
B1-58
Appendix B1 ARM and Thumb Assembler Instructions
$label
$label
$label
MACRO
SMUL
$a, $b, $c
IF {ARCHITECTURE}=“5TE”
SMULBB $a, $b, $c
MEXIT
ENDIF
MUL
$a, $b, $c
MEND
MAP (alias ˆ), FIELD (alias #)
These directives define objects similar to C structures. MAP sets the base address or
offset of a structure, and FIELD defines structure elements. The syntax is
<name>
MAP
FIELD
<base> {, <base_register>}
<field_size_in_bytes>
The MAP directive sets the value of the special assembler variable {VAR} to the base
address of the structure. This is either the value <base> or the register relative value
<base_register>+<base>. Each FIELD directive sets <name> to the value VAR
and increments VAR by the specified number of bytes. For register relative values,
the expressions :INDEX:<name> and :BASE:<name> return the element offset from
base register, and base register number, respectively.
In practice the base register form is not that useful. Instead you can use the plain
form and mention the base register explicitly in the instruction. This allows you to
point to a structure of the same type with different base registers. The following
example sets up a structure on the stack of two int variables:
count
type
size
MAP
FIELD
FIELD
FIELD
0
4
4
0
SUB
MOV
STR
STR
sp,
r0,
r0,
r0,
;
;
;
;
structure elements offset from 0
define an int called count
define an int called type
record the struct size
sp, #size
; make room on the stack
#0
[sp, #count] ; clear the count element
[sp, #type] ; clear the type element
NOFP
This directive bans the use of floating-point instructions in the assembly file. We
don’t cover floating-point instructions and directives in this appendix.
B1.4
ARM Assembler Quick Reference
OPT
The OPT directive controls the formatting of the armasm -list option. This is
seldom used now that source-level debugging is available. See the armasm
documentation.
PROC
See FUNCTION.
RLIST, RN
<name>
<name>
RN <numeric expression>
RLIST <list of ARM register enclosed in {}>
These directives name a list of ARM registers or a single ARM register. For example,
the following code names r0 as arg and the ATPCS preserved registers as saved.
arg
saved
RN 0
RLIST {r4-r11}
ROUT
The ROUT directive defines a new local label area. See Section ARM Assembler Labels.
SETA, SETL, SETS
These directives set the values of arithmetic, logical, and string variables, respectively.
See Section ARM Assembler Variables.
SPACE (alias %)
{<label>} SPACE
<numeric_expression>
This directive reserves <numeric_expression> bytes of space. The bytes are zero
initialized.
WHILE, WEND
These directives supply an assemble-time looping structure. WHILE is followed by
a logical expression. While this expression is true, the assembler repeats the code
between WHILE and WEND. The following example shows how to create an array of
powers of two from 1 to 65,536.
B1-59
B1-60
Appendix B1 ARM and Thumb Assembler Instructions
count
count
B1.5
GBLA
SETA
WHILE
DCD
SETA
WEND
count
1
count<=65536
count
2*count
GNU Assembler Quick Reference
This section summarizes the more useful commands and expressions available with
the GNU assembler, gas, when you target this assembler for ARM. Each assembly
line has the format
{<label>:} {<instruction or directive>} @ comment
Unlike the ARM assembler, you needn’t indent instructions and directives. Labels
are recognized by the following colon rather than their position at the start of the
line. The following example shows a simple assembly file defining a function add
that returns the sum of the two input arguments:
.section
.global
.text, “x”
add
@ give the symbol add external linkage
add:
ADD
MOV
r0, r0, r1 @ add input arguments
pc, lr
@ return from subroutine
GNU Assembler Directives
Here is an alphabetical list of the more common gas directives.
.ascii “<string>”
Inserts the string as data into the assembly, as for DCB in armasm.
.asciz “<string>”
As for .ascii but follows the string with a zero byte.
.balign <power_of_2> {,<fill_value> {,<max_padding>} }
B1.5
GNU Assembler Quick Reference
Aligns the address to <power_of_2> bytes. The assembler aligns by adding bytes of
value <fill_value> or a suitable default. The alignment will not occur if more than
<max_padding> fill bytes are required. Similar to ALIGN in armasm.
.byte <byte1> {,<byte2>} ...
Inserts a list of byte values as data into the assembly, as for DCB in armasm.
.code <number_of_bits>
Sets the instruction width in bits. Use 16 for Thumb and 32 for ARM assembly.
Similar to CODE16 and CODE32 in armasm.
.else
Use with .if and .endif. Similar to ELSE in armasm.
.end
Marks the end of the assembly file. This is usually omitted.
.endif
Ends a conditional compilation code block. See .if, .ifdef, .ifndef. Similar
to ENDIF in armasm.
.endm
Ends a macro definition. See .macro. Similar to MEND in armasm.
.endr
Ends a repeat loop. See .rept and .irp. Similar to WEND in armasm.
.equ <symbol name>, <value>
This directive sets the value of a symbol. It is similar to EQU in armasm.
.err
Causes assembly to halt with an error.
.exitm
Exit a macro partway through. See .macro. Similar to MEXIT in armasm.
.global <symbol>
This directive gives the symbol external linkage. It is similar to EXPORT in armasm.
B1-61
B1-62
Appendix B1 ARM and Thumb Assembler Instructions
.hword <short1> {,<short2>} ...
Inserts a list of 16-bit values as data into the assembly, as for DCW in armasm.
.if <logical_expression>
Makes a block of code conditional. End the block using .endif. Similar to IF in
armasm. See also .else.
.ifdef <symbol>
Include a block of code if <symbol> is defined. End the block with .endif.
.ifndef <symbol>
Include a block of code if <symbol> is not defined. End the block with .endif.
.include “<filename>”
Includes the indicated source file. Similar to INCLUDE in armasm or # include in C.
.irp <param> {,<val_1>} {,<val_2>} ...
Repeats a block of code, once for each value in the value list. Mark the end of
the block using a .endr directive. In the repeated code block, use \<param> to
substitute the associated value in the value list.
.macro <name> {<arg_1>} {,<arg_1>} ... {,<arg_k>}
Defines an assembler macro called <name> with k parameters. The macro
definition must end with .endm. To escape from the macro at an earlier point, use
.exitm. These directives are similar to MACRO, MEND, and MEXIT in armasm. You
must precede the dummy macro parameters by \. For example:
.macro SHIFTLEFT a, b
.if \b < 0
MOV \a, \a, ASR #-\b
.exitm
.endif
MOV \a, \a, LSL #\b
.endm
.rept <number_of_times>
Repeats a block of code the given number of times. End the block with .endr.
<register_name> .req <register_name>
B1.5
GNU Assembler Quick Reference
This directive names a register. It is similar to the RN directive in armasm except that
you must supply a name rather than a number on the right. For example, acc .req r0.
.section <section_name> {,”<flags>”}
Starts a new code or data section. Usually you should call a code section .text,
an initialized data section .data, and an uninitialized data section .bss . These
have default flags, and the linker understands these default names. The directive is
similar to the armasm directive AREA. Table B1.19 lists possible characters to appear
in the <flags> string for ELF format files.
.set <variable_name>, <variable_value>
TABLE B1.19
Flag
a
.section flags for ELF format files.
Meaning
allocatable section
w
writable section
x
executable section
This directive sets the value of a variable. It is similar to SETA in armasm.
.space <number_of_bytes> {,<fill_byte>}
Reserves the given number of bytes. The bytes are filled with zero or <fill_byte> if
specified. It is similar to SPACE in armasm.
.word <word1> {,<word2>} ...
Inserts a list of 32-bit word values as data into the assembly, as for DCD in armasm.
B1-63
B2
A
P
P
E
N
D
I
X
ARM and Thumb
Instruction
Encodings
Andrew Sloss,
ARM; Dominic Symes, ARM;
Chris Wright,
Ultimodule Inc.
B2.1
ARM Instruction Set Encodings B-3
B2.2
Thumb Instruction Set Encodings B-9
B2.3
Program Status Registers B-11
This appendix gives tables for the instruction set encodings of the 32-bit ARM and
16-bit Thumb instruction sets. We also describe the fields of the processor status
registers cpsr and spsr.
B2.1
ARM Instruction Set Encodings
Table B2.1 summarizes the bit encodings for the 32-bit ARM instruction set
architecture ARMv6. This table is useful if you need to decode an ARM instruction
by hand. We’ve expanded the table to aid quick manual decode. Any bitmaps not
listed are either unpredictable or undefined for ARMv6.
To use Table B2.1 efficiently, follow this decoding procedure:
■
Look at the leading hex digit of the instruction, bits 28 to 31. If this has a
value 0xF, then jump to the end of Table B2.1. Otherwise, the top hex digit
represents a condition cond. Decode cond using Table B2.2.
ARM instruction decode table.
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
STRH | LDRH post
LDRD | STRD | LDRSB | LDRSH post
LDRD | STRD | LDRSB | LDRSH post
MRS Rd, cpsr | MRS Rd, spsr
MSR cpsr, Rm | MSR spsr, Rm
BXJ
SMLAxy
SMLAWy
SMULWy
SMLALxy
SMULxy
TST | TEQ | CMP | CMN
ORR | BIC
MOV | MVN
BX | BLX
CLZ
QADD | QSUB | QDADD | QDSUB
TST |
ORR |
MOV |
SWP |
STREX
LDREX
TEQ | CMP | CMN
BIC
MVN
SWPB
0
0
0
0
0
cond
cond
cond
cond
cond
BKPT
0
cond
cond
cond
cond
cond
cond
cond
0
0
0
0
0
0
1 1 1 0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
cond
0
0
0
0
0
0
0
0
0
0
0
0
0
1
U
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
0
1
1
op
op 0
op 1
op 0
0 0
0 0
Rn
Rn
Rn
Rd
Rd
RdHi
RdHi
Rn
Rn
Rn
Rn
Rn
0 0
Rn
Rn
Rn
1 1
s x
1 1
Rd
Rd
Rd
RdHi
Rd
Rn
Rn
0 0 0
1 1 1
1 1 1
Rn
1
f
1
1
S
S 0
0
0
1
1 0
0
0
0
0
0
0
0
0
1
S
S
0
0
0
0 op
0 op
0 op
op 0
op 1
0 1
0 0
0 1
0 1
1 0
1 1
op
op 0
op 1
0 1
1 1
op
0 0 1 0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0 0 U 1
0 0 0 U 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
S
S
0 S
1 S
0 0
op
S
0 0 op
0
0
1
op
op
0 0 0 U 1
0
0
0
0
0
0 0 0
0 0 0
immed[15:4]
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0 1 1 1
0 0
0 0
0 1
y x
y 0
y 1
y x
y x
shift
shift
shift
0 0 op
0 0 0
0 1 0
0
0
0
1
1
1
1
1
1 1 op 1
1 1 op 1
1 0 1 1
1
1
1
1
1
0 shift 1
shift 0
0 0 0 0
Rs
0 shift
Rd
Rs
0 shift
0
Rd
Rs
0 shift
Rd
0 0 0 0 1 0 0
Rd
1 1 1 1 1 0 0
Rd
1 1 1 1 1 0 0
0
1
1
1
c
1
Rs
shift_size
0 0 0 0
Rs
Rn
Rs
RdLo
Rs
RdLo
Rs
Rd
0 0 0 0
immed
Rd
[7:4]
Rd
0 0 0 0
immed
Rd
[7:4]
Rd
0 0 0 0
1 1 1 1 0 0 0 0
1 1 1 1 1 1 1 1
Rn
Rs
Rn
Rs
0 0 0 0
Rs
RdLo
Rs
0 0 0 0
Rs
0 0 0 0 shift_size
Rd
shift_size
Rd
shift_size
1 1 1 1 1 1 1 1
Rd
1 1 1 1
Rd
0 0 0 0
Rd
Rd
Rm
Rm
Rm
Rm
Rm
immed
[3:0]
Rm
immed
[3:0]
0 0 00
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
immed
[3:0]
Rm
Rm
Rm
Rm
Rm
1 1 11
Rm
Rm
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
AND | EOR | SUB | RSB |
ADD | ADC | SBC | RSC
AND | EOR | SUB | RSB |
ADD | ADC | SBC | RSC
MUL
MLA
UMAAL
UMULL | UMLAL | SMULL | SMLAL
STRH | LDRH post
Instruction classes (indexed by op)
TABLE B2.1
B2-4
Appendix B2 ARM and Thumb Instruction Encodings
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
U
U
U
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
U
0
0
0
op 1
op
op 0
op 1
op T
op W
op T
op
op
op
op
op
op
0 0
op 1
op 1
0 0
op 1
op 0
op 0
op 1
op 1
op 1
op 1
op W
0 0
0 0
1 0
op
Rn
Rn
Rn
Rn
Rn
Rd
Rd
Rd
Rd
Rd
0 f s x c 1 1 1 1
1
Rn
0 0 0 0
S
Rn
Rd
S 0 0 0 0
Rd
op
Rn
Rd
op
Rn
Rd
op
Rn
Rd
Rn
Rd
Rn
Rd
Rn
Rd
Rn
Rd
Rn
Rd
Rn
Rd
0
Rn
Rd
immed5
Rd
0
immed4
Rd
0
Rn
Rd
1 1 1 1 1
Rd
0 Rn!=1111
Rd
0 1 1 1 1
Rd
0 Rn!=1111
Rd
0 1 1 1 1
Rd
1 Rn!=1111
Rd
1 1 1 1 1
Rd
op
Rn
Rd
0
Rd
Rn!=1111
0
Rd
1 1 1 1
0
RdHi
RdLo
S
0 0 0 1 U 1 W op
0 0 0 1 U 0 W op
0 0 1 0
cond
LDRD | STRD | LDRSB | LDRSH pre
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
LDRD | STRD | LDRSB | LDRSH pre
0 0 0 1 U 1 W op
0 0 0 1 U 0 W op
cond
cond
STRH | LDRH pre
AND | EOR | SUB | RSB |
ADD | ADC | SBC | RSC
MSR cpsr, #imm | MSR spsr, #imm
TST | TEQ | CMP | CMN
ORR | BIC
MOV | MVN
STR | LDR | STRB | LDRB post
STR | LDR | STRB | LDRB pre
STR | LDR | STRB | LDRB post
{ |S|Q|SH| |U|UQ|UH}ADD16
{ |S|Q|SH| |U|UQ|UH}ADDSUBX
{ |S|Q|SH| |U|UQ|UH}SUBADDX
{ |S|Q|SH| |U|UQ|UH}SUB16
{ |S|Q|SH| |U|UQ|UH}ADD8
{ |S|Q|SH| |U|UQ|UH}SUB8
PKHBT | PKHTB
{S|U}SAT
{S|U}SAT16
SEL
REV | REV16 | | REVSH
{S|U}XTAB16
{S|U}XTB16
{S|U}XTAB
{S|U}XTB
{S|U}XTAH
{S|U}XTH
STR | LDR | STRB | LDRB pre
SMLAD | SMLSD
SMUAD | SMUSD
SMLALD | SMLSLD
cond
immed
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
Rm
immed
1 0 1 1
[3:0]
1 1 op 1 Rm
immed
1 1 op 1
[3:0]
1 0 1 1
immed
immed
immed
immed
immed12
immed12
shift_size
shift 0
1 1 1 1 0 0 0 1
1 1 1 1 0 0 1 1
1 1 1 1 0 1 0 1
1 1 1 1 0 1 1 1
1 1 1 1 1 0 0 1
1 1 1 1 1 1 1 1
shift_size
op 0 1
shift_size
sh 0 1
1 1 1 1 0 0 1 1
1 1 1 1 1 0 1 1
1 1 1 1 op 0 1 1
rot 0 0 0 1 1 1
rot 0 0 0 1 1 1
rot 0 0 0 1 1 1
rot 0 0 0 1 1 1
rot 0 0 0 1 1 1
rot 0 0 0 1 1 1
shift_size
shift 0
Rs 0 op X 1
Rs 0 op X 1
Rs 0 op X 1
rotate
rotate
rotate
rotate
rotate
0 0 0 0
immed
[7:4]
0 0 0 0
immed
[7:4]
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
STRH | LDRH pre
Instruction classes (indexed by op)
TABLE B2.1 ARM instruction decode table. (Continued.)
B2.1
ARM Instruction Set Encodings
B2-5
SMMLA | | | SMMLS
SMMUL
USADA8
USAD8
Undefined and expected to stay so
STMDA | LDMDA | STMIA | LDMIA
STMDB | LDMDB | STMIB | LDMIB
B to instruction_address+8+4*offset
BL to instruction_address+8+4*offset
MCRR | MRRC
STC{L} | LDC{L} unindexed
STC{L} | LDC{L} post
STC{L} | LDC{L} pre
CDP
MCR | MRC
SWI
CPS | | CPSIE | CPSID
SETEND LE | SETEND BE
PLD pre
PLD pre
RFEDA | RFEIA | RFEDB | RFEIB
SRSDA | SRSIA | SRSDB | SRSIB
BLX instruction+8+4*offset+2*a
MCRR2 | MRRC2
STC2{L} | LDC2{L} unindexed
STC2{L} | LDC2{L} post
STC2{L} | LDC2{L} pre
CDP2
MCR2 | MRC2
Instruction classes (indexed by op)
1
1
1
1
1
1
1
1
1
1
1
1
1
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
cond
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
0
0
1
1
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
0
0
0
0
1
1
1
0
0
0
1
0
0
1
0
0
0
0
1
1
1
1
1
1
1
0
1
0
1
0
0
0
1
0
0
1
1
1
1
1
op
op
a
0
0
0
1
0
0
1 0
L 0
L 1
L W
op1
op1
0
1
U
U
1 0
L 0
L 1
L W
op1
op1
0
1
U
U
op
op
op
op
op
0 0 0
0 0 0
1 0 1
1 0 1
0 W 1
1 W 0
0
0
U
U
op
op
op
op
op
op
op
1 0 1
1 0 1
0 0 0
0 0 0
1 1 1
^ W op
^ W op
0
0
1
1
1
op
op
Rn ! =1111
1 1 1 1
Rn ! = 1111
1 1 1 1
x
Rs
Rs
Rs
Rs
op R 1 Rm
0 0 R 1 Rm
0 0 0 1 Rm
0 0 0 1 Rm
1 1 1 1
x
Rn
register_list
Rn
register_list
signed 24-bit branch offset
signed 24-bit branch offset
Rn
Rd
copro
op1
Cm
Rn
Cd
copro
option
Rn
Cd
copro
immed8
Rn
Cd
copro
immed8
Cn
Cd
copro
op2 0 Cm
Cn
Rd
copro
op2 1 Cm
immed24
op M 0 0 0 0 0 0 0 0 a i f 0
mode
0 0 0 1 0 0 0 0 0 0 op 0 0 0 0 0 0 0 0 0
Rn
1 1 1 1
immed12
Rn
1 1 1 1
shift_size shift 0 Rm
Rn
0 0 0 0 1 0 1 0 0 0 0 0 0 0 00
1 1 0 1 0 0 0 0 0 1 0 1 0 0 0
mode
signed 24-bit branch offset
Rn
Rd
copro
op1
Cm
Rn
Cd
copro
option
Rn
Cd
copro
immed8
Rn
Cd
copro
immed8
Cn
Cd
copro
op2 0
Cm
Cn
Cd
copro
op2 1 Cm
Rd
Rd
Rd
Rd
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
TABLE B2.1 ARM instruction decode table. (Continued.)
B2-6
Appendix B2 ARM and Thumb Instruction Encodings
B2.1
ARM Instruction Set Encodings
TABLE B2.2 Decoding table for cond.
Binary
Hex
cond
Binary
Hex
cond
0000
0
EQ
1000
8
HI
0001
1
NE
1001
9
LS
0010
2
CS/HS
1010
A
GE
0011
3
CC/LO
1011
B
LT
0100
4
MI
1100
C
GT
0101
5
PL
1101
D
LE
0110
6
VS
1110
E
{AL}
0111
7
VC
■
Index through Table B2.1 using the second hex digit, bits 24 to 27 (shaded).
■
Index using bit 4, then bit 7 or bit 23 of the instruction where these bits are
shaded.
■
Once you have located the correct table entry, look at the bits named op.
Concatenate these to form a binary number that indexes the | separated
instruction list on the left. For example if there are two op bits value 1 and 0,
then the binary value 10 indicates instruction number 2 in the list (the third
instruction).
■
The instruction operands have the same name as in the instruction description
of Appendix B1.
The table uses the following abbreviations:
■
L is 1 if the L suffix applies for LDC and STC operations.
■
M is 1 if CPS changes processor mode. mode is defined in Table B2.3.
TABLE B2.3 Decoding table for mode.
Binary
Hex
mode
10000
0x10
user mode ( _usr)
10001
0x11
FIQ mode ( _fiq)
10010
0x12
IRQ mode ( _irq)
10011
0x13
supervisor mode ( _svc)
10111
0x17
abort mode ( _abt)
11011
0x1B
undefined mode ( _und)
11111
0x1F
system mode
B2-7
B2-8
Appendix B2 ARM and Thumb Instruction Encodings
TABLE B2.4 Decoding table for shift, shift_size, and Rs.
shift
shift_size
Rs
Shift action
00
0 to 31
N/A
LSL #shift_size
00
N/A
Rs
LSL Rs
01
0
N/A
LSR #32
01
1 to 31
N/A
LSR #shift_size
01
N/A
Rs
LSR Rs
10
0
N/A
ASR #32
10
1 to 31
N/A
ASR #shift_size
10
N/A
Rs
ASR Rs
11
0
N/A
RRX
11
1 to 31
N/A
ROR #shift_size
11
N/A
Rs
ROR Rs
N/A
0 to 31
N/A
The shift value is implicit: For PKHBT it is 00.
For PKHTB it is 10. For SAT it is 2* sh.
■
op1 and op2 are the opcode extension fields in coprocessor instructions.
■
post indicates a postindexed addressing mode such as [Rn], Rm or [Rn], #immed.
■
pre indicates a preindexed addressing mode such as [Rn, Rm] or [Rn,
#immed].
■
register_list is a bit field with bit k set if register Rk appears in the register
list.
■
rot is a byte rotate. The second operand is Rm ROR (8*rot).
■
rotate is a bit rotate. The second operand is #immed ROR (2*rotate).
■
shift and sh encode a shift type and direction. See Table B2.4.
■
U is the up/down select for addressing modes. If `U ⫽ 1, then we add the offset
to the base address, as in [Rn], #4 or [Rn,Rm]. If U ⫽ 0 then we subtract the
offset from the base address, as in [Rn,#-4] or [Rn],-Rm.
■
unindexed indicates an addressing mode of the form [Rn],{option}.
■
R is 1 if the R (round) instruction suffix is present.
■
T is 1 if the T suffix is present on load and store instructions.
■
W is 1 if ! (writeback) is specified in the instruction mnemonic.
■
X is 1 if the X (exchange) instruction suffix is present.
■
x and y are 0 for the B suffix, 1 for the T suffix.
■
∧
is 1 if the
∧
suffix is applied in LDM or STM instructions.
B2.2
B2.2
B2-9
Thumb Instruction Set Encodings
Thumb Instruction Set Encodings
Table B2.5 summarizes the bit encodings for the 16-bit Thumb instruction set. This
table is useful if you need to decode a Thumb instruction by hand. We’ve expanded
the table to aid quick manual decode. The table contains instruction definitions
up to architecture THUMBv3. Any bitmaps not listed are either unpredictable or
undefined for THUMBv3.
To use the table efficiently, follow this decoding procedure:
■
Index through the table using the first hex digit of the instruction, bits 12 to
15 (shaded).
■
Index on any shaded bits from bits 0 to 11.
■
Once you have located the correct table entry, look at the bits named op.
Concatenate these to form a binary number that indexes the | separated
instruction list on the left. For example, if there are two op bits value 1 and 0,
then the binary value 10 indicates instruction number 2 in the list (the third
instruction).
TABLE B2.5
Thumb instruction decode table.
Instruction classes (indexed by op)
15
14
13
12 11 10
LSL | LSR
ASR
ADD | SUB
ADD | SUB
MOV | CMP
ADD | SUB
AND | EOR | LSL | LSR
ASR | ADC | SBC | ROR
TST | NEG | CMP | CMN
ORR | MUL | BIC | MVN
CPY Ld, Lm
ADD | MOV Ld, Hm
ADD | MOV Hd, Lm
ADD | MOV Hd, Hm
CMP
CMP
CMP
BX | BLX
LDR Ld, [pc, #immed*4]
STR | STRH | STRB | LDRSB pre
LDR | LDRH | LDRB | LDRSH pre
STR | LDR Ld, [Ln, #immed*4]
STRB | LDRB Ld, [Ln, #immed]
STRH | LDRH Ld, [Ln, #immed*2]
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
op
0
1
1
op
op
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
op
op
op
9
8
7
6
immed5
immed5
0 op
Lm
1 op
immed3
Ld/Ln
Ld
0 0
0
op
0 0
1
op
0 1
0
op
0 1
1
op
1 1
0
0 0
1 op 0
0 1
1 op 0
1 0
1 op 0
1 1
1 0
1
0 1
1 0
1
1 0
1 0
1
1 1
1 1
1 op
Ld
op
Lm
op
Lm
immed5
immed5
immed5
5
4
3
Lm
Lm
Ln
Ln
immed8
immed8
Lm/Ls
Lm/Ls
Lm
Lm
Lm
Hm & 7
Lm
Hm & 7
Hm & 7
Lm
Hm & 7
Rm
immed8
Ln
Ln
Ln
Ln
Ln
2
1
0
Ld
Ld
Ld
Ld
Ld
Ld
Ld/Ln
Ld
Ld
Ld
Hd & 7
Hd & 7
Ln
Hn & 7
Hn & 7
0
0 0
Ld
Ld
Ld
Ld
Ld
B2-10
TABLE B2.5
Appendix B2 ARM and Thumb Instruction Encodings
Thumb instruction decode table. (Continued.)
Instruction classes (indexed by op)
15
14
13
12 11 10
9
STR | LDR Ld, [sp, #immed*4]
ADD Ld, pc, #immed*4 |
ADD Ld, sp, #immed*4
ADD sp, #immed*4 | SUB sp,
#immed*4
SXTH | SXTB | UXTH | UXTB
REV | REV16 | | REVSH
PUSH | POP
SETEND LE | SETEND BE
CPSIE | CPSID
BKPT immed8
STMIA | LDMIA Ln!, {register-list}
B<cond> instruction_address+
4+offset*2
Undefined and expected to remain so
SWI immed8
B instruction_address+4+offset*2
BLX ((instruction+4+
(poff<<12)+offset*4) &~ 3)
This must be preceded by a branch prefix
instruction.
This is the branch prefix instruction. It must be
followed by a relative BL or BLX instruction.
BL instruction+4+ (poff<<12)+
offset*2 This must be preceded by a
branch prefix instruction.
1
0
0
1
op
Ld
immed8
1
0
1
0
op
Ld
immed8
1
0
1
1
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0
0
1
op
0
0
1
op
0
0
1
1
1
1
1
1
0
1
1
1
Ln
0
0
R
0
0
0
1
1
0
1
1
1
1
1
1
1
0
0
1
1
1
0
1
1
0
1
1
1
0
1
unsigned 10-bit offset
1
1
1
1
0
signed 11-bit prefix offset poff
1
1
1
1
1
unsigned 11-bit offset
■
8
7
op
0
0
1
1
5
4
3
2
1
0
immed7
op
op
cond < 1110
1
1
6
1
1
Lm
Lm
register_list
0
1 op 0
1 op 0 a
immed8
register_list
Ld
Ld
0
i
0
f
signed 8-bit offset
0
1
x
immed8
signed 11-bit offset
0
The instruction operands have the same name as in the instruction description
of Appendix B1.
The table uses the following abbreviations:
■
register_list is a bit field with bit k set if register Rk appears in the register
list.
■
R is 1 if lr is in the register list of PUSH or pc is in the register list of POP.
B2.3
B2.3
Program Status Registers
Program Status Registers
Table B2.6 shows how to decode the 32-bit program status registers for ARMv6.
TABLE B2.6 cpsr and spsr decode table.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
N Z C V
Q
Res
J
Res
Res
GE[3:0]
Field
EA
IFT
mode
Use
N
Negative flag, records bit 31 of the result of flag-setting operations.
Z
Zero flag, records if the result of a flag-setting operation is zero.
C
Carry flag, records unsigned overflow for addition, not-borrow for subtraction, and is
also used by the shifting circuit. See Table B1.3.
V
Overflow flag, records signed overflows for flag-setting operations.
Q
Saturation flag. Certain operations set this flag on saturation. See for example
QADD in Appendix B1 (ARMv5E and above).
J
J ⫽ 1 indicates Java execution (must have T ⫽ 0). Use the BXJ instruction to
change this bit (ARMv5J and above).
Res
These bits are reserved for future expansion. Software should preserve the values
in these bits.
GE[3:0]
The SIMD greater-or-equal flags. See SADD in Appendix B1 (ARMv6).
E
Controls the data endianness. See SETEND in Appendix B1 (ARMv6).
A
A ⫽ 1 disables imprecise data aborts (ARMv6).
I
I ⫽ 1 disables IRQ interrupts.
F
F ⫽ 1 disables FIQ interrupts.
T
T ⫽ 1 indicates Thumb state. T ⫽ 0 indicates ARM state. Use the BX or BLX
instructions to change this bit (ARMv4T and above).
mode
The current processor mode. See Table B2.4.
B2-11
Download