ch08

advertisement
Chapter 8
CPU and Memory: Design, Enhancement, and Implementation
8.1 (BL3) To keep the x86 current, Intel has added numerous features over the past thirty years since the
basic design was first released. The goal has been to maintain the ability to execute older programs
while providing increased capabilities for modern systems. The result is an instruction set with
instruction words ranging from one byte to many tens of bytes, a variety of op codes of different
lengths, a number of different and inconsistent addressing formats, registers ranging in size from
eight bits to 128 bits, and more. All of these modifications have made support for superscalar
processing difficult. Nonetheless, Intel has come up with a number of creative solutions to make
superscalar processing possible and effective.
(1) Since instructions don't fall on well-defined, consistent address boundaries, early x86
processors used a buffer to load a large number of bytes into the CPU for instruction analysis and
preprocessing. This allowed the CPU to identify upcoming instructions in advance of their
execution. More recently, Intel expanded this buffer into a large instruction pool.
(2) Intel created a secondary, uniform-length internal instruction set. Commonly used instructions
are reformatted into the secondary instruction set for pipelining. Rarely used instructions are
executed as built-in software subroutines.
(3) Intel created multiple execution units with different pipelines for different types of instructions
and replaceable sets of registers that could be used for the temporary storage of data. This allowed
Intel to create algorithms for determining data dependencies and allowing out-ot-order processing.
An instruction completion pool area allows the machine to complete instructions in the correct
order. The parallel execution units make it possible to process multiple instructions in parallel.
Essentially, the CPU pulls instructions from the pool as they become available for execution,
requiring only that doing so does not affect the outcome of the original program. This allows the
CPU to keep the various pipelines filled as much as possible.
With this flexibility in place, the x86 is able to implement superscalar processing.
8.2
(BL2) The instruction that requires four execution steps stalls in the pipeline, while the instruction
behind it waits for three instruction cycles, creating a three cycle delay.
8.3 This problem illustrates the issue of data dependencies.
a. (BL3) If both instructions access an integer unit simultaneously, the ADD instruction will
operate on the initial data in R5, 0, instead of the result from the multiplication, as obviously
intended. Furthermore, the final value in R5 will then be destroyed, when the MULTIPLY
completes, and the MULTIPLY instruction stores its own value, 24, into R5. It is as though the
ADD instruction never executed. If data dependencies are taken into account, the CPU
recognizes that the MULTIPLY instruction must complete first, before the ADD instruction can
proceed. Therefore, the execution unit with the ADD instruction sits idle for 15 cycles.
b. (BL3) If out-of-order execution is permitted, the idle execution unit could execute other, nondependent instructions while it is waiting. The problem states that the later ADD instruction
meets the requirements. Although it can perform the addition, however, it must wait for
availability of the R5 register to hold the result until the previous MULTIPLY and ADD are
completed, so there is no real gain. If the system provides rename registers, however, the later
ADD can be executed and stored in a rename register named R5. When the instruction is
retired, the result is then moved to the actual R5 register, preserving the correct flow of the
program. In this case, we have gained the cycles normally required for the last ADD instruction
by executing it out of order.
8.4
(BL3) A branch dependency requires a delay while the pipeline is refilled with instructions from
the correct branch. If it is known that the CPU always executes the two instructions following the
branch instruction, then those instructions can be executed while the refilling operation takes
place, eliminating the delay. To be useful, this technique requires reordering instructions in such a
way that the two instructions following a branch are free of dependencies and would normally be
executed prior to the branch.
8.5
(BL3-) The student would observe that most backward conditional branches occur as a result of
loops, and most forward branches result from IF-THEN-ELSE conditions. These results are
consistent with normal programming situations. IF-THEN-ELSE situations always execute a block
of code following the decision point; statistically, true conditions are executed more frequently
than false conditions, so that a program design that branches foward on true condtions will predict
coreectly more than 50% of the time. The prediction that a backward branch will be taken will
perform even better, since most loops are repeated multiple times, often many times.
8.6
(BL2+ - 4) This is a creative problem, with no single best answer, but one would expect a solution
that provides multiple personnel, representing each execution unit, and possibly one or more
assembly lines to represent pipelining. One solution suggests a fast food restaurant in which orders
are taken and hung in order on clips awaiting action by the kitchen personnel. A number of cooks,
representing execution units, take the orders and prepare them. To keep the diners happy, orders
for a single table are delivered together (data dependencies) and orders are delivered to tables in
the order that they were taken (in-order completion).
8.7
a. (BL2+) The LOAD and STORE each take five steps (or four, if you assume that the PC may
be incremented in parallel with other operations). The ADD and SUBTRACT also require five
(or four) steps, IN and OUT require four (or three), SKIPs require four (or three), and JUMPs
require three. Then a typical program mix requires
S = 0.25 (5 + 5) + 0.10 (5 + 5 + 4 + 4) + 0.05 (4 + 3) = 4.65 steps per instruction on
average.
If the clock ticks at 10 MHz., the number of instructions executed in a second,
N = 10,000,000 / 4.65 = approximately 2.17 instructions per second.
If we assume that the PC is incremented in parallel with execution, then the result is
N = 10,000,000 / ( 0.25 (4 + 4) + 0.10 (4 + 4 + 3 + 3) + 0.05 (3 + 3) )
= approx. 2.7 million IPS
b. (BL2+) With pipelining, each instruction is reduced by the two steps required for the fetch.
(This, of course, assumes that bubbles in the pipeline can usually be avoided.) Then,
N = 10,000,000 / ( 0.25 (2 + 2) + 0.10 (2 + 2 + 1 +1) + 0.05 (2 + 1) )
= approx. 5.7 million IPS
Although these calculations are only rough estimates of CPU performance, they illustrate clearly
the substantial gains that have resulted from sophisticated modern design techniques.
b. (BL2+) The data is found in 1279 + 10 + 12 (in X) = 1301. This instruction leaves 2111 +
1322 = 3433 in A.
8.8 (BL1) 2 billion instructions per second; approximately 6 billion instructions per second, assuming
that instructions can be processed continuously, with no delays for memory access and data
dependencies.
8.9 (BL3) To see the interaction between the program and the cache memory, first suppose that the
program processes the array one column at a time. Each cache line holds sixteen values. Since the
cache line holds the next sixteen values in column order, all sixteen values will be used
immediately, at which time, those values are completed “forever.” This will be true for the entire
array, so the total number of cache transfers is 160,000/16, or 10,000 transfers.
Now consider processing the array one row at a time. This situation is more difficult. As the
program processes the first row, each value is found in a separate part of linear memory, and a
new cache line must be loaded for each value in the row. When the 301st value is accessed, the
cache is full, so it is necessary to replace a previous cache line. No matter which cache line is
replaced, values in that line will be required for the next several rows, and the line will have to be
reloaded. There are a number of different ways of selecting a cache line for replacement, but for
simplicity, we can assume that the oldest line is replaced. If this is the case, the first 100 blocks
will be replaced, in order to finish the first row. Processing the second row will then require
replacing each of those lines. To do so, the next hundred will be eliminated. The result is a new
cache line transfer for every value in the array. Although it would be possible to design a
replacement algorithm that would reduce the number of transfers somewhat, the improved
algorithm would be more complex, and would only apply to particular cases. This makes such an
algorithm impractical and not worthwhile.
8.10
(BL1+) When a cache miss occurs, a new cache line must be moved from memory to cache before
the memory access can occur. In the simplest case, there is space available in cache, and the only
delay is the additional time required to access conventional memory to load the cache line. If it is
necessary to write back another cache line to make space for the new line, however, there is an
additional delay.for writing the old cache line to memory.
8.11
(BL1+) The tag identifies the actual physical memory addresses corresponding to the data stored
in a particular cache block. As such, the tags together act as a directory to identify the physical
memory addresses for everything stored in cache memory.
8.12
(BL1+) Both memory write-through and write-back result from the fact that writes to cache are
actually destined for physical memory, so that writes to cache must ultimately be transferred to
physical memory. Memory write-through writes every cache write back to memory immediately.
This assures that physical memory is always current, and simplifies the design. The cost is that a
memory write is required for every cache write, which slows down performance. The write-back
technique only writes to physical memory when a cache line is replaced, which is faster, but there
is a small risk of data loss, and careful design is required to assure that data loss doesn't occur.
8.13. (BL 2) Master-slave multiprocessing requires the use of a single CPU core to manage and delegate
services to the slave cores. The master CPU represents a single point of failure for the system. In
addition, the master CPU carries the load of delegating work to the slaves. If the system load is
heavy, the master may, itself, be too busy to keep the slaves fully occupied. Conversely, masterslave multiprocessing is useful when the slaves are assigned specific, known tasks that can be
executed relatively independently. This is the case, for example, with the Cell processor, which is
used in the Sony PlayStation 3 game controller. The slaves perform specifically-assigned graphics
processing tasks which are compute-intensive, reducing the burden on the master CPU.
For most general purpose processing, symmetric multiprocessing is to be preferred. Work can be
divided evenly among the various CPU cores, and each has access to the data and program
instructions, so that any CPU can pick up work as it becomes available. There is no single point of
failure. However, the shifting of a program from core to core during execution can make it more
difficult to implement cache memory in such a way that data integrity is maintained. Furthermore,
the operating system requires more sophistication to delegate work effectively and efficiently.
In general, you would select symmetric multiprocessing for fault tolerant systems, since the failure
of a single CPU slows down, but does not normally disable, system processing. This is true
because in symmetrical multiprocessing any CPU can execute any task.
8.14. (BL 2) The tasks performed by the slave processors are determined, of course, by the game
designers or programmers for the applications that are run on the processor. Typically, the assigned
tasks represent processing-intensive, repetitive tasks that can take advantage of the vector
processing capability of a slave and that can be run with relatively independently of the master
CPU. For gaming, these tasks include terrain generation, lighting and shadowing, MPEG encoding
and decoding, transformation of 3-D graphics, audio decoding, and other vector and multimedia
processing tasks.
The primary role of the master processor is to manage the system and to delegate work to the slave
processors. For multimedia-intensive applications this configuration of master-slave
multiprocessing is well-suited, because typical applications of this type require intensive
processing that is well-defined and easily isolated from the main application.
Other applications of this type include many scientific applications, in particular, problems with
large amounts of data to be analyzed. Current problems observed in the literature include cancer
analysis, physics particle analysis, weather analysis, and DNA analysis.
8.15. (BL 3) When the number of cores grows large, there is usually a diminishing rate of return. There
are two primary reasons for this.
(1). The number of processors exceeds the number of programs that require execution time, on
average, at any given time; therefore a number of the processors are sitting idle, not performing
useful work. Even when the number of programs being executed is relatively large, typically some
of them would be waiting for I/O, and therefore not capable of execution at the time.
(2). Each core must have access to memory to execute its instructions. As the number of cores
grows large, the time required for each memory access becomes a limiting resource. Although the
use of cache memory for each core alleviates this problem to some extent, cache memory creates
other, additional problems. (See (3) below.) Also, cache misses require access to memory to
replace cache lines. As the number of cores grows, the number of cache lines that must be loaded
also grows, consuming memory access time.
(3). The amount of overhead required by the operating system to manage and keep track of each
program being executed grows large. A considerable amount of effort, for example, is needed to
keep track of data changes in the cache memory of each processor, to make sure that the data
throughout the cache memory in every core is consistent.
The number of cores can be increased somewhat if we can assure that the programs running in
each core are relatively independent of each other (which solves problem (3), and if the system is
designed in such a way that each processor has access to its own local memory (which solves
problem (2). ) In this case, access to the main, shared computer memory is limited to block DMA
transfers, treating main memory as I/O. This is the case for master-slave multiprocessors like the
Cell processor. Similar designs have been created in the past for transaction processing
applications, where each processor handled its own transactions independently.
Download