Uploaded by chimmuk

CS

advertisement
Chapter 4.
#1
Pipeline stalls are necessary when an instruction must wait due to unresolved
hazards or resource conflicts, effectively pausing the instruction flow and allowing
the pipeline to synchronize and resolve the _________.
#2 An _________ occurs when instructions i and j write the same register or
memory location. The ordering between the instructions must be preserved to
ensure that the value finally written corresponds to instruction j.
#3
__________ is a technique to get more performance from loops that access
arrays, in which multiple copies of the loop body are made and instructions from
different iterations are scheduled together.
#4
Many superscalars extend the basic framework of dynamic issue decisions to
include __________, it chooses which instructions to execute in a given clock cycle
while trying to avoid hazards and stalls.
#5
Explain the three primary units of a dynamically scheduled pipeline.
#6
_______ is a situation in pipelined execution when an instruction blocked from
executing does not cause the following instructions to wait.
#7
_______ is a small memory that is indexed by the lower portion of the address
of the branch instruction and that contains one or more bits indicating whether
the branch was recently taken or not.
#8
Consider a loop branch that branches nine times in a row, and then is not
taken once. What is the prediction accuracy for this branch, assuming the
prediction bit for this branch remains in the prediction buffer?
#9
#10
Explain the 2-bit prediction scheme.
Generally, we can use the term ________ to refer to any unexpected change
in control flow without distinguishing whether the cause is internal or external; we
can use the term ________ only when the event is externally caused.
#11
In a __________, the address to which control is transferred is determined by
the cause of the exception, possibly added to a base register that points to
memory range for them.
#12
________ is a style of instruction set architecture that launches many
operations that are defined to be independent in a single-wide instruction, typically
with many separate opcode fields.
#13
An__________ between instructions i and j occurs when instruction j writes a
register or memory location that instruction i reads. The original ordering must be
preserved to ensure that i reads the correct value.
#14
Generally, the term _________ can refer to any unexpected change in control
flow without distinguishing the cause, while the term _________ refers to externally
caused events.
#15
In a _________, the address to which control is transferred is determined by
the cause of the exception, possibly added to a base register.
#16
A pipeline _________ is a temporary halt in the execution of instructions to
resolve data hazards or control hazards.
#17
Dynamic issue decisions in superscalars are extended to include _________,
which chooses which instructions to execute in a given clock cycle to avoid
hazards.
#18
Static branch prediction relies on predetermined patterns, such as assuming
a branch will always be taken or not taken, based on typical branch behavior,
which helps in simplifying the branch prediction process but may not always
accurately reflect dynamic _________ patterns.
#19
Dynamic branch prediction, on the other hand, uses hardware to track the
actual behavior of branches over time and adapt predictions accordingly, providing
a more accurate method for handling branches by learning from recent _________
outcomes.
#20
The critical path analysis determines the longest delay in a pipeline stage,
which affects the clock period and overall pipeline performance, highlighting the
importance of minimizing delays to enhance _________.
#21
Forwarding, also referred to as bypassing, is a technique used to resolve data
hazards by directly routing the output of one pipeline stage to a subsequent stage
without waiting for it to be written back to the register file, thus reducing
unnecessary _________.
#22
The three primary units of a dynamically scheduled pipeline include the
_________ unit, _________ unit, and the _________ unit.
#23
#24
#25
Chapter 5
#1. ________ locality is the locality principle stating that if a data location is
referenced then it will tend to be referenced again soon .
#2. ________ locality is the locality principle stating that if a data location is
referenced, data locations with nearby addresses will tend to be referenced soon .
#3. ________ is a structure that uses multiple levels of memories; as the distance
from the processor increases, the size of the memories and the access time both
increase while the cost per bit decreases .
#4. Which of the following statements are generally true?
1. Memory hierarchies take advantage of temporal locality.
2. On a read, the value returned depends on which blocks are in the cache.
3. Most of the cost of the memory hierarchy is at the highest level.
4. Most of the capacity of the memory hierarchy is at the lowest level .
#5. _______ is the time required for the desired sector of a disk to rotate under
the read/write head; usually assumed to be half the rotation time .
#6. How many total bits are required for a direct-mapped cache with 16 KiB of
data and four-word blocks, assuming a 64-bit address?
#7. Consider a cache with 64 blocks and a block size of 16 bytes. To what block
number does byte address 1200 map?
#8. _______ is a scheme that handles writes by updating values only to the block
in the cache, then writing the modified block to the lower level of the hierarchy .
#9. _______ is a scheme in which a level of the memory hierarchy is composed of
two independent caches that operate in parallel with each other, with one handling
instructions and one handling data .
#10. Assume the miss rate of an instruction cache is 2% and the miss rate of the
data cache is 4%. If a processor has a CPI of 2 without any memory stalls, and
the miss penalty is 100 cycles for all misses, determine how much faster a
processor would run with a perfect cache that never missed. Assume the
frequency of all loads and stores is 36%.
#11. Find the average memory access time (AMAT) for a processor with a 1 ns
clock cycle time, a miss penalty of 20 clock cycles, a miss rate of 0.05 misses per
instruction, and a cache access time (including hit detection) of 1 clock cycle.
Assume that the read and write miss penalties are the same and ignore other
write stalls .
#12. Assume there are three small caches, each consisting of four one-word
blocks. One cache is fully associative, a second is two-way set associative, and the
third is direct-mapped. Find the number of misses for each cache organization
given the following sequence of block addresses: 0, 8, 0, 6, and 8 .
#13. Increasing associativity requires more comparators and more tag bits per
cache block. Assuming a cache of 4096 blocks, a four-word block size, and a
32-bit address, find the total number of sets and the total number of tag bits for
caches that are direct-mapped, two-way and four-way set associative, and fully
associative .
#14. Suppose we have a processor with a base CPI of 1.0, assuming all references
hit in the primary cache, and a clock rate of 4 GHz. Assume a main memory
access time of 100 ns, including all the miss handling. Suppose the miss rate per
instruction at the primary cache is 2%. How much faster will the processor be if
we add a secondary cache that has a 5 ns access time for either a hit or a miss
and is large enough to reduce the miss rate to main memory to 0.5%?
#15. Assume one byte data value is 10011010two. First show the Hamming ECC
code for that byte, and then invert bit 10 and show that the ECC code finds and
corrects the single bit error. Leaving spaces for the parity bits, the 12-bit pattern
is__ 1_ 0 0 1_ 1 0 1 0 .
#16. ______ is a cache that keeps track of recently used address mappings to try
to avoid an access to the page table .
#17. In a memory hierarchy like that of Figure 5.30, which includes a TLB and a
cache organized as shown, a memory reference can encounter three different
types of misses: a TLB miss, a page fault, and a cache miss. Consider all the
combinations
of
these
three
events
with
one
or
more
occurring
(seven
possibilities). For each possibility, state whether this event can actually occur and
under what circumstances .
#18. ______ addressed cache: A cache that is accessed with a virtual address
rather than a physical address. ______ addressed cache: A cache that is addressed
by a physical address .
#19. ______ is a changing of the internal state of the processor to allow a
different process to use the processor that includes saving the state needed to
return to the currently executing process .
#20. ______ is a cache miss that occurs in a set-associative or direct-mapped
cache when multiple blocks compete for the same set and that are eliminated in a
fully associative cache of the same size .
#21. Given that a multicore multiprocessor means multiple processors on a single
chip, these processors very likely share a common physical address space.
Caching shared data introduces a new problem, because the view of memory held
by two different processors is through their individual caches, which, without any
additional precautions, could end up seeing two distinct values. Figure 5.40
illustrates the problem and shows how two different processors can have two
different values for the same location. This difficulty is generally referred to as the
_______ problem .
#22. In _________ architecture, multiple instructions can be issued per clock cycle,
increasing the instruction throughput of the processor .
#23. _________ Execution allows instructions to be executed as resources become
available, rather than strictly in the order they appear, improving performance by
better utilizing available execution units .
#24. ________ is a special cache that stores recent translations of virtual
addresses to physical addresses, speeding up memory access .
#25. When data that is not reused frequently fills up the cache, it leads to
_________, which wastes cache space and degrades performance .
#26. If a virtual memory system has 2^48 bytes of virtual address space and a
page size of 4 KiB, how many pages can it address?
#27. A processor has a CPI of 2 without memory stalls, and the miss penalty is 50
cycles. If the instruction cache has a miss rate of 3%, what is the effective CPI?"
#28. Consider a two-level cache hierarchy where the Level 1 (L1) cache has a hit
rate of 95% and the Level 2 (L2) cache has a hit rate of 90%. The access time for
the L1 cache is 1 ns, the access time for the L2 cache is 10 ns, and the access
time for main memory is 100 ns. Calculate the overall average memory access
time (AMAT) .
#29. A disk has a track-to-track seek time of 3 ms and a full stroke seek time of
10 ms. If the average seek time is typically one-third of the full stroke seek time,
calculate the average seek time for this disk .
Chapter 6
#1 According to _______ law, the potential speedup of a process using multiple
processors is limited by the time needed for the sequential portions of the
process.
#2 ______ scaling is speed-up achieved on a multiprocessor without increasing the
size of the problem. ____ scaling is speed-up achieved on a multiprocessor while
increasing the size of the problem proportionally to the increase in the number
of processors.
#3
Strong
scaling
evaluates
a
system's
performance
improvement
as
more
processors are added while keeping the problem size _______. In weak scaling,
the system's performance is measured as the problem size increases in _______
with the number of processors, maintaining a constant workload per processor.
#4 To achieve the speed-up of 20.5 on the previous larger problem with 40
processors, we assumed the load was perfectly balanced. That is, each of the 40
processors had 2.5% of the work to do. Instead, show the impact on speed-up if
one processor’s load is higher than all the rest. Calculate at twice the load (5%)
and five times the load (12.5%) for that hardest working processor. How well
utilized are the rest of the processors?
#5 ______ is a set of computers connected over a local area network that function
as a single large multiprocessor.
#6 True or false: Clusters have separate memories and thus need many copies of
the operating system.
#7 _______ is a parallel computing architecture developed by NVIDIA, allowing
developers to use GPUs for general-purpose processing tasks beyond graphics.
#8 True or false: GPUs rely on graphics DRAM chips to reduce memory latency
and thereby increase performance on graphics applications.
#9 A _______ Architecture is designed to handle specific types of tasks or
applications more efficiently than general-purpose processors, often used in
areas like AI and graphics.
#10 A
_______ Processing Unit, is specialized hardware developed by Google to
accelerate machine learning workloads and improve the efficiency of neural
network computations.
#11 True or False? DSAs are more effective than CPUs or GPUs in their domains
primarily because you can justify using a much larger die for a domain.
#12 Explain roofline model.
#13 The _______ model provides a graphical representation of a computer system’s
performance based on its computational intensity and memory bandwidth.
#14 _______ Memory Access describes a memory architecture where the access
time to memory is the same for all processors in the system. In _______ Memory
Access, memory access times vary depending on the processor’s distance from
the memory location, impacting performance.
#15 ______ is a type of single address space multiprocessor in which some
memory accesses are much faster than others depending on which processor
asks for which word.
#16 True or false: Shared memory multiprocessors cannot take advantage of
task-level parallelism.
#17 ______ is a parallel processor with a single physical address space.
#18 _______ allows multiple processors to access shared memory and I/O devices
in a symmetrical and balanced manner, enhancing performance.
#19 Cache _______ mechanisms ensure that all processors in a multiprocessor
system have a consistent view of shared memory, preventing data inconsistency.
#20 ______ is a version of multithreading that lowers the cost of multithreading by
utilizing
the
resources
needed
for
multiple
issue,
dynamically
scheduled
microarchitecture.
#21 ______ is a version of hardware multithreading that implies switching between
threads after every instruction. ______ is a version of hardware multithreading
that implies switching between threads only after significant events, such as a
last-level cache miss.
#22 True or false: Both multithreading and multicore rely on parallelism to get
more efficiency from a chip.
#23 True or false: Simultaneous multithreading (SMT) uses threads to improve
resource utilization of a dynamically scheduled, out-of-order processor.
#24 ______ is the conventional MIMD programming model, where a single program
runs across all processors.
#25 True or false: As exemplified in the x86, multimedia extensions can be thought
of as a vector architecture with short vectors that support only contiguous
vector data transfers.
#26 _______ is a cloud computing model where software applications are delivered
over the Internet, making them accessible from anywhere and scalable on
demand.
#27 In a shared memory system, each of the 16 processors has a local cache. If
the cache hit rate is 95% and the average memory access time is 100
nanoseconds for a cache hit and 500 nanoseconds for a cache miss, what is the
average memory access time?
#28 Calculate the speedup achieved on a 20-processor system if the workload is
85% parallelizable and the remaining 15% must be executed sequentially.
#29 In a 32-processor NUMA system, the average remote memory access latency
is 300 nanoseconds, and the local memory access latency is 100 nanoseconds. If
25% of the memory accesses are remote, what is the overall average memory
access latency?
#30 A program that initially takes 250 seconds to run on a single processor has a
parallel fraction of 70%. How long will it take to run on 25 processors?
Download