Computer architecture questions

advertisement
Computer architecture questions
These questions were collected from previous exams and tests, so you will find a
new set of processor specifications inserted at various locations: the questions
following use those processor specifications. You will also find some essentially
identical questions! “Re-use” is a well-established software engineering
principle: we use it for exam questions too!
Except where otherwise indicated, use the following operating system and processor
characteristics in all questions.
Your operating system uses 8 kbyte pages. The machine you are using has a 4-way
set associative 32 kbyte unified L1 cache and a 64 entry fully associative TLB.
Cache lines contain 32 bytes. Integer registers are 32 bits wide. Physical addresses
are also 32 bits. It supports virtual addresses of 46 bits. 1 Gbyte of main memory is
installed.
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)
l)
Give one advantage of a direct mapped cache.
What is the main disadvantage of a direct mapped cache?
How many sets does the cache contain?
How many comparators does the cache require?
How many bits do these comparators work on?
Your program is a text processor for large documents: in an initial check, it scans
the document looking for illegal characters. For an 8 Mbyte document, what
would you expect the L1 cache hit rate to be during the initial check? (You are
expected to do a calculation and give an approximate numeric answer!)
Your program manipulates large arrays of data. In order to consistent good
performance, you should avoid one thing. What is it? (Be precise – a numeric
answer relevant to the processor described above and an explanation is required
here.)
What is the alternative to a unified cache? What advantages does it provide?
In addition to data and tags, a cache will have additional bits associated with each
entry. List these bits and add a short phrase describing the purpose of each bit (or
set of bits). (In all cases, make your answers concise: simply list any differences
from a preceding answer.)
(i) A set-associative write-back cache
(ii) A set-associative write-through cache
(iii) A direct mapped cache
(iv) A fully associative cache
32 processes are currently running. If the OS permitted each process to use the
maximum possible address space, how many page table entries are required.
(i) Conventional page tables
(ii) Inverted page tables
Draw a diagram showing how the bits of a virtual address are used to generate a
32-bit physical address.
“A program which simply copies a large block of data from one memory location
to another exhibits little locality of reference, therefore its performance is not
improved by the presence of a cache.” Comment on this statement. Is it strictly
true, mostly true or not true at all? Explain your answer. Assume you are running
programs on the processor described at the beginning of this section.
m)
You are advising a team of programmers writing a large scientific simulation
program. The team mainly consists of CS graduates who skipped any study of
computer architecture in their degrees. Performance is critical.
List some simple things that you would advise them to do when writing code for
this system. Provide a one sentence explanation for each point of advice.
(1 mark for each valid piece of advice, 1 for explaining it and 1 for adding a
number that makes the advice specific to the processor described earlier.)
-----------------------------------------------------------------------------------------------------------Except where otherwise indicated, use the following operating system and processor
characteristics in all questions.
Your operating system uses 8 kbyte pages. The machine you are using has a 4-way
set associative 32 kbyte unified L1 cache and a 128 entry fully associative TLB.
Integer registers are 32 bits wide. Physical addresses are also 32 bits. It supports
virtual addresses of 44 bits. The bus is 64 bits wide: the most usual bus transaction
has four data cycles. 1 Gbyte of main memory is installed.
n)
Why would a processor execute both statements s1 and s2 from a compound
statement:
if ( condition ) s1;
else s2;
o)
What would you expect the cache line length to be?
p)
How many comparators does the cache require?
q)
How many bits do these comparators work on?
r)
How many comparators does the TLB require?
s)
How many bits do these comparators work on?
t)
Your program is a text processor for large documents: in an initial check, it scans
the document looking for illegal characters. For an 8 Mbyte document, what
would you expect the hit rates to be during the initial check? (You are expected to
do a calculation and give an approximate numeric answer!)
(i) Cache
(ii) TLB
u)
Under what conditions would you expect to achieve 100% TLB hits? (Two
answers required. For one, you are expected to do a calculation and give an
approximate numeric answer!)
v)
Why are caches built with long (ie more than 8 byte) lines? (Two reasons
needed.)
w)
Your program manipulates large arrays of data. In order to consistent good
performance, you should avoid one thing. What is it? (Be precise – a numeric
answer relevant to the processor described above and an explanation is required
here.)
x)
What is the maximum number of page faults can be generated by a single memory
access? Explain your answer. (Assume the page fault handler is locked in
memory and no other page faults are generated for pages of instructions.)
y)
z)
aa)
bb)
cc)
dd)
ee)
a.
b.
ff)
gg)
hh)
ii)
A system interface unit will often change the order in which memory accesses
generated by the program are placed on the system bus. Give two examples of
such re-orderings and explain why the order is changed.
Why does a read transaction check the write queue in a system interface unit?
If you had only a limited number of transistors available for improving branch
performance, what prediction logic would you add? Why will it work?
Why does successful branch prediction improve the performance of a processor?
Give two examples of speculation in high performance processors. Add a
sentence to explain how this improves processor performance.
Considering all the ‘caches’ present in a high performance processor (instruction
and data cache, TLB, branch history buffer, etc), which ones increase the
performance of a program which simply copies data from one location to another?
Which ones have little or no effect? A good answer will list each cache and add a
sign for increase or decrease and add a single phrase of explanation.
On a system with a 128Kbyte L1 cache for a program with a working data set of
2Mbytes:
Calculate the expected cache hit rate when no assumptions can be made about
data access patterns.
Would you expect the actual hit rate to be better or worse than this? Why?
Your system has a 4-way set associative cache with 4 32-bit words per cache line.
a. If the total cache size is 64kbytes, how many sets are there?
b. A physical address on this system has 32-bits. How much main memory can
it accommodate?
c. What is the total number of bits are needed for the cache? It’s a write-back
cache.
d. How many comparators does this cache require?
e. How are the real addresses of lines in the same set related?
f. If this cache was fully associative,
i. How many comparators would be needed?
ii. How many overhead bits would be required?
iii. What would be the advantage (if any) obtained from the extra resources.
Your OS uses a page size of 8kbytes.
a.
What is the coverage of a 64 entry TLB?
b.
What will the TLB hit rate be for a 2Mb working data set program (making
no assumptions about access patterns)?
c.
If the program sweeps through the data accessing 64-bit doubles in
sequential order, what will the TLB hit rate be?
It’s necessary for a processor to provide an instruction to invalidate all or part of a
cache. Why?
To allow the locations into which a DMA operation will be performed to be
flushed from the cache.
To flush data belonging to an old page out of the cache on page swap.
The PowerPC allows you to mark some pages of memory as “not to be cached”.
Why would you want to do this?
If you mark I/O buffers like this, then writes to them always go directly to memory
(not wasting cache space) and they don’t need to be invalidated when new DMA
operations are performed. Most OS’s will copy data from read buffers to a user’s
address space immediately after it’s been read from the device, so there’s no
advantage in caching it: it’s only ever read once.
jj)
What benefit would you expect from a fully associative cache (compared to other
cache organizations)?
kk)
Despite this, fully associative data caches are rarely found. Why?
ll)
TLBs are often fully associative caches. Referring to your answer to the previous
question, explain why.
mm) What distinguishes write-through and write-back caches?
nn)
Your processor has a 64kbyte 8-way set associative L1 cache with 32 byte lines.
When writing programs that need to perform well on this machine, list two simple
things could you do to get the maximum performance. (Your answer may
mention things you would not do if you prefer!)
oo)
How many sets does this cache have?
pp)
What is the relationship between addresses of lines in the same set?
qq)
Why is the previous answer relevant to ensuring that programs run efficiently?
rr)
What is a potential pitfall of writing a program that uses the answer to (pp) above
to ensure good performance?
ss)
Your program spends most its time scanning through documents which are
usually about 2Mbytes long looking for key words. What would you expect the
L1 cache hit rate to be? Your answer should consider only the hit rate for the
document data, ie it can ignore small perturbations caused by hits or misses in
program code, OS interrupts, etc.
tt)
An OS supports pages of 4 kbytes. Virtual addresses are 44 bits long. The TLB
has 84 entries and is fully associative. What is the TLB coverage?
uu)
“TLBs are just caches.” What is the ‘data’ stored in a TLB?
vv)
Does this TLB present the same problem that questions (g), (h) and (i) refer to?
Why?
ww) How many bits does the tag in this TLB have?
xx)
How does an OS share pages between different processes or users?
yy)
What benefits result from sharing pages? At least two answers required.
zz)
Give an example of an instruction sequence which contains a data dependency.
Indicate the dependency present.
aaa) Give an example of an instruction sequence which benefits from value forwarding
hardware. Indicate why value forwarding helps.
-------------------------------------------------------------------------------------------------------Except where otherwise stated, assume that all caches in the following questions
have lines of 32 bytes and a total capacity of 64kbytes.
bbb) Why would a cache be built with such a long line? (Two reasons needed.)
ccc) If the cache is direct mapped, how many comparators are needed?
ddd) If the cache is fully associative, how many comparators does it need?
eee) If the cache is 8-way set associative, how many sets does it have?
fff)
If the cache is 4-way set associative, what is the relationship between lines in the
same set?
ggg)
hhh)
iii)
jjj)
kkk)
Virtual addresses have 48 bits and physical addresses have 32 bits. The OS uses a
page size of 16kbytes. The cache is 4-way set associative. How many tags are
present in the whole cache?
How many bits are needed for each tag?
How many additional bits are needed per line? Indicate the purpose of each bit.
Draw a diagram showing how the bits of a 40-bit virtual address are used to
generate a 32-bit physical address. Assume the cache is 4-way set associative and
the OS has set the page size to 8kbytes.
What fields would you expect to find in a page table entry? Add a short phrase
indicating the purpose of each field.
lll)
Why does a system interface unit provide separate queues for read and write
transactions?
mmm) Why does a read transaction check the write queue in a system interface unit?
nnn) If a processor with a simple branch predictor sees a conditional branch, under
what circumstances can it make a prediction which has a high probability of being
successful? Why?
ooo) Why is the branch processing unit placed as early in the pipeline as possible?
ppp) Irrespective of the number and power of the individual processing elements in a
parallel processor, what factor primarily determines whether the processor will be
efficient?
qqq) Under what conditions would you expect to achieve 100% TLB hits? (Two
answers required. For one, you are expected to do a calculation and give an
approximate numeric answer!)
rrr) Why are caches built with long (ie more than 8 byte) lines? (Two reasons
needed.)
sss) A superscalar processor has 6 functional units. What determines the maximum
number of instructions that this processor can start every cycle?
ttt)
List the functions of the instruction issue unit of a superscalar processor. No
marks for “issue instructions” (somewhat obvious!)- list the other functions that
the IIU performs.
uuu) You are trying to estimate the performance on your application of a superscalar
processor with 8 functional units and a clock speed of 2GHz. It’s a conventional
RISC machine. You decide to start by working out how many instructions the
processor can complete every second. The application is a commercial one with
no floating point operations. What question do you need to ask before you can
make this estimate? (Alternatively: what piece of information do you need to find
in the processor’s data sheets?)
vvv)
Have dataflow architectures disappeared in the way of the dinosaurs? Explain
your answer.
www) Describe a situation in which it is beneficial for an OS to share pages between
different processes or users.
xxx) In a multiprocessor based on a common bus, why must the bus support a READMODIFY-WRITE transaction?
yyy) What is the purpose of the ‘snooping’ circuitry in a high performance processor.
zzz) The number of high performance processors that can be placed on a single bus
when constructing a parallel processor is limited. How many processors would
you expect to be able to support on a single bus? (You are not expected to give a
precise answer: an approximate one appropriate to current technology will
suffice.)
aaaa) Give as many reasons as you can for this limitation.
bbbb) What is an SIMD processor? Explain the acronym and describe the basic
characteristics of an SIMD processor.
cccc) Give one example of a problem that can be solved effectively with an SIMD
architecture processor.
dddd) Some modern high performance processors have capabilities of an SIMD machine.
Explain this statement.
eeee) Sizes of components in computer architectures are usually powers of 2.
Which of the following must be a power or 2?
Answer ‘yes’ if it would require extraordinary effort on behalf of a compiler or
operating system or a large amount of additional circuitry to manage a value other
than a power of 2.
(i) The maximum number of bytes in a physical address space
YES / NO
(ii) The actual amount of memory installed in a computer
YES / NO
(iii) The number of lines in a direct mapped cache
YES / NO
(iv) The number of lines in a fully associative cache
YES / NO
(v) The number of ways in a set associative cache
YES / NO
(vi) The number of entries in a fully associative TLB
YES / NO
(vii) The number of bytes in a page
YES / NO
(viii) The number of functional units in a superscalar machine
YES / NO
(ix) The number of general purpose registers
YES / NO
(x) The number of stack entries in a stack machine
YES / NO
(xi) The number of stages in a pipeline
YES/ NO
(xii) The maximum number of primary op-codes (distinct instructions)
in an architecture
YES / NO
(xiii) The number of bus clocks for transferring data in the data phase of a bus
transaction
YES / NO
ffff)
Describe a scenario in which you would prefer a write-back cache to a writethrough one. Explain why a write-back cache should perform better.
You can describe the scenario in English, pseudo-code, actual code or any other
way that would describe it clearly and unambiguously.
gggg) Describe a scenario in which you would prefer a write-through cache to a writeback one. Explain why a write-through cache should perform better.
hhhh) Give one advantage and one disadvantage of a direct mapped cache.
Except where otherwise stated, assume that all caches in the following questions are 4way set-associative, have lines of 64 bytes and a total capacity of 64kbytes.
iiii)
The system bus is 64 bits wide. How many bus clock cycles are used to transfer
data for the most common bus transaction?
jjjj)
The system has a split address and data bus. List the overhead bus cycles needed
for the common bus transaction. An overhead cycle is one which transfers no
data. A simple name implying a function for each cycle will suffice. The list is
started for you.
Address Bus Request
kkkk) The cache is write-through. The machine emits 64-bit addresses. One bit is used
for an LRU algorithm. What is the total number of bits in the cache? Count all
overhead bits.
Since you do not have a calculator, show your working leading to a numeric
expression which would, if fed into a calculator, give the final answer.
llll)
What is the relationship between lines in the same set?
mmmm)
Virtual addresses have 48 bits and physical addresses have 32 bits. The
OS uses a page size of 16kbytes. How many tags are present in the whole cache?
nnnn) What fields would you expect to find in a page table entry? Add a short phrase
indicating the purpose of each field.
oooo) Why does a read transaction check the write queue in a system interface unit?
An OS uses 8kbyte pages. It’s running on a system with a 44-bit virtual address
space. A page table entry requires 4 bytes. Physical addresses are 32 bits.
How much space is needed for the page table for each user?
pppp) If the page tables are inverted and the system can handle 256 simultaneous
processes, how much space is needed for page tables?
qqqq) Show how the address emitted by a program running on the system in Q4 is
translated into a physical address.
rrrr) Discuss the benefits (if any) of having separate TLBs for instructions and data.
ssss) A system with a 64kb cache exhibits a hit rate of 95% on a benchmark program.
A cache access time is 1.8 cycles, so that pipeline is stalled for 1 cycle.
Increasing the cache size to 128kb increases the hit rate to 98% and the cache
access time to 2.0 cycles. The access time for a main memory access is 15 cycles.
Is increasing the cache size a good idea?
tttt)
Your processor’s L1 cache contains 16kB of data; it is organized as an 8-way set
associative cache with lines of 64 bytes each. The processor has a 64 bit data bus.
a.
How many data cycles would you expect in the most common bus
transaction?
b.
How many sets does this cache contain?
c.
You are designing a program to process matrices: what situation would
you look out for? Be precise – supply a number in your answer!
d.
How many comparators are required?
e.
How many bits will be in each tag?
f.
How many tags will this cache hold?
g.
For an image processing program that works its way sequentially through
2Mbyte monochrome images (each pixel is one byte), what would you
expect the hit rate to be?
h.
If the image is stored in row-major order and a program processes the
image column-by-column, what would you expect the hit rate to be?
i.
For an engineering program that processes streams of double precision
floats that have been captured on disc, what would you expect the hit rate
to be?
uuuu) List the advantages of separate instruction and data caches.
vvvv) The OS manages pages of 8kB. The TLB has 128 entries.
a.
What is the coverage of this TLB?
b.
Your program needs to multiply matrices which contain 100x100 doubles.
How would you expect the TLB to perform?
Would there be any advantage to padding the matrices out to 128x128
elements? (Don’t forget the cache!)
wwww)
Why does a typical branch predictor count the number of times that it
predicted the branch direction successfully?
xxxx) Most elements of a typical processor are replicated 2k times where k is an integer.
Which of the following need to be 2k in size? Interpret ‘need’ here to mean that
either that a considerable amount of extra circuitry would be required or that
software would become considerably more complicated.
a.
Maximum physical memory supported by a processor.
b.
Actual amount of physical memory installed in a processor
c.
Number of lines in a
i.
direct mapped cache
ii.
fully associative cache
iii.
set-associative cache
d.
Number of entries in a TLB
e.
Size of a page
f.
Number of entries in a page table
g.
Number of data phases in the most common bus transaction
yyyy) On a system with a 128Kbyte L1 cache for a program with a working data set of
2Mbytes:
a.
Calculate the expected cache hit rate when no assumptions can be made
about data access patterns.
b.
Would you expect the actual hit rate to be better or worse than this? Why?
Better – this assumes perfectly random access to everything. Loop variables, constants,
etc are likely to have much better hit rates.
zzzz) Your system has a 4-way set associative cache with 4 32-bit words per cache line.
a.
If the total cache size is 64kbytes, how many sets are there?
b.
A physical address on this system has 32-bits. How much main memory
can it accommodate?
c.
What is the total number of bits are needed for the cache? It’s a writeback cache.
d.
How many comparators does this cache require?
e.
How are the real addresses of lines in the same set related?
f.
If this cache was fully associative,
i.
How many comparators would be needed?
ii.
How many overhead bits would be required?
iii.
What would be the advantage (if any) obtained from the extra
resources.
aaaaa) Your OS uses a page size of 8kbytes.
a.
What is the coverage of a 64 entry TLB?
b.
What will the TLB hit rate be for a 2Mb working data set program
(making no assumptions about access patterns)?
c.
If the program sweeps through the data accessing 64-bit doubles in
sequential order, what will the TLB hit rate be?
bbbbb) An OS uses 8kbyte pages. It’s running on a system with a 44-bit virtual address
space. A page table entry requires 4 bytes. Physical addresses are 32 bits. How
much space is needed for the page table for each user?
ccccc) If the page tables are inverted and the system can handle 256 simultaneous
processes, how much space is needed for page tables?
ddddd) Show how the address emitted by a program running on the system in Q4 is
translated into a physical address.
eeeee) A system with a 64kb cache exhibits a hit rate of 95% on a benchmark program.
A cache access time is 1.8 cycles, so that pipeline is stalled for 1 cycle.
Increasing the cache size to 128kb increases the hit rate to 98% and the cache
access time to 2.0 cycles. The access time for a main memory access is 15 cycles.
Is increasing the cache size a good idea?
fffff) What benefit would you expect from a fully associative cache (compared to other
cache organizations)?
Parallel Processing
1.1 Superscalar procesors
1. Draw a diagram showing how the instruction fetch and execution units of a
superscalar processor are connected. Show the widths of the datapath (in words - not
bits; your diagram should be relevant to a 32-bit or 64-bit processor). Which factor
primarily determines performance: the instruction issue width (number of instructions
issued per cycle) or the number of functional units?
2. List the capabilities of the instruction fetch/despatch unit needed to make an effective
superscalar processor.
1.2 Branch Prediction
3. Why does branch prediction speed up a processor?
Two reasons – one to do with the effect of branches on performance, the other to do
with the likelihood that prediction is possible.
4. If you only had a few transistors to implement a branch prediction system, what
would you do? Why would it be effective?
5. In addition to the answer that you almost certainly gave for the previous question,
describe a scenario where you would expect branch prediction to be successful.
6. Describe the status bits in a branch target buffer.
7. Does it make sense to have both branch prediction and speculative execution in the
same processor? Explain your answer.
Atomic Instructions
8. Is an atomic instruction, such as a test-and-set instruction necessary for (a) a single
processor running a multi-threaded OS and (b) a shared memory parallel processor.
In each case, explain your answer.
9. A computer bus must support READ and WRITE commands. List some other
commands that it must support. Consider also the situation when the processors have
snooping caches.
1.3 Programming Model
10. What does the ‘Shared Memory’ programming model imply?
11. Distinguish between a ‘Uniform Memory Access’ system and a ‘Non-Uniform
Memory Access’ system.
12. When using a shared memory, cache coherence transactions are expensive and could
potentially clog up a bus so much that the bandwidth available to useful transactions
becomes very low. A hacker (who’s spent his time reading the manual for his
processor on the net instead of going to SE363 lectures) discovers that there’s an
instruction that will disable the cache and decides that this will solve the problem.
Why is this likely to be a bad idea?
13. (State transition diagram for the MESI protocol inserted here.)
14. Explain the significance of each state in a MESI protocol.
15. Using the diagram, describe a scenario that would lead to the …………… transition
in the diagram.
1.4 Other Parallel Processors
16. Why does a VLIW machine need a good optimizing compiler?
17. Where can you find a small dataflow machine in every high performance processor?
Download