Computer architecture questions These questions were collected from previous exams and tests, so you will find a new set of processor specifications inserted at various locations: the questions following use those processor specifications. You will also find some essentially identical questions! “Re-use” is a well-established software engineering principle: we use it for exam questions too! Except where otherwise indicated, use the following operating system and processor characteristics in all questions. Your operating system uses 8 kbyte pages. The machine you are using has a 4-way set associative 32 kbyte unified L1 cache and a 64 entry fully associative TLB. Cache lines contain 32 bytes. Integer registers are 32 bits wide. Physical addresses are also 32 bits. It supports virtual addresses of 46 bits. 1 Gbyte of main memory is installed. a) b) c) d) e) f) g) h) i) j) k) l) Give one advantage of a direct mapped cache. What is the main disadvantage of a direct mapped cache? How many sets does the cache contain? How many comparators does the cache require? How many bits do these comparators work on? Your program is a text processor for large documents: in an initial check, it scans the document looking for illegal characters. For an 8 Mbyte document, what would you expect the L1 cache hit rate to be during the initial check? (You are expected to do a calculation and give an approximate numeric answer!) Your program manipulates large arrays of data. In order to consistent good performance, you should avoid one thing. What is it? (Be precise – a numeric answer relevant to the processor described above and an explanation is required here.) What is the alternative to a unified cache? What advantages does it provide? In addition to data and tags, a cache will have additional bits associated with each entry. List these bits and add a short phrase describing the purpose of each bit (or set of bits). (In all cases, make your answers concise: simply list any differences from a preceding answer.) (i) A set-associative write-back cache (ii) A set-associative write-through cache (iii) A direct mapped cache (iv) A fully associative cache 32 processes are currently running. If the OS permitted each process to use the maximum possible address space, how many page table entries are required. (i) Conventional page tables (ii) Inverted page tables Draw a diagram showing how the bits of a virtual address are used to generate a 32-bit physical address. “A program which simply copies a large block of data from one memory location to another exhibits little locality of reference, therefore its performance is not improved by the presence of a cache.” Comment on this statement. Is it strictly true, mostly true or not true at all? Explain your answer. Assume you are running programs on the processor described at the beginning of this section. m) You are advising a team of programmers writing a large scientific simulation program. The team mainly consists of CS graduates who skipped any study of computer architecture in their degrees. Performance is critical. List some simple things that you would advise them to do when writing code for this system. Provide a one sentence explanation for each point of advice. (1 mark for each valid piece of advice, 1 for explaining it and 1 for adding a number that makes the advice specific to the processor described earlier.) -----------------------------------------------------------------------------------------------------------Except where otherwise indicated, use the following operating system and processor characteristics in all questions. Your operating system uses 8 kbyte pages. The machine you are using has a 4-way set associative 32 kbyte unified L1 cache and a 128 entry fully associative TLB. Integer registers are 32 bits wide. Physical addresses are also 32 bits. It supports virtual addresses of 44 bits. The bus is 64 bits wide: the most usual bus transaction has four data cycles. 1 Gbyte of main memory is installed. n) Why would a processor execute both statements s1 and s2 from a compound statement: if ( condition ) s1; else s2; o) What would you expect the cache line length to be? p) How many comparators does the cache require? q) How many bits do these comparators work on? r) How many comparators does the TLB require? s) How many bits do these comparators work on? t) Your program is a text processor for large documents: in an initial check, it scans the document looking for illegal characters. For an 8 Mbyte document, what would you expect the hit rates to be during the initial check? (You are expected to do a calculation and give an approximate numeric answer!) (i) Cache (ii) TLB u) Under what conditions would you expect to achieve 100% TLB hits? (Two answers required. For one, you are expected to do a calculation and give an approximate numeric answer!) v) Why are caches built with long (ie more than 8 byte) lines? (Two reasons needed.) w) Your program manipulates large arrays of data. In order to consistent good performance, you should avoid one thing. What is it? (Be precise – a numeric answer relevant to the processor described above and an explanation is required here.) x) What is the maximum number of page faults can be generated by a single memory access? Explain your answer. (Assume the page fault handler is locked in memory and no other page faults are generated for pages of instructions.) y) z) aa) bb) cc) dd) ee) a. b. ff) gg) hh) ii) A system interface unit will often change the order in which memory accesses generated by the program are placed on the system bus. Give two examples of such re-orderings and explain why the order is changed. Why does a read transaction check the write queue in a system interface unit? If you had only a limited number of transistors available for improving branch performance, what prediction logic would you add? Why will it work? Why does successful branch prediction improve the performance of a processor? Give two examples of speculation in high performance processors. Add a sentence to explain how this improves processor performance. Considering all the ‘caches’ present in a high performance processor (instruction and data cache, TLB, branch history buffer, etc), which ones increase the performance of a program which simply copies data from one location to another? Which ones have little or no effect? A good answer will list each cache and add a sign for increase or decrease and add a single phrase of explanation. On a system with a 128Kbyte L1 cache for a program with a working data set of 2Mbytes: Calculate the expected cache hit rate when no assumptions can be made about data access patterns. Would you expect the actual hit rate to be better or worse than this? Why? Your system has a 4-way set associative cache with 4 32-bit words per cache line. a. If the total cache size is 64kbytes, how many sets are there? b. A physical address on this system has 32-bits. How much main memory can it accommodate? c. What is the total number of bits are needed for the cache? It’s a write-back cache. d. How many comparators does this cache require? e. How are the real addresses of lines in the same set related? f. If this cache was fully associative, i. How many comparators would be needed? ii. How many overhead bits would be required? iii. What would be the advantage (if any) obtained from the extra resources. Your OS uses a page size of 8kbytes. a. What is the coverage of a 64 entry TLB? b. What will the TLB hit rate be for a 2Mb working data set program (making no assumptions about access patterns)? c. If the program sweeps through the data accessing 64-bit doubles in sequential order, what will the TLB hit rate be? It’s necessary for a processor to provide an instruction to invalidate all or part of a cache. Why? To allow the locations into which a DMA operation will be performed to be flushed from the cache. To flush data belonging to an old page out of the cache on page swap. The PowerPC allows you to mark some pages of memory as “not to be cached”. Why would you want to do this? If you mark I/O buffers like this, then writes to them always go directly to memory (not wasting cache space) and they don’t need to be invalidated when new DMA operations are performed. Most OS’s will copy data from read buffers to a user’s address space immediately after it’s been read from the device, so there’s no advantage in caching it: it’s only ever read once. jj) What benefit would you expect from a fully associative cache (compared to other cache organizations)? kk) Despite this, fully associative data caches are rarely found. Why? ll) TLBs are often fully associative caches. Referring to your answer to the previous question, explain why. mm) What distinguishes write-through and write-back caches? nn) Your processor has a 64kbyte 8-way set associative L1 cache with 32 byte lines. When writing programs that need to perform well on this machine, list two simple things could you do to get the maximum performance. (Your answer may mention things you would not do if you prefer!) oo) How many sets does this cache have? pp) What is the relationship between addresses of lines in the same set? qq) Why is the previous answer relevant to ensuring that programs run efficiently? rr) What is a potential pitfall of writing a program that uses the answer to (pp) above to ensure good performance? ss) Your program spends most its time scanning through documents which are usually about 2Mbytes long looking for key words. What would you expect the L1 cache hit rate to be? Your answer should consider only the hit rate for the document data, ie it can ignore small perturbations caused by hits or misses in program code, OS interrupts, etc. tt) An OS supports pages of 4 kbytes. Virtual addresses are 44 bits long. The TLB has 84 entries and is fully associative. What is the TLB coverage? uu) “TLBs are just caches.” What is the ‘data’ stored in a TLB? vv) Does this TLB present the same problem that questions (g), (h) and (i) refer to? Why? ww) How many bits does the tag in this TLB have? xx) How does an OS share pages between different processes or users? yy) What benefits result from sharing pages? At least two answers required. zz) Give an example of an instruction sequence which contains a data dependency. Indicate the dependency present. aaa) Give an example of an instruction sequence which benefits from value forwarding hardware. Indicate why value forwarding helps. -------------------------------------------------------------------------------------------------------Except where otherwise stated, assume that all caches in the following questions have lines of 32 bytes and a total capacity of 64kbytes. bbb) Why would a cache be built with such a long line? (Two reasons needed.) ccc) If the cache is direct mapped, how many comparators are needed? ddd) If the cache is fully associative, how many comparators does it need? eee) If the cache is 8-way set associative, how many sets does it have? fff) If the cache is 4-way set associative, what is the relationship between lines in the same set? ggg) hhh) iii) jjj) kkk) Virtual addresses have 48 bits and physical addresses have 32 bits. The OS uses a page size of 16kbytes. The cache is 4-way set associative. How many tags are present in the whole cache? How many bits are needed for each tag? How many additional bits are needed per line? Indicate the purpose of each bit. Draw a diagram showing how the bits of a 40-bit virtual address are used to generate a 32-bit physical address. Assume the cache is 4-way set associative and the OS has set the page size to 8kbytes. What fields would you expect to find in a page table entry? Add a short phrase indicating the purpose of each field. lll) Why does a system interface unit provide separate queues for read and write transactions? mmm) Why does a read transaction check the write queue in a system interface unit? nnn) If a processor with a simple branch predictor sees a conditional branch, under what circumstances can it make a prediction which has a high probability of being successful? Why? ooo) Why is the branch processing unit placed as early in the pipeline as possible? ppp) Irrespective of the number and power of the individual processing elements in a parallel processor, what factor primarily determines whether the processor will be efficient? qqq) Under what conditions would you expect to achieve 100% TLB hits? (Two answers required. For one, you are expected to do a calculation and give an approximate numeric answer!) rrr) Why are caches built with long (ie more than 8 byte) lines? (Two reasons needed.) sss) A superscalar processor has 6 functional units. What determines the maximum number of instructions that this processor can start every cycle? ttt) List the functions of the instruction issue unit of a superscalar processor. No marks for “issue instructions” (somewhat obvious!)- list the other functions that the IIU performs. uuu) You are trying to estimate the performance on your application of a superscalar processor with 8 functional units and a clock speed of 2GHz. It’s a conventional RISC machine. You decide to start by working out how many instructions the processor can complete every second. The application is a commercial one with no floating point operations. What question do you need to ask before you can make this estimate? (Alternatively: what piece of information do you need to find in the processor’s data sheets?) vvv) Have dataflow architectures disappeared in the way of the dinosaurs? Explain your answer. www) Describe a situation in which it is beneficial for an OS to share pages between different processes or users. xxx) In a multiprocessor based on a common bus, why must the bus support a READMODIFY-WRITE transaction? yyy) What is the purpose of the ‘snooping’ circuitry in a high performance processor. zzz) The number of high performance processors that can be placed on a single bus when constructing a parallel processor is limited. How many processors would you expect to be able to support on a single bus? (You are not expected to give a precise answer: an approximate one appropriate to current technology will suffice.) aaaa) Give as many reasons as you can for this limitation. bbbb) What is an SIMD processor? Explain the acronym and describe the basic characteristics of an SIMD processor. cccc) Give one example of a problem that can be solved effectively with an SIMD architecture processor. dddd) Some modern high performance processors have capabilities of an SIMD machine. Explain this statement. eeee) Sizes of components in computer architectures are usually powers of 2. Which of the following must be a power or 2? Answer ‘yes’ if it would require extraordinary effort on behalf of a compiler or operating system or a large amount of additional circuitry to manage a value other than a power of 2. (i) The maximum number of bytes in a physical address space YES / NO (ii) The actual amount of memory installed in a computer YES / NO (iii) The number of lines in a direct mapped cache YES / NO (iv) The number of lines in a fully associative cache YES / NO (v) The number of ways in a set associative cache YES / NO (vi) The number of entries in a fully associative TLB YES / NO (vii) The number of bytes in a page YES / NO (viii) The number of functional units in a superscalar machine YES / NO (ix) The number of general purpose registers YES / NO (x) The number of stack entries in a stack machine YES / NO (xi) The number of stages in a pipeline YES/ NO (xii) The maximum number of primary op-codes (distinct instructions) in an architecture YES / NO (xiii) The number of bus clocks for transferring data in the data phase of a bus transaction YES / NO ffff) Describe a scenario in which you would prefer a write-back cache to a writethrough one. Explain why a write-back cache should perform better. You can describe the scenario in English, pseudo-code, actual code or any other way that would describe it clearly and unambiguously. gggg) Describe a scenario in which you would prefer a write-through cache to a writeback one. Explain why a write-through cache should perform better. hhhh) Give one advantage and one disadvantage of a direct mapped cache. Except where otherwise stated, assume that all caches in the following questions are 4way set-associative, have lines of 64 bytes and a total capacity of 64kbytes. iiii) The system bus is 64 bits wide. How many bus clock cycles are used to transfer data for the most common bus transaction? jjjj) The system has a split address and data bus. List the overhead bus cycles needed for the common bus transaction. An overhead cycle is one which transfers no data. A simple name implying a function for each cycle will suffice. The list is started for you. Address Bus Request kkkk) The cache is write-through. The machine emits 64-bit addresses. One bit is used for an LRU algorithm. What is the total number of bits in the cache? Count all overhead bits. Since you do not have a calculator, show your working leading to a numeric expression which would, if fed into a calculator, give the final answer. llll) What is the relationship between lines in the same set? mmmm) Virtual addresses have 48 bits and physical addresses have 32 bits. The OS uses a page size of 16kbytes. How many tags are present in the whole cache? nnnn) What fields would you expect to find in a page table entry? Add a short phrase indicating the purpose of each field. oooo) Why does a read transaction check the write queue in a system interface unit? An OS uses 8kbyte pages. It’s running on a system with a 44-bit virtual address space. A page table entry requires 4 bytes. Physical addresses are 32 bits. How much space is needed for the page table for each user? pppp) If the page tables are inverted and the system can handle 256 simultaneous processes, how much space is needed for page tables? qqqq) Show how the address emitted by a program running on the system in Q4 is translated into a physical address. rrrr) Discuss the benefits (if any) of having separate TLBs for instructions and data. ssss) A system with a 64kb cache exhibits a hit rate of 95% on a benchmark program. A cache access time is 1.8 cycles, so that pipeline is stalled for 1 cycle. Increasing the cache size to 128kb increases the hit rate to 98% and the cache access time to 2.0 cycles. The access time for a main memory access is 15 cycles. Is increasing the cache size a good idea? tttt) Your processor’s L1 cache contains 16kB of data; it is organized as an 8-way set associative cache with lines of 64 bytes each. The processor has a 64 bit data bus. a. How many data cycles would you expect in the most common bus transaction? b. How many sets does this cache contain? c. You are designing a program to process matrices: what situation would you look out for? Be precise – supply a number in your answer! d. How many comparators are required? e. How many bits will be in each tag? f. How many tags will this cache hold? g. For an image processing program that works its way sequentially through 2Mbyte monochrome images (each pixel is one byte), what would you expect the hit rate to be? h. If the image is stored in row-major order and a program processes the image column-by-column, what would you expect the hit rate to be? i. For an engineering program that processes streams of double precision floats that have been captured on disc, what would you expect the hit rate to be? uuuu) List the advantages of separate instruction and data caches. vvvv) The OS manages pages of 8kB. The TLB has 128 entries. a. What is the coverage of this TLB? b. Your program needs to multiply matrices which contain 100x100 doubles. How would you expect the TLB to perform? Would there be any advantage to padding the matrices out to 128x128 elements? (Don’t forget the cache!) wwww) Why does a typical branch predictor count the number of times that it predicted the branch direction successfully? xxxx) Most elements of a typical processor are replicated 2k times where k is an integer. Which of the following need to be 2k in size? Interpret ‘need’ here to mean that either that a considerable amount of extra circuitry would be required or that software would become considerably more complicated. a. Maximum physical memory supported by a processor. b. Actual amount of physical memory installed in a processor c. Number of lines in a i. direct mapped cache ii. fully associative cache iii. set-associative cache d. Number of entries in a TLB e. Size of a page f. Number of entries in a page table g. Number of data phases in the most common bus transaction yyyy) On a system with a 128Kbyte L1 cache for a program with a working data set of 2Mbytes: a. Calculate the expected cache hit rate when no assumptions can be made about data access patterns. b. Would you expect the actual hit rate to be better or worse than this? Why? Better – this assumes perfectly random access to everything. Loop variables, constants, etc are likely to have much better hit rates. zzzz) Your system has a 4-way set associative cache with 4 32-bit words per cache line. a. If the total cache size is 64kbytes, how many sets are there? b. A physical address on this system has 32-bits. How much main memory can it accommodate? c. What is the total number of bits are needed for the cache? It’s a writeback cache. d. How many comparators does this cache require? e. How are the real addresses of lines in the same set related? f. If this cache was fully associative, i. How many comparators would be needed? ii. How many overhead bits would be required? iii. What would be the advantage (if any) obtained from the extra resources. aaaaa) Your OS uses a page size of 8kbytes. a. What is the coverage of a 64 entry TLB? b. What will the TLB hit rate be for a 2Mb working data set program (making no assumptions about access patterns)? c. If the program sweeps through the data accessing 64-bit doubles in sequential order, what will the TLB hit rate be? bbbbb) An OS uses 8kbyte pages. It’s running on a system with a 44-bit virtual address space. A page table entry requires 4 bytes. Physical addresses are 32 bits. How much space is needed for the page table for each user? ccccc) If the page tables are inverted and the system can handle 256 simultaneous processes, how much space is needed for page tables? ddddd) Show how the address emitted by a program running on the system in Q4 is translated into a physical address. eeeee) A system with a 64kb cache exhibits a hit rate of 95% on a benchmark program. A cache access time is 1.8 cycles, so that pipeline is stalled for 1 cycle. Increasing the cache size to 128kb increases the hit rate to 98% and the cache access time to 2.0 cycles. The access time for a main memory access is 15 cycles. Is increasing the cache size a good idea? fffff) What benefit would you expect from a fully associative cache (compared to other cache organizations)? Parallel Processing 1.1 Superscalar procesors 1. Draw a diagram showing how the instruction fetch and execution units of a superscalar processor are connected. Show the widths of the datapath (in words - not bits; your diagram should be relevant to a 32-bit or 64-bit processor). Which factor primarily determines performance: the instruction issue width (number of instructions issued per cycle) or the number of functional units? 2. List the capabilities of the instruction fetch/despatch unit needed to make an effective superscalar processor. 1.2 Branch Prediction 3. Why does branch prediction speed up a processor? Two reasons – one to do with the effect of branches on performance, the other to do with the likelihood that prediction is possible. 4. If you only had a few transistors to implement a branch prediction system, what would you do? Why would it be effective? 5. In addition to the answer that you almost certainly gave for the previous question, describe a scenario where you would expect branch prediction to be successful. 6. Describe the status bits in a branch target buffer. 7. Does it make sense to have both branch prediction and speculative execution in the same processor? Explain your answer. Atomic Instructions 8. Is an atomic instruction, such as a test-and-set instruction necessary for (a) a single processor running a multi-threaded OS and (b) a shared memory parallel processor. In each case, explain your answer. 9. A computer bus must support READ and WRITE commands. List some other commands that it must support. Consider also the situation when the processors have snooping caches. 1.3 Programming Model 10. What does the ‘Shared Memory’ programming model imply? 11. Distinguish between a ‘Uniform Memory Access’ system and a ‘Non-Uniform Memory Access’ system. 12. When using a shared memory, cache coherence transactions are expensive and could potentially clog up a bus so much that the bandwidth available to useful transactions becomes very low. A hacker (who’s spent his time reading the manual for his processor on the net instead of going to SE363 lectures) discovers that there’s an instruction that will disable the cache and decides that this will solve the problem. Why is this likely to be a bad idea? 13. (State transition diagram for the MESI protocol inserted here.) 14. Explain the significance of each state in a MESI protocol. 15. Using the diagram, describe a scenario that would lead to the …………… transition in the diagram. 1.4 Other Parallel Processors 16. Why does a VLIW machine need a good optimizing compiler? 17. Where can you find a small dataflow machine in every high performance processor?