SOFTENG 363 THE UNIVERSITY OF AUCKLAND FIRST SEMESTER, 2008 Campus: City SOFTWARE ENGINEERING Computer Architecture Mid-Semester Test (Time allowed: 75mins) NOTE: Permitted aids: One A4 sheet of handwritten notes. This sheet may not be a photocopy. Instructions: Calculators are not permitted. Annotated numeric expressions that can be trivially reduced to the correct answer with a calculator will be accepted for full marks. You are expected to be able to do arithmetic on powers of 2. Answer as many questions as you can in the time allowed. There are 113 marks on this paper: a reasonable target is 100 marks. Write the answers to questions in Section A on the examination paper. If you need additional space or wish to change your answer, you may use your answer book. Make sure to indicate clearly when answers to Section A questions appear in your answer book. A large “SAB” (See Answer Book) over wrong answers will suffice. When you have finished, hand in Section A of the examination paper, your answer book and your A4 sheet of notes. CONTINUED SOFTENG 363 Section A Short Answer Questions The questions in this section can be answered with no more than two concise sentences, a short calculation or a simple diagram. Sometimes a single word will suffice. The marks allocated to each question give some indication of the length of answer required. Questions marked with an * continue on from the previous question: read all the questions in the group before answering. Question 1 (a) (b) (c) (d) * (e) * (f) * (g) * (h) * Except where otherwise indicated, use the following operating system and processor characteristics in all questions. Your operating system uses 8 kbyte pages. The machine you are using has a 4-way set associative 32 kbyte unified L1 cache and a 64 entry fully associative TLB. Cache lines contain 32 bytes. Integer registers are 32 bits wide. Physical addresses are also 32 bits. It supports virtual addresses of 46 bits. 1 Gbyte of main memory is installed. Give one advantage of a direct mapped cache. 1. Simple, fast hardware 2. Smaller circuitry (essentially same as ‘simple’) 3. Only one comparator (essentially same as ‘simple’) What is the main disadvantage of a direct mapped cache? Too many conflicts How many sets does the cache contain? Line – 32=25 bytes, set – 4x25 = 27 bytes Cache capacity – 32kbyte = 215 bytes Number of sets = 215/27 = 28 = 256 How many comparators does the cache require? One per way, so 4 How many bits do these comparators work on? PA = 32 bits Tag Set address Address of byte within line 8 5 So tag = 32 – 8 – 5 = 19bits Comparators are 19 bit comparators Your program is a text processor for large documents: in an initial check, it scans the document looking for illegal characters. For an 8 Mbyte document, what would you expect the L1 cache hit rate to be during the initial check? (You are expected to do a calculation and give an approximate numeric answer!) This program works sequentially through the data. It generates a cache miss for the first byte in every line, and a cache hit for the remaining 31. So L1 cache hit rate = 31/32 (sufficient for 4 marks) = 97% Note that this is independent of the data size! Question (i) deals with the same issue. Your program manipulates large arrays of data. In order to consistent good performance, you should avoid one thing. What is it? (Be precise – a numeric answer relevant to the processor described above and an explanation is required here.) General answer: Matrices with rows containing 2k+m bytes. where k is number of bits to address a set ( = 8 here) and m is number of bits to address a byte within a line ( = 5 here) So rows of 213 bytes will have a high probability of generating conflicts. Alternative: Arrays of any structure containing 213 bytes may exhibit the same problem. What is the alternative to a unified cache? What advantages does it provide? Separate instruction and data caches Main reason: Greater bandwidth to the caches; ability to access both caches simultaneously Secondary reasons: Instruction cache can be simpler (so faster and uses fewer transistors) Ability to store pre-decoded instructions in I-cache Marks 1 2 2 1 2 4 4 3 CONTINUED SOFTENG 363 (i) * (j) * (k) * (l) (m) (v) In addition to data and tags, a cache will have additional bits associated with each entry. List these bits and add a short phrase describing the purpose of each bit (or set of bits). (In all cases, make your answers concise: simply list any differences from a preceding answer.) (i) A set-associative write-back cache Valid – cache data is valid Dirty – cache data is modified and needs to be written back LRU – LRU algorithm to identify ‘old’ data or best candidate for replacement (ii) A set-associative write-through cache As (i) but Dirty not needed (iii) A direct mapped cache As (i) but LRU not needed (iv) A fully associative cache As (i), but more bits required for LRU 32 processes are currently running. If the OS permitted each process to use the maximum possible address space, how many page table entries are required. (i) Conventional page tables Virtual address space = 246 bytes Page size = 213 bytes Number of PTEs per user = 246 / 2 13 = 2 33 PTEs for 32 users = 233 x 25 = 238 (ii) Inverted page tables One PTE per page of physical memory Physical memory = 1Gbyte = 230 bytes Number of PTEs = 230 / 213 = 217 Draw a diagram showing how the bits of a virtual address are used to generate a 32-bit physical address. Diagram from lecture notes annotated with correct number of bits for this processor needed for full marks. Low order bits of VA: offset within page, 13 bits copied directly to physical address (PA) High order 33 bits of VA address a PTE, from which high order 19 bits of PA are extracted. “A program which simply copies a large block of data from one memory location to another exhibits little locality of reference, therefore its performance is not improved by the presence of a cache.” Comment on this statement. Is it strictly true, mostly true or not true at all? Explain your answer. Assume you are running programs on the processor described at the beginning of this section. Not true: the best thing for the copying program to do is to work sequentially through memory. This means that it will exploit spatial locality: once one word from a line is loaded, all remaining words in the same cache line will lead to cache hits. You are advising a team of programmers writing a large scientific simulation program. The team mainly consists of CS graduates who skipped any study of computer architecture in their degrees. Performance is critical. List some simple things that you would advise them to do when writing code for this system. Provide a one sentence explanation for each point of advice. (1 mark for each valid piece of advice, 1 for explaining it and 1 for adding a number that makes the advice specific to the processor described earlier.) Put frequently accessed data in small blocks (ensure spatial locality) – to benefit from long caches line – specifically, pack data into 32-byte chunks wherever possible Do not make matrices with rows that are 2x bytes long – to reduce cache conflicts – specifically avoid x = 13 for this processor Pack as much data into pages as possible – to reduce page thrashing – a page is 213 bytes on this processor Do not make random jumps through memory when accessing large structures – spatial locality at all levels – packing into 32 byte lines or 8k byte pages best for this processor Small ‘tight’ program loops will use the I-cache efficiently – critical sizes are 32 bytes or 8 instructions and 32kbytes (L1 cache size) This question deliberately left blank. No marks even if you do know the answer. Predictably, several readers of Douglas Adams wrote 42 here 8 2 2 4 Max 12 0 CONTINUED SOFTENG 363 CONTINUED