Test 2008: Section B answers

advertisement
SOFTENG 363
THE UNIVERSITY OF AUCKLAND
FIRST SEMESTER, 2008
Campus: City
SOFTWARE ENGINEERING
Computer Architecture
Mid-Semester Test
(Time allowed: 75mins)
NOTE:
Permitted aids:
One A4 sheet of handwritten notes. This sheet may not be a photocopy.
Instructions:
Calculators are not permitted. Annotated numeric expressions that can be
trivially reduced to the correct answer with a calculator will be accepted for
full marks. You are expected to be able to do arithmetic on powers of 2.
Answer as many questions as you can in the time allowed. There are 113
marks on this paper: a reasonable target is 100 marks.
Write the answers to questions in Section A on the examination paper. If you
need additional space or wish to change your answer, you may use your answer
book. Make sure to indicate clearly when answers to Section A questions
appear in your answer book. A large “SAB” (See Answer Book) over wrong
answers will suffice.
When you have finished, hand in Section A of the examination paper, your
answer book and your A4 sheet of notes.
CONTINUED
SOFTENG 363
Section A
Short Answer Questions
The questions in this section can be answered with no more than two concise sentences, a short calculation or a
simple diagram. Sometimes a single word will suffice. The marks allocated to each question give some indication
of the length of answer required. Questions marked with an * continue on from the previous question: read all
the questions in the group before answering.
Question 1
(a)
(b)
(c)
(d)
*
(e)
*
(f)
*
(g)
*
(h)
*
Except where otherwise indicated, use the following operating system and processor
characteristics in all questions.
Your operating system uses 8 kbyte pages. The machine you are using has a 4-way set
associative 32 kbyte unified L1 cache and a 64 entry fully associative TLB. Cache lines
contain 32 bytes. Integer registers are 32 bits wide. Physical addresses are also 32 bits. It
supports virtual addresses of 46 bits. 1 Gbyte of main memory is installed.
Give one advantage of a direct mapped cache.
1. Simple, fast hardware
2. Smaller circuitry (essentially same as ‘simple’)
3. Only one comparator (essentially same as ‘simple’)
What is the main disadvantage of a direct mapped cache?
Too many conflicts
How many sets does the cache contain?
Line – 32=25 bytes, set – 4x25 = 27 bytes
Cache capacity – 32kbyte = 215 bytes
Number of sets = 215/27 = 28 = 256
How many comparators does the cache require?
One per way, so 4
How many bits do these comparators work on?
PA = 32 bits
Tag
Set address
Address of byte within line
8
5
So tag = 32 – 8 – 5 = 19bits
Comparators are 19 bit comparators
Your program is a text processor for large documents: in an initial check, it scans the
document looking for illegal characters. For an 8 Mbyte document, what would you expect
the L1 cache hit rate to be during the initial check? (You are expected to do a calculation and
give an approximate numeric answer!)
This program works sequentially through the data.
It generates a cache miss for the first byte in every line,
and a cache hit for the remaining 31.
So L1 cache hit rate = 31/32 (sufficient for 4 marks) = 97%
Note that this is independent of the data size! Question (i) deals with the same issue.
Your program manipulates large arrays of data. In order to consistent good performance, you
should avoid one thing. What is it? (Be precise – a numeric answer relevant to the
processor described above and an explanation is required here.)
General answer: Matrices with rows containing 2k+m bytes.
where k is number of bits to address a set ( = 8 here)
and
m is number of bits to address a byte within a line ( = 5 here)
So rows of 213 bytes will have a high probability of generating conflicts.
Alternative: Arrays of any structure containing 213 bytes may exhibit the same
problem.
What is the alternative to a unified cache? What advantages does it provide?
Separate instruction and data caches
Main reason: Greater bandwidth to the caches; ability to access both caches
simultaneously
Secondary reasons:
Instruction cache can be simpler (so faster and uses fewer transistors)
Ability to store pre-decoded instructions in I-cache
Marks
1
2
2
1
2
4
4
3
CONTINUED
SOFTENG 363
(i)
*
(j)
*
(k)
*
(l)
(m)
(v)
In addition to data and tags, a cache will have additional bits associated with each entry. List
these bits and add a short phrase describing the purpose of each bit (or set of bits). (In all
cases, make your answers concise: simply list any differences from a preceding answer.)
(i) A set-associative write-back cache
Valid – cache data is valid
Dirty – cache data is modified and needs to be written back
LRU – LRU algorithm to identify ‘old’ data or best candidate for replacement
(ii) A set-associative write-through cache
As (i) but Dirty not needed
(iii) A direct mapped cache
As (i) but LRU not needed
(iv) A fully associative cache
As (i), but more bits required for LRU
32 processes are currently running. If the OS permitted each process to use the maximum
possible address space, how many page table entries are required.
(i) Conventional page tables
Virtual address space = 246 bytes
Page size = 213 bytes
Number of PTEs per user = 246 / 2 13 = 2 33
PTEs for 32 users = 233 x 25 = 238
(ii) Inverted page tables
One PTE per page of physical memory
Physical memory = 1Gbyte = 230 bytes
Number of PTEs = 230 / 213 = 217
Draw a diagram showing how the bits of a virtual address are used to generate a 32-bit
physical address.
Diagram from lecture notes annotated with correct number of bits for this processor
needed for full marks.
Low order bits of VA: offset within page, 13 bits copied directly to physical address
(PA)
High order 33 bits of VA address a PTE, from which high order 19 bits of PA are
extracted.
“A program which simply copies a large block of data from one memory location to another
exhibits little locality of reference, therefore its performance is not improved by the presence
of a cache.” Comment on this statement. Is it strictly true, mostly true or not true at all?
Explain your answer. Assume you are running programs on the processor described at the
beginning of this section.
Not true: the best thing for the copying program to do is to work sequentially through
memory. This means that it will exploit spatial locality: once one word from a line is
loaded, all remaining words in the same cache line will lead to cache hits.
You are advising a team of programmers writing a large scientific simulation program. The
team mainly consists of CS graduates who skipped any study of computer architecture in their
degrees. Performance is critical.
List some simple things that you would advise them to do when writing code for this system.
Provide a one sentence explanation for each point of advice.
(1 mark for each valid piece of advice, 1 for explaining it and 1 for adding a number that
makes the advice specific to the processor described earlier.)
Put frequently accessed data in small blocks (ensure spatial locality) – to benefit
from long caches line – specifically, pack data into 32-byte chunks wherever possible
Do not make matrices with rows that are 2x bytes long – to reduce cache conflicts –
specifically avoid x = 13 for this processor
Pack as much data into pages as possible – to reduce page thrashing – a page is 213
bytes on this processor
Do not make random jumps through memory when accessing large structures –
spatial locality at all levels – packing into 32 byte lines or 8k byte pages best for this
processor
Small ‘tight’ program loops will use the I-cache efficiently – critical sizes are 32 bytes
or 8 instructions and 32kbytes (L1 cache size)
This question deliberately left blank.
No marks even if you do know the answer.
Predictably, several readers of Douglas Adams wrote 42 here
8
2
2
4
Max
12
0
CONTINUED
SOFTENG 363
CONTINUED
Download