Uploaded by Scot스캇

2009 Final Exam Solutions CA

advertisement
Name:__________________________
CSE 30321 – Computer Architecture I – Fall 2009
Final Exam
December 18, 2009
Test Guidelines:
1. Place your name on EACH page of the test in the space provided.
2. Answer every question in the space provided. If separate sheets are needed, make sure to
include your name and clearly identify the problem being solved.
3. Read each question carefully. Ask questions if anything needs to be clarified.
4. The exam is open book and open notes.
5. All other points of the ND Honor Code are in effect!
6. Upon completion, please turn in the test and any scratch paper that you used.
Suggestion:
- Whenever possible, show your work and your thought process. This will make it easier for us to
give you partial credit.
Question
Possible Points
1
15
2
10
3
20
4
15
5
15
6
10
7
15
Total
100
Your Points
Name:__________________________
Problem 1: (15 points)
Question A: (5 points)
Briefly (in 4-5 sentences or a bulleted list) explain why many of the transistors on a modern
microprocessor chip are devoted to Level 1, Level 2, and sometimes Level 3 cache. Your answer must
fit in the box below!
Studentsʼ answers should say something to the effect of:
- Processing logic is faster than off-chip, 1-transistor DRAM – and this performance gap has been
continually growing
- Idea: bring in subsets of (off-chip) main memory into faster on-chip memory (SRAM) that can
operate at the speed of the processor
- Caches help to ensure faster “data supply times” to ensure that logic is not idle for larger
number of CCs (e.g. the time to access off-chip memory)
- If instruction encodings and data for load instructions could not be accessed in 1-2 CCs, CPU
performance would be significantly (and negatively) impacted.
- A discussion of spatial vs. temporal locality should receive little to no credit – speed differentials
are the most important consideration in this answer.
In HW 8, you saw that some versions of the Pentium 4 microprocessor have two 8 Kbyte, Level 1
caches – one for data and one for instructions. However, a design team is considering another option –
a single, 16 Kbyte cache that holds both instructions and data.
Additional specs for the 16 Kbyte cache include:
-
Each block will hold 32 bytes of data (not including tag, valid bit, etc.)
The cache would be 2-way set associative
Physical addresses are 32 bits
Data is addressed to the word and words are 32 bits
Question B: (3 points)
How many blocks would be in this cache?
Answer
- The cache holds 214 bytes of data
- Each block holds 25 bytes
- Thus, there are 214 / 25 = 29 = 512 blocks
Name:__________________________
Question C: (3 points)
How many bits of tag are stored with each block entry?
Answer
We need to figure out how many bits are dedicated to the offset, index and tag.
question asks how many bits of tag are needed.)
- Index:
o # of sets: 1024 / 2 = 256 = 28
o Therefore 8 bits of index are needed
- Offset:
o # of words per block = 32 / 4 = 8
o 23 = 8
o Therefore 3 bits of offset
- Tag
o 32 – 3 – 8 = 21 bits of tag
(Basically, this
Therefore, 21 bits of tag need to be stored in each block.
Question D: (4 points)
Each instruction fetch means a reference to the instruction cache and 35% of all instructions reference
data memory. With the first implementation:
-
The average miss rate in the L1 instruction cache was 2%
The average miss rate in the L1 data cache was 10%
In both cases, the miss penalty is 9 CCs
For the new design, the average miss rate is 3% for the cache as a whole, and the miss penalty is
again 9 CCs.
Which design is better and by how much?
Answer
Miss penaltyv1
= (1)(.02)(9)
Miss penaltyv2
= (.03)(9)
V2 is the right design choice
+
(0.35)(.1)(9)
=
.18 + .063
=
0.495
=
0.270
Name:__________________________
Problem 2: (10 points)
Question A: (4 points)
Explain the advantages and disadvantages (in 4-5 sentences or a bulleted list) of using a direct
mapped cache instead of an 8-way set associative cache. Your answer must fit in the box below!
Answer
- A direct mapped cache should have a faster hit time; there is only one block that data for a
physical address can be mapped to
- The above “pro” can also be a “con”; if there are successive reads to 2 separate addresses that
map to the same cache block, then there may never be a cache hit. This will significantly
degrade performance.
- In contrast, with a set associative cache, a block can map to one of 8 blocks within a set. Thus,
if the situation described above were to occur, both references would be hits and there would be
no conflict misses.
- However, a set associative cache will take a bit longer to search – could decrease clock rate.
Question B: (2 points)
Assume you have a 2-way set associative cache.
-
Words are 4 bytes
Addresses are to the byte
Each block holds 512 bytes
There are 1024 blocks in the cache
If you reference a 32-bit physical address – and the cache is initially empty – how many data words are
brought into the cache with this reference?
Answer
- The entire block will be filled
- If words are 4 bytes long and each block holds 512 bytes, there are 29 / 22 words in the block
- i.e. there are 27 or 128 words in each block
Question C: (4 points)
Which set does the data that is brought in go to if the physical address F A B 1 2 3 8 9 (in hex) is
supplied to the cache?
Answer
We need to determine what the index bits are. From above, we know the offset is 9 bits (remember,
data is byte addressable) – so we will need to break up the hex address into binary:
1111
Our offset for this address is:
1010
1011
0001
0010
1
1000
1001
0011
1000
1001
1024 / 2 = 210 / 21 = 512 = 29 – therefore 9 bits of index are required.
These are:
01  0010  001 which implies the address maps to the 145th set.
Name:__________________________
Problem 3: (20 points)
Question A: (5 points)
Explain (in 4-5 sentences or via a short bulleted list) why there is translation lookaside buffer on the
virtual-to-physical address critical path. Your answer must fit in the box below!
Answer
- The page table for a process/program can be huge – and the entire page table will almost
certainly not be cacheable.
- As a page table reference is required for every address translation, if every instruction and every
data reference required a main memory lookup, performance would quickly and significantly
degrade.
- The TLB is a fast cache for part of the page table – and if (a) the TLB contains 64-128 entries
and (b) each page has 214 – 216 addressable entries, the seemingly small TLB can provide wide
coverage
For the next question, refer to the snapshot of TLB and page table state shown below.
Initial TLB State:
(Note that ʻ1ʼ = “Most Recently Used and ʻ4ʼ = “Least Recently Used”)
Valid
1
1
1
1
LRU
3
4
2
1
Tag
1111
0011
1000
0100
Physical Page #
0001
0010
1000
1010
Initial Page Table State:
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Valid
0
1
1
1
1
0
1
0
1
1
1
1
1
0
1
1
Physical Page #
0011
1001
0000
0010
1010
0100
1011
0101
1000
0110
1111
1101
0111
1110
1100
0001
Name:__________________________
Also:
1. Pages are 4 KB
2. There is a 4-entry, fully-associative TLB
3. The TLB uses a true, least-recently-used replacement policy
Question B: (5 points)
Assume that the Page Table Register is equal to 0.
The virtual address supplied is:
(MSB) 1100  0010  0010  0100 (LSB)
What physical address is calculated? If you cannot calculate the physical address because of a page
fault, please just write “Page Fault”.
Answer:
- Our VPN is 1100
- This entry is not in one of the tag fields in the TLB … so we need to look in the page table
- The entry in the page table is valid – and suggests that the PFN is 0111
- We can then concatenate 0111 to 0010 0010 0100
- Thus, our PA is: 0111  0010  0010  0100 (or 7224 hex)
Question C: (5 points)
Consider the following:
- Virtual address are 32 bits
- Pages have 65,536 (or 216) addressable entries
- Each page table entry has:
o 1 valid bit
o 1 dirty bit
o The physical frame number
- Physical addresses are 30 bits long
How much memory would we need to simultaneously hold the page tables for two different processes?
Answer:
- Our VPN is 16 bits long – so each page table will have 216 entries
- There are also 16 bits left over for offset
- Thus, each physical frame number is 30 – 16 = 14 bits
- Each page table entry will hold the above 14 bit PFN + 1 valid bit + 1 dirty bit
o Thus, each page table entry is 2 bytes.
- There are 216, 2-byte entries per process
o Thus, each page table requires 217 bytes – or ~128 Kbytes
- Because each process has its own page table, a total of 218 bytes – or ~256 Kbytes are needed
Name:__________________________
Question D: (5 points)
Assume that for a given system, virtual addresses are 40 bits long and physical addresses are 30 bits
long. There are 8 Kbytes of addressable entries per page. The TLB in the address translation path has
128 entries. How many virtual addresses can be quickly translated by the TLB? Would changing the
page size make your answer better or worse? – Note that there are 2 questions here!
Answer:
There are 128 (27) entries in the TLB and there are 8192 (213) entries per page. Therefore 27 x 213
implies 220 (or 1 MB) addressable entries are covered by the TLB.
The answer would get better b/c 213 would be larger.
Name:__________________________
Problem 4: (15 points)
This question considers the basic, MIPS, 5-stage pipeline. (For this problem, you may assume that
there is full forwarding.)
Question A: (5 points)
Explain how pipelining can improve the performance of a given instruction mix. Answer in the box.
Answer
- Pipelining provides “pseudo-parallelism” – instructions are still issued sequentially, but their
overall execution (i.e. from fetch to write back) overlaps
o (note – a completely right answer really needs to make this point)
- Performance is improved via higher throughput
o (note – a completely right answer really needs to make this point too)
o An instruction (ideally) finishes every short CC
Question B: (6 points)
Show how the instructions will flow through the pipeline:
lw $10, 0($11)
add $9, $11, $11
sub $8, $10, $9
lw $7, 0($8)
sw $7, 4($8)
1
2
3
4
5
6
7
8
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
9
10
11
12
W
Question C: (4 points)
Where might the sw instruction get its data from? Be very specific. (i.e. “from the lw instruction” is not
a good answer!)
Answer
- Data is available from the lw in the MEM / WB register
- There could be a feedback path from this register back to one of the inputs to data memory
- A control signal could then select the appropriate input to data memory to support this lw – sw
input combination
Name:__________________________
Problem 5: (15 points)
This question considers the basic, MIPS, 5-stage pipeline. (For this problem, you may assume that
there is full forwarding.)
Question A: (5 points)
Using pipelining as context, explain why very accurate branch prediction is important in advanced
computer architectures. Answer in the box below.
Answer
- If we need to wait for the branch to be resolved (e.g. after 3 stages) performance will be
adversely affected – especially given that ~1 in every 6 instructions is a branch
o 3 stall CCs would be tacked on to the base CPI of 1
Question B: (7 points)
Show how the instructions will flow through the pipeline.
- You should assume that branches are predicted to be taken.
- You should assume that 0($2) == 4
lw $1, 0($2)
bneq $0, $1, X
1
2
3
4
5
6
7
F
D
E
M
W
F
D
D
F
F
8
E
M
W
D
E
M
W
F
D
E
M
9
10
11
12
add $2, $1, $1
X: add $2, $1, $0
sw $2, 4($2)
W
Question C: (3 points)
What is the value that will be written to 4($2) in the last instruction?
Answer
lw $1, 0($2):
bneq $0, $1, X:
add $2, $1, $0:
sw $2, 4($2):
$1  4
$0 != $1, therefore goto X (also predict taken)
$2  4 + 0 = 4
4($2)  4
13
14
15
16
Name:__________________________
Problem 6: (10 points)
Consider an Intel P4 microprocessor with a 16 Kbyte unified L1 cache. The miss rate for this cache is
3% and the hit time is 2 CCs. The processor also has an 8 Mbyte, on-chip L2 cache. 95% of the time,
data requests to the L2 cache are found. If data is not found in the L2 cache, a request is made to a 4
Gbyte main memory. The time to service a memory request is 100,000 CCs. On average, it takes 3.5
CCs to process a memory request. How often is data found in main memory?
Average memory access time
=
Hit Time + (Miss Rate x Miss Penalty)
Average memory access time
=
Hit Time L1 + (Miss Rate L1 x Miss Penalty L1)
Miss PenaltyL1
=
Hit Time L2 + (Miss Rate L2 x Miss Penalty L2)
Miss PenaltyL2
=
Hit Time Main + (Miss Rate Main x Miss Penalty Main)
=
=
=
=
=
=
=
2 + 0.03 (15 + 0.05 (200 + X (100,000)))
2 + 0.03 (15 + 10 + 5000X)
2 + 0.03 (25 + 5000X)
2 + 0.75 + 150X
2.75 + 150X
150X
.005
3.5
3.5
3.5
3.5
3.5
0.75
X
Thus, 99.5% of the time, we find the data we are looking for in main memory.
Name:__________________________
Problem 7: (15 points)
-
-
Assume that for a given problem, N CCs are needed to process each data element.
The microprocessor that this problem will run on has 4 cores.
If we want to solve part of the problem on another core, we will need to spend 250 CCs for each
instantiation on a new core.
o (e.g. if a problem is split up on 2 cores, 1 instantiation is needed)
Also, if the problem is to run on multiple cores, an overhead of 10 CCs per data element is
associated with each instantiation.
We want to process M data elements.
Question A: (8 points)
If N is equal to 500 and M is equal to 100000, what speedup do we get if we run the program on 4
cores instead of 1? (Hint – start by writing an expression!)
Answer:
- To run the program on 1 core, M x N CCs are required:
o Thus, 500 x 100,000 = 50,000,000 CCs
-
To run the program on 4 cores, we need to consider overhead:
o We need to start jobs on 3 other cores…
 Thus, 250 CCs x 3 = 750 CCs are needed
o We need to send a certain number of data elements to each core
 i.e. 100,000 / 4 = 25,000 elements must be sent to each core
 Thus, an overhead of 25,000 x 10 x 3 = 750,000 CCs is required
o We still need to do the computation itself…
 … but now this is split up over 4 cores…
 Thus, the computation time is equal to 500 x 100,000 / 4 = 12,500,000 CCs
-
We can use this analysis to create a generalized expression for running jobs on a single core
and running jobs on multiple cores:
o Single core:
 Execution timesingle core
=MxN
o Multiple cores:
 Execution timemultiple cores
=
• Time to process + Time to instantiate + Time to send data
•
-
[(N x M) / # of cores] + [(# of cores – 1)(250)] + [(M / # of cores)(# of cores – 1)(10)]
Using these expressions, we can calculate single core and mulit-core execution times:
o Single core
=
500 x 100,000 = 50,000,000
o Multi-core
=
(500 x 100,000 / 4) + (3 x 250) + [(100,000 / 4)(3)(10)]
=
12,500,000 + 750 + 750,000
=
13,250,750
o Speedup
=
50,000,000 / 13,250,750 = 3.77
Name:__________________________
Question B: (7 points)
If N is equal to 250, M is equal to 120, the communication overhead increases to 100 CC per element,
and the instantiation overhead remains the same (at 250 CC), how many cores should we run the
problem on? Explain why.
Answer:
- We can use this expressions generated above:
o Single core:
 Execution timesingle core
=MxN
o Multiple cores:
 Execution timemultiple cores
=
• Time to process + Time to instantiate + Time to send data
•
-
-
[(N x M) / # of cores] + [(# of cores – 1)(250)] + [(M / # of cores)(# of cores – 1)(100)]
The easiest way to solve this problem is to just compare all the cases by doing some trivial math
o Single Core
 250 x 100 = 30,000
o 2 Cores
 =
(30,000 / 2) +
(1)(250)
+
(120/2)(1)(100)
 =
15,000
+
250
+
6,000
 =
21,250
o 3 Cores
 =
(30,000 / 3) +
(2)(250)
+
(120/3)(2)(100)
 =
10,000
+
500
+
8,000
 =
18,500
o 4 Cores
 =
(30,000 / 4) +
(3)(250)
+
(120/4)(3)(100)
 =
7,500
+
750
+
9,000
 =
17,250
Thus, 4 cores still makes sense!
Download