(NOTE that the disassembler output below omit the binary

advertisement
1. Answer the following questions:
a. (multiple choice) Many new CMPs have private L1s and a shared L2 as the last-level
cache. What is the advantage of having a shared L2 instead of private L2s?
1.
Improve the overall cache capacity
2.
Allow fast L2 access time
3.
Simplify cache coherence
4.
Flexible cache space sharing
5.
Enable for another L3 cache level
6.
Reduce cache miss penalty
7.
None of above
b. (multiple choice) What are the essential factors for today’s microprocessor?
1.
Super pipelined and aggressive speculative since the transistors in a chip has
been continuing follow the Moore Law
2.
Branch prediction is the most important performance factor
3.
With emerging 3D VLSI stacking technology, on-chip cache becomes less
critical due to high off-chip bandwidth
4.
Power has become a limiting factor in future high-performance processors
5.
CMP has arrived and simple pipeline is the trend.
c. Do you agree that a 3.8GHz Pentium 4 has a better performance than a 1.5GHz
Itanium 2? Why?
d. AMD Opteron requires 9 index bits to access the first-level (L1) data cache before
the address translation. However, there are only 7 bits available from the page offset
that do not need any address translation. How does AMD Opteron handle the extra 2
index bits? What’s the issue related to their solutions?
2. Assume a cache consists of 4 cache blocks. Construct a sequence of 12 requests to
show that a direct-mapped cache performs better than a 2-way cache, and the 2-way
cache performs better than a fully-associative cache. The block address is expressed as
a 4-bit binary (and its digital number) without block offset. The cache content is
expressed by the set (separated by “;”) and blocks within a set from MRU to LRU
(separated by “,”). All the caches are empty initially with LRU replacement. For each
access, you must indicate a hit or a miss with the miss type: CM-Compulsory, CFConflict, and CP-Capacity.
Block Address
Direct-Map
2-way
Fully-asso.
0000 (0)
0; -; -; (CM)
0,-; -,(CM)
0,-,-,(CM)
0001
1
0010
0011
1000
0000
0001
0010
0011
1000
3. A write-invalidate cache coherence protocol is implemented in a snooping-bus SMP
system. Fill the blanks in the following table to simulate the cache coherence activities.
Note that each processor executes instructions independently. However, they can only
access the bus one at a time. Assume block X is located in both P1 and P2 caches in
the shared state initially.
Step
1
Coherence
Action
in P1
Coherence
Action
in P2
Cache hit; X in
shared state
Cache hit; X in
shared state
Bus
Activity
None
Processor
Activity
P1 read X=0
P2 read X=0
2
2
Cache miss to X in
the shared state
P1 write X=1
P2 write X=2
(Simultaneously)
3
P1 won the bus
arbitration
4
P2 re-issue write
X=2
P1 read X
5
(See Figure 4.23)
4. Comparison of a 3-state ESI protocol with a 4-state MESI protocol.
5. Simulate snooping-bus or directory-based coherence actions with a sequence of
instructions from multiple processors.
3
Download