Erik Lee EN 525.712 Advanced Computer Architecture Charles B. Cameron Homework Set 1 (P2.2, P2.13, P2.17) 2/14/12 Problem 2.2 Using equation 2.7 the cost/performance optimal pipeline depth kopt can be computed using parameters G, T, L, and S. Compute kopt for the pipelined floating-point multiplier example in Section 2.1 by using the chip count as the cost terms (G=175 chips, L = 82/2 = 41 chips per interstage latch) and the delays shown for T and S (T = 400ns, S = 22ns). How different is kopt from the proposed pipelined design? πΊπ π π‘ππππ ππ πππ€ ππππππππ ππππ‘ = √ πΏπ πππππππππ βπππ€πππ πππ π‘ πΊ = 175 πβπππ 82 πβπππ πππ π‘ ππ ππππππππ πππ‘πβππ πΏ = = 41 2 πππ‘πππ π‘πππ πππ‘πβ πππ‘ππππ¦ ππ πππππππππ πππ πππ π = 400ππ πππππ¦ ππ πππππ πππ‘πβππ π = 22ππ To calculate the optimal number of stages we use the equation and plug in the values. We round down so we are not over pipelined. 175 ∗ 400 ππππ‘ = √ = 8.8 41 ∗ 22 Kopt = 8 stages The proposed design has the same number of stages as calculated by the equation. Problem 2.13 Given the IBM experience outlined in Section 2.2.4.3, compute the CPI impact of the addition of a level-zero data cache that is able to supply the data operand in a single cycle, but only 75% of the time. The level-zero and level-one caches are accessed in parallel, so that when the level-zero cache misses, the level-one cache returns the result in the next cycle, resulting in a load-delay slot. Assume uniform distribution of level-zero hits across load-delay slots that can and cannot be filled. Show your Work. There are 3 features that can save a load instruction from penalty. Each can save a single clock cycle. Forwarding hardware saves an instruction cycle for every load instruction. Scheduling will save an instruction another cycle 75% of the time. Finally, by adding the level-0 cache 75% of loads will save another cycle. Forwarding Yes (100%) Yes (100%) Yes (100%) Yes (100%) Scheduled Yes (75%) Yes (75%) No (25%) No (25%) Level zero hit Yes (75%) No (25%) Yes (75%) No (25%) Load penalty 0 Cycles 0 Cycles 0 Cycles 1 Cycles There are 3 features that can save a branch instruction from penalty. Each can save a single clock cycle. Being PC-relative, unconditional, schedulable or a level zero hit each save a clock cycle of penalties. Being an unconditional branch has the same benefit of being schedulable. Unconditional Yes (33%) Yes (33%) Yes (33%) Yes (33%) No (66%) No (66%) No (66%) No (66%) No (66%) No (66%) No (66%) No (66%) PC-Relative Address Yes (90%) Yes (90%) No (10%) No (10%) Yes (90%) Yes (90%) Yes (90%) Yes (90%) No (10%) No (10%) No (10%) No (10%) Scheduled Level zero hit Branch Penalty ----Yes (50%) Yes (50%) No (50%) No (50%) Yes (50%) Yes (50%) No (50%) No (50%) Yes (75%) No (25%) Yes (75%) No (25%) Yes (75%) No (25%) Yes (75%) No (25%) Yes (75%) No (25%) Yes (75%) No (25%) 0 Cycles 0 Cycles 0 Cycles 1 Cycles 0 Cycles 0 Cycles 0 Cycles 1 Cycles 0 Cycles 1 Cycles 1 Cycles 2 Cycles In calculating the penalties for all the scenarios if there are at least two cycles saving features there are no penalties. If there is only one cycle saving feature there is only a one cycle penalty. If there are no cycle saving features there is a two cycle penalty. πΆππΌ = 1 π΄ππ πππ ππ¦πππ πππ π‘ππ’ππ‘πππ +πΏπππ πππ‘ π πβπππ’πππ πΏππ£ππ π§πππ πππ π (25% ∗ 25% ∗ 25% ∗ 1) +π΅ππππβ ππππππππ‘πππππ πππ‘ ππΆπππππ‘ππ£π πΏππ£ππ π§πππ πππ π (20% ∗ 33% ∗ 10% ∗ 25% ∗ 1) +π΅ππππβ πΆπππππ‘πππππ ππΆπππππ‘ππ£π πππ‘ π πβπππ’πππ πΏππ£ππ π§πππ πππ π (20% ∗ 66% ∗ 90% ∗ 50% ∗ 25% ∗ 1) +π΅ππππβ πΆπππππ‘πππππ πππ‘ ππΆπππππ‘ππ£π ππβπππ’πππ πΏππ£ππ π§πππ πππ π (20% ∗ 66% ∗ 10% ∗ 50% ∗ 25% ∗ 1) +π΅ππππβ πΆπππππ‘πππππ πππ‘ ππΆπππππ‘ππ£π πππ‘ ππβπππ’πππ πΏππ£ππ π§πππ βππ‘ (20% ∗ 66% ∗ 10% ∗ 50% ∗ 75% ∗ 1) +π΅ππππβ πΆπππππ‘πππππ πππ‘ ππΆπππππ‘ππ£π πππ‘ ππβπππ’πππ πΏππ£ππ π§πππ πππ π (20% ∗ 66% ∗ 10% ∗ 50% ∗ 25% ∗ 2) πΆππΌ = 1 +.25 ∗ .25 ∗ .25 +.20 ∗ .33 ∗ .10 ∗ .25 +.20 ∗ .66 ∗ .90 ∗ .50 ∗ .25 +.20 ∗ .66 ∗ .10 ∗ .50 ∗ .25 +.20 ∗ .66 ∗ .10 ∗ .50 ∗ .75 +.20 ∗ .66 ∗ .10 ∗ .50 ∗ .25 ∗ 2 πΆππΌ = 1.042025 CPI = 1.042025 Problem 2.17 The MIPS pipeline shown in table 2.7 employs a two-phase clocking scheme that makes efficient use of a shared TLB, since instruction fetch accesses the TLB in phase one and data fetch accesses in phase two. However, when resolving a conditional branch, both the branch target address and the branch-fall through address need to be translated during phase one – in parallel with the branch condition check in phase one of the ALU stage – to enable instruction fetch from either the target or the fall-through during phase two. This seems to imply a dual-ported TLB. Suggest an architected solution to this problem that avoids dualporting the TLB. The translation from the TLB for the branch instruction was fetched during the IF2 phase and the RD1 phase. By calculating the next address we know the translated address of the fall through address. The only thing we have to check for is if the instruction is on a boundary. If it is on a boundary then we can incur a penalty and query the TLB for the correct address translation.