Question 1. (a) Describe the concept of register renaming. Use a simple program example to illustrate your explanation. Each time the execution unit dispatches for execution an instruction that writes into a register, the register is given a new identification tag. Subsequent instructions that use the register as operand would also get the latest tag. When an instruction has executed, its result emerges from the arithmetic unit with the destination register tag, and is delivered to any instruction with this tag as operand. When a waiting instruction has received all its operands, it enters the arithmetic unit for execution. E.g. Before Renaming I1 I2 I3 I4 I5 mov r1, 0 add r2, r3, r1 mul r1, r3, 3 sub r4, r1, r7 div r2, r2, r1 Attach a tag to each register. Tag is incremented each time an instruction that modifies the register is dispatched. All subsequent references to the same register will receive the most up-to-date tag. I1 mov r1-1, 0 I2 add r2-1, r3-1, r1-1 I3 mul r1-2, r3-1, 3 I4 sub r4-1, r1-2, r2-1 I5 div r2-2, r2-1, r1-2 The Tomasulo (or equivalent) pipeline ensures that the same register with different tags (e.g. r1-1 and r1-2) behave as two different registers. 1 Question 1 (b) Explain why register renaming prevents Read-Write hazard, and why it does not completely prevent Write-Write hazard. The result of the write would have a different tag from the operand tag of the previous instructions that use the same register. Hence, the new result would not replace the operand of the earlier instruction. E.g. If I3 executes before I2, there is no problem because I2 will read from r1-1, while I3 will write to r1-2, both of which are effectively different registers. Two writes would have different tags and their results would be delivered to different consumer instructions so would not be confused. So if I3 completes before I1, I2 will not receive anything for r1 since I3 will broadcast results for r1-2, and I2 is waiting for results for r1-1. Likewise if I1 completes, instruction I4, I5 will not receive erroneous r1 from I1, since it is expecting r1-2, and I1 is producing r1-1. So read-write (WAR) hazard is resolved. But if I3 completes before I1 and I3 updates r1 with r1-2, and later I1 completes and updates r1 with r1-1, we get an erroneous updating of r1 (WAW hazard). Hence we need to do re-ordering. Of course this can be done by making use of the tags. Results of r1-2 and r1-1 are written to the re-order buffer: I3 completes: ROB Destination r1 r1 Tag 2 1 Value y x 2 Committed n n I1 completes: Destination r1 r1 Tag 2 1 Value y x Committed n n Value y x Committed n y I1’s result is committed to r1. Destination r1 r1 Tag 2 1 I3’s results are committed to r1. WAW hazard eliminated. Destination r1 r1 Tag 2 1 Value y x Committed y y Of course we can optimize this by just taking the most up-todate copy of r1 in the ROB and writing this to r1. This is of course the r1 with the highest tag. However this has serious complications for interrupt processing. For e.g. the process is interrupted at I2, in which case even with r1-2 and r1-1 in the ROB, the correct result in r1 should be r1-1 and not r1-2. An optimization where r1-1 is skipped in favor of r1-2 will result in the wrong value in r1 if an interrupt triggers at I2. 3 Question 2 (a) File processing programs often use both indexed addressing mode and displacement addressing mode; explain by providing one example of each. (Note: this question is specifically in the file processing context. Other examples are not relevant. You are not required to provide program examples. Descriptive explanations are fine.) Main gist: Indexed addressing is useful for accessing arrays. Displacement addressing is useful for accessing particular records. Why: Indexed addressing encodes the base address in the instruction and the index in a register. Accessing next element accomplished just by incrementing register by 1. Final address is base + index * size, where size is the size of the data item in words or bytes. Displacement addressing encodes the displacement in the instruction and the base address in a register. Hence convenient mainly for accessing single items/records. Final address is displ * size + base, size is the size of a data item is words or bytes. The two addressing modes are not generally interchangeable (e.g. you cannot do an indexed addressing mode by encoding the base in the displacement portion of the instruction and placing the index into the register) because the displacement portion of a displacement mode instruction will generally be shorter than the base portion of an indexed mode instruction. Hence indexed mode is useful for accessing an array of records, while displacement mode is useful for accessing a particular record. 4 Question 2 (b) Explain why RISC processors normally use the Execute stage of the pipeline to perform address computation. Primary reason: Savings in hardware. Why: In RISC architecture arithmetic instructions (e.g. add, sub, mul, div, shl, shr, and, or, not etc.) cannot access memory, and hence do not require address computation. The arithmetic unit can be dedicated totally to processing the instruction. Instructions that require address computation (e.g. load, store) cannot perform arithmetic (e.g. add, sub, etc.), and hence the arithmetic unit will be used solely for computing address. So there is no need for a separate arithmetic unit to do address computations, and hence can “re-use” unit in EX stage. Economical. Downside -> destination address known only at end of EX stage. 5 Question 3 (a) Explain the idea of “set-associative” in cache, translation-lookaside buffer or other devices that use the idea. Side issues (not related to question) Direct-mapped cache: The rightmost n bits of an address or VPN is used to indexed a cache with 2n blocks. The data (or PPN) for this address (or VPN) then goes to the indexed block. The bits to the left of these index bits are used as “tags” to determine cache hit/miss. E.g. for DM caches with n=4: Data for address 001101 goes to block 13, tag 00 Data for address 001010 goes to block 10, tag 00 Data for address 011101 goes to block 13, tag 01, Data for address 111001 goes to block 9, tag 11 Good: Simple to implement Bad: Frequent collision resulting in poor hit rate. E.g. a repeated alternating access to locations 001101 and 011101 (i.e. access 001101, then 011101, then 001101, then 011101) will result in a hitrate of 0. Fully associative Cache/TLB: Entire address/VPN is stored as a tag in any block in the cache. When an address/VPN is presented, a parallel search across all blocks is made for a matching tag. If hit, data is read out/written to that block. Good: Flexibility to implement clever replacement algorithms to maximize hit rate. Bad: Need comparator for every block. Becomes too big and slow when too many blocks are involved. Hence limited to small caches. 6 Actual Answer: Set-Associative Cache/TLB: Compromise between DM and FA. The leftmost m bits of the address are used to index to a set of blocks. Blocks within this set are fully associative. There are 2m sets, and a certain number of blocks in each set. E.g. if m = 3, then there will be 8 sets. Suppose each set has 2 blocks. Then: Data for address 001101 goes to set 5 block 0, tag 001 Data for address 001010 goes to set 2 block 0, tag 001 Data for address 011101 goes to set 5 block 1, tag 011, Data for address 111001 goes to set 1 block 0, tag 11 To access, take leftmost 3 bits (e.g. 101b = 5), then search both blocks in set 5 for the tag 001. If found, hit, else miss. Good: Has flexibility offered by FA in allocating blocks to maximize hit rate, yet only requires as many comparators as there are blocks in each set. So if each set has 2 blocks, we only need 2 comparators. Fast, cheap. Bad: Less associativity (and lower hit rate) then FA. More complex than DM. 7 Question 3 (b) In a multiprocessor shared-memory system, one processor may invalidate data in another processor. Explain why this is useful, and describe one example of such invalidation going through it step by step from the initial condition making invalidation necessary to subsequent events canceling the invalidation. When a memory block is shared and has multiple copies in different caches, then processor that modifies its copy must invalidate the other copies so that they will obtained the latest values when their copies are used and not use the outdated copies. The events are (note this is NOT the MESI protocol, but an imaginary one): (1) processor A first uses block X which is tagged Exclusive (2) Processor B also uses block X and both copies are tagged Shared (3) Processor A modifies its copy which is again marked Exclusive and is written through to memory, while B’copy is Invalid (4) Access by B on its copy causes cache miss with access of latest copy from memory. Both copies are now marked Shared. 8 Question 4 (a) Explain why the Ethernet protocol is not suitable for linking processors with memory modules, which usually go through a bus or switch. Main point: Transfers to/from memory usually small and frequent. Hence low overhead, low latencies and predictable timing are important. Characteristics of Ethernet: i) High overheads: Ethernet has complex packeting and framing structure, and requires encoding (non-return to zero or NRZ) before transmitting and decoding upon receiving. All these add to overheads. Bus and switches generally use plain +5v/0v binary. Routing done once and data sent rapidly. ii) Slow: Ethernet is a serial medium transmitting 1 bit at a time. Bus and switches are parallel architectures, with widths matching machine word size (or more!). Hence fast to transmit. iii) Ethernet arbitration is by contention. A station monitors line until it is quiet, then transmits. As it transmits it continues to monitor for collision. If a collision occurs the transmitting stations back-off for a random period of time and try again. Hence latencies are unpredictable. Bus and switches, once set up and routed properly, have fixed latencies (bus propagation delay + switch latencies). iv) Contention-based protocol also cause bandwidth to decay rapidly if many stations (CPUs) are transmitting (send/receive to/from memory). At worst can decay to almost 0 bps. Bus and switches, once set up and routed, are dedicated channels between CPU and memory. 9 Question 4 (b) A polynomial a + bz + cz^2 +dz^3 + ez^4 + fz^5 + gz^6 +hz^7 .. may be computed on a N-node distributed machine in O(logN) steps as follows node 0 1 a b multiply z multiply multiply 2 c 3 d 4 e z z^2 5 f 6 g z z^2 z^4 z^4 7 .. h .. z .. z^2 z^2 .. z^4 z^4 .. .. followed by summation which takes logN steps as shown in a tutorial. Outline a simple program doing the first part shown above. There is no need to show the rest of the program which is the summation part. The two parts are quite similar as the computation in each iteration depends on a binary digit of the node index. Analysis: At each step, z needs to be squared. This is so that we can compute z *z to give us z2, z *z2 to give us z3, z2 * z2 to give us z4, z * z4 to give us z5 etc. Also at some point we need to multiply in the coefficients. We can actually do this at any time as long as it is done only once. We begin by multiplying in the coefficients when I is odd. The table on the next page shows how it will work: 10 P0 I=0 r=a I=0 r=a I=0 r=a I=0 r=a P1 I=1 r=b v=z P2 I=2 r=c v=z P3 I=3 r=d v=z P4 I=4 r=e v=z P5 I=5 r=f v=z P6 I=6 r=g v=z P7 I=7 r=h v=z r=rv = bz v=v2 =z2 I=0 r=bz v=v2 =z2 v=v2 =z2 v=v2 =z2 I=2 r=e v=z2 r=rv =fz v=v2 =z2 I=2 r=fz v=z2 v=v2 =z2 I=1 r=c v=z2 r=rv =dz v=v2 =z2 I=1 r=dz v=z2 I=3 r=g v=v2 r=rv = hz v=v2 =z2 I=3 r=hz v=z2 r=rv =cz2 v=v2 =v4 I=0 r=cz2 r=rv =dz3 v=v2 =v4 I=0 r=dz3 v=v2 =v4 v=v2 =z4 I=1 r=e v=z4 I=1 r=fz z=z4 r=rv =gv2 v=v2 =z4 I=1 r=gv2 v=z4 r=rv =hz3 v=v2 =z4 I=1 r=hz3 v=z4 r=rv =ev4 v=v2 =z8 I=0 r=ev4 r=rv =fz5 v=v2 =z8 I=0 r=fz5 r=rv =gv6 v=v2 =z8 I=0 r=gz6 r=rv =hz7 v=v2 =z8 I=0 r=hz7 I=0 r=bz I=0 r=bz I=0 r=cz2 I=0 r=dz3 Algorithm takes log2N steps to complete. 11 Program: r = coeff; // This depends on node. Node 0 coeff=a, node 1, coeff=b // etc. v = z; I = nodenum(); // Get node number while(I>0) { if(odd(I)) r=r*v } v = v * v; I=I/2; 12 Question 5 (a) Describe the use of parity checks in storage systems and ATM switches. For each example, describe the situation when the parity checks are needed to solve a problem, and how this solution is carried out. Storage systems: Parity is done by adding across the same block for all disks, and across all blocks within the same disk. Recovery single disks or blocks achieved by subtracting the sum of all the other disks/blocks from the parity. Possible to do multiple disk recovery (see tutorial question). ATM Switch: ATM switch computes checksum across all data in the packet and compares it with the checksum stored in the packet. If match, packet is sent along to next node in the virtual circuit. If mismatch, switch asks previous node to re-send the packet. 13 Question 5 (b) The Itanium processor is classified as an EPIC, explicitly parallel instruction computer. Explain the concept EPIC. Also explain how Itanium handles branches and how this is different from MIPS processors. Reference: Computer Organization and Architecture, 5ed, William Stallings. Key point: Explicit parallelism. i) EPIC is a VLIW architecture. (very long instruction word) i. In this architecture multiple instructions can be placed into a single instruction word. ii. Each instruction word is loaded, and the multiple instructions are dispatched to separate execution unit. iii. Dependencies are NOT checked by hardware!! The IA-64 can handle up to 3 instructions per instruction word. ii) Instructions that can be executed in parallel must be bundled together. Bits are set in the “template” portion of the instruction to tell CPU which instructions are independent. In the IA-64/Itanium the independent instructions can span instruction words. i. Compiler re-arranges instruction so that independent instruction are placed into the same instruction word or contiguous instruction words. So if the compiler can find 8 independent instructions, it will span a total of 2 2/3 instruction words (each has 3 instructions). The template bits are set to show that these eight instructions are independent. ii. CPU takes these 8 instructions and executes them in parallel. iii) Hence parallelism must be explicitly set by the compiler in the instruction -> Explicitly Parallel Instruction Computer. 14 Branches: MIPS handles branches by attempting to predict direction of branch (taken/not taken). If predict taken, instructions are fetched and executed from the target. If predicted not taken, instructions are fetched and executed following the branch. A mis-prediction will require the pipeline to be flushed before the correct instructions are fetched and executed. Itanium handles branches by predication. E.g.: if(x==1) { y = y + 1; z = y * 2; } else { y = y – 2; z = y / 3; } Let x be register r1, y be r2, z be r3. Conventional MIPS style: ELSE_PART: EXIT: cmpi r0, r1, 1 jz r0, ELSE_PART addi r2, r2, 1 multi r3, r2, 2 j EXIT subi r2, r2, 2 divi r3, r2, 3 … // ro is 0 if r1 != 1 But in Itanium-style, a predicate Px is appended to instructions in the TRUE portion, and a predicate Py is appended to instructions in the FALSE portion. Do not need to append predicates to any other instructions. 15 <P1, P2> = cmp(r1 == 1) // P1 = true predicate, // P2 = false predicate. <P1> addi r2, r2, 1 <P1> multi r3, r2, 2 <P2> subi r2, r2, -2 <P2> divi r3, r2, 3 Predication allows both the TRUE and FALSE portions to be executed simultaneously. When the outcome of the comparison is known, only the results of the correct predicate are committed. The results of the wrong predicate are discarded. This is trivially achieved by writing the results of both sides (<P1> and <P2>) to an ROB, attaching a tag to the result indicating which predicate this result belongs to. The results corresponding to the correct tag are then committed, while those for the wrong tag are purged from the ROB (ROB = re-order buffer). This simplifies branch handling, eliminates the costs associated with pipeline flusing, and predicated streams (which are naturally independent) can be executed in parallel by giving the compiler more independent instructions that it can use to fill instruction bundles. 16 Side Issues The Itanium has speculative loading. Speculative loads return immediately (instead of waiting for the load to complete), and the load goes on independently. Moreover the result of the load are not committed to the register until a later time. This allows the compiler to move all loads to the start of the program, then schedule other instructions while the load is going on, until the results of the load are actually needed. Of course the load may trigger an exception (e.g. loading from another processes’ address space causing protection violations). It will not be useful to interrupt the process there and then, because it is possible that the load is never actually used (e.g. the load may be pulled up from a portion of code that never would have gotten executed). Instead a check is done just before the result of the load is used, and if an error had occurred, the exception will be triggered at this point. This checking operation also commits the results of the load to the register, since the speculative load is taking place at the start of the program rather than at the “correct” position. E.g. Original code w/o speculation and predication: cmpi r0, r1, 1 jz r0, ELSE ELSE: EXIT: lw r4, 0(r7) addi r4, r3, 2 mul r7, r4, r3 j EXIT lw r5, 0(r1) addi r5, r3, 2 divi r7, r5, 7 … With predication: EXIT: <p1, p2> = cmpi(r0 == 1) <p1> lw r4, 0(r7) <p1> addi r4, r3, 2 <p1> mul r7, r4, r3 <p2> lw r5, 0(r1) <p2> addi r5, r3, 2 <p2> divi r7, r5, 7 … 17 Pull all lw to the top, replace with speculative load s.lw. Replace all current lw in the predicated instructions with s.check to do checking for exceptions. Exit: s.lw r4, 0(r7) s.lw r5, 0(r1) <p1, p2> = cmpi(ro==1) <p1> s.check r4 <p1> addi r4, r3, 2 <p1> mul r7, r4, r3 <p2> s.check r5 <p2> addi r5, r3, 2 <p2> divi r7, r5, 7 … The s.check checks for exceptions, and commits the results of the load to the register. 18