Dynamically Translating Instructions to Cope with Faulty Hardware Kevin Kauffman Jeremy Walch Kevin.Kauffman@duke.edu Jeremy.Walch@duke.edu Department of Electrical and Computer Engineering Pratt School of Engineering Duke University Abstract In any computing application, uptime is of utmost importance. Down time on a machine will most likely result in a loss of money. Hardware faults are a common occurrence when working with technology at the nanometer scale, and their frequency will only increase as processes become smaller. Even with millions or billions of transistors on a chip, a single error can cripple a processor and make its output useless. We propose a system by which faulty processing cores can be made useable by translating instructions on the fly to route the datapath around the faulty hardware. Our goal was to make a system by which the fault-tolerant core would appear identical to the outside while being able to cope with a variety of faults, where the only outside sign of a fault is longer execution time. We accomplish this by adding a stage to the pipeline between fetch and decode which pulls translations out of a dedicated ROM when a fetched instruction would use known faulty hardware. 1. Introduction With transistors at the nanometer scale, faults in hardware are becoming ever more common; the soft error rate in logic at 50nm is expected to be around 100 FIT – approximately nine orders of magnitude higher than at 600nm! [1] In a standard non-fault-tolerant processor, the processor is “dumb” in that it will continue to pump instructions through faulty components, thus generating garbage results. The goal of a fault-tolerant processor is to be able to generate correct results even in the presence of faulty hardware. In the case of a wide processor, hardware faults are of little concern because there is natural component redundancy built into the design of a superscalar processor. Similarly, multi-core and multi-threaded processors can easily be leveraged for fault-tolerance by running the same instructions on multiple hardware resources [2][3]. A more interesting challenge is being able to cope with faulty hardware in a simple one-wide core, where redundant hardware is not available. Before one can build a system which copes with faulty hardware, there must be a mechanism in place to determine whether hardware is faulty or not. Systems which accomplish this at relatively low hardware cost have been devised previously [4][5]. In particular, [4] is shown to be phenomenally effective at detecting faults in simple cores, and thus for the remainder of this paper, we will consider the fault detector to be ideal (i.e. detects all faults). Instruction translation is simply the case in which the code executing in the core is not the same code which is being sent to the processor Dynamically Translating Instructions to Cope with Faulty Hardware 1 December 2009 to be executed. There are numerous instances of architectures which take advantage of instruction translation for various purposes. Transmeta used instruction translation that effectively served as a VLIW hardware compiler, which meant that the code executing on the hardware was already optimized, translated code [6]. Intel uses dynamic translation when it translates incoming cumbersome x86 instructions into its proprietary micro-ops which are natively executed on the hardware [7]. In both cases, the translation happens at runtime and is invisible to the outside. To the external viewer, execution looks identical regardless of the fact that the instructions are translated to something different inside the processor core. In terms of fault-tolerant computing, dynamic instruction translation is a natural solution. Although it is possible to expose known defects through a static architecture to a compiler [8], it is ideal that a faulty processor should appear to operate identically to a nonfaulty processor. One approach is to modify code such that it can verify data integrity via a clever software algorithm, as has previously been proposed [9]. However, this incurs the overhead of re-compiling code. Translations that occur at runtime circumvent this penalty. Furthermore, in many cases the existence of a fault may not be known until runtime, suggesting a static solution may not be practical. One difference between our translations for faulttolerance as opposed to Intel’s micro-ops is that fault-tolerant translations occur conditionally. Intel translates every instruction after it is fetched. In the scheme we present, only instructions which are limited by faulty hardware are translated. The rest of this paper is organized as follows: In section 2, we discuss previous work on detouring. In section 3, we discuss our implementation for dynamic translations. In section 4, we discuss our methods for evaluation. In section 5, we present the results. In section 6, we discuss the results. In section 7, we conclude our work, and in section 8 we discuss future directions. 2. Detouring Detouring, as proposed by [10] is the basic idea of rewriting instructions such that they use different hardware than they were originally intended to, such that the datapath “detours” the hard faults. The goal is to come up with cheap (in terms of cycles) translations of the instructions which can simulate the original instructions while not using known faulty hardware. An example is synthesizing a left shift using additions as shown in Figure 1. The scheme also leverages detailed knowledge about the faults. For example, if a 32x32 multiplier is known to be faulting only in the 37th bit of the result, then it could still be used as a 16x16 multiplier without any faults. We do not consider the case of partially functional arithmetic units, though the techniques used could also be applied to our scheme. In either case, the software must be translated such that the originally written code is not executed on the machine (assuming the core has some non-zero number of faults). As presented in detouring, when faults are found in the core, the code must be recompiled with the correct translations of instructions. Figure 1: Example detour We build upon this idea by developing a separate scheme by which instructions are dynamically translated into a different set of instructions by the hardware at runtime if a fault is detected, in a manner similar to the ACFs proposed in DISE [11]. This method has advantages over detouring because the same binaries can be input to both faulty and nonfaulty cores and produce identical outputs, except for the extra latency which is incurred during translation in the faulty core. This means Dynamically Translating Instructions to Cope with Faulty Hardware 2 December 2009 that the system is not limited to multi-core processors or to systems which are capable of recompiling software. It also means that the processor does not need to incur the overhead of recompiling code once a fault is detected - the only penalty incurred by faulty hardware is the time it takes to simulate the instruction(s) which require faulty hardware. 3. Dynamic Translation 3.1 Assumptions Before continuing on to our hardware implementations for dynamic translations, we must make some assumptions on the types of cores we are dealing with. The cores we consider here are strictly small, simple cores implementing a RISC architecture. The core executes in order, and is one-wide with a shallow pipeline (e.g., a prototypical 5-stage pipeline). The reason we limit the scope of this paper is due to the nature of complex cores. As processors rise in complexity, they inevitably trend toward superscalar. At that point, instruction translation becomes a poor solution due to the natural redundancy built into the system. If a multiplier goes down, there is no reason to take a performance hit synthesizing multiplies out of shifts and adds when the same task can be completed much faster by waiting for another available multiplier. Beyond these practical considerations, our scheme should be implementable on any microarchitecture for which there can be a stage inserted in between fetch and decode. 3.2 The Translation Stage In order to effectively translate instructions on the fly, we deemed it necessary to add an extra pipeline stage between the fetch and decode stages. Instructions being fetched in program order are fed into this stage. The purpose of this additional stage is to examine each individual instruction, determine if it is affected by faulty hardware and then either pass it off to the decode stage, or feed the known translation back into itself (stalling fetch if necessary). When a faulty instruction is received, the stage must pass through each instruction in the detour before proceeding to the next instruction of the original program. 3.3 The Valid Table A key hardware structure which is resident in this translation stage is the valid table. It consists of a table which is indexed by opcodes and each entry contains a valid bit and the location of the translation routine, as shown in Figure 2. The valid bit signifies whether that instruction can be executed (whether or not it would use faulty hardware). The valid bits are set by the error detection scheme, however it might be implemented. When an instruction is fetched, the first step is to look up whether it is valid in the valid table. If the valid bit is high, then the instruction is simply passed along normally to the next stage. If the valid bit is low, then it means that the original instruction cannot be executed on the currently operable hardware. Figure 2: Valid Table 3.4 Auxiliary Registers Each translation is, of course, comprised of a set of instructions where the eventual output matches the output of the original instruction, had it been able to execute properly. Since each intermediate instruction must store its result somewhere which does not interfere with the execution of the rest of the program, we propose adding a small separate register file, perhaps only eight registers, which contains the intermediate operands, which is useable only by the translated instructions. This is a small hardware price to pay to ensure that Dynamically Translating Instructions to Cope with Faulty Hardware 3 December 2009 no critical registers in the architected register file are overwritten. Another alternative might have been to save the registers to memory and then load the values back out at the end of a translated routine, which adds extra latency to the completion of every translated instruction. While the former alternative does have an associated hardware cost, we believe it is preferable to the potential performance penalty of the second option. Therefore, when the processor is in “detour mode” as opposed to normal operation, the register codes will be directed to the auxiliary register file instead of the normal file. Special registers such as the zero register may overlap between the two. Another area of overlap is that a translated subroutine will need to use operands from the architected register file rather than the auxiliary one. In order to accomplish this, at the time when the translation stage initiates a translation, the register names of the original operands must be passed to the subroutine. The operands can be passed into the auxiliary file via a bus. This necessitates that the original operand values be bypassed in to ensure that the subroutine gets the correct values. The result is copied back from the auxiliary register file to the original location in the architected via the same mechanism. 3.5 Accessing Translations In terms of actually accessing the translations, we propose a microcode ROM of available translations. The valid table contains a pointer to a ROM location for each instruction that has an available translation. For each instruction that is part of the translation, there is certain information on operands that differs from regular instructions. Namely, there must be a way to access the operands of the original instruction. This is accomplished by careful placement of the operands when they are first copied to the auxiliary register. As long as they are moved to the same registers in the auxiliary file each time (for example, the two lowest registers) then we can use those registers in the translation to represent the operands of the original instruction in all cases, assuming we are careful not to overwrite them. Similarly, if we always place the result in the same location in the auxiliary file, we can always pass back values from that register. In order to effectively fetch instructions out of our translation ROM, we must maintain a pointer to the next instruction. Therefore, the auxiliary register file is associated with a program counter which is only relevant to the translation routine. It is treated just like a regular program counter, and any control flow instructions which are part of the translation will instead modify this special program counter (except for special cases where the main program counter needs to be modified). In this regard, the translation routines have characteristics common to independent threads. As mentioned previously, the valid table contains a pointer to the first instruction of the translation. What happens in reality is that this pointer value is stored in the special PC and then fetch begins from the translation ROM instead of from the I-cache. At this point the special PC is incremented normally and instructions are fetched sequentially from the translation ROM. Allowing control flow within our translation routines is a critical feature in maintaining acceptable performance. A number of instructions can be synthesized in variable number of simpler instructions. Consider the case of a simple left shift. If the shift amount is only one, this can be accomplished with a simple add. If the shift amount is larger, then the addition would need to be repeated several times, incurring higher latency. While a compiler could optimize for these cases when the shift amount is known statically, allowing control flow in the subroutine allows the hardware to potentially perform fewer iterations when the operand values are not known until runtime. A multiply instruction is also a prototypical example. When the operands of a multiply are relatively small, it takes a relatively small number of iterated shifts and adds to get the result, as compared to larger operands. We theorize that often these instructions perform Dynamically Translating Instructions to Cope with Faulty Hardware 4 December 2009 operations on operands with relatively few bits, therefore allowing us to preserve performance by reducing the number of dynamic instructions executed in a translation routine. the translation is completed and results are passed back to the architected register. At this point the core execution can commence as it would have had not any translation taken place. In order to determine when a given translation is finished, we must insert special instructions into the translation ROM which alert the translation routine to halt and continue taking instructions from the I-cache as opposed to the translation table. Since we have control over which instructions are in the translation, we can set aside an arbitrary instruction which is not used by in any translation and use it as our “end A special case of the PC is when a control flow instruction needs to be translated. In this case, a result of the translation needs to update the main PC rather than the architected register file. To accomplish this, there must be a bus between the main PC and the auxiliary register file. Since the main PC normally has the potential to be updated from different busses, this is just a matter of updating the control logic Figure 3: Register Stack on Nested Translations translation” instruction. A simple hard wired comparator is used to determine that this instruction has been fetched, and at that point governing the multiplexor selecting which bus updates the PC. When this occurs, the translation stage will automatically start passing Dynamically Translating Instructions to Cope with Faulty Hardware 5 December 2009 original program instructions from the I-cache at the intended location once the translation is finished. 3.6 The Translation Stack Another key feature of the design of our translation stage is that it is easily extensible to multiple nested translations, as shown in Figure 3. Suppose, for instance, that both the multiplier and the shifter are found to be faulty. Consider a multiply instruction in the original program. The multiply instruction is then broken down into sequences of shifts and adds. The shifter, though, is also faulty, so those shifts must be able to be translated into additions even though we are already executing translation code. We solve this problem with the use of a hardware stack. Each stack frame consists of a register file and a PC, and each successive frame specifies another nested translation. Thus, the bottom frame of the stack represents the original program code being fetched from the I-cache with the actual PC and the architected register file. Each subsequent frame has a PC pointing to the instruction in the translation ROM and its own register file representing the data being used in the calculations for that translation. Interactions between any two stack frames are the same as the interactions described previously for a single translation. When a translation is initiated, the operands are passed into the next register file, and that register file is “activated.” Instructions are then fetched from the next frame’s PC until the end translation instruction is reached, at which time the result register is passed down to the previous frame, and control is returned to the previous frame’s PC. Since there are multiple frames, at any given time, it is necessary to have control lines specifying which register file is active so that the execution stage fetches operands from the correct registers. Much as the outside viewer should not be aware that the core is not executing the instructions as expected, any particular stack frame should have no connection with any frame that does not neighbor it. For instance it should be impossible for a nested translation to be able to pass any information to the original program - only the first translation should be able to do this. The original program should be completely unaware of any translation that is not the one which it called. Because each stack frame has its own register file and PC, it can be effectively thought of as a thread, where any thread under it in the stack cannot execute until the top thread has produced its output value. Obviously, the stack must have a finite size because it is impossible to have arbitrarily many register files present in hardware. Not only does it take up die size, but the access latency to the caches increases with each possible frame as there must be additional multiplexer lines for each one. We realize, though, that the maximum number of stack frames is actually relatively small. There are a finite number of components to a processor, and after two nested translations, it becomes likely that we will be trying to translate onto hardware which has already been found faulty and whose translations call for other faulty hardware to be used. In an arbitrarily large stack, this would create a loop of two translations calling each other over and over. Furthermore, after a certain number of nested translations, the performance impact would almost certainly become prohibitive. Therefore, we limit the stack to three frames, the first frame is the actual program, the second is any translation needed by the program, and the third is any translation needed by translation code. After this point, it becomes unlikely that there are any other translations which can be called which don’t use hardware that is faulty. If another translation is called on a full stack, the processor will not be able to execute the program, and the core should shut down. In terms of overall hardware cost, the largest additions are the valid table and the translation ROM. Also included are the additional register files and the additional hardware control. In the end, the tables and Dynamically Translating Instructions to Cope with Faulty Hardware 6 December 2009 ROM dwarf the rest of the additions in terms of size. 4. Methods of Evaluation In order to evaluate our idea, we first had to generate translations of instructions to determine the added latency incurred for each translation. We chose to use the Alpha ISA because it is a RISC architecture which is supported by simple core designs and for which there is a readily available simulator (SimpleScalar). The translations we created are similar to the ones discussed in [10]. In developing the translations, we generally took the simplest available path. We generally assume that functional units are independent, meaning that an addition and AND operation do not overlap, for example. Some operations such as multiplication can be easily split into two halves, each executed on the same half of the multiplier and subsequently reassembled. Here we took complete detours, meaning we didn’t use the multiplier altogether. This is because we did not want to make the assumption that our fault detection hardware could identify which bits of a component were faulty, just that it would inform us that a multiplication instruction was faulty. As shown in the example of Figure 1, we implemented instructions using instructions from the ISA itself, simplifying the binary translation. More example detours are shown in the appendix. Once translations had been synthesized, we needed to make some assumptions about the microarchitecture and latencies in order to obtain the translation latencies in terms of cycles. For simplicity, we assume that the ALU operations under consideration all take a single cycle, with exception to multiplication, which we take to be a four-cycle operation. We also assume that there is perfect branch prediction within our translations. Once we had obtained the cycle counts for each of the instruction detours, we were able to effectively simulate the execution time for various benchmarks (anagram, go, gcc compress95). The goal of the design here was not to demonstrate an increase in IPC as in many other design ideas, but to reasonably limit the performance penalty associated with running on faulty hardware. In a certain respect, we have infinite speedup over the baseline, since the IPC without fault-tolerance is 0, as the outputs are incorrect. Therefore any correct execution is a gain. We modeled several possible combinations of faulty hardware units. First we simulated baseline performance with an extra cycle of branch mis-predict penalty due to the extra pipe stage, followed by detouring each of the following types of instructions individually: multiply, sign-extension, byte-extraction, shifts, masks, (unconditional) jumps, branches, and adds/subtracts. We then simulated the results of arbitrary groupings of instructions simultaneously needing to be detoured. We consider three pairs (masks and byte-extraction; shifts and jumps; branches and jumps), two trios (shifts, byte-extraction, and sign-extension; multiply, masks, and jumps), and one group of five (multiply, masks, jumps, byte-extraction, and sign-extension). In light of the results, we also simulated the performance impact if the add/subtract subroutine required fewer instructions (“cheaper add”). Upon seeing preliminary data, we also concluded it would be useful to have data on how many times a given instruction was dynamically executed as compared to the total number of dynamically executed executions. Thus we also modified SimpleScalar to include several new counters for all the classifications of instructions considered. Dynamically Translating Instructions to Cope with Faulty Hardware 7 December 2009 1.2 1.0 0.8 0.6 0.4 Anagram Go 0.2 Gcc 0.0 Compress95 Figure 4: Performance Results 5. Results The results of our simulations detailing performance can be found in Figure 4. Along the x-axis are the various faults injected into the system. Along the y-axis is the speedup relative to a fully functioning processor with no modifications (baseline). Figures 5-8 contain the instruction breakdowns for each of the benchmarks used. 6. Discussion Jump 1% Sign extensions and jumps caused very little slowdown. For the sign extensions, it is clearly because they are not used in the course of the benchmarks. In terms of the jumps, it is likely because it is only a two cycle detour, which, coupled with jumps being a small percentage of the total program, did not result in Anagram Branch Mask 14% Shift 2% 1% ByteExtract 3% SignExtend Multiply 0% 0% Much of the results are highly dependent on the benchmarks (as was also found by [10]). If a benchmark contains many instructions which require the use of a component which is faulty, there will likely be a larger performance hit. In our results, there were varying amounts of slowdown caused by various faults. Go Add 44% Other 28% Add 59% Other 35% Figure 5: Anagram Instruction Breakdown Multiply 0% SignExtend Shift 0% 0% ByteExtract 0% Mask 0% Branch 12% Jump 1% Figure 6: Go Instruction Breakdown Dynamically Translating Instructions to Cope with Faulty Hardware 8 December 2009 Jump 2% Mask Shift 0% 2% ByteExtract 2% SignExtend Multiply 0% 0% Gcc Compress95 Branch 16% Add 46% Add 92% Other 32% Figure 7: Gcc Instruction Breakdown much slowdown. Surprisingly, the multiplier did not result in much slowdown at all, the range being from 03%. We had expected the multiply to be highly detrimental to performance due to common usage and high-cost detour. From our data, we can draw two conclusions. One is that the benchmarks we chose did not contain many multiplications. The second that they were small multiplications (that is, small operands), which have lower latency to detour. Moving up a step in slowdown we come to byte-extraction, masking, and branching. We expect that there was some slowdown for the former two due to about five cycles of latency for detouring each, yet they are still not used frequently enough to make a large dent in IPC. Branching, on the other hand, produced about 8% slowdown across all benchmarks despite the fact that it is only a two cycle penalty. This is likely because branches are used frequently enough in all the benchmarks that the large number of branches more than compensates for the small latency. This leads into two instructions which are not only used fairly commonly in our benchmarks, but also have relatively long latencies associated with their detours. The first is shifts. Shifts, right shifts especially, have huge delays associated with their detours. Right shifts required the use of the left shift module, thus right shifting eats the left shift latency for each time it must be done. Further, a right-shift actually had a lower latency for large operands rather than small. We believe both of these issues are particular to our synthesized Multiply Jump 0% 0% Shift 0% Mask Sign- 0% Extend 0% Other Byte2% Extract Branch 0% 6% Figure 8: Compress95 Instruction Breakdown translations. A different ISA might allow for a more compact detour. Since right shift is a very common operation, it makes sense that the slowdown when the shifter was out was as high as 57% in the case of GCC. It is likely that anagram and compress95 use few shifts due to their maintaining relatively minor slowdown. The most detrimental single fault was to the adder. This is likely due to the fact that it is the most common operation, and compounded by the fact that its synthesis routine has an extremely high latency associated with it. Performance loss was above 90% on all benchmarks, even after supposing a significantly lower latency for the subroutine. This suggests that adds/subtracts are just so common that any penalty becomes prohibitive. We also synthesized a variety of combination faults. As expected, in the first three results which only pit two faults with each other, the slowdown was not much worse than with each fault individually. The most interesting result comes from the shift + byte-extraction + sign-extension test, where IPC plummeted for three of the four benchmarks. This is likely because both the byte extraction and the sign extension detours both require shifts to complete. Therefore, you are not only losing out when there are shifts to be done, but every time one of those other operations arrives, you also lose performance on the shift. It makes sense that anagram had the largest fall from the shift to the shift + byte-extraction faults because its byte-extraction slowdown implies that it has more of that type of operation than the others. Dynamically Translating Instructions to Cope with Faulty Hardware 9 December 2009 In the end, we find it interesting that the shifter is what plays the most crucial role in determining how fast a faulty processor can run, because it seems that not only is the shifter detour long, but many other detours depend on the shifter to work properly. 7. Conclusion In an era where the simple single-wide core is making a comeback in embedded systems and multi-core processors, building a system that can tolerate faults in cores which have little inherent redundancy will become essential. We feel that we have created a scheme that would not only allow cores to continue to operate under a multitude of hard faults, but to do so while maintaining the illusion of normal operation. Our system has places no burden on the software (e.g., code recompilation) because everything occurs at runtime. The hardware additions are reasonably small other than the addition of an extra independent stage in the pipeline. Acknowledgements We would like to thank Dan Sorin for his guidance and inspiration in preparation for this paper. References [1] P. Shivakumar et al. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks. [2] E. Rotenberg. AR-SMT: A Microarchitectural Approach to Fault Tolerance. [3] M. Cirinei et al. A Dlexible Scheme for Scheduling Fault-Tolerant Real-Time Tasks on Multiprocessors. [4] D.J. Sorin et al. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. In Proceedings of the 40th Annual International Symposium on Microarchitecture, Dec. 2007. [5] T. M.Austin. DIVA: A Dynamic Approach to Microprocessor Verification. Journal of Instruction-Level Parallelism, 2, May 2000. [6] A. Klaiber. The Technology Behind Crusoe Processors. Jan. 2000. [7] G. Hinton et al. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, Q1 2001. [8] P. Shivakumar et al. Fault Aware Instruction Placement for Static Architectures. [9] A. Li, B. Hong. A Low-Cost Correction Algorithm for Transient Data Errors. Ubiquity, Vol. 7, Issue 22. 8. Future Work In the future, we would like to expand our scheme to work with other faults which are not considered in the paper, such as PC logic faults and memory errors. We would look to develop a working prototype of a core using our scheme which can operate under a variety of faults while still maintaining independence and correctness. We would also like to interface our design with a fault-detection system such as Argus, which could provide a fully fault-tolerant core: the ability to detect faults, and to fix them. Future work should also explore the impact of the ISA on the efficiency of translation routines. It is possible that some results are particular to the Alpha ISA. Would all RISC architectures yield similar results? Would translation routines in a CISC architecture be more efficient? Would the best results happen somewhere in the middle, or does the ISA not even matter at all? These are questions worth exploring Dynamically Translating Instructions to Cope with Faulty Hardware 10 December 2009 [10] [11] A. Meixner, D.J. Sorin. Detouring: Translating Software to Circumvent Hard Faults in Simple Cores. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Jun. 2008. M.L.Corliss et al. DISE: Programmable Macro Engine Customizing Applications. Appendix Example Translations MULL: ZAP Ra, 0xF0,TempReg1 ZAP Rb, 0xF0,TempReg2 ADDQ R31,R31,Rc ADDL R31,TempReg1,TempReg3 ADDL R31,TempReg2,TempReg4 XOR TempReg3,TempReg4,TempReg4 ZAP TempReg4,0x0F,TempReg4 bitsLOOP: AND TempReg2,1, TempReg3 SUBQ R31,TempReg3,TempReg3 AND TempReg1,TempReg3,TempReg3 ADDQ Rc,TempReg3,Rc SLL TempReg1,1,TempReg1 SRL TempReg2,1,TempReg2 BNE TempReg2,LOOP OR Rc,TempReg4,Rc A for ADDQ: OR R31, Ra, TempReg1 OR R31, Rb, TempReg2 OR R31, 1, TempReg3 OR R31, R31, TempReg4 OR R31,R31, Rc Loop: AND TempReg1,TempReg3, TempReg5 AND TempReg2, TempReg3, TempReg6 XOR TempReg5, TempReg4, TempReg7 XOR TempReg7, TempReg6, TempReg7 OR TempReg7, Rc, Rc AND TempReg5, TempReg4, TempReg7 AND TempReg6, TempReg4, TempReg8 OR TempReg7, TempReg8, TempReg4 AND TempReg5, TempReg6, TempReg7 OR TempReg7, TempReg4, TempReg4 SLL TempReg4, 1, TempReg4 OR TempReg3, TempReg1, TempReg1 OR TempReg3, TempReg2, TempReg2 XOR TempReg3, TempReg1, TempReg1 XOR TempReg3, TempReg2, TempReg2 OR TempReg1, TempReg2, TempReg5 OR TempReg5, TempReg4, TempReg5 SLL TempReg3, 1, TempReg3 BNE TempReg5, Loop BEQ: ADDQ GPC,disp,TempReg1 CMOVEQ Ra, TempReg1, GPC SLL: ADDQ R31,Rb,TempReg1 ADDQ R31,Ra,Rc LOOP: ADDQ Rc,Rc,Rc SUBQI TempReg1,1,TempReg1 BNE TempReg1, LOOP JMP: ADDQ GPC,4,Ra ADDQ R31,Rb,GPC Dynamically Translating Instructions to Cope with Faulty Hardware 11 December 2009