project3 - Duke University

advertisement
Dynamically Translating Instructions to
Cope with Faulty Hardware
Kevin Kauffman
Jeremy Walch
Kevin.Kauffman@duke.edu
Jeremy.Walch@duke.edu
Department of Electrical and Computer Engineering
Pratt School of Engineering
Duke University
Abstract
In any computing application, uptime is
of utmost importance. Down time on a machine
will most likely result in a loss of money.
Hardware faults are a common occurrence when
working with technology at the nanometer scale,
and their frequency will only increase as
processes become smaller. Even with millions or
billions of transistors on a chip, a single error
can cripple a processor and make its output
useless. We propose a system by which faulty
processing cores can be made useable by
translating instructions on the fly to route the
datapath around the faulty hardware. Our goal
was to make a system by which the fault-tolerant
core would appear identical to the outside while
being able to cope with a variety of faults, where
the only outside sign of a fault is longer
execution time. We accomplish this by adding a
stage to the pipeline between fetch and decode
which pulls translations out of a dedicated ROM
when a fetched instruction would use known
faulty hardware.
1. Introduction
With transistors at the nanometer scale,
faults in hardware are becoming ever more
common; the soft error rate in logic at 50nm is
expected to be around 100 FIT – approximately
nine orders of magnitude higher than at 600nm!
[1] In a standard non-fault-tolerant processor,
the processor is “dumb” in that it will continue to
pump instructions through faulty components,
thus generating garbage results. The goal of a
fault-tolerant processor is to be able to generate
correct results even in the presence of faulty
hardware. In the case of a wide processor,
hardware faults are of little concern because
there is natural component redundancy built into
the design of a superscalar processor. Similarly,
multi-core and multi-threaded processors can
easily be leveraged for fault-tolerance by
running the same instructions on multiple
hardware resources [2][3]. A more interesting
challenge is being able to cope with faulty
hardware in a simple one-wide core, where
redundant hardware is not available.
Before one can build a system which
copes with faulty hardware, there must be a
mechanism in place to determine whether
hardware is faulty or not. Systems which
accomplish this at relatively low hardware cost
have been devised previously [4][5]. In
particular, [4] is shown to be phenomenally
effective at detecting faults in simple cores, and
thus for the remainder of this paper, we will
consider the fault detector to be ideal (i.e.
detects all faults).
Instruction translation is simply the case
in which the code executing in the core is not the
same code which is being sent to the processor
Dynamically Translating Instructions to Cope with Faulty Hardware
1
December 2009
to be executed. There are numerous instances
of architectures which take advantage of
instruction translation for various purposes.
Transmeta used instruction translation that
effectively served as a VLIW hardware compiler,
which meant that the code executing on the
hardware was already optimized, translated
code [6]. Intel uses dynamic translation when it
translates
incoming
cumbersome
x86
instructions into its proprietary micro-ops which
are natively executed on the hardware [7]. In
both cases, the translation happens at runtime
and is invisible to the outside. To the external
viewer, execution looks identical regardless of
the fact that the instructions are translated to
something different inside the processor core.
In terms of fault-tolerant computing,
dynamic instruction translation is a natural
solution. Although it is possible to expose known
defects through a static architecture to a
compiler [8], it is ideal that a faulty processor
should appear to operate identically to a nonfaulty processor. One approach is to modify
code such that it can verify data integrity via a
clever software algorithm, as has previously
been proposed [9]. However, this incurs the
overhead of re-compiling code. Translations that
occur at runtime circumvent this penalty.
Furthermore, in many cases the existence of a
fault may not be known until runtime, suggesting
a static solution may not be practical. One
difference between our translations for faulttolerance as opposed to Intel’s micro-ops is that
fault-tolerant translations occur conditionally.
Intel translates every instruction after it is
fetched. In the scheme we present, only
instructions which are limited by faulty hardware
are translated.
The rest of this paper is organized as
follows: In section 2, we discuss previous work
on detouring. In section 3, we discuss our
implementation for dynamic translations. In
section 4, we discuss our methods for
evaluation. In section 5, we present the results.
In section 6, we discuss the results. In section 7,
we conclude our work, and in section 8 we
discuss future directions.
2. Detouring
Detouring, as proposed by [10] is the
basic idea of rewriting instructions such that they
use different hardware than they were originally
intended to, such that the datapath “detours” the
hard faults. The goal is to come up with cheap
(in terms of cycles) translations of the
instructions which can simulate the original
instructions while not using known faulty
hardware. An example is synthesizing a left shift
using additions as shown in Figure 1. The
scheme also leverages detailed knowledge
about the faults. For example, if a 32x32
multiplier is known to be faulting only in the 37th
bit of the result, then it could still be used as a
16x16 multiplier without any faults. We do not
consider the case of partially functional
arithmetic units, though the techniques used
could also be applied to our scheme. In either
case, the software must be translated such that
the originally written code is not executed on the
machine (assuming the core has some non-zero
number of faults). As presented in detouring,
when faults are found in the core, the code must
be recompiled with the correct translations of
instructions.
Figure 1: Example detour
We build upon this idea by developing a
separate scheme by which instructions are
dynamically translated into a different set of
instructions by the hardware at runtime if a fault
is detected, in a manner similar to the ACFs
proposed in DISE [11]. This method has
advantages over detouring because the same
binaries can be input to both faulty and nonfaulty cores and produce identical outputs,
except for the extra latency which is incurred
during translation in the faulty core. This means
Dynamically Translating Instructions to Cope with Faulty Hardware
2
December 2009
that the system is not limited to multi-core
processors or to systems which are capable of
recompiling software. It also means that the
processor does not need to incur the overhead
of recompiling code once a fault is detected - the
only penalty incurred by faulty hardware is the
time it takes to simulate the instruction(s) which
require faulty hardware.
3. Dynamic Translation
3.1 Assumptions
Before continuing on to our hardware
implementations for dynamic translations, we
must make some assumptions on the types of
cores we are dealing with. The cores we
consider here are strictly small, simple cores
implementing a RISC architecture. The core
executes in order, and is one-wide with a
shallow pipeline (e.g., a prototypical 5-stage
pipeline). The reason we limit the scope of this
paper is due to the nature of complex cores. As
processors rise in complexity, they inevitably
trend toward superscalar. At that point,
instruction translation becomes a poor solution
due to the natural redundancy built into the
system. If a multiplier goes down, there is no
reason to take a performance hit synthesizing
multiplies out of shifts and adds when the same
task can be completed much faster by waiting
for another available multiplier. Beyond these
practical considerations, our scheme should be
implementable on any microarchitecture for
which there can be a stage inserted in between
fetch and decode.
3.2 The Translation Stage
In order to effectively translate
instructions on the fly, we deemed it necessary
to add an extra pipeline stage between the fetch
and decode stages. Instructions being fetched in
program order are fed into this stage. The
purpose of this additional stage is to examine
each individual instruction, determine if it is
affected by faulty hardware and then either pass
it off to the decode stage, or feed the known
translation back into itself (stalling fetch if
necessary). When a faulty instruction is
received, the stage must pass through each
instruction in the detour before proceeding to the
next instruction of the original program.
3.3 The Valid Table
A key hardware structure which is
resident in this translation stage is the valid
table. It consists of a table which is indexed by
opcodes and each entry contains a valid bit and
the location of the translation routine, as shown
in Figure 2. The valid bit signifies whether that
instruction can be executed (whether or not it
would use faulty hardware). The valid bits are
set by the error detection scheme, however it
might be implemented. When an instruction is
fetched, the first step is to look up whether it is
valid in the valid table. If the valid bit is high,
then the instruction is simply passed along
normally to the next stage. If the valid bit is low,
then it means that the original instruction cannot
be executed on the currently operable hardware.
Figure 2: Valid Table
3.4 Auxiliary Registers
Each
translation
is,
of
course,
comprised of a set of instructions where the
eventual output matches the output of the
original instruction, had it been able to execute
properly. Since each intermediate instruction
must store its result somewhere which does not
interfere with the execution of the rest of the
program, we propose adding a small separate
register file, perhaps only eight registers, which
contains the intermediate operands, which is
useable only by the translated instructions. This
is a small hardware price to pay to ensure that
Dynamically Translating Instructions to Cope with Faulty Hardware
3
December 2009
no critical registers in the architected register file
are overwritten. Another alternative might have
been to save the registers to memory and then
load the values back out at the end of a
translated routine, which adds extra latency to
the completion of every translated instruction.
While the former alternative does have an
associated hardware cost, we believe it is
preferable to the potential performance penalty
of the second option. Therefore, when the
processor is in “detour mode” as opposed to
normal operation, the register codes will be
directed to the auxiliary register file instead of
the normal file. Special registers such as the
zero register may overlap between the two.
Another area of overlap is that a translated
subroutine will need to use operands from the
architected register file rather than the auxiliary
one. In order to accomplish this, at the time
when the translation stage initiates a translation,
the register names of the original operands must
be passed to the subroutine. The operands can
be passed into the auxiliary file via a bus. This
necessitates that the original operand values be
bypassed in to ensure that the subroutine gets
the correct values. The result is copied back
from the auxiliary register file to the original
location in the architected via the same
mechanism.
3.5 Accessing Translations
In terms of actually accessing the
translations, we propose a microcode ROM of
available translations. The valid table contains a
pointer to a ROM location for each instruction
that has an available translation. For each
instruction that is part of the translation, there is
certain information on operands that differs from
regular instructions. Namely, there must be a
way to access the operands of the original
instruction. This is accomplished by careful
placement of the operands when they are first
copied to the auxiliary register. As long as they
are moved to the same registers in the auxiliary
file each time (for example, the two lowest
registers) then we can use those registers in the
translation to represent the operands of the
original instruction in all cases, assuming we are
careful not to overwrite them. Similarly, if we
always place the result in the same location in
the auxiliary file, we can always pass back
values from that register.
In order to effectively fetch instructions
out of our translation ROM, we must maintain a
pointer to the next instruction. Therefore, the
auxiliary register file is associated with a
program counter which is only relevant to the
translation routine. It is treated just like a regular
program counter, and any control flow
instructions which are part of the translation will
instead modify this special program counter
(except for special cases where the main
program counter needs to be modified). In this
regard,
the
translation
routines
have
characteristics common to independent threads.
As mentioned previously, the valid table
contains a pointer to the first instruction of the
translation. What happens in reality is that this
pointer value is stored in the special PC and
then fetch begins from the translation ROM
instead of from the I-cache. At this point the
special PC is incremented normally and
instructions are fetched sequentially from the
translation ROM.
Allowing control flow within our
translation routines is a critical feature in
maintaining acceptable performance. A number
of instructions can be synthesized in variable
number of simpler instructions. Consider the
case of a simple left shift. If the shift amount is
only one, this can be accomplished with a
simple add. If the shift amount is larger, then
the addition would need to be repeated several
times, incurring higher latency. While a compiler
could optimize for these cases when the shift
amount is known statically, allowing control flow
in the subroutine allows the hardware to
potentially perform fewer iterations when the
operand values are not known until runtime. A
multiply instruction is also a prototypical
example. When the operands of a multiply are
relatively small, it takes a relatively small
number of iterated shifts and adds to get the
result, as compared to larger operands. We
theorize that often these instructions perform
Dynamically Translating Instructions to Cope with Faulty Hardware
4
December 2009
operations on operands with relatively few bits,
therefore allowing us to preserve performance
by reducing the number of dynamic instructions
executed in a translation routine.
the translation is completed and results are
passed back to the architected register. At this
point the core execution can commence as it
would have had not any translation taken place.
In order to determine when a given
translation is finished, we must insert special
instructions into the translation ROM which alert
the translation routine to halt and continue taking
instructions from the I-cache as opposed to the
translation table. Since we have control over
which instructions are in the translation, we can
set aside an arbitrary instruction which is not
used by in any translation and use it as our “end
A special case of the PC is when a
control flow instruction needs to be translated. In
this case, a result of the translation needs to
update the main PC rather than the architected
register file. To accomplish this, there must be a
bus between the main PC and the auxiliary
register file. Since the main PC normally has the
potential to be updated from different busses,
this is just a matter of updating the control logic
Figure 3: Register Stack on Nested Translations
translation” instruction. A simple hard wired
comparator is used to determine that this
instruction has been fetched, and at that point
governing the multiplexor selecting which bus
updates the PC. When this occurs, the
translation stage will automatically start passing
Dynamically Translating Instructions to Cope with Faulty Hardware
5
December 2009
original program instructions from the I-cache at
the intended location once the translation is
finished.
3.6 The Translation Stack
Another key feature of the design of our
translation stage is that it is easily extensible to
multiple nested translations, as shown in Figure
3. Suppose, for instance, that both the multiplier
and the shifter are found to be faulty. Consider a
multiply instruction in the original program. The
multiply instruction is then broken down into
sequences of shifts and adds. The shifter,
though, is also faulty, so those shifts must be
able to be translated into additions even though
we are already executing translation code. We
solve this problem with the use of a hardware
stack. Each stack frame consists of a register
file and a PC, and each successive frame
specifies another nested translation. Thus, the
bottom frame of the stack represents the original
program code being fetched from the I-cache
with the actual PC and the architected register
file. Each subsequent frame has a PC pointing
to the instruction in the translation ROM and its
own register file representing the data being
used in the calculations for that translation.
Interactions between any two stack frames are
the same as the interactions described
previously for a single translation. When a
translation is initiated, the operands are passed
into the next register file, and that register file is
“activated.” Instructions are then fetched from
the next frame’s PC until the end translation
instruction is reached, at which time the result
register is passed down to the previous frame,
and control is returned to the previous frame’s
PC. Since there are multiple frames, at any
given time, it is necessary to have control lines
specifying which register file is active so that the
execution stage fetches operands from the
correct registers. Much as the outside viewer
should not be aware that the core is not
executing the instructions as expected, any
particular stack frame should have no
connection with any frame that does not
neighbor it. For instance it should be impossible
for a nested translation to be able to pass any
information to the original program - only the first
translation should be able to do this. The original
program should be completely unaware of any
translation that is not the one which it called.
Because each stack frame has its own register
file and PC, it can be effectively thought of as a
thread, where any thread under it in the stack
cannot execute until the top thread has
produced its output value.
Obviously, the stack must have a finite
size because it is impossible to have arbitrarily
many register files present in hardware. Not only
does it take up die size, but the access latency
to the caches increases with each possible
frame as there must be additional multiplexer
lines for each one. We realize, though, that the
maximum number of stack frames is actually
relatively small. There are a finite number of
components to a processor, and after two
nested translations, it becomes likely that we will
be trying to translate onto hardware which has
already been found faulty and whose
translations call for other faulty hardware to be
used. In an arbitrarily large stack, this would
create a loop of two translations calling each
other over and over. Furthermore, after a certain
number of nested translations, the performance
impact would almost certainly become
prohibitive. Therefore, we limit the stack to three
frames, the first frame is the actual program, the
second is any translation needed by the
program, and the third is any translation needed
by translation code. After this point, it becomes
unlikely that there are any other translations
which can be called which don’t use hardware
that is faulty. If another translation is called on a
full stack, the processor will not be able to
execute the program, and the core should shut
down.
In terms of overall hardware cost, the
largest additions are the valid table and the
translation ROM. Also included are the
additional register files and the additional
hardware control. In the end, the tables and
Dynamically Translating Instructions to Cope with Faulty Hardware
6
December 2009
ROM dwarf the rest of the additions in terms of
size.
4. Methods of Evaluation
In order to evaluate our idea, we first
had to generate translations of instructions to
determine the added latency incurred for each
translation. We chose to use the Alpha ISA
because it is a RISC architecture which is
supported by simple core designs and for which
there is a readily available simulator
(SimpleScalar). The translations we created are
similar to the ones discussed in [10]. In
developing the translations, we generally took
the simplest available path. We generally
assume that functional units are independent,
meaning that an addition and AND operation do
not overlap, for example. Some operations such
as multiplication can be easily split into two
halves, each executed on the same half of the
multiplier and subsequently reassembled. Here
we took complete detours, meaning we didn’t
use the multiplier altogether. This is because we
did not want to make the assumption that our
fault detection hardware could identify which bits
of a component were faulty, just that it would
inform us that a multiplication instruction was
faulty. As shown in the example of Figure 1, we
implemented instructions using instructions from
the ISA itself, simplifying the binary translation.
More example detours are shown in the
appendix.
Once
translations
had
been
synthesized, we needed to make some
assumptions about the microarchitecture and
latencies in order to obtain the translation
latencies in terms of cycles. For simplicity, we
assume that the ALU operations under
consideration all take a single cycle, with
exception to multiplication, which we take to be
a four-cycle operation. We also assume that
there is perfect branch prediction within our
translations.
Once we had obtained the cycle counts
for each of the instruction detours, we were able
to effectively simulate the execution time for
various benchmarks (anagram, go, gcc
compress95). The goal of the design here was
not to demonstrate an increase in IPC as in
many other design ideas, but to reasonably limit
the performance penalty associated with running
on faulty hardware. In a certain respect, we have
infinite speedup over the baseline, since the IPC
without fault-tolerance is 0, as the outputs are
incorrect. Therefore any correct execution is a
gain.
We
modeled
several
possible
combinations of faulty hardware units. First we
simulated baseline performance with an extra
cycle of branch mis-predict penalty due to the
extra pipe stage, followed by detouring each of
the following types of instructions individually:
multiply, sign-extension, byte-extraction, shifts,
masks, (unconditional) jumps, branches, and
adds/subtracts. We then simulated the results of
arbitrary
groupings
of
instructions
simultaneously needing to be detoured. We
consider three pairs (masks and byte-extraction;
shifts and jumps; branches and jumps), two trios
(shifts, byte-extraction, and sign-extension;
multiply, masks, and jumps), and one group of
five (multiply, masks, jumps, byte-extraction, and
sign-extension). In light of the results, we also
simulated the performance impact if the
add/subtract
subroutine
required
fewer
instructions (“cheaper add”).
Upon seeing preliminary data, we also
concluded it would be useful to have data on
how many times a given instruction was
dynamically executed as compared to the total
number of dynamically executed executions.
Thus we also modified SimpleScalar to include
several new counters for all the classifications of
instructions considered.
Dynamically Translating Instructions to Cope with Faulty Hardware
7
December 2009
1.2
1.0
0.8
0.6
0.4
Anagram
Go
0.2
Gcc
0.0
Compress95
Figure 4: Performance Results
5. Results
The results of our simulations detailing
performance can be found in Figure 4. Along the
x-axis are the various faults injected into the
system. Along the y-axis is the speedup relative
to a fully functioning processor with no
modifications (baseline). Figures 5-8 contain the
instruction breakdowns for each of the
benchmarks used.
6. Discussion
Jump
1%
Sign extensions and jumps caused very
little slowdown. For the sign extensions, it is
clearly because they are not used in the course
of the benchmarks. In terms of the jumps, it is
likely because it is only a two cycle detour,
which, coupled with jumps being a small
percentage of the total program, did not result in
Anagram
Branch
Mask 14%
Shift 2%
1%
ByteExtract
3%
SignExtend Multiply
0%
0%
Much of the results are highly
dependent on the benchmarks (as was also
found by [10]). If a benchmark contains many
instructions which require the use of a
component which is faulty, there will likely be a
larger performance hit. In our results, there were
varying amounts of slowdown caused by various
faults.
Go
Add
44%
Other
28%
Add
59%
Other
35%
Figure 5: Anagram Instruction Breakdown
Multiply
0%
SignExtend
Shift
0%
0%
ByteExtract
0%
Mask
0%
Branch
12%
Jump
1%
Figure 6: Go Instruction Breakdown
Dynamically Translating Instructions to Cope with Faulty Hardware
8
December 2009
Jump
2%
Mask
Shift
0%
2%
ByteExtract
2%
SignExtend
Multiply
0%
0%
Gcc
Compress95
Branch
16%
Add
46%
Add
92%
Other
32%
Figure 7: Gcc Instruction Breakdown
much slowdown.
Surprisingly, the multiplier did not result
in much slowdown at all, the range being from 03%. We had expected the multiply to be highly
detrimental to performance due to common
usage and high-cost detour. From our data, we
can draw two conclusions. One is that the
benchmarks we chose did not contain many
multiplications. The second that they were small
multiplications (that is, small operands), which
have lower latency to detour.
Moving up a step in slowdown we come
to byte-extraction, masking, and branching. We
expect that there was some slowdown for the
former two due to about five cycles of latency for
detouring each, yet they are still not used
frequently enough to make a large dent in IPC.
Branching, on the other hand, produced about
8% slowdown across all benchmarks despite the
fact that it is only a two cycle penalty. This is
likely because branches are used frequently
enough in all the benchmarks that the large
number of branches more than compensates for
the small latency.
This leads into two instructions which
are not only used fairly commonly in our
benchmarks, but also have relatively long
latencies associated with their detours. The first
is shifts. Shifts, right shifts especially, have huge
delays associated with their detours. Right shifts
required the use of the left shift module, thus
right shifting eats the left shift latency for each
time it must be done. Further, a right-shift
actually had a lower latency for large operands
rather than small. We believe both of these
issues are particular to our synthesized
Multiply
Jump 0%
0% Shift
0%
Mask
Sign- 0%
Extend
0%
Other Byte2% Extract
Branch
0%
6%
Figure 8: Compress95 Instruction Breakdown
translations. A different ISA might allow for a
more compact detour. Since right shift is a very
common operation, it makes sense that the
slowdown when the shifter was out was as high
as 57% in the case of GCC. It is likely that
anagram and compress95 use few shifts due to
their maintaining relatively minor slowdown.
The most detrimental single fault was to
the adder. This is likely due to the fact that it is
the most common operation, and compounded
by the fact that its synthesis routine has an
extremely high latency associated with it.
Performance loss was above 90% on all
benchmarks, even after supposing a significantly
lower latency for the subroutine. This suggests
that adds/subtracts are just so common that any
penalty becomes prohibitive.
We also synthesized a variety of
combination faults. As expected, in the first three
results which only pit two faults with each other,
the slowdown was not much worse than with
each fault individually. The most interesting
result comes from the shift + byte-extraction +
sign-extension test, where IPC plummeted for
three of the four benchmarks. This is likely
because both the byte extraction and the sign
extension detours both require shifts to
complete. Therefore, you are not only losing out
when there are shifts to be done, but every time
one of those other operations arrives, you also
lose performance on the shift. It makes sense
that anagram had the largest fall from the shift to
the shift + byte-extraction faults because its
byte-extraction slowdown implies that it has
more of that type of operation than the others.
Dynamically Translating Instructions to Cope with Faulty Hardware
9
December 2009
In the end, we find it interesting that the
shifter is what plays the most crucial role in
determining how fast a faulty processor can run,
because it seems that not only is the shifter
detour long, but many other detours depend on
the shifter to work properly.
7. Conclusion
In an era where the simple single-wide
core is making a comeback in embedded
systems and multi-core processors, building a
system that can tolerate faults in cores which
have little inherent redundancy will become
essential. We feel that we have created a
scheme that would not only allow cores to
continue to operate under a multitude of hard
faults, but to do so while maintaining the illusion
of normal operation. Our system has places no
burden
on
the
software
(e.g.,
code
recompilation) because everything occurs at
runtime. The hardware additions are reasonably
small other than the addition of an extra
independent stage in the pipeline.
Acknowledgements
We would like to thank Dan Sorin for his
guidance and inspiration in preparation for this
paper.
References
[1]
P. Shivakumar et al. Modeling the Effect
of Technology Trends on the Soft Error
Rate of Combinational Logic. In
Proceedings of the 2002 International
Conference on Dependable Systems
and Networks.
[2]
E.
Rotenberg.
AR-SMT:
A
Microarchitectural Approach to Fault
Tolerance.
[3]
M. Cirinei et al. A Dlexible Scheme for
Scheduling Fault-Tolerant Real-Time
Tasks on Multiprocessors.
[4]
D.J. Sorin et al. Argus: Low-Cost,
Comprehensive Error Detection in
Simple Cores. In Proceedings of the 40th
Annual International Symposium on
Microarchitecture, Dec. 2007.
[5]
T. M.Austin. DIVA: A Dynamic Approach
to Microprocessor Verification. Journal
of Instruction-Level Parallelism, 2, May
2000.
[6]
A. Klaiber. The Technology Behind
Crusoe Processors. Jan. 2000.
[7]
G. Hinton et al. The Microarchitecture of
the Pentium 4 Processor. Intel
Technology Journal, Q1 2001.
[8]
P. Shivakumar et al. Fault Aware
Instruction
Placement
for
Static
Architectures.
[9]
A. Li, B. Hong. A Low-Cost Correction
Algorithm for Transient Data Errors.
Ubiquity, Vol. 7, Issue 22.
8. Future Work
In the future, we would like to expand our
scheme to work with other faults which are not
considered in the paper, such as PC logic faults
and memory errors. We would look to develop a
working prototype of a core using our scheme
which can operate under a variety of faults while
still maintaining independence and correctness.
We would also like to interface our design with a
fault-detection system such as Argus, which
could provide a fully fault-tolerant core: the
ability to detect faults, and to fix them.
Future work should also explore the impact
of the ISA on the efficiency of translation
routines. It is possible that some results are
particular to the Alpha ISA. Would all RISC
architectures yield similar results? Would
translation routines in a CISC architecture be
more efficient? Would the best results happen
somewhere in the middle, or does the ISA not
even matter at all? These are questions worth
exploring
Dynamically Translating Instructions to Cope with Faulty Hardware
10
December 2009
[10]
[11]
A. Meixner, D.J. Sorin. Detouring:
Translating Software to Circumvent
Hard Faults in Simple Cores. In
Proceedings of the 38th Annual
IEEE/IFIP International Conference on
Dependable Systems and Networks,
Jun. 2008.
M.L.Corliss
et
al.
DISE:
Programmable Macro Engine
Customizing Applications.
Appendix
Example Translations
MULL:
ZAP Ra, 0xF0,TempReg1
ZAP Rb, 0xF0,TempReg2
ADDQ R31,R31,Rc
ADDL R31,TempReg1,TempReg3
ADDL R31,TempReg2,TempReg4
XOR TempReg3,TempReg4,TempReg4
ZAP TempReg4,0x0F,TempReg4
bitsLOOP:
AND TempReg2,1, TempReg3
SUBQ R31,TempReg3,TempReg3
AND TempReg1,TempReg3,TempReg3
ADDQ Rc,TempReg3,Rc
SLL TempReg1,1,TempReg1
SRL TempReg2,1,TempReg2
BNE TempReg2,LOOP
OR Rc,TempReg4,Rc
A
for
ADDQ:
OR R31, Ra, TempReg1
OR R31, Rb, TempReg2
OR R31, 1, TempReg3
OR R31, R31, TempReg4
OR R31,R31, Rc
Loop:
AND TempReg1,TempReg3, TempReg5
AND TempReg2, TempReg3, TempReg6
XOR TempReg5, TempReg4, TempReg7
XOR TempReg7, TempReg6, TempReg7
OR TempReg7, Rc, Rc
AND TempReg5, TempReg4, TempReg7
AND TempReg6, TempReg4, TempReg8
OR TempReg7, TempReg8, TempReg4
AND TempReg5, TempReg6, TempReg7
OR TempReg7, TempReg4, TempReg4
SLL TempReg4, 1, TempReg4
OR TempReg3, TempReg1, TempReg1
OR TempReg3, TempReg2, TempReg2
XOR TempReg3, TempReg1, TempReg1
XOR TempReg3, TempReg2, TempReg2
OR TempReg1, TempReg2, TempReg5
OR TempReg5, TempReg4, TempReg5
SLL TempReg3, 1, TempReg3
BNE TempReg5, Loop
BEQ:
ADDQ GPC,disp,TempReg1
CMOVEQ Ra, TempReg1, GPC
SLL:
ADDQ R31,Rb,TempReg1
ADDQ R31,Ra,Rc
LOOP:
ADDQ Rc,Rc,Rc
SUBQI TempReg1,1,TempReg1
BNE TempReg1, LOOP
JMP:
ADDQ GPC,4,Ra
ADDQ R31,Rb,GPC
Dynamically Translating Instructions to Cope with Faulty Hardware
11
December 2009
Download