Research Paper

advertisement
Final Paper
Prabhjot S. Saini
Digital Design 1
The Transmeta Crusoe Microprocessor
Advances in wireless networking technology have engendered a new paradigm of
computing, called mobile computing; in which users carrying portable devices have
access to a shared infrastructure independent of their physical location. This has led to an
explosion in areas of mobile computing leading to an extensive and aggressive drive to
optimize these devices and squeeze out every bit of performance possible.
Some of the challenges faced by the industry were to provide low-power processing so as
to extend battery life without sacrificing performance. Standard applications still needed
to be run therefore there was a continual need for sufficient processing capability with a
longer battery run-time. The existing super-scalar designs which delivered performance
where expensive regarding power savings, while any new processor approaches were
hindered by the popularity of the x86 architecture and thereby the new design had to
adhere to or be compatible with the industry standards set by x86.
The objectives thus, were to offer strong performance, low-power consumption and x86
compatibility. The company Transmeta came up with a novel collection of solutions to
address some of these problems at both the hardware and the software level and the rest
of this paper discusses the application and the thinking behind some of these solutions.
Hardware Solution
The hardware designers created a very simple, high-performance VLIW (very long
instruction word) engine with two integer units, one floating-point unit, a memory
load/store unit and a branch unit. A word in this architecture is usually 64bits or 128bits
long and is known as a molecule and contains four RISC-like instructions, called atoms.
An important thing to note is that all the atoms of a molecule are executed in parallel; and
the format of the molecule determines how the atoms get routed to the functional units.
This procedure simplifies a lot of the operations involved during the decoding phase.
Another modification, so as the processor is optimized and running at full speed is to
pack the molecules as closely together as possible.
The integer register file has 64 registers %r0 to %r63. A code Morphing software
allocates some of these to hold x86 state while others contain state internal to the system,
or can be used as temporary registers, e.g., for register renaming in software. In the
assembly code examples in this paper, they are written one molecule per line, with atoms
separated by semicolons. The destination register of an atom is specified first; a “.c”
opcode suffix designates an operation that sets the condition codes. Where a
Register holds x86 state, we use the x86 name for that register (e.g., %eax instead of the
less descriptive %r0). Since the x86 instruction set is so complex, it requires a large
number of transistors which require power, for decoding and dispatching purposes. The
Crusoe microprocessor uses software techniques in addition to the above mentioned
hardware techniques to achieve the overall effect of power saving.
Software Solution
Crusoe processors consist of a hardware engine logically surrounded by a software layer.
This revolves around the code morphing software that was developed by the Crusoe
engineers and scientists. It is a dynamic translation and instruction scheduling software
that converts from the x86 architecture to the VLIW architecture. This software is
resident in a serial ROM that is transferred to RAM on startup for faster access and it is
the only software written for the VLIW core. It is essentially an on the fly
compiler/interpreter which breaks all x86 instructions (including BIOS and OS) into
atoms, then schedules them into molecules just as a VLIW compiler.
Code Morphing software includes a number of advanced features to achieve good
system-level performance. Code Morphing software support facilities are also built into
the underlying processor hardware. Code Morphing software is fundamentally a dynamic
translation system - a program that compiles instructions for one instruction set
architecture (in this case, the x86 target ISA) into instructions for another ISA (the VLIW
host ISA). Code Morphing software is the first program to start executing when the
processor boots. All x86 code sees only the x86 ISA that the Code Morphing software
supports. The only program written directly for the VLIW engine is the Code Morphing
software itself. The typical behavior of Code Morphing software is to execute a loop that
decodes and executes x86 instructions. The first few times a specific x86 code sequence
is executed, Code Morphing software interprets the code by decoding the instructions one
at a time and then dispatching execution to corresponding VLIW native instruction
subroutines. Once the x86 code has been executed several times, Code Morphing
software translates the x86 instructions into highly optimized and extremely fast native
VLIW instructions, executes the translated code, and caches the native instruction
translations for future use. If the same x86 code is required to execute again, the highperformance cached translations are executed immediately and no re-translation is
required.
The flexibility of the software translation approach comes at a price - the processor has to
dedicate some of its operating cycles to running the Code Morphing software. These
extra operating cycles are cycles that a conventional x86 processor could use to execute
application code. To deliver good overall system performance, Code Morphing software
has been carefully designed for maximum efficiency and low overhead. Application code
developed for use on the Crusoe processor can also benefit from a few simple guidelines
that likewise improve code execution efficiency and minimize Code Morphing software
overhead.
Making a Translation
A simple example how the Code Morphing system translates a chunk of x86 code into
equivalent code for the Crusoe processor’s VLIW engine.1 Assume that the filtering and
path selection algorithms have chosen the following four x86 instructions, (A)
through (D), for translation.
A.
B.
C.
D.
addl
addl
movl
subl
%eax,(%esp)
%ebx,(%esp)
%esi,(%ebp)
%ecx,5
//
//
//
//
load data from stack, add to %eax
ditto, for %ebx
load %esi from memory
subtract 5 from %ecx register
In a first pass, the frontend of the translation system decodes the x86 instructions and
translates them into a simple sequence of atoms. At this stage, it is still fairly easy to
discern the correspondence between the original and generated code. (Registers %r30
and %r31 are used as temporaries for the memory-load operations.)
ld %r30,[%esp]
add.c %eax,%eax,%r30
ld %r31,[%esp]
add.c %ebx,%ebx,%r31
ld %esi,[%ebp]
sub.c %ecx,%ecx,5
// load from stack, into temporary
// add to %eax, set condition codes.
In a second pass, the optimizer applies well-known compiler optimizations to the code,
such as common subexpression elimination, loop invariant removal, or dead code
elimination (including unnecessary settings of the condition codes). This exemplifies
optimizations that a hardware-only x86 implementation cannot do: a software-based
translation system can actually eliminate atoms from the instruction stream,
rather than just reorder them. In this example, all but the last setting of the condition code
is unnecessary (allowing for greater flexibility in scheduling), and one of the load atoms
is redundant, leaving fewer atoms to be executed.
ld %r30,[%esp]
add %eax,%eax,%r30
add %ebx,%ebx,%r30
ld %esi,[%ebp]
sub.c %ecx,%ecx,5
// load from stack only once
// reuse data loaded earlier
// only this last condition code needed
In a final pass, the scheduler reorders the remaining atoms and groups them into
individual molecules. This process is similar to what out-of-order processors do in their
dispatch hardware. However, by using software to schedule the code, it becomes feasible
to use more effective scheduling algorithms and consider a larger window of instructions
than would be reasonable in hardware. After scheduling, we have reduced the four
original x86 instructions down to just two molecules:
1. ld %r30,[%esp]; sub.c %ecx,%ecx,5
2. ld %esi,[%ebp]; add %eax,%eax,%r30; add %ebx,%ebx,%r30
There are two important points to observe here:
• Though the molecules are executed in-order by the hardware, they perform the work of
the original x86 instructions out of order.
• The molecules explicitly encode the instruction-level parallelism, hence they can be
executed by a simple (and hence fast and low-power) VLIW engine; the hardware need
not perform any complex instruction reordering itself.
At times, x86 instructions in memory get overwritten, either because the operating system
is loading a new program, or because an application is using self-modifying code. When
this happens to code that has already been translated, the Code Morphing software needs
to be notified to keep it from erroneously executing a translation for the old code. To this
end, whenever the system translates a block of x86 code, it write-protects the page of x86
memory containing that code. It does so by setting a dedicated “translated” bit in that
page’s entry in the processor’s memory management unit. (As with other details of
the VLIW hardware, that bit is invisible to x86 software.) When a protected page is
written to, the simplest remedy is to invalidate the affected translation(s). As the runtime
system dynamically learns more about the program’s behavior, it switches to more
sophisticated strategies (beyond the scope of this paper).
Code Morphing software can also adjust the Crusoe processor’s voltage on the fly (since
at a lower operating frequency, a lower voltage can be used). Because power varies
linearly with clock speed and by the square of the voltage, adjusting both can produce
cubic reductions in power consumption whereas a conventional CPUs can adjust power
only linearly. For example, assume an application program only requires 90% of the
processor’s speed. On a conventional processor, throttling back the processor speed by
10% cuts power by 10%, whereas under the same conditions, LongRun power
management can reduce power by almost 30%.
Conclusion
Rather than “throwing hardware” at design problems, they chose an innovative approach
that employs a unique combination of hardware and software. Using software to
decompose complex instructions into simple atoms and to schedule and optimize the
atoms for parallel execution saves millions of logic transistors and cuts power
consumption on the order of 60–70% over conventional approaches—while at the same
time enabling aggressive code optimization techniques that are simply not feasible in
traditional x86 implementations. Transmeta’s Code Morphing software and fast VLIW
hardware, working together, achieve low power consumption without sacrificing high
performance for real-world applications.
References
http://www.lems.brown.edu/~iris/en291s9-02/lectures/Jeryl-crusoe.pdf
http://www.charmed.com/products/specs/datasheets/TM5800.pdf
http://www.transmeta.com/crusoe_docs/Crusoe_SWOptGuide_8-3-01.pdf
Download