Research Paper

Final Paper Prabhjot S. Saini Digital Design 1 The Transmeta Crusoe Microprocessor Advances in wireless networking technology have engendered a new paradigm of computing, called mobile computing; in which users carrying portable devices have access to a shared infrastructure independent of their physical location. This has led to an explosion in areas of mobile computing leading to an extensive and aggressive drive to optimize these devices and squeeze out every bit of performance possible. Some of the challenges faced by the industry were to provide low-power processing so as to extend battery life without sacrificing performance. Standard applications still needed to be run therefore there was a continual need for sufficient processing capability with a longer battery run-time. The existing super-scalar designs which delivered performance where expensive regarding power savings, while any new processor approaches were hindered by the popularity of the x86 architecture and thereby the new design had to adhere to or be compatible with the industry standards set by x86. The objectives thus, were to offer strong performance, low-power consumption and x86 compatibility. The company Transmeta came up with a novel collection of solutions to address some of these problems at both the hardware and the software level and the rest of this paper discusses the application and the thinking behind some of these solutions. Hardware Solution The hardware designers created a very simple, high-performance VLIW (very long instruction word) engine with two integer units, one floating-point unit, a memory load/store unit and a branch unit. A word in this architecture is usually 64bits or 128bits long and is known as a molecule and contains four RISC-like instructions, called atoms. An important thing to note is that all the atoms of a molecule are executed in parallel; and the format of the molecule determines how the atoms get routed to the functional units. This procedure simplifies a lot of the operations involved during the decoding phase. Another modification, so as the processor is optimized and running at full speed is to pack the molecules as closely together as possible. The integer register file has 64 registers %r0 to %r63. A code Morphing software allocates some of these to hold x86 state while others contain state internal to the system, or can be used as temporary registers, e.g., for register renaming in software. In the assembly code examples in this paper, they are written one molecule per line, with atoms separated by semicolons. The destination register of an atom is specified first; a “.c” opcode suffix designates an operation that sets the condition codes. Where a Register holds x86 state, we use the x86 name for that register (e.g., %eax instead of the less descriptive %r0). Since the x86 instruction set is so complex, it requires a large number of transistors which require power, for decoding and dispatching purposes. The Crusoe microprocessor uses software techniques in addition to the above mentioned hardware techniques to achieve the overall effect of power saving. Software Solution Crusoe processors consist of a hardware engine logically surrounded by a software layer. This revolves around the code morphing software that was developed by the Crusoe engineers and scientists. It is a dynamic translation and instruction scheduling software that converts from the x86 architecture to the VLIW architecture. This software is resident in a serial ROM that is transferred to RAM on startup for faster access and it is the only software written for the VLIW core. It is essentially an on the fly compiler/interpreter which breaks all x86 instructions (including BIOS and OS) into atoms, then schedules them into molecules just as a VLIW compiler. Code Morphing software includes a number of advanced features to achieve good system-level performance. Code Morphing software support facilities are also built into the underlying processor hardware. Code Morphing software is fundamentally a dynamic translation system - a program that compiles instructions for one instruction set architecture (in this case, the x86 target ISA) into instructions for another ISA (the VLIW host ISA). Code Morphing software is the first program to start executing when the processor boots. All x86 code sees only the x86 ISA that the Code Morphing software supports. The only program written directly for the VLIW engine is the Code Morphing software itself. The typical behavior of Code Morphing software is to execute a loop that decodes and executes x86 instructions. The first few times a specific x86 code sequence is executed, Code Morphing software interprets the code by decoding the instructions one at a time and then dispatching execution to corresponding VLIW native instruction subroutines. Once the x86 code has been executed several times, Code Morphing software translates the x86 instructions into highly optimized and extremely fast native VLIW instructions, executes the translated code, and caches the native instruction translations for future use. If the same x86 code is required to execute again, the highperformance cached translations are executed immediately and no re-translation is required. The flexibility of the software translation approach comes at a price - the processor has to dedicate some of its operating cycles to running the Code Morphing software. These extra operating cycles are cycles that a conventional x86 processor could use to execute application code. To deliver good overall system performance, Code Morphing software has been carefully designed for maximum efficiency and low overhead. Application code developed for use on the Crusoe processor can also benefit from a few simple guidelines that likewise improve code execution efficiency and minimize Code Morphing software overhead. Making a Translation A simple example how the Code Morphing system translates a chunk of x86 code into equivalent code for the Crusoe processor’s VLIW engine.1 Assume that the filtering and path selection algorithms have chosen the following four x86 instructions, (A) through (D), for translation. A. B. C. D. addl addl movl subl %eax,(%esp) %ebx,(%esp) %esi,(%ebp) %ecx,5 // // // // load data from stack, add to %eax ditto, for %ebx load %esi from memory subtract 5 from %ecx register In a first pass, the frontend of the translation system decodes the x86 instructions and translates them into a simple sequence of atoms. At this stage, it is still fairly easy to discern the correspondence between the original and generated code. (Registers %r30 and %r31 are used as temporaries for the memory-load operations.) ld %r30,[%esp] add.c %eax,%eax,%r30 ld %r31,[%esp] add.c %ebx,%ebx,%r31 ld %esi,[%ebp] sub.c %ecx,%ecx,5 // load from stack, into temporary // add to %eax, set condition codes. In a second pass, the optimizer applies well-known compiler optimizations to the code, such as common subexpression elimination, loop invariant removal, or dead code elimination (including unnecessary settings of the condition codes). This exemplifies optimizations that a hardware-only x86 implementation cannot do: a software-based translation system can actually eliminate atoms from the instruction stream, rather than just reorder them. In this example, all but the last setting of the condition code is unnecessary (allowing for greater flexibility in scheduling), and one of the load atoms is redundant, leaving fewer atoms to be executed. ld %r30,[%esp] add %eax,%eax,%r30 add %ebx,%ebx,%r30 ld %esi,[%ebp] sub.c %ecx,%ecx,5 // load from stack only once // reuse data loaded earlier // only this last condition code needed In a final pass, the scheduler reorders the remaining atoms and groups them into individual molecules. This process is similar to what out-of-order processors do in their dispatch hardware. However, by using software to schedule the code, it becomes feasible to use more effective scheduling algorithms and consider a larger window of instructions than would be reasonable in hardware. After scheduling, we have reduced the four original x86 instructions down to just two molecules: 1. ld %r30,[%esp]; sub.c %ecx,%ecx,5 2. ld %esi,[%ebp]; add %eax,%eax,%r30; add %ebx,%ebx,%r30 There are two important points to observe here: • Though the molecules are executed in-order by the hardware, they perform the work of the original x86 instructions out of order. • The molecules explicitly encode the instruction-level parallelism, hence they can be executed by a simple (and hence fast and low-power) VLIW engine; the hardware need not perform any complex instruction reordering itself. At times, x86 instructions in memory get overwritten, either because the operating system is loading a new program, or because an application is using self-modifying code. When this happens to code that has already been translated, the Code Morphing software needs to be notified to keep it from erroneously executing a translation for the old code. To this end, whenever the system translates a block of x86 code, it write-protects the page of x86 memory containing that code. It does so by setting a dedicated “translated” bit in that page’s entry in the processor’s memory management unit. (As with other details of the VLIW hardware, that bit is invisible to x86 software.) When a protected page is written to, the simplest remedy is to invalidate the affected translation(s). As the runtime system dynamically learns more about the program’s behavior, it switches to more sophisticated strategies (beyond the scope of this paper). Code Morphing software can also adjust the Crusoe processor’s voltage on the fly (since at a lower operating frequency, a lower voltage can be used). Because power varies linearly with clock speed and by the square of the voltage, adjusting both can produce cubic reductions in power consumption whereas a conventional CPUs can adjust power only linearly. For example, assume an application program only requires 90% of the processor’s speed. On a conventional processor, throttling back the processor speed by 10% cuts power by 10%, whereas under the same conditions, LongRun power management can reduce power by almost 30%. Conclusion Rather than “throwing hardware” at design problems, they chose an innovative approach that employs a unique combination of hardware and software. Using software to decompose complex instructions into simple atoms and to schedule and optimize the atoms for parallel execution saves millions of logic transistors and cuts power consumption on the order of 60–70% over conventional approaches—while at the same time enabling aggressive code optimization techniques that are simply not feasible in traditional x86 implementations. Transmeta’s Code Morphing software and fast VLIW hardware, working together, achieve low power consumption without sacrificing high performance for real-world applications. References http://www.lems.brown.edu/~iris/en291s9-02/lectures/Jeryl-crusoe.pdf http://www.charmed.com/products/specs/datasheets/TM5800.pdf http://www.transmeta.com/crusoe_docs/Crusoe_SWOptGuide_8-3-01.pdf

Research Paper

Related documents

Products

Support

Research Paper

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib